(This article was first published on

**#untitled**, and kindly contributed to R-bloggers)If you are mining with a text source, performing some language modelling, you need to sample from a text corpus. I was wondering how to have a function to randomly selects chunks of a text file efficiently and is doing the job fast. I wanted to keep the function as simple as possible and using basic S3 methods in R.

###### Solution

I heavily encourage you to use `scan`

function that is a S3 basic R function:

```
sampleText <- function (filename, total, sampleSize) {
lineNumber = sample(total, sampleSize)
sample <- list()
for (line in lineNumber) {
result <- scan(filename, what="character", skip= line,
nlines=1, sep="\n", strip.white = TRUE, fileEncoding = 'UTF-8')
sample <- list(sample, result)
}
return(unlist(sample))
}
```

And here’s how to use it:

```
sample <- createPartition('path_to_file.txt', 2000, 10)
```

Inspecting result:

```
> class(sample)
[1] "character"
> sample
[1] "It's a cloudy day"
[2] "But I did want her to have at least PART of my imaginary Paris experience, so I used her pretty Paris stamp to make her birthday card."
...
[10] "Our smartest friend (Zachary)- was nice enough to study the injection and give us some information to share with everyone..."
```

**NOTE**: The reason that this function is fast is that you send in a file connection. If you are working with different encoding, to change values accordingly. Also, make sure that in case you want sentences back, set `sep='.'`

.

Hope you find this function useful! ?

To

**leave a comment**for the author, please follow the link and comment on their blog:**#untitled**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...