Random Sampling of Plain Text in R

July 11, 2016

(This article was first published on #untitled, and kindly contributed to R-bloggers)

If you are mining with a text source, performing some language modelling, you need to sample from a text corpus. I was wondering how to have a function to randomly selects chunks of a text file efficiently and is doing the job fast. I wanted to keep the function as simple as possible and using basic S3 methods in R.


I heavily encourage you to use scan function that is a S3 basic R function:

sampleText <- function (filename, total, sampleSize) {  
  lineNumber = sample(total, sampleSize)
  sample <- list()
  for (line in lineNumber) {
    result <- scan(filename, what="character", skip= line, 
         nlines=1, sep="\n", strip.white = TRUE, fileEncoding = 'UTF-8')
    sample <- list(sample, result)

And here’s how to use it:

sample <- createPartition('path_to_file.txt', 2000, 10)  

Inspecting result:

> class(sample)
[1] "character"
> sample
 [1] "It's a cloudy day"                                                                                                                                                
 [2] "But I did want her to have at least PART of my imaginary Paris experience, so I used her pretty Paris stamp to make her birthday card."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
[10] "Our smartest friend (Zachary)- was nice enough to study the injection and give us some information to share with everyone..." 

NOTE: The reason that this function is fast is that you send in a file connection. If you are working with different encoding, to change values accordingly. Also, make sure that in case you want sentences back, set sep='.'.

Hope you find this function useful! 😀

To leave a comment for the author, please follow the link and comment on their blog: #untitled.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)