How to Create Cross-Validation Partitions with Plain Text in R?

July 11, 2016
By

(This article was first published on #untitled, and kindly contributed to R-bloggers)

In my last blog post, I talked about Random Sampling for Plain Text in R. Now when you are performing some supervised learning on your data, if you follow the cross-validation principle which divides the data into train, holdout and test, you want to make sure each subset is different. The great caret package has some neat partitioning techniques however for my use, I wanted a minimal function without the need to load extra libraries.

Solution
createPartition <- function (filename, trainPortion=0.6, holdOutPortion=0.2) {  
  con <- file(filename)
  total <- length(readLines(con))

  sample <- floor(total * trainPortion)
  sample_holdout <- floor(total * holdOutPortion)

  corpus <- scan(con, what="character", skip= 0, nlines=total, sep="\n", fileEncoding = 'UTF-8')
  close(con)
  train <- corpus[1:sample]
  holdout <- corpus[(sample + 1): (sample + sample_holdout)]
  test <- corpus[(sample + sample_holdout + 1) : total]

  result <- list(train=train, holdout=holdout, test=test)
  return(result)
}

What you need to pass is only a path to filename. I set the partitions to be 60/20/20 rule which is suggested by the great Andrew Ng.

Here’s how to use it:

data <- createPartition("path_to_file.format")  
data$train  
data$holdout  
data$test  

I hope this helped! 😀😀😀

To leave a comment for the author, please follow the link and comment on their blog: #untitled.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)