Read Random Rows from A Huge CSV File

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Given R data frames stored in the memory, sometimes it is beneficial to sample and examine the data in a large-size csv file before importing into the data frame. To the best of my knowledge, there is no off-shelf R function performing such data sampling with a relatively low computing cost. Therefore, I drafted two utility functions serving this particular purpose, one with the LaF library and the other with the reticulate library by leveraging the power of Python. While the first function is more efficient and samples 3 records out of 336,776 in about 100 milliseconds, the second one is more for fun and a showcase of the reticulate package.


library(LaF)

sample1 <- function(file, n) {
  lf <- laf_open(detect_dm_csv(file, sep = ",", header = TRUE, factor_fraction = -1))
  return(read_lines(lf, sample(1:nrow(lf), n)))
}

sample1("Downloads/nycflights.csv", 3)
#   year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
# 1 2013     9  15     1323        -6     1506       -23      MQ  N857MQ   3340
# 2 2013     3  18     1657        -4     2019         9      UA  N35271     80
# 3 2013     6   7     1325        -4     1515       -11      9E  N8477R   3867
#   origin dest air_time distance hour minute
# 1    LGA  DTW       82      502   13     23
# 2    EWR  MIA      157     1085   16     57
# 3    EWR  CVG       91      569   13     25

library(reticulate)

sample2 <- function(file, n) {
  rows <- py_eval(paste("sum(1 for line in open('", file, "'))", sep = '')) - 1
  return(import("pandas")$read_csv(file, skiprows = setdiff(1:rows, sample(1:rows, n))))
}

sample2("Downloads/nycflights.csv", 3)
#   year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
# 1 2013    10   9      812        12     1010       -16      9E  N902XJ   3507
# 2 2013     4  30     1218       -10     1407       -30      EV  N18557   4091
# 3 2013     8  25     1111        -4     1238       -27      MQ  N721MQ   3281
#   origin dest air_time distance hour minute
# 1    JFK  MSY      156     1182    8     12
# 2    EWR  IND       92      645   12     18
# 3    LGA  CMH       66      479   11     11

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)