Splitting a Large CSV File into Separate Smaller Files Based on Values Within a Specific Column

April 3, 2013

(This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers)

One of the problems with working with data files containing tens of thousands (or more) rows is that they can become unwieldy, if not impossible, to use with “everyday” desktop tools. When I was Revisiting MPs’ Expenses, the expenses data I downloaded from IPSA (the Independent Parliamentary Standards Authority) came in one large CSV file per year containing expense items for all the sitting MPs.

In many cases, however, we might want to look at the expenses for a specific MP. So how can we easily split the large data file containing expense items for all the MPs into separate files containing expense items for each individual MP? Here’s one way using a handy little R script in RStudio

Load the full expenses data CSV file into RStudio (for example, calling the dataframe it is loaded into mpExpenses2012. Previewing it we see there is a column MP.s.Name identifying which MP each expense claim line item corresponds to:


We can easily pull out the unique values of the MP names using the levels command, and then for each name take a subset of the data containing just the items related to that MP and print it out to a new CSV file in a pre-existing directory:

mpExpenses2012 = read.csv("~/Downloads/DataDownload_2012.csv")
#mpExpenses2012 is the large dataframe containing data for each MP
#Get the list of unique MP names
for (name in levels(mpExpenses2012$MP.s.Name)){
  #Subset the data by MP
  #Create a new filename for each MP - the folder 'mpExpenses2012' should already exist
  fn=paste('mpExpenses2012/',gsub(' ','',name),sep='')
  #Save the CSV file containing separate expenses data for each MP


This technique can be used to split any CSV file into multiple CSV files based on the unique values contained within a particular, specified column.

To leave a comment for the author, please follow the link and comment on their blog: OUseful.Info, the blog... » Rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training



CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)