Get Better: loading multiple csv files in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In a previous post, I described how to run a session to teach R to cell biologists. In this post we’ll look in a bit more detail at one of the steps: how to load data into R.
As a reminder, a typical analysis task in cell biology follows these steps:
- do the experiment(s)
- collect the data – e.g. microscopy images
- analyse and get a plain text (csv) output – e.g. using Fiji
- load the data into R
- crunch some numbers and plot
Loading data into R
In the test dataset from the previous post we have a single folder of 80 csv files. 4 experiments, 2 conditions, 10 files from each. And each file has 20 rows of data. See below for other scenarios.
The goal is to load of the data into R and assemble into a single data frame. There are many ways to do this, I’ll show three of the most popular.
- base R
- tidyverse
- data.table
# Demonstrate three ways to load all CSV files in Data:
# 1) base R
# 2) tidyverse (readr + dplyr + purrr)
# 3) data.table
data_dir <- "Data"
csv_files <- list.files(data_dir, pattern = "\\.csv$", full.names = TRUE)
# --- 1) base R ---------------------------------------------------------------
base_list <- lapply(csv_files, function(path) {
df <- read.csv(path)
df$source_file <- basename(path)
df
})
base_all <- do.call(rbind, base_list)
# --- 2) tidyverse ------------------------------------------------------------
if (!requireNamespace("readr", quietly = TRUE) ||
!requireNamespace("dplyr", quietly = TRUE) ||
!requireNamespace("purrr", quietly = TRUE)) {
stop("Please install tidyverse components: readr, dplyr, purrr")
}
tidy_all <- purrr::map_dfr(
csv_files,
~ readr::read_csv(.x, show_col_types = FALSE) |>
dplyr::mutate(source_file = basename(.x))
)
# --- 3) data.table -----------------------------------------------------------
if (!requireNamespace("data.table", quietly = TRUE)) {
stop("Please install data.table")
}
dt_list <- lapply(csv_files, function(path) {
dt <- data.table::fread(path)
dt[, source_file := basename(path)]
dt
})
dt_all <- data.table::rbindlist(dt_list, use.names = TRUE, fill = TRUE)
# Quick checks
cat("Files loaded:", length(csv_files), "\n")
cat("Rows (base):", nrow(base_all), "\n")
cat("Rows (tidyverse):", nrow(tidy_all), "\n")
cat("Rows (data.table):", nrow(dt_all), "\n")
In each case we list the files and then use this list to load each item and assemble into a single large data frame.
We need to know which rows of the large data frame came from which file. This is essential if there is no identifier within the data. So in each case, after loading, we add the name of the file as a new column called source_file. Then we assemble these modified data frames into a single large data frame.
Here is the output from the last part of the code:
> # Quick checks
> cat("Files loaded:", length(csv_files), "\n")
Files loaded: 80
> cat("Rows (base):", nrow(base_all), "\n")
Rows (base): 1600
> cat("Rows (tidyverse):", nrow(tidy_all), "\n")
Rows (tidyverse): 1600
> cat("Rows (data.table):", nrow(dt_all), "\n")
Rows (data.table): 1600
>
The result is identical with all three approaches.
My preferred strategy is to use base R for tasks like this. Generally, it is best to stick to base R rather than relying on libraries. In terms of speed, {data.table} is renowned for being fast. If your data is massive, it is worth using {data.table}. However, for a set of 80 small files, speed is not a concern, and base R performs very well. It could be that you prefer the tidyverse syntax and find it easier to understand, in which case go for it. Otherwise my advice is stick to base R.
Note that, with each of these three approaches, there are several different ways to achieve the same thing. I am only presenting one. For example, with base R you may see examples where a for-loop is used to achieve the same thing as lapply. This approach is slower than the one shown here although it is arguably more readable.
We’ll keep going with base R as we look at a few alternative scenarios that you are likely to encounter.
Alternative scenarios
The base case (in more detail)
We had a flat directory (Data/) of 80 files, where the filenames encoded the experimental details. Therefore we used this base R code to load the data in.
data_dir <- "Data"
csv_files <- list.files(data_dir, pattern = "\\.csv$", full.names = TRUE)
base_list <- lapply(csv_files, function(path) {
df <- read.csv(path)
df$source_file <- basename(path)
df
})
base_all <- do.call(rbind, base_list)
Following this, we need to use the source_file column to add additional columns that signify the experimental details. Our files are called things like control_n2_9.csv or rapa_n3_2.csv – in other words they are of the form condition_experiment_x.csv. The underscores can be used to split the filename with strsplit() and we can then take the 1st or 2nd element of the result and store them in new columns.
# source_file column has name of the file, name is of the form foo_bar_1.csv # extract foo and bar into two columns base_all$cond <- sapply(strsplit(base_all$source_file, "_"), "[", 1) base_all$expt <- sapply(strsplit(base_all$source_file, "_"), "[", 2)
As I said above, there’s always other approaches and here it could be that we do these steps inside the original lapply() call, i.e. before we have assembled the large data frame.
base_list <- lapply(csv_files, function(path) {
df <- read.csv(path)
df$source_file <- basename(path)
df$cond <- sapply(strsplit(df$source_file, "_"), "[", 1)
df$expt <- sapply(strsplit(df$source_file, "_"), "[", 2)
df
})
base_all <- do.call(rbind, base_list)
However, if you leave the column wrangling until you have assembled the large data frame the loading part of the code is more likely to be reusable.
Let’s look at a few other scenarios.
Only a subset of columns are required
If the files have many columns, you may only require a subset of columns in your data frame. More rarely, the csv files may have differing numbers of columns. In this case, it isn’t possible to use the code above because we need an equal number of columns to assemble the large data frame.
The solution in both of these cases is to specify which columns to load. We can do:
> head(read.csv(csv_files[1])) X Area Mean StdDev Min Max IntDen RawIntDen 1 1 1248 48.83477 8.864353 0 255 60945 60945 2 2 1248 52.46805 10.564050 0 255 65480 65480 3 3 1248 72.14579 8.947991 0 255 90037 90037 4 4 1248 55.77559 9.542218 0 255 69607 69607 5 5 1248 56.42217 9.749921 0 255 70414 70414 6 6 1248 73.86571 7.626613 0 255 92184 92184
To look at the first part (head()) of the first file. Let’s say we only want Area, Mean (and the source_file) columns. We can then do:
data_dir <- "Data"
csv_files <- list.files(data_dir, pattern = "\\.csv$", full.names = TRUE)
my_columns <- c("Area", "Mean")
base_list <- lapply(csv_files, function(path) {
df <- read.csv(path)
df <- df[,my_columns]
df$source_file <- basename(path)
df
})
base_all <- do.call(rbind, base_list)
From here, we can assemble the cond and expt columns as shown above.
So far, all of the information required is encoded in the filename. If this isn’t the case, it is better at this stage to alter the script that generated the csvs so that the necessary information can be read from the filename or alternatively, from the filepath.
Identical (non-unique) filenames in different folders
In the example above, the condition, the experiment and a differentiator were encoded in the filename. It could be that that the csv files are organised like this:
- Data/
- Control/
- cell1.csv
- cell2.csv
- Drug/
- cell1.csv
- cell2.csv
- cell3.csv
- Control/
or
- Data/
- Expt1/
- Control/
- cell1.csv
- cell2.csv
- Drug/
- cell1.csv
- cell2.csv
- Control/
- Expt2/
- Control/
- cell1.csv
- cell2.csv
- cell3.csv
- Drug/
- cell1.csv
- Control/
- Expt1/
or any other combination. The point being that the filename is no longer unique. There are several files with the same differentiator, and to access the condition or the experiment, we need to manipulate the filepath rather than the filename.
data_dir <- "Data"
csv_files <- list.files(data_dir,
pattern = "\\.csv$",
full.names = TRUE, # get the full path (folder names)
recursive = TRUE) # ensures we look in subfolders of data_dir
base_list <- lapply(csv_files, function(path) {
df <- read.csv(path)
df$source_path <- path
df
})
base_all <- do.call(rbind, base_list)
This code block will deal with bunch of subfolders within data_dir and assemble the large data frame. This time, we make a column called source_path and here we store the full path of the each file.
So, a file called cell2.csv in Expt1/Drug/ within the Data/ folder in the project with have the source_path of Data/Expt1/Drug/cell2.csv
We just need to wrangle this path to extract the condition and experiment information.
# "experiment" folder base_all$cond <- sapply(strsplit(base_all$source_path, .Platform$file.sep, fixed = TRUE), "[", 2) # folder enclosing file base_all$expt <- sapply(strsplit(base_all$source_path, .Platform$file.sep, fixed = TRUE), "[", 3) base_all$source_file <- basename(base_all$source_path) # get the filename (differentiator)
This wrangling step will need to be adjusted to your needs. We use .Platform$file.sep rather than "/" or "\" because the file separator differs on Windows.
Exercises
Here are three scenarios you might encounter. How would you solve them?
- The files are in
Data/and are organised into three condition folders, inside each are four experiment subfolders, each with 10 csv files in. - The files are in
Data/and are organised into four experiment folders, inside each are 20 csv files. They are named like this:microscopy-analysis_Control_IF488_cell3.csvandmicroscopy-analysis_Drug_IF488_cell2.csv - The files are located at
~/Desktop/and are organised into 12 folders calledControl-Expt1orDRUG1-Expt2(there are three conditions and four experiments). The files inside are calledcell1.csvetc. - The files are in a single folder, using experiment_condition_differentiator labelling, however the user has been inconsistent. Sometimes the Control is called
Control, sometimesCtrlorctrl
Answers
Click to reveal.
Problem 1
data_dir <- "Data"
csv_files <- list.files(data_dir,
pattern = "\\.csv$",
full.names = TRUE, # get the full path (folder names)
recursive = TRUE) # ensures we look in subfolders of data_dir
base_list <- lapply(csv_files, function(path) {
df <- read.csv(path)
df$source_path <- path
df
})
base_all <- do.call(rbind, base_list)
base_all$cond <- sapply(strsplit(base_all$source_path, .Platform$file.sep, fixed = TRUE), "[", 3)
base_all$expt <- sapply(strsplit(base_all$source_path, .Platform$file.sep, fixed = TRUE), "[", 2)
base_all$source_file <- basename(base_all$source_path)
Problem 2
data_dir <- "Data"
csv_files <- list.files(data_dir,
pattern = "\\.csv$",
full.names = TRUE, # get the full path (folder names)
recursive = TRUE) # ensures we look in subfolders of data_dir
base_list <- lapply(csv_files, function(path) {
df <- read.csv(path)
df$source_path <- path
df
})
base_all <- do.call(rbind, base_list)
base_all$expt <- sapply(strsplit(base_all$source_path, .Platform$file.sep, fixed = TRUE), "[", 2)
base_all$source_file <- basename(base_all$source_path)
base_all$cond <- sapply(strsplit(base_all$source_path, "_", fixed = TRUE), "[", 2)
Problem 3
# first relocate the files to "Data/" in the project folder and then...
data_dir <- "Data"
csv_files <- list.files(data_dir,
pattern = "\\.csv$",
full.names = TRUE, # get the full path (folder names)
recursive = TRUE) # ensures we look in subfolders of data_dir
base_list <- lapply(csv_files, function(path) {
df <- read.csv(path)
df$source_path <- path
df
})
base_all <- do.call(rbind, base_list)
base_all$condexpt <- sapply(strsplit(base_all$source_path, .Platform$file.sep, fixed = TRUE), "[", 2)
base_all$source_file <- basename(base_all$source_path)
base_all$cond <- sapply(strsplit(base_all$condexpt, "-", fixed = TRUE), "[", 1)
base_all$expt <- sapply(strsplit(base_all$condexpt, "-", fixed = TRUE), "[", 2)
Problem 4
There are a number of ways to deal with this problem. If the labelling is very inconsistent it is best to rerun the analysis (which gave the csv files) in a way that gives consistent labelling. If this is not possible e.g. original filenames are inconsistently named, then you can extract the condition and experiment labels as before and then rename them. Given the mix of upper and lower case, it’s advisable to run to.lower() first and then figure out which entries are unique to which group and then assign them accordingly. Another approach is to make a data frame showing how the entries should be renamed and use that to rename the labels. The bonus part of dealing with a problem like this is that once you have been through the pain, it will make you more consistent when naming things in the future!
—
The post title comes from Get Better by The New Fast Automatic Daffodils.
Part of a series on development of lab members’ skills.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.