Using R: reading tables that need a little cleaning

[This article was first published on There is grandeur in this view of life » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Sometimes one needs to read tables that are a bit messy, so that read.table doesn’t immediately recognize the content as numerical. Maybe some weird characters are sprinkled in the table (ever been given a table with significance stars in otherwise numerical columns?). Some search and replace is needed. You can do this by hand, and I know this is a task for which many people would turn to a one-liner of perl or awk, but I like to do my scripting in R.

Again, this is in the ”trivial of you only think it out” category of problems. But it gives a starting point for more subtle changes to text files that one might have to do from time to time.

Assume the data look something like this (I made this up):

id;col1;col2
id1; 0,2;0.55*
id2; -,1;,23

Thankfully, R’s conversion from numbers to text is pretty good. It will understand, for instance, that both .5 and 0.5 are numbers. If the problem was just an odd decimal separator, like the Swedish decimal comma, we could just change the dec parameter to read.table. In the above, we have mixed decimal separators. Let’s start out by reading the table as character.

character.data <- read.table("some_data.csv", sep=";",
                   colClasses="character", header=T)
row.names(character.data) <- character.data[,1]
character.data <- character.data[,-1]

This also stores away the sample ids in row names, and removes the first column with the ids. This is optional, but for this example I assume that all columns should be numeric. If that is not the case, the code below can of course be restricted to the numeric columns, and the character (or factor) columns can be added later.

A data.frame is pretty much a list of vectors, so we use plyr to apply over the list and stringr to search and replace in the vectors. After removing characters that aren’t numbers, decimal separators, or the minus sign, we change the decimal separator, and convert the vector to numeric. Finally, we make the list a data.frame, and propagate the row names.

library(stringr)
library(plyr)

data <- data.frame(llply(character.data, function(x) {
  x <- str_replace_all(x, pattern="[^0-9\\.\\,-]", replacement="")
  x <- str_replace(x, pattern="\\,", replacement=".")
  return(as.numeric(x))
}))
row.names(data) <- row.names(character.data)

Postat i:computer stuff Tagged: R, read.table

To leave a comment for the author, please follow the link and comment on their blog: There is grandeur in this view of life » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)