I’ve written before about importing tabular text files into R, and here comes some more. This is because I believe (firmly) that importing data is the major challenge for beginners who want to analyse their data in R. What is the most important thing about using any statistics software? To get your data into it in the first place! Unfortunately, no two datasets are the same and many frustrations await the beginner. Therefore, let me share a few things that I’ve picked up while trying to read data into R.
General tip: keep it as simple as possible. Open your favourite spreadsheet software, remove all the colours and formatting, make sure to have only one header row with short and simple column names (avoid any special characters; R will turn them to full stops), remove empty columns above and to the left of the table and export the sheet in to plain text, tab separated or comma separated. Then use either read.table, read.csv or the Import dataset button in RStudio to read your table, and in case of doubt, begin with the default settings, which are often sensible. If you see any annoying data import errors, below are a few possible causes.
If you don’t get the number of columns you expect
Then it’s often a separator problem. read.table allows you to set the filed separator character (with sep=) and the decimal separator (with dec=). To get a tab character, use ”\t”:
data <- read.table(file="data.txt", sep="\t", dec=",")
In many countries this is not an issue, but the Swedish standard is using a comma as decimal separator, while R uses a decimal point. If you export a comma separated value file on a Swedish computer, you are likely to get a numbers with decimal commas and semicolons as field separators. Then what you want is read.csv2( ), which is a read.csv made for semicolon separated files with decimal commas.
As an additional benefit of using it, RStudio solves those issues for you with the Import dataset button. The dialog box that appears when you click it lets you choose separators, header line (yes/no) and whether there are quotes around fields, and shows a preview of what the table will look like. Then it builds the read.table command for you, so that you can copy it into your script. If you’re not an RStudio user, just look at the actual file in a text editor, and add dec= and sep= instructions to your read.table as needed.
If you see a lot more columns than you expect, usually after the ones you expect, filled with NAs, it’s probably because you’ve happened to enter some character (very likely a whitespace …) somewhere to the right in the spreadsheet. When exported, the software will fill in all the empty cells, which R will interpret as missing values. If this happens, make sure to delete the contents of the spreadsheet after the columns you’re interested in.
Whitespaces can also be a problem if you’ve specified a custom separator, like above. Normally, read.table is very good about skipping over whitespace, should you have happened to put an extra whitespace in before that factor level. That functionality is turned off when you specify a separator, though, so you might need to switch it on again by setting strip.white=T in the function call. So, for instance, ” male” and ”female ” are interpreted as male/female, which is probably what one wants. This cost me several grey hairs and string operations before I came to my senses and read the data import manual.
If columns are not the type you expect
If something is a character when it should be numeric you might see messages such as ”‘x’ must be numeric ” or ”non-numeric argument to binary operator”. If something is a factor when it should be character, some character operations might fail. Regardless, it could still be a separator problem. If R expects to see decimal points and sees a comma, it will make that column a character vector rather then a numeric column. The principle is that read.table it will try to turn the column into numeric or integer (or complex or logical, but they are rarer). Most often, this means that if you feed it numbers it will make them numeric. But if there is anything else in any of the elements, it will assume character — and by default it will convert character vector into factors.
There are several reasons why you might have text among the numbers, making R not interpret the column as numeric. We’ve mentioned wrong separators, but sometimes if you paste spreadsheets from different sources together you end up with inconsistent field separators. I’ve written a post about a way to deal with that in some cases, but you can also search and replace the file with a text editor, or use a command-line tool like sed. Missing values can be another problem. read.table understands that empty fields are missing, but maybe you or your collaborator has used another character to mean missing, like ”-” or ”?”. This is dealt with by specifying your own na.strings:
data <- read.table("data.txt", na.strings=c("NA", "-", "?"))
R’s default of interpreting characters as factors is sometimes a good choice, but often what you really want is characters that you can work with and then, if needed, turn into factors. In particular, if your column of sample ids is turned into a factor, comparing and matching ids will sometimes hurt you. To avoid that you can specify stringsAsFactors=F. This works when you make data frames with the data.frame function as well.
data <- read.table("data.txt", stringsAsFactors=F)
Diagnosing problems with your data frame
Look at the dimensions of your data frame and the column of the classes. dim gives dimensions; the class function gives the type of a column and the str function will give you a summary of the structure of the object.
dim(data) lapply(data, class) str(data)