Fixing aberrant files using R and the shell: a case study

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Once in a while, you embark on what looks like a simple computational procedure only to encounter frustration very early on. “I can’t even read my file into R!” you cry.

Step back, take a deep breath and take note of what the software is trying to tell you. Most times, you’ve just missed something very straightforward. Here’s an example.

Recently, I was trying to retrieve some data describing characteristics of microbial genomes from the NCBI FTP site. The file, lproks_0.txt (direct link), looked like a regular tab-delimited file with a couple of header lines:

head lproks_0.txt
## Microbial Organism Information Page
## Columns:	"RefSeq project ID"	"Project ID"	"Taxonomy ID"	"Organism Name"	"Super Kingdom"	"Group"	"Sequence Status"	"Genome Size"	"GC Content"	"Gram Stain"	"Shape"	"Arrangment"	"Endospores"	"Motility"	"Salinity"	"Oxygen Req"	"Habitat"	"Temp. range"	"Optimal temp."	"Pathogenic in"	"Disease"	"Genbank accessions"	"Refseq accessions"
49725	30807	551115	'Nostoc azollae' 0708	Bacteria	Cyanobacteria	complete	5.532	38.3		Filaments	Filaments		Yes		Aerobic	Multiple	Mesophilic	-			CP002059,CP002060,CP002061	-
55729	33011	592010	Abiotrophia defectiva ATCC 49176	Bacteria	Firmicutes	assembly	3.4774	37.1	+	Cocci		NoNo		Facultative			-			ACIN00000000	-

Sharp eyes will notice a problem right there, on the first line of data. Less sharp-eyed users like me will open an R console to read the file, expecting no issues:

genomes <- read.table("lproks_0.txt", header = T, skip = 2, stringsAsFactors = F, sep = "\t")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  line 376 did not have 23 elements

“I can’t even read my file into R!”

My first mistake: reach for the tool that I know best, not the tool which is appropriate. In this case, my instinct was to count the fields in a line using awk. Since we skipped the first 2 lines in R, we want to examine line 378 in the original file:

awk '{FS="\t"; print NF}' lproks_0.txt | head -n 378 | nl | tail -n1

   378  23

Sure looks like line 378 has 23 elements to me. Replacing “NF” with “$0″, we can examine the line content:

awk '{FS="\t"; print $0}' lproks_0.txt | head -n 378 tail -n1

49969	40633	376219	Arthrospira sp. PCC 8005	Bacteria	Cyanobacteria	unfinished	6.14555	44.7		Spiral			Yes        Facultative	Aquatic	Mesophilic	-			-	-

Looks fine. What is the problem?

As it happens, R comes with several useful functions to examine the structure of files. One of these is count.fields(), which does what it says on the tin. We can combine it with table() to sum the field count for each line:

> table(count.fields("lproks_0.txt", skip = 2, sep = "\t"))

   4    6   21   23
  15    1   15 6753

There we are: indeed, R is seeing lines that do not have 23 fields.

Which lines are causing problems? We can find the indices of the problem lines like this:

> which(count.fields("lproks_0.txt", skip = 2, sep = "\t") != 23)
 [1]  377  684  685  830  831  835  836  837  838  839  840  841  842  916  918
[16] 1995 2709 2710 3248 3254 3255 3256 3257 3262 4086 4105 4679 6125 6293 6294
[31] 6556

Now we can go back to the original file and examine some of the problem lines, remembering to add 2 to the line number:

awk '{FS="\t"; print $0}' lproks_0.txt | head -n 379 | tail -n1

58297	13478	322098	Aster yellows witches'-broom phytoplasma AYWB	Bacteria	Firmicutes	complete	0.727401	26.8		Sphere					Aerobic	Host-associated	Mesophilic	-	Plant	Aster yellows witches'-broom	CP000061,CP000063,CP000064,CP000065,CP000062	-

awk '{FS="\t"; print $0}' lproks_0.txt | head -n 844 | tail -n1

	47787	535329	Brucella abortus bv. 1 str. 8-953#1146	Bacteria	Alphaproteobacteria	unfinished			-	Rod		--	-

Problem solved. Some of the lines contain a single-quote character; read.table() thinks that this indicates a quoted field. Others contain the “#” symbol; read.table() interprets this as a comment character. No wonder that this file could not be read correctly.

How to fix the file? Ideally, you would find a way to read it “as is”. Alternatively, if altering the file contents is not an issue, we can bring sed into play:

# backup file and remove single quotes
cp lproks_0.txt lproks_0.txt.orig
sed -i "s/'//g" lproks_0.txt
# replace # with -
sed -i "s/#/-/g" lproks_0.txt

And back to the beginning:

genomes <- read.table("lproks_0.txt", header = T, skip = 2, stringsAsFactors = F, sep = "\t")
dim(genomes)
[1] 6783   23

All of which illustrates that 80% or more of my time spent on data analysis is spent on simply getting the data into a state that can be analysed.


Filed under: computing, programming, R, statistics Tagged: awk, bash, data, sed, shell

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)