Missing values and column types when reading data into R

November 17, 2011
By

(This article was first published on indiacrunchin » R, and kindly contributed to R-bloggers)

Reading data into R when dealing with column types and values that need to be considered as NA

Below are code snippets to introduce a few arguments of the read.csv function in R

# Create sample data
strVals <- do.call("c",lapply(1:1000,function(x)paste(sample(letters,sample(5:20,1)),collapse="")))
miscVals <- sample(c("","999","—-","MISS"),100,replace=T)
numVals <- rnorm(1000)

# Scenario 1 : Pure numeric and strings
dataTemp<-data.frame(numericVals = numVals, stringVals = strVals)
write.csv(dataTemp,file="inputData.csv",quote=F,row.names=F)
inData <- read.csv("inputData.csv",header=T)
sapply(inData,class)
# Col: stringVals is type factor

# Using the function argument stringsAsFactors = FALSE mitigates character columns
# being turned into factor type
inData <- read.csv("inputData.csv",header=T,stringsAsFactors=FALSE)
sapply(inData,class)

# Using function argument colClasses
# predefine the column types in the input file
inData <- read.csv("inputData.csv",header=T,colClasses = c("numeric","character"))
sapply(inData,class)

# If you have data values that need to be considered as NA
# Add values from miscVals ( "","999","—-","MISS" ) to numVals and strVals
numMiscVals <- sample(c(numVals,miscVals),1000)
strMiscVals <- sample(c(strVals,miscVals),1000)

dataTemp<-data.frame(numericVals = numMiscVals, stringVals = strMiscVals)
write.csv(dataTemp,file="inputData.csv",quote=F,row.names=F)
inData 0

# Use na.strings argument
inData <- read.csv("inputData.csv",header=T,stringsAsFactors=FALSE,na.strings = c("","999","—-","MISS"))
sapply(inData,class)
# The columns have the right type numericVals is numeric and stringVals is character
sum(c("","999","—-","MISS") %in% inData$numericVals)
# should return 0


To leave a comment for the author, please follow the link and comment on his blog: indiacrunchin » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.