Organise your data

April 5, 2013
By

(This article was first published on DataSurg » Tag » R, and kindly contributed to R-bloggers)

Use R to specify factors, recode variables and begin by-group analyses.

Video

Files

This file contains data on pain score after laparoscopic vs. open hernia repair. Age, gender and primary/recurrent hernia also included. The ultimate aim here is to work out which of these factors are associated with more pain after this operation.

lap_hernia

Script

##########################
# Organise your data     #
# Ewen Harrison          #
# April 2013             #
# www.datasurg.net       #
##########################

data<-read.table("lap_hernia.csv", sep=",", header=TRUE)

# This is how to check your data, recode variables and 
# begin to analyse group differences 

str(data)

# First look and ensure that all your grouped data - categorical - 
# are factors - they are not here.
# Check that the continuous data are integers or numeric. 

# The data is in a dataframe we have called data. 
# To access variables within that dataframe, use the "$" sign.

data$recurrent
summary(data$recurrent)

# Recurrent is a variable describing whether a hernia is 
# being repaired for the first time or is recurrent. 
# It is a factor, yes/no, and should be specified as such. 

# Change a variable to a factor
data$recurrent<-factor(data$recurrent)

# Check
summary(data$recurrent)

# Do the same for others.
data$laparoscopic<-factor(data$laparoscopic)
summary(data$laparoscopic)

# Check full dataset again and note what has changed
str(data)
summary(data)

data$gender

# This variable has a number of different representations of the same thing
# It needs recoded

# Do this by using "<-" 

data$gender[data$gender=="female"]<-"f"
data$gender[data$gender=="fem "]<-"f"
data$gender[data$gender=="m ale"]<-"m"
data$gender[data$gender=="male"]<-"m"

# This is important. R uses "NA" for missing data.
# All missing data should be specified NA.
# This often happens automatically, but hasn't happened in this case.

data$gender[data$gender==""]<-NA

summary(data$gender)

# Note that there all counts are now under the correct levels - 
# "m" and "f"
# Get rid of unused levels by re-defining as a factor:
data$gender<-factor(data$gender)

# This may all seem like a drag, but when you have had to import
# your data 7 times (as usually happens) because of errors
# that someone else made, just being able to ctrl-R this whole page
# to get back to where you were is amazing, rather than click-click
# which you have to do in SPSS etc. 
#---------------------------------------------------------------
# Summarise data by subgroup

# There are lots of ways of doing this, here's a couple. 

# By
help(by)

# Use "by" followed by the dependent variable you want to summarie
# then what you want to summarise by
# then what you want the summary to be.

by(data$pain.score, data$gender, mean)
by(data$pain.score, data$gender, sd)
by(data$pain.score, data$gender, median)
#etc.

# Make a group comparison by graph, boxplots are great
# They show the distribution very well. 

boxplot(data$pain.score~data$gender)

# Split
# This is often taught but I don't use it that much. 
# This splits the dataframe into one containing two dataframes
# defined by the group

data2<-split(data, data$gender)
str(data2)
summary(data2$f)

# Plyr
# This seems intimidating and is. 
# It will be very useful in the future, especially with large datasets
# Try this. 

# install.packages("plyr") #remove "#" first time to install
library(plyr)
help(package=plyr)

# Plyr takes data in any form and outputs in any form. 
# Here the "dd" means take a dataframe and give me one back. 

ddply(data, .(gender), summarise, mean=mean(pain.score), sd=sd(pain.score))

# Please post questions or anything that is not clear.

 

To leave a comment for the author, please follow the link and comment on his blog: DataSurg » Tag » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.