Cleaning Data and Graphing in R and Python

February 10, 2014

(This article was first published on Climate Change Ecology » R, and kindly contributed to R-bloggers)

Python has some pretty awesome data-manipulation and graphing capabilities. If you’re a heavy R-user who dabbles in Python like me, you might wonder what the equivalent commands are in Python for dataframe manipulation. Additionally, I was curious to see how many lines of code it took me to do that same task (load, clean, and graph data) in both R and Python. (I’d like to stop the arguments about efficiency and which language is better than which here, because neither my R nor Python code are the super-efficient, optimal programming methods. They are, however, how I do things. So to me, that’s what matters. Also, I’m not trying to advocate one language over the other (programmers can be a sensitive bunch), I just wanted to post an example showing how to do equivalent tasks in each language).

First, R

# read Data
JapBeet_NoChoice <- read.csv("~/Documents/FIU/Research/JapBeetle_Temp_Herbivory/Data/No_Choice_Assays/JapBeet_NoChoice.csv")
# drop incomplete data
feeding <- subset(JapBeet_NoChoice, Consumption!='NA')
# refactor and clean
feeding$Food_Type <- factor(feeding$Food_Type)
feeding$Temperature[which(feeding$Temperature==33)] <- 35

# subset
plants <- c('Platanus occidentalis', 'Rubus allegheniensis', 'Acer rubrum', 'Viburnum prunifolium', 'Vitis vulpina')
subDat <- feeding[feeding$Food_Type %in% plants, ]

# make a standard error function for plotting
seFunc <- function(x){
 se <- sd(x) / sqrt(sum(!
 lims <- c(mean(x) + se, mean(x) - se)
 names(lims) <- c('ymin', 'ymax')

# ggplot!
ggplot(subDat, aes(Temperature, Herb_RGR, fill = Food_Type)) +
 stat_summary(geom = 'errorbar', = 'seFunc', width = 0, aes(color = Food_Type), show_guide = F) +
 stat_summary(geom = 'point', fun.y = 'mean', size = 3, shape = 21) +
 ylab('Mass Change (g)') +
 xlab(expression('Temperature '*degree*C)) +
 scale_fill_discrete(name = 'Plant Species') +
 axis.text = element_text(color = 'black', size = 12),
 axis.title = element_text(size = 14),
 axis.ticks = element_line(color = 'black'),
 legend.key = element_blank(),
 legend.title = element_text(size = 12),
 panel.background = element_rect(color = 'black', fill = NA)


Next, Python!

# read data
JapBeet_NoChoice = pd.read_csv("/Users/Nate/Documents/FIU/Research/JapBeetle_Temp_Herbivory/Data/No_Choice_Assays/JapBeet_NoChoice.csv")

# clean up
feeding = JapBeet_NoChoice.dropna(subset = ['Consumption'])
feeding['Temperature'].replace(33, 35, inplace = True)

# subset out the correct plants
keep = ['Platanus occidentalis', 'Rubus allegheniensis', 'Acer rubrum', 'Viburnum prunifolium', 'Vitis vulpina']
feeding2 = feeding[feeding['Food_Type'].isin(keep)]

# calculate means and SEs
group = feeding2.groupby(['Food_Type', 'Temperature'], as_index = False)
sum_stats = group['Herb_RGR'].agg({'mean' : np.mean, 'SE' : lambda x: x.std() / np.sqrt(x.count())})

for i in range(5):
    py.errorbar(sum_stats[sum_stats['Food_Type'] == keep[i]]['Temperature'],
                sum_stats[sum_stats['Food_Type'] == keep[i]]['mean'],
                yerr = sum_stats[sum_stats['Food_Type'] == keep[i]]['SE'],
                fmt = 'o', ms = 10, capsize = 0, mew = 1, alpha = 0.75,
                label = keep[i])

py.xlabel(u'Temperature (\u00B0C)')
py.ylabel('Mass Change')
py.xlim([18, 37])
py.xticks([20, 25, 30, 35])
py.legend(loc = 'upper left', prop = {'size':10}, fancybox = True, markerscale = 0.7)
Snazzy 2!

Snazzy 2!

So, roughly the same number of lines (excluding importing of modules and libraries) although a bit more efficient in Python (barely). For what it’s worth, I showed these two graphs to a friend and asked him which he liked more, he chose Python immediately. Personally, I like them both. It’s hard for me to pick one over the other. I think they’re both great. The curious can see much my older, waaayyy less efficient, much more hideous version of this graph in my paper, but I warn you.. it isn’t pretty. And the code was a nightmare (it was pre-ggplot2 for me, so it was made with R’s base plotting commands which are a beast for this kind of graph).

To leave a comment for the author, please follow the link and comment on their blog: Climate Change Ecology » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training



CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)