Cleaning Data and Graphing in R and Python

Nathan Lemoine

8 years ago

[This article was first published on Climate Change Ecology » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Python has some pretty awesome data-manipulation and graphing capabilities. If you’re a heavy R-user who dabbles in Python like me, you might wonder what the equivalent commands are in Python for dataframe manipulation. Additionally, I was curious to see how many lines of code it took me to do that same task (load, clean, and graph data) in both R and Python. (I’d like to stop the arguments about efficiency and which language is better than which here, because neither my R nor Python code are the super-efficient, optimal programming methods. They are, however, how I do things. So to me, that’s what matters. Also, I’m not trying to advocate one language over the other (programmers can be a sensitive bunch), I just wanted to post an example showing how to do equivalent tasks in each language).

First, R

# read Data
JapBeet_NoChoice <- read.csv("~/Documents/FIU/Research/JapBeetle_Temp_Herbivory/Data/No_Choice_Assays/JapBeet_NoChoice.csv")
# drop incomplete data
feeding <- subset(JapBeet_NoChoice, Consumption!='NA')
# refactor and clean
feeding$Food_Type <- factor(feeding$Food_Type)
feeding$Temperature[which(feeding$Temperature==33)] <- 35

# subset
plants <- c('Platanus occidentalis', 'Rubus allegheniensis', 'Acer rubrum', 'Viburnum prunifolium', 'Vitis vulpina')
subDat <- feeding[feeding$Food_Type %in% plants, ]

# make a standard error function for plotting
seFunc <- function(x){
 se <- sd(x) / sqrt(sum(!is.na(x)))
 lims <- c(mean(x) + se, mean(x) - se)
 names(lims) <- c('ymin', 'ymax')
 return(lims)
}

# ggplot!
ggplot(subDat, aes(Temperature, Herb_RGR, fill = Food_Type)) +
 stat_summary(geom = 'errorbar', fun.data = 'seFunc', width = 0, aes(color = Food_Type), show_guide = F) +
 stat_summary(geom = 'point', fun.y = 'mean', size = 3, shape = 21) +
 ylab('Mass Change (g)') +
 xlab(expression('Temperature '*degree*C)) +
 scale_fill_discrete(name = 'Plant Species') +
 theme(
 axis.text = element_text(color = 'black', size = 12),
 axis.title = element_text(size = 14),
 axis.ticks = element_line(color = 'black'),
 legend.key = element_blank(),
 legend.title = element_text(size = 12),
 panel.background = element_rect(color = 'black', fill = NA)
 )

Snazzy!

Next, Python!

# read data
JapBeet_NoChoice = pd.read_csv("/Users/Nate/Documents/FIU/Research/JapBeetle_Temp_Herbivory/Data/No_Choice_Assays/JapBeet_NoChoice.csv")

# clean up
feeding = JapBeet_NoChoice.dropna(subset = ['Consumption'])
feeding['Temperature'].replace(33, 35, inplace = True)

# subset out the correct plants
keep = ['Platanus occidentalis', 'Rubus allegheniensis', 'Acer rubrum', 'Viburnum prunifolium', 'Vitis vulpina']
feeding2 = feeding[feeding['Food_Type'].isin(keep)]

# calculate means and SEs
group = feeding2.groupby(['Food_Type', 'Temperature'], as_index = False)
sum_stats = group['Herb_RGR'].agg({'mean' : np.mean, 'SE' : lambda x: x.std() / np.sqrt(x.count())})

# PLOT
for i in range(5):
    py.errorbar(sum_stats[sum_stats['Food_Type'] == keep[i]]['Temperature'],
                sum_stats[sum_stats['Food_Type'] == keep[i]]['mean'],
                yerr = sum_stats[sum_stats['Food_Type'] == keep[i]]['SE'],
                fmt = 'o', ms = 10, capsize = 0, mew = 1, alpha = 0.75,
                label = keep[i])

py.xlabel(u'Temperature (\u00B0C)')
py.ylabel('Mass Change')
py.xlim([18, 37])
py.xticks([20, 25, 30, 35])
py.legend(loc = 'upper left', prop = {'size':10}, fancybox = True, markerscale = 0.7)
py.show()

Snazzy 2!

So, roughly the same number of lines (excluding importing of modules and libraries) although a bit more efficient in Python (barely). For what it’s worth, I showed these two graphs to a friend and asked him which he liked more, he chose Python immediately. Personally, I like them both. It’s hard for me to pick one over the other. I think they’re both great. The curious can see much my older, waaayyy less efficient, much more hideous version of this graph in my paper, but I warn you.. it isn’t pretty. And the code was a nightmare (it was pre-ggplot2 for me, so it was made with R’s base plotting commands which are a beast for this kind of graph).

To leave a comment for the author, please follow the link and comment on their blog: Climate Change Ecology » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.