**Climate Change Ecology » R**, and kindly contributed to R-bloggers)

I’m interested in Python a lot, mostly because it appears to be wickedly fast. The downside is that I don’t know it nearly as well as R, so any speed gain in computation time is more than offset by Google time trying to figure out how to do what I want. However, I’m curious as to know how much faster Python is for things like simulating data (because I have a particular interest in neutral models and data simulation). So I faked a linear regression and bootstrapped the confidence interval on the slope. Since I’m only interested in the comparison in speed for bootstrapping, that’s all I’m going to show.

First, with R:

set.seed(101) # generate data x <- 0:100 y <- 2*x + rnorm(101, 0, 10) # plot data plot(x, y) # run the regression mod1 <- lm(y ~ x) yHat <- fitted(mod1) # get the residuals errors <- resid(mod1) # make a bootstrapping function boot <- function(n = 10000){ b1 <- numeric(n) b1[1] <- coef(mod1)[2] for(i in 2:n){ residBoot <- sample(errors, replace=F) yBoot <- yHat + residBoot modBoot <- lm(yBoot ~ x) b1[i] <- coef(modBoot)[2] } return(b1) } # Run the bootstrapping function system.time( bootB1 <- boot() ) mean(bootB1)

On my system, it takes about 10.5 seconds to run this bootstrap.

In Python:

import numpy as np import pylab as py import statsmodels.api as sm import time # Create the data x = np.arange(0, 101) y = 2*x + np.random.normal(0, 10, 101) # Add the column of ones for the intercept X = sm.add_constant(x, prepend=False) # Plot the data py.clf() py.plot(x, y, 'bo', markersize=10) # Define the OLS models mod1 = sm.OLS(y, X) # Fit the OLS model results = mod1.fit() # Get the fitted values yHat = results.fittedvalues # Get the residuals resid = results.resid # Set bootstrap size n = 10000 t0 = time.time() # Save the slope b1 = np.zeros( (n) ) b1[0] = results.params[0] for i in np.arange(1, 10000): residBoot = np.random.permutation(resid) yBoot = yHat + residBoot modBoot = sm.OLS(yBoot, X) resultsBoot = modBoot.fit() b1[i] = resultsBoot.params[0] t1 = time.time() elapsedTime = t1 - t0 print elapsedTime np.mean(b1)

On my system, the elapsed time is about 4.7 seconds. That’s a *substantial* increase in speed. I can only assume that for larger simulations, the time difference would be greater. Thus, it seems like Python may be the way to go for permutational analyses, especially when data are simulated.

So what I have learned about Python so far?

**PROS:**

- Super fast
- Language is ‘relatively easy’ to learn
- Has some built in stats capabilities
- Has great symbolic math capabilities

**CONS:**

- Language is different enough that learning is going to be slow
- Statistical capabilities are sparse, and R is an easy statistical language (so far)

Overall, if Python had good stats capabilities, I’d probably switch all together. As it is, I’m considering dropping R for things like modeling and simulations just because Python is so much faster. I’m also reading Numerical Ecology right now, and I’m considering sitting down and putting together the beginnings of a multivariate ecology stats module, but I bet that’s going to be more work than I have time for. It’d be pretty cool though.

**R-bloggers’ editor note:** this post has “open ends” which are thoroughly discussed in the bloggers comment section. For just one quote off that discussion, please notice Hadley’s comment:

I agree with your remarks in general, but it’s particularly problematic in this case because you compare a high-level R function designed to be robust against bad inputs to a low-level python function designed to be fast. To be fair to R, you also need to average in the cost of the one time where you provide numerically unstable input and R complains nicely but python fails silently and you spend two hours trying to track down the problem

**leave a comment**for the author, please follow the link and comment on their blog:

**Climate Change Ecology » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...