by Andy Nicholls
When speaking with clients and other R users at events such as LondonR and EARL I’ve noticed an increasing trend in people looking to learn some python as the next step in their data science journey. At Mango most of our consultants are pretty happy using either language but as an R user of 12 or so years I’ve only ever dabbled with python. Recently however I found myself having to learn quickly and so I thought I’d share some of my observations.
Before you stop reading I should say that I am fully aware that there are many blog posts covering the high level pros and cons of each language. For this post I thought I’d get down to the nitty gritty. What does an R user really experience when trying to pick up python? In particular what does an R user that comes from a statistics background experience?
Personally I found eight (I wanted 10 but python is too good) and here they are:
- Lack of Hadley. So there is a Wes but there is a lot of duplication in functionality between packages. To start with you import statistics and find the mean function only to find it has been re-written for pandas. Later you find that everyone has their own idea on the best way to implement cross-validation. All very confusing when you start out. This brings me on to:
- Plotting. I had heard a lot of good things about matplotlib and seaborn but ggplot2 is streets ahead (IMHO). I would even go as far as to say that ggplot2 has a shallower learning curve.
- IDEs. Hats off to RStudio for changing the R world when it comes to IDEs. I remember a time before RStudio when the R GUI, StatET and Tinn-R were the norm. How things have improved. Sadly, python is not quite there yet. As an RStudio user I opted for Spyder. It’s OK but the script editor needs some work. The integration in Jupyter Notebook seems much better when I chat with colleagues but I’m just not a big fan of notebooks.
- Namespaces. I’ve lost count of the number of times I’ve told trainees on an intro to R course that masking very rarely trips you up as a user (unless you’re building packages it really doesn’t). Let’s just say that in python you have to be careful. Bring too much in and you’ll overwrite your own objects and cause chaos. This means you bring in things as and when you need them. Having to explicitly import OS utilities in order to change the working directory and so on is frustrating. That said, python’s capabilities are a little better than R in this area.
- Object Orientation. I’ve grown to love R’s flexible S3 classes with lines like:
> x <- 5 > class(x) <- "just_made_this_up" > x  5 attr(,"class")  "just_made_this_up"
In python I am never quite sure what methods exist for an object and when to just go functional. You also really have to know about classes to work with python effectively whereas a casual R user can get by without even knowing that R has a class system.
- Reliance on R. On my recent project I was using the best of the statistical capabilities in python. First off I should say that it’s basically all there (except for stepwise GLMs for some bizarre reason). However, although I’ve always known that most of the statistical modelling capabilities in python have been ported from R the documentation is pretty lazy and most of it just points you at the R documentation. The example datasets are even the same! Speaking of the documentation.
- Help documentation. I can only speak for the more popular packages in the two languages but the R documentation is much more plentiful and generally contains a lot more examples.
- Zero-based arrays. I couldn’t write a list without this coming up. I do love it when smug coders that have developed in other languages tell me that R is the exception here by indexing from 1. However, as a human being I count from 1 and this will always make more sense to me. Ending at n-1 is also confusing. Compare:
# R x = seq(2,10, by = 2) x[1:3] # Select first 3 elements  2 4 6 # Python x = list(range(2,11, 2)) x[0:3] # Select first 3 elements [2, 4, 6]
What I was impressed by was how extensively the statistical capabilities in R have been ported to python (I wasn’t expecting the mixed modelling or survival analysis capabilities to be anything like that in R for example). However, as an existing R user there really is no point in switching to python for statistics. The only benefit would be if you were using python for, say, extensive web-scraping and you wanted to be consistent. If that’s your reason though then let me point you towards Chris Musselle’s blog post, “Integrating Python and R Part II – Executing R from Python and Vice Versa”. And don’t forget that you can also just use rvest.
So my advice would be if you’re going to try to learn python, don’t learn it with the intention of using it to build models. Learn it because it’s a more flexible all-round programming language and you have some heavy lifting to do. Just find something that’s hard to do in R and try using python for that. Otherwise you’ll end up like me, writing a whingy blog post!