by Micheleen Harris
Microsoft Data Scientist
As a Data Scientist, I refuse to choose between R and Python, the top contenders currently fighting for the title of top Data Science programming language. I am not going to argue about which is better or pit Python and R against each other. Rather, I'm simply going to suggest to play to the strengths of each language and consider using them together in the same pipeline if you don't want to give up advantages of one over the other. This is not a novel concept. Both languages have packages/modules which allow for the other language to be used within it (rpy2 in Python and rPython in R). Even in Jupyter notebooks, using the python kernel, one can use “R magics” to execute native R code (which actually relies on rpy2).
I learned R and Python at about the same time. Having pretty equal footing in both languages makes pipelining them together when need be an attractive option as I have my favorite aspects of each. It is agreed R has crisp, clean and journal-quality graphics as well as an incredible arsenal of statistical packages. Python is both a general purpose language and it is agreed in some places it's really a production-ready coding language. But who says you can't do the heavy statistics, machine learning and/or graphics in R within Python? This blog is not about comparing the two languages, however, simply about options to pipeline them and maybe a bit on why you would want to do so.
First things first. We need to decide on a platform and here I'm focusing on notebooks. We actually could do all of this outside a notebook environment, but in general notebook systems are more sharable, interactive, and completely appropriate for demonstrations. If I was well-funded and wanted much more than a notebook I'd probably go with Sense, a notebook-like-IDE-like pipelining system (you can demo it here). If I wanted to be daring, I'd go with Beaker notebooks, a promising, new open-source polyglot notebook project: beakernotebook. This time around, however, I'm going to go with the more established Jupyter notebook project, running a python 3.4 kernel, and interacting with R 3.2.3 via “R magics” and rpy2 python module.
Why notebooks? In a nutshell, I've found them a great place to learn, teach, share, and test code (see my Aside below for further explanation). It took some getting used to, but notebooks are booming right now in university courses and at conferences, both academic and industry. Why Jupyter notebooks? One reason that I particularly like is that kernels for over 50 languages have been developed for the Jupyter notebook system thus far, including ones for Scala and Julia, two increasingly popular languages in the data science arena.
Getting back to R and Python, here are two notebook snippets I created.
The first demonstrates loading the R ipython extension, creating a python pandas dataframe, passing this as input to R, and graphing the data with R's ggplot2 package.
The second demonstrates creating some data in python with numpy, passing this as input to R, performing a linear fit, graphing the results with R's plot, and passing the results of the fit (the coefficients) back to python for printing.
Aside: When I first came across notebook systems, I really disliked them. I'd start writing a chunk of code in a cell and end up switching to my favorite IDE instead, abandoning the notebook. So, what happened? I started teaching. Creating modules in notebook format, for students to enter into, interactively run, modify, test, etc. seemed like an excellent way to learn (and fun). In fact, I started learning this way, looking at notebooks I found online. And I found recently at conferences, workshop presenters are using notebooks to teach. I could grab these notebooks and go back to them over and over on my own notebook server (which is fairly easy to set up). In the end it was a shift in perspective and comfort for me, much like learning a new language. I use notebooks now for teaching, learning, testing, sharing code, and documentation. I might even start blogging in a notebook system soon (this is already a thing people do).
Also, excitingly, Jupyter Notebook had a new release at the time of writing this (4.1 came out January 8, 2016). Check out the announcement. Now, we have multi-cell selection, find-and-replace and “Restart and Run All Cells” and some other nifty stuff.
Note: if you are trying to get the latest rpy2 working on the latest Jupyter notebook on Windows, just be warned you might run into a console-writing issue. That is, print statements might write to the terminal instead of the notebook browser window. If this happens, it must still be an active bug. Contact rpy2 people. This happened to me with rpy2 2.7.6 with Jupyter notebook 4.1 and R 3.2.2 on Windows 10.