If you’re new to data science, or your organization is, you’ll need to pick a language to analyze your data and a thoughtful way to make that decision. Full disclosure: While I can write Python, my background is mostly in the R community—but I’ll try my best to be non-partisan.
The good news is that you don’t need to sweat the decision too hard: both Python and R have vast software ecosystems and communities, so either language is suitable for almost any data science task.
The two most commonly used programming language indexes, TIOBE and IEEE Spectrum, rank the most popular programming languages. They use different criteria for popularity, which explains the differences in the results (TIOBE is entirely based on search engine results; IEEE Spectrum also includes community and social media data sources like Stack Overflow, Reddit, and Twitter). Of the languages on each list that are commonly used for data science, both indexes list Python as the most popular language for data science, followed by R. MATLAB and SAS come in third and fourth place, respectively.
Now that we’ve established that Python and R are both good, popular choices, there are a few factors that may sway your decision one way or the other.
What language do your colleagues use?
The most important factor in deciding which programming language to use is knowing which language your colleagues use, since the benefits of being able to share code with your colleagues and maintaining a simpler software stack outweigh any benefits of one language over another.
Who is working with data?
Python was originally developed as a programming language for software development (the data science tools were added later), so people with a computer science or software development background often find Python comes more naturally to them. That is, the transition from other popular programming languages like Java or C++ to Python is easier than the transition from those languages to R.
R has a set of packages known as the Tidyverse, which provide powerful yet easy-to-learn tools for importing, manipulating, visualizing, and reporting on data. Using these tools, people without any programming or data science experience (at least anecdotally) can become productive more quickly than in Python. If you want to test this for yourself, try taking Introduction to the Tidyverse, which introduces R’s dplyr and ggplot2 packages, and Introduction to Data Science in Python, which introduces Python’s pandas and Matplotlib packages, and see which you prefer.
Verdict: If data science in your organization will primarily be conducted by a dedicated team with programming experience, Python has a slight advantage. If you have many employees who don’t have a data science or programming background, but who still need to work with data, R has a slight advantage.
What tasks are you performing?
While Python and R can basically both do any data science task you can think of, there are some areas where one language is stronger than the other.
|Where Python Excels
|Where R Excels
|The majority of deep learning research is done in Python, so tools such as Keras and PyTorch have “Python-first” development. You can learn about these topics in Introduction to Deep Learning in Keras and Introduction to Deep Learning in PyTorch.
|A lot of statistical modeling research is conducted in R, so there’s a wider variety of model types to choose from. If you regularly have questions about the best way to model data, R is the better option. DataCamp has a large selection of courses on statistics with R.
|Another area where Python has an edge over R is with deploying models into other pieces of software. Since Python is a general purpose programming language, you can write the whole application in Python and then including your Python-based model is seamless. We cover deploying models in Designing Machine Learning Workflows in Python and Building Data Engineering Pipelines in Python.
|The other big trick up R’s sleeve is easy dashboard creation using Shiny. This enables people without much technical experience to create and publish dashboards to share with their colleagues. Python’s Dash is an alternative, but not as mature. You can learn about Shiny in Building Web Applications with Shiny in R and Building Web Applications with Shiny in R: Case Studies.
This list is far from exhaustive and experts endlessly debate which tasks can be done better in one language or another. Again, there is more good news: Python programmers and R programmers borrow good ideas from each other a lot. For example, Python’s plotnine data visualization package was inspired by R’s ggplot2 package, and R’s rvest web scraping package was inspired by Python’s BeautifulSoup package. So eventually the best ideas from either language make their way into the other.
If you’re too impatient to wait for a particular feature in your language of choice, it’s also worth noting that there is excellent language interoperability between Python and R. That is, you can run R code from Python using the rpy2 package, and you can run Python code from R using reticulate. That means that all the features present in one language can be accessed from the other language. For example, the R version of deep learning package Keras actually calls Python. Likewise, rTorch calls PyTorch.
What do your competitors use?
If you work at a business that is growing fast and want to recruit top employees, it’s worth doing some opposition research to see what technologies your competitors are using. After all, your new hires will be productive more quickly if they don’t have to learn a new language.
Programming language wars are mostly excuses for people to promote their favorite language and have fun trolling people who use something else. So I want to be clear that I’m not interested in starting another argument on the internet about Python versus R for data science.
I hope I’ve convinced you that, while both Python and R are good choices for data science, factors like employee background, the problems you work on, and the culture of your industry can guide your decision.