In Data Science from Scratch, a book introducing data science using Python, Joel Grus said the following about R (pg. 302):
Although you can totally get away with not learning R, a lot of data scientists and data science projects use it, so it’s worth getting familiar with it.
In part, this is so that you can understand people’s R-based blog posts and examples and code; in part, this is to help you better appreciate the (comparatively) clean elegance of Python; and in part, this is to help you be a more informed participant in the never-ending “R versus Python” flamewars.
From the wording he uses to the content of his message, Mr. Grus demonstrates how programmers can be passionate about the languages they use. Every year, websites publish articles about the popularity of programming languages and what the current trends are. AppDynamics, an American application performance management and IT operations analytics company, recently published an article on the most popular languages of 2017 on their website, projecting what languages they believe will remain dominant and which up-and-comers people should keep an eye on (in a broad sense). Meanwhile, statisticians and data scientists regularly write articles tracking the horse race between R, Python (the front runners), SAS, and the rest. Bob Muenchen, in a recent article, tracked the popularity of languages in job postings and found that R recently surpassed SAS, but both are less popular than Python. There’s also the “doomsayer” genre that’s regularly published, where people argue that such-and-such languages are “dying.” Quora questions regularly pop up asking “Is [insert language here] dying?” Dice.com has a 2014 article of languages marked for death, with a 2016 follow-up. In my personal life, my brother’s fiance, a computer programmer, continues to insist that Python is a dying language (a statement I disagree with).
Programmers, or those who write programs regularly, love talking about languages. No one would question how important languages are in general, but many love to compare the popularity and merit of languages in use. The Python people believe Python is best for data science; it’s “simpler than R” (cough cough) and, unlike R, is a general-purpose programming language. The R people believe R is better; it is built specifically for data analysis, has a larger universe of packages devoted to data analysis, and supports functional programming better. But wait, the Pythonistas say: R’s approach to OOP is… bizarre, to put it mildly. Well, the R crowd replies, at least we can switch text editors without fearing screwing up all the white space and our script becoming one giant IndentationError for reasons unknown (curly braces for the win!). And so on.
The battle of the languages, though, is no laughing matter. Set aside for a second what language(s) to choose when starting a project. What languages practitioners do and don’t know directly corresponds to their professional success. Failing to keep on top of industry trends will lead to one losing a job and never get another to replace it.
I know this very well.
When I was born, my dad was the editor of a local newspaper, but lost his job for reasons not fully in his control (I hear it amounts to workplace politics). So he went through an intensive two-year education to get an associate degree in computer science. When Dad graduated, I walked up the stage with him, a toddler at the time. (Read my Dad’s blog post, written when I graduated with my BS in Mathematics and HBS in Economics.)
So for most of my life (I’d guess around twenty years), my dad was a computer programmer, writing COBOL code. We had a decent middle-class living through my life. Unfortunately, Dad did not expand his skills. COBOL was losing popularity. Dad thought that knowing a rare language was an asset, but being proficient only in a rare language turned out to be a liability. He did eventually realize this and sought employers who would train him, but none of them, from Discover to Wencor to the State of Utah, trained him in more popular modern languages (even when they promised they would). He tried participating in a training program offered by the Utah Department of Workforce Services, but the company they hired for the training was… well, terrible. He didn’t learn anything. (Why the State of Utah chose to have a private company provide training instead of, say, Salt Lake Community College, is beyond me. I’m convinced that this company has a parasitic deal with the state government, where they get money from the state to provide crappy services to people down on their luck looking to better their lives, fueling my skepticism of education from the private sector. But I digress.)
The conclusion of the story: my parents declared bankruptcy, the house I grew up in from 1996 up until 2015 was foreclosed on, and my dad is a bus driver now. He’s kept looking for computer programming work, and bought a book on Java, but the life of the poor is hard, and erratic bus driving schedules coupled with living paycheck-to-paycheck makes learning programming hard, especially without a decent computer.
Choosing What to Use
So yes, programming languages matter. They make or break careers. Furthermore, no one can depend on employers to give their employees all the skills they need to stay relevant in the labor market; one should be staying on top of trends and being prepared to take initiative for themselves.
But what language to learn?
I’m not going to provide any data; I’ve linked to at least two good articles that could give a good description of the “lay of the land”. I’m merely going to note what I’ve noticed.
As much as people like to talk about the features of this or that programming language and how that makes them better or worse, one of the key factors that determines what language will be used in a project is its existing user base. One reason why the user base matters is it determines how easy it is to find answers to questions that invariably arise. In the day and age of programming using the Google+StackExchange method, one cannot understate how important this is. If you’re using a popular programming language in your field, chances are that any problem you encounter has already been solved.
Another reason that plays into the former is that packages are, in many ways, more important than the languages themselves, and the user base determines what packages will be written for the language. This means that not only are users more likely to find existing code that does what they want to do, that code will likely be better supported since there are more eyes looking at it to identify undesirable behavior, meaning those packages will be of better quality. If healthcare analysts prefer R, you will likely find lots of high-quality R packages for healthcare-related applications, while if data scientists prefer Python, you can expect lots of excellent machine learning packages for Python (as a hypothetical example).
I learned this lesson from experience. In my last year as an undergraduate at the University of Utah, I worked for a non-profit policy advocacy group called Voices for Utah Children studying Utah’s gender gap in wages. I wrote two reports. One was a basic study where I sliced and diced the gender gap in different ways. The second was my Honor’s thesis, where I did a more advanced econometric study of the pay gap.
For the first study, I used R because it was a language I had learned while taking statistics courses, and it’s a free, open-source languages. I was an inexperienced programmer and I was working alone (my supervisor at Voices for Utah Children did not do anything programming-related), but using the well-known Google+StackExchange method, I was able to learn a lot, enough to do the job. Granted, these days I don’t even want to look at the scripts I wrote then, they were so terrible, but I still managed to learn a lot and get the job done.
For my Honor’s thesis, I still wanted to keep using R, but now I needed to work with a faculty member in the Economics department. He used Stata for his work; in fact, the vast majority of econometricians and those doing statistical work in policy or social science use Stata. Relevant data was provided in Stata-friendly formats. Not only were the packages I needed for my project best supported in Stata, the equivalent R packages were not just inflexible and not user friendly, they may not have even worked at all (or at least for how I needed to use them)! And even communicating what was being done with my thesis adviser was difficult, leading to perhaps months of wasted time and effort. On one fateful day, when there were discrepancies between how R and Stata were subsetting the same data set, I was in my adviser’s office trying to work things out. I was repeatedly producing errors in my code, slowing the process down, trying to work with code that looked unnecessarily complicated, and my adviser eventually said, “I know one thing; I will never use R.”1 The day he said that, I went home and paid $200 for Stata. R may have been the superior language (and I still believe so), but I was swimming upstream trying to use it.
An unfortunate consequence of what I’ve just described is that better tools or languages may not see use simply because they’re not popular (an example of a network effect). Those few brave souls who try to use better tools are in for a rough ride. That said, unless the benefit of using a “better” language surpasses the cost of going against the consensus, it’s better to stick with what’s popular.
Learning to Learn
That said, what’s more important than learning any particular programming language is learning how to learn programming languages. If there are two popular programming languages in a field (say, R and Python), learn both. Learning lots of programming languages is surprisingly easy. Eventually, familiar patterns appear that makes learning new languages easier. Feel free to specialize in a few, but a broad skill set is more valuable than a narrow one.
The best reason to keep learning new languages, though, is because technology is always changing. Once upon a time, hardware was more important than software. Low-level programming languages were key since one had to optimize heavily for low speed and storage space. Later, Moore’s Law lead to lots of processing power and hard-drive space, so less efficient languages, starting with C but going so far as Python, became popular; programmer time was more valuable than computer-time. Future innovations are likely to render the existing order obsolete as well. One can easily imagine quantum computing revolutionizing software again, bringing in a new set of programming languages that anyone worth her salt will need to learn to stay relevant.
People are welcome to continue debating the merits of this language or that. What matters most, though, are what people are actually using, and the field is always changing, from fads to reactions to truly revolutionary technology. So one must be continually learning to stay relevant today and stay sharp for what’s ahead. I at least enjoy the process.
This blog post was inspired by “The Most Popular Programming Languages for 2017,” by Jordan Bach, which was brought to my attention by Bethany Emerson at Ghergich & Co. If you’re interested in investigating new programming languages to learn, consider the infographic below. Click the image below to read the full article.
Click to Enlarge Image
Granted, I did not know then what I do about R now, and a lot of what I’ve learned since then would likely have led to that experience in my adviser’s office going better. At the time, I was unaware of Hadley Wickham, dplyr, or the tidyverse in general; I was subsetting using the abominable
which()function. A few years of following R-Bloggers has done wonders for my R skills. But even if I had known of dplyr, I would eventually have been forced to switch to Stata anyway. I needed to do Oaxaca-Blinder decomposition on CPS survey data, using regressions robust to heteroskedasticity. With Stata, doing this is almost trivially easy, but with R, I would need to use the sandwich, survey, and oaxaca packages and combine them together in a way they refuse to combine. survey is extremely difficult to understand, isn’t very flexible, and does combine with sandwich to get heterosketasticity-robust standard errors. oaxaca‘s principal function,
oaxaca(), has a terrible interface that, first, uses the parameter name
weightthat the function
lm()would need to use in an entirely different way, and second, refuses to allow custom functions to compute regressions in a way that alleviates the first problem. With this in mind, I’m not shocked at all that econometricians use Stata instead of R. The tidyverse revolution has yet to touch R econometrics in a way that would make it remotely usable even for the most basic task as computing a linear regression with heteroskedasticity-robust standard errors on survey data. Oaxaca decompositions are also very common, yet they are not practically doable in R right now without re-writing the function. Someone needs to take a look at this. End of rant. ↩