Harrison – Center for Strategic and Budgetary Analysis, Washington DC
Cara – Department of the Air Force (Studies, Analyses, and Assessments – AF/A9), Washington DC
The views expressed in this article represent the personal views of the author and are not necessarily the views of the Department of Defense (DoD) or the Department of the Air Force.
This post is an effort to condense the ‘buzz’ surrounding the explosion of open source solutions in all facets of analysis – to include those done by Military Operations Research Society ( MORS) members and those they support – by describing our experiences with the R programming language.
The impact of R in the statistical world (and by extension, data science) has been big: worthy of an entire issue of SIGNIFICANCE Magazine (RSS). Surprisingly, R is not a new computing language; modeled on S and Scheme,- the technology at the core of R is over forty years old. This longevity is, in itself, noteworthy. Additionally, fee-based and for-profit companies have begun to incorporate R with their products. While statistics is the focus of R, with the right packages – and know-how – it can also be used for a much broader spectrum, to include machine learning, optimization, and interactive web tools.
In this post, we go back and forth discussing our individual experiences.
Getting Started with R
Harrison: I got started with R in earnest shortly before retiring from the U.S. Navy in 2016. I knew that I was going to need a programming language to take with me into my next career. The reason I chose R was not particularly analytical; the languages that I had done the most work in during grad school – MATLAB and Java – were not attractive in that the first required licensing fees and the second was – to me at the time – too ‘low level’ for the type of analysis I wanted to perform. I had used SPlus in my statistics track, but never really ‘took’ to it while in school. Several tools to ‘bridge’ the gap between Excel and R had been recommended to me by a friend, including RStudio and Rcommander.
Onboard ship many years ago, I learned to eat with chopsticks by requesting that the wardroom staff stop providing me with utensils, substituting a bag of disposable chopsticks I purchased in Singapore. Turns out when deprived of other options, you can learn very fast. Learning R basics was the same; instead of silverware, it was removing the shortcuts to my usual tools on my home laptop (Excel). I simply did every task that I could from the mundane to the elegant in R.
Cara: I started dabbling with R in 2017 when I had about a year and a half left in my PhD journey, after I decided to pursue a post-doctoral government career. Sitting comfortably in academia with abundant software licenses for almost a decade, I had no reason until that point to consider abandoning my SAS discipleship (other than abhorrent graphics capability, which I bolstered with use of SigmaPlot). My military background taught me not to expect access to expensive software in a government gig, and I had a number of friends and colleagues already using R and associated tools, so I installed it on my home computer. Other than being mildly intrigued by the software version naming convention, I stubbornly clung to SAS to finish my doctoral research, however.
How has using R shaped your practice?
Harrison: There is a lot of talk about how various tools perform in the sense of runtime, precision, graphics, etc. These are considerations, but they are completely eclipsed by the following: We don’t talk as a community about how much the tools we use shape our thinking. I frequently tell my colleagues that the fundamental unit in Excel is called a cell because it is your mind prison. There’s actually some truth to that. R is vectorized, so for most functions, passing an array gives an appropriate array output. When you work day-in and day-out with vectors, you stop thinking about individual operations start to think in terms of sentences. The magrittr
%>% operator, which takes the expression on the left as the first argument to the function on the right, makes this possible. Analysis begins to feel more like writing sentences – or even short poems – than writing computing code.
Early in my work with R, I was told by a colleague that “R might be good but the graphics are terrible”. This was a bit of a shock, as graphics has been one of the main selling points of the language, and I didn’t want to be making seedy graphs. From that point on, I made it a point to make the best graphics I possibly could, usually – but not always – using methods and extensions found in the
ggplot2 package. It is no exaggeration to say that I spend roughly 20% of my analysis time picking colors and other aesthetics for plots. If you are willing to take the time, you can get the graphics to sing; there are color schemes based on The Simpsons and Futurama, and fonts based on xkcd comics.
Cara: When I began teaching myself R – and using it daily – I thought I was merely learning the syntax of a new programming language. With the analytic capability inherent with R and the flexibility of development environments, however, it is really more of a way of thinking. Fold in the powerful (and mostly free!) resources and passionate following of analysts and data scientists, and you get an R community that I truly enjoy being a part of.
The R environment, even as a novice user, can have positive impacts on your workflow. For example, beyond syntax, my earliest explorations in R taught me that if you are going to do something more than once, write a function. I had never truly internalized that idea, even after a decade of using SAS. Another thing I learned relatively early on – get to know the
dplyr package, and use it! I had been coding in R for about 6 months before I was really introduced to functions like
dplyr::mutate(); these are powerful functions that can save a ton of code. I’ve been analyzing data for over a decade and I’ve never come across a dataset that was already in the form I needed. Prior to using the dplyr package, however, I was spending a lot of time manipulating data using no functions and a lot of lines of code. Beyond time savings, dplyr helps you think about your data more creatively. As a very basic example,
dplyr::summarise() is a more powerful option than
mean() used alone, especially for multiple calculations in a single data table. And once you master the Wonder Twin-esque combination of using
summarise(), you’ll be amazed at what you can (quickly) reveal through exploratory analysis. Data wrangling is (and always will be) a fact of life. The more efficiently you manipulate data, however, the more time you have to spend on the seemingly more exciting aspects of any project.
Disadvantages of R
Harrison: This piece is not a ‘sales pitch’ for R; but rather a sober consideration of what the tradeoffs an organization needs to consider when choosing an analytic platform writ large:
Compatibility and Editing. Because R is a computing language, graphics built in R are not editable by non-R users, as opposed to Excel graphs. This can be a challenge in the frequent case where the reviewers are not the same people that created the plots. If you made the plot, you are going to have to be the one who does the editing, unless there is another R user who understands your particular technique in the office.
No license costs do not mean that it’s free: I frequently like to say that I haven’t spent a dime on analytics software since I retired from the Navy; this is strictly true, but also misleading. I have spent considerable time learning the best practices in R over the past 4 years. An organization that is looking to make this choice needs to realize upfront that the savings in fees will be largely eaten up by extra manpower to learn how to make it work. The reward for investing the time in increasing the ability of your people to code is twofold; first, it makes them closer in touch with the actual analysis, and secondly, it allows for bespoke applications.
Cara: I work in a pretty dynamic world as a government operations research analyst (ORSA); we don’t typically have dedicated statisticians, programmers, data scientists, modelers, or data viz specialists. Most of the time, we are all functioning in some or all of those capacities. As a former engineer with a dynamic background, this suits me well. However, it also means that things change from day to day, from project to project, and as the government analytic world changes (rapidly). I do not have the flexibility to use one software package exclusively. Further, I face challenges within DoD related to systems, software, classification, and computing infrastructure that most people in academia or industry do not. In my organization, there has been a relatively recent and rapid shift in the analytic environment. We formerly leaned heavily on Excel-based regressions and descriptive statistics, usually created by a single analysts that answer a single question, and in many cases these models were not particular dynamic or scalable. We now focus on using open-source tools in a team construct, sometimes with industry partners, to create robust models that are designed to answer multiple questions from a variety of perspectives; scale easily to mirror operational requirements; fit with other models; and transition well to high performance computing environments.
The two open-source tools we (i.e., my division) currently use most for programming are R and Python. We have had success combining data analysis, statistics, and graphical models to create robust tools coded as RShiny apps. Recently, we chose to code in Python for a project that involved machine learning and high performance computing. I do not propose to debate the strengths and weaknesses of either R or Python in this forum; rather, I challenge you to consider carefully the implications of programming language choice for any project with a cradle to grave perspective.
Getting started with R can be daunting. We recommend the following references.
Stack Overflow. This invaluable resource is a bulletin board exchange of programming ideas and tips. The real skill required to use it effectively is knowing how to write an effective question. “I hate ggplot” or “My R code doesn’t work” are not useful; try “Could not subset closure” or “ggplot axis font size” instead.
Vignettes. Well-developed R packages have vignettes, which are very useful in seeing both an explanation of the code as well as an example. Two very good references are the ggplot2 gallery and the dplyr vignette Finally, the RViews blog is a great way to keep up-to-date with practice.
Books. Although I tend to acquire books with reckless abandon, the ones I actually keep and use have withstood careful consideration and have generally pegged the daily utility meter. Try R for Data Science by Wickham and Grolemund (O’Reilly Publishing 2017) and Elegant Graphics for Data Analysis by Wickham (Springer 2016); available both as print copies or electronic editions.
Podcasts. For those moments in your life when you need some data science-related enrichment, the producers of DataCamp host an excellent podcast called DataFramed. Fifty-nine episodes have been recorded so far; find them on soundcloud, Spotify, YouTube, or VFR direct from the creator’s listening notes.
RStudio Cheatsheets. Sometimes you need densely constructed (read: compact yet surprisingly in-depth), straightforward references. RStudio creates (and updates) these two-pagers for the most popular and versatile R packages to be great portable references for programmers – think of them as a combined dictionary and thesaurus for learning R. Fun fact: they can be downloaded in multiple languages.
Forums. (1) Data Science Center of Education (DSCOE) is a CAC-enabled collaboration site that hosts data science tutorials developed by Army analysts, mostly using R, and supports a week-long R immersion course offered at Center for Army Analysis (CAA) twice a year. The DSCOE forum is managed collaboratively by the CAA, U.S. Army Cyber Command (ARCYBER), Naval Postgraduate School (NPS), and the United Stated Military Academy (USMA). Contributions are both welcome and encouraged. (2) R-bloggers, created in 2015, is an R centric forum designed to foster connection, collaboration, and resource sharing within the R community. The utility of this forum lies in its array of technical resources that will benefit both new and practiced users. (3) Data Science DC, for those in the NCR, was formed via the concatenation of numerous meetup groups – including RMeetup DC – and a major proponent of a number of events, including hackathons and the DCR conference (held annually in the fall).