This post describes the tools I currently use for working with data.
People often ask me to recommend specific tools, and I always
hesitate, because so much boils down to personal preference. I
recently added a workshop to the DSS lineup providing an overview of
popular tools for working with data. The core idea is that researchers
have a lot of choices available when it comes to choosing tools to
implement a reproducible workflow. For example, it doesn’t really
matter whether you choose to learn R or Python; the important thing is
that you write and document code of some kind so that your analysis
can be reproduced. Similarly, it doesn’t matter much whether you
choose to use RStudio or Jupyter notebooks; the important thing is
that you have a development and authoring environment that encourages
good research practices. Still, inquiring minds want to know, what do
The short answer is as follows:
- Operating system
- Arch Linux
- Programming language
- R and Python
- Editor / IDE
- Markup language
- Org Mode and LaTeX
- Revision control system
Those curious to know why I prefer these tools and how I’ve
customized them to suit my needs and preferences can read on.
I use the Arch Linux operating system
Operating systems are often excluded from “toolkit” discussions,
presumably because most popular tools are cross-platform and abstract
away a lot of the differences across operating systems. Nevertheless,
operating systems are not all created equal, and in my opinion
Linux-based operating systems are currently the best option for
working with data.
Linux is similar in many ways to OS X (they are both UNIX-like), but
without the annoying restrictions and limitations. Linux gives you a
package manager and the freedom to install whatever tools you need,
and to configure them however you like. All this freedom can make it
easier to shoot yourself in the foot, but I find that preferable to
being restricted by technical limitations (Windows) or greedy
corporate policy (Apple OS X).
It is worth noting that the world of Linux is diverse and varied, and
that there can be significant differences among different Linux
distributions. One frequent issue is that many Linux distributions are
very conservative about releasing software updates. This might make
sense for embedded devices or servers (though I have my doubts) but it
makes absolutely no sense on a laptop or personal machine. Generally
speaking, I want to run the latest stable release of all the programs
installed on my computers. Arch Linux is one of the few Linux
distributions that makes it easy to keep your applications up-to-date.
If you are new to Linux you may wish to start with Manjaro Linux, a
pre-configured Arch Linux derivative.
I use the Emacs text editor
I use the Emacs text editor because I haven’t yet found anything that
I like better. It has a lot of legacy baggage that makes it feel alien
and intimidating at first, but I got used the these quirks after a
using it for a week or two. The defining feature of Emacs is that it
can be customized and extended without limit. It is both a text editor
and a toolkit to building your own editing environment. A large and
robust ecosystem of community developed packages makes it easy to use
Emacs not only as a text editor, but also as a Git and GitHub
front-end, an email client, and StackOverflow browser (to give but a
I customize Emacs extensively to improve the user interface, provide
better support for specific programming and markup languages, and to
provide front-ends for reading email and managing Git repositories.
Highlights of my Emacs configuration include
- improved support for running R, python, or other programming
- support for LaTeX and other markup languages,
- support for literate programming using org-mode or R markdown,
- consistent and familiar code evaluation using
- consistent and familiar indentation and code completion using the
- powerful and simple search-based tools for finding commands, files
and buffers, inserting citations etc.
- more standard select/copy/paste keys and right-click behavior makes
it more familiar to those new to Emacs,
- more powerful and convenient window management.
I write documents using the Org Mode, Markdown, and LaTeX markup languages
Most of the documents I produces these days are technical or training
materials that include lots of example code. I often use
literate programming techniques to keep the examples and the output
produced by those examples together in a single document. The markup
language I use for this purpose depends on the complexity of the
Markdown is the simplest and most ubiquitous of the markup languages
I use. I often use it for simple documents, or those on which
non-emacs users are collaborating. Org mode is a more powerful
markup language for which adequate support is available only in Emacs.
I use it for many things, including most of my workshop notes. Finally
LaTeX is the most powerful, complex, and verbose of the markup
languages I use. It is useful when you need more control over the
appearance of the resulting document.
I write markdown using markdown-mode in Emacs. To “typeset” the
documents for printing or posting on line I use pandoc to convert
notebook) format. I have written a couple of scripts to make this
process easier in specific cases, e.g. these scripts convert markdown
documents to jupyter notebooks and this one converts markdown to
.html using a custom template.
Org mode customization
Although I rarely write directly in LaTeX these days (Markdown and Org
mode and much simpler and I strongly prefer them) LaTeX remains
important as a backend. For example, I may prepare notes for a
presentation in Markdown and then export to LaTeX in order to typeset
the slides using beamer.
- a modern font that includes math symbols,
- simple and clean layout (e.g., only section same and page numbers in
- plenty of IQSS orange!
If you have any questions or difficulties using this theme please open
I use the R and Python programming languages
I use R and Python for working with data because they provide a good
balance of flexibility and convenience. I prefer them to statistics
packages like SPSS or SAS because both R and Python are full-fledged
programming languages that give me the power and flexibility I need to
address unusual or complicated tasks. At the same time, their
substantial standard libraries and huge package repositories make it
easy to accomplish standard or common data management and analysis
Notably, I do not choose these languages because the core language
design – in both cases I don’t particularly like the languages. It
really is the ecosystem of packages that keeps me using these tools
instead of running off to a shiny new thing like Julia or Go, or even
from wandering off to an old but more interesting environment like
I find the following R packages especially useful:
- robust graphics package
- structural equation models
- mixed effects modes in R
- Amelia, mitools, mice
- multiple imputation
- consistent and clean functional programming tools
- powerful text manipulation tools
- xml2, jsonlite
- powerful tools for manipulating and converting XML and JSON data
- a web client written in R
Other than installing these packages I mostly use the default R configuration.
I use the fish shell in the Terminology emulator
It has been my experience that kids these days don’t really like
shells. Honestly I don’t blame theme. Shell technology has been stuck
in the 80’s for far too long. That situation is starting to change,
but bash (a shell from the 80’s!) is still by far the most commonly
used. I think it is time for a change, and fish is the most
well-developed at the moment. It doesn’t make using the command line
fun exactly, but it feels a lot less like being forced to
time-travel 40 years into the past.
I use the Terminology emulator because unlike other terminal emulators
it can display images and video directly in your terminal. This also
helps avoid the forced-time-travel feeling commonly induced by using
bash in a typical terminal emulator.
One of the nice things about the fish sell is that it has all the
bells and whistles turned on by default. Very little configuration is
needed to have a pleasant environment. My fish configuration is
limited to a handful of convenience functions (AKA aliases) e.g., to
update my system or ssh to a particular computer.
I use Git and GitHub for revision control
I don’t particularly like git, but since everyone uses it I don’t feel
like I have much of a choice. If it were up to me I would use
something simpler like mercurial but it’s not so I use git. It is much
more complicated and frustrating than it needs to be, but it doesn’t
suck once you get the hang of it.
Commonalities and alternatives
If you’ve read this far you must be really interested in tools for
working with data! While I hope it was interesting to read about my
choices, I encourage you to try out some alternatives and pick a set
of tools that works well for you.
In reflecting on my own tool choices I notice that customization and
community activity are key values for me. For example, I like R not so
much because of the design of the language, but because it is flexible
and has an active community building and sharing tools in the form of
R packages. Similarly, I value Emacs because it is easy to configure
and because there is an active community developing Emacs packages. I
value these tools not because of their design per se., but because
they are actually platforms that their user and developer
communities have built tools on top of. The downside of this
preference for power and flexibility is that these tools are often
complex. Some people prefer simpler tools, e.g., Stata instead of
R or Jupyter Notebooks instead of Emacs with Org mode, and that
is perfectly reasonable.
My data science tools workshop notes describe some alternative tools
and is a good place to start if you’re not sure what you should use
for a particular task.