My Data Science Tool Box

April 3, 2018
By

[This article was first published on Ista Zahn (Posts about R), and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post describes the tools I currently use for working with data.
People often ask me to recommend specific tools, and I always
hesitate, because so much boils down to personal preference. I
recently added a workshop to the DSS lineup providing an overview of
popular tools for working with data. The core idea is that researchers
have a lot of choices available when it comes to choosing tools to
implement a reproducible workflow. For example, it doesn’t really
matter whether you choose to learn R or Python; the important thing is
that you write and document code of some kind so that your analysis
can be reproduced. Similarly, it doesn’t matter much whether you
choose to use RStudio or Jupyter notebooks; the important thing is
that you have a development and authoring environment that encourages
good research practices. Still, inquiring minds want to know, what do
you use?

The short answer is as follows:

Operating system
Arch Linux
Programming language
R and Python
Editor / IDE
Emacs
Markup language
Org Mode and LaTeX
Revision control system
git
Shell
fish

Those curious to know why I prefer these tools and how I’ve
customized them to suit my needs and preferences can read on.

I use the Arch Linux operating system

Operating systems are often excluded from “toolkit” discussions,
presumably because most popular tools are cross-platform and abstract
away a lot of the differences across operating systems. Nevertheless,
operating systems are not all created equal, and in my opinion
Linux-based operating systems are currently the best option for
working with data.

Linux is similar in many ways to OS X (they are both UNIX-like), but
without the annoying restrictions and limitations. Linux gives you a
package manager and the freedom to install whatever tools you need,
and to configure them however you like. All this freedom can make it
easier to shoot yourself in the foot, but I find that preferable to
being restricted by technical limitations (Windows) or greedy
corporate policy (Apple OS X).

It is worth noting that the world of Linux is diverse and varied, and
that there can be significant differences among different Linux
distributions. One frequent issue is that many Linux distributions are
very conservative about releasing software updates. This might make
sense for embedded devices or servers (though I have my doubts) but it
makes absolutely no sense on a laptop or personal machine. Generally
speaking, I want to run the latest stable release of all the programs
installed on my computers. Arch Linux is one of the few Linux
distributions that makes it easy to keep your applications up-to-date.
If you are new to Linux you may wish to start with Manjaro Linux, a
pre-configured Arch Linux derivative.

I use the Emacs text editor

I use the Emacs text editor because I haven’t yet found anything that
I like better. It has a lot of legacy baggage that makes it feel alien
and intimidating at first, but I got used the these quirks after a
using it for a week or two. The defining feature of Emacs is that it
can be customized and extended without limit. It is both a text editor
and a toolkit to building your own editing environment. A large and
robust ecosystem of community developed packages makes it easy to use
Emacs not only as a text editor, but also as a Git and GitHub
front-end, an email client, and StackOverflow browser (to give but a
few examples).

Emacs customization

I customize Emacs extensively to improve the user interface, provide
better support for specific programming and markup languages, and to
provide front-ends for reading email and managing Git repositories.

Highlights of my Emacs configuration include

  • improved support for running R, python, or other programming
    languages inside,
  • support for LaTeX and other markup languages,
  • support for literate programming using org-mode or R markdown,
  • consistent and familiar code evaluation using CTRL-RETURN,
  • consistent and familiar indentation and code completion using the
    TAB key,
  • powerful and simple search-based tools for finding commands, files
    and buffers, inserting citations etc.
  • more standard select/copy/paste keys and right-click behavior makes
    it more familiar to those new to Emacs,
  • more powerful and convenient window management.

If you are interested in giving Emacs a try take a look at the
instructions and report any problems you may encounter.

I write documents using the Org Mode, Markdown, and LaTeX markup languages

Most of the documents I produces these days are technical or training
materials that include lots of example code. I often use
literate programming techniques to keep the examples and the output
produced by those examples together in a single document. The markup
language I use for this purpose depends on the complexity of the
document.

Markdown is the simplest and most ubiquitous of the markup languages
I use. I often use it for simple documents, or those on which
non-emacs users are collaborating. Org mode is a more powerful
markup language for which adequate support is available only in Emacs.
I use it for many things, including most of my workshop notes. Finally
LaTeX is the most powerful, complex, and verbose of the markup
languages I use. It is useful when you need more control over the
appearance of the resulting document.

Markdown customization

I write markdown using markdown-mode in Emacs. To “typeset” the
documents for printing or posting on line I use pandoc to convert
markdown to .pdf files (via LaTeX), .html, or .ipynb (jupyter
notebook) format. I have written a couple of scripts to make this
process easier in specific cases, e.g. these scripts convert markdown
documents to jupyter notebooks
and this one converts markdown to
.html using a custom template.

Org mode customization

Since many of the documents I prepare in org-mode include code
examples, I’ve configured org-mode support for bash, R, Python and
Matlab
. I also use a custom template to export R workshop notes for
publication on https://tutorials.iq.harvard.edu.

LaTeX customization

Although I rarely write directly in LaTeX these days (Markdown and Org
mode and much simpler and I strongly prefer them) LaTeX remains
important as a backend. For example, I may prepare notes for a
presentation in Markdown and then export to LaTeX in order to typeset
the slides using beamer.

I have developed a custom beamer theme using IQSS colors that might be
useful to you, especially if you are an IQSS affiliate. Highlights
include

  • a modern font that includes math symbols,
  • simple and clean layout (e.g., only section same and page numbers in
    the footer),
  • plenty of IQSS orange!

If you have any questions or difficulties using this theme please open
an issue
.

I use the R and Python programming languages

I use R and Python for working with data because they provide a good
balance of flexibility and convenience. I prefer them to statistics
packages like SPSS or SAS because both R and Python are full-fledged
programming languages that give me the power and flexibility I need to
address unusual or complicated tasks. At the same time, their
substantial standard libraries and huge package repositories make it
easy to accomplish standard or common data management and analysis
tasks.

Notably, I do not choose these languages because the core language
design – in both cases I don’t particularly like the languages. It
really is the ecosystem of packages that keeps me using these tools
instead of running off to a shiny new thing like Julia or Go, or even
from wandering off to an old but more interesting environment like
Haskell.

R customization

My customization of R is mostly limited to the installation of
packages. My .Rprofile just sets a default CRAN repository and prints
an amusing quote
.

I find the following R packages especially useful:

ggplot2
robust graphics package
lvaan
structural equation models
lme4
mixed effects modes in R
Amelia, mitools, mice
multiple imputation
purrr
consistent and clean functional programming tools
stringi
powerful text manipulation tools
xml2, jsonlite
powerful tools for manipulating and converting XML and JSON data
httr
a web client written in R

Other than installing these packages I mostly use the default R configuration.

Python customization

As with R I don’t really customize Python much. I mostly use python
from Emacs
, sometimes using Org mode with python code blocks for
literate programming.

I use the fish shell in the Terminology emulator

It has been my experience that kids these days don’t really like
shells. Honestly I don’t blame theme. Shell technology has been stuck
in the 80’s for far too long. That situation is starting to change,
but bash (a shell from the 80’s!) is still by far the most commonly
used. I think it is time for a change, and fish is the most
well-developed at the moment. It doesn’t make using the command line
fun exactly, but it feels a lot less like being forced to
time-travel 40 years into the past.

I use the Terminology emulator because unlike other terminal emulators
it can display images and video directly in your terminal. This also
helps avoid the forced-time-travel feeling commonly induced by using
bash in a typical terminal emulator.

Fish customization

One of the nice things about the fish sell is that it has all the
bells and whistles turned on by default. Very little configuration is
needed to have a pleasant environment. My fish configuration is
limited to a handful of convenience functions (AKA aliases) e.g., to
update my system or ssh to a particular computer.

I use Git and GitHub for revision control

I don’t particularly like git, but since everyone uses it I don’t feel
like I have much of a choice. If it were up to me I would use
something simpler like mercurial but it’s not so I use git. It is much
more complicated and frustrating than it needs to be, but it doesn’t
suck once you get the hang of it.

Git customization

I mostly use git from a terminal, but I often launch graphical tools
from the command line, e.g., gitg for viewing history and meld for
viewing and merging diffs.

Commonalities and alternatives

If you’ve read this far you must be really interested in tools for
working with data! While I hope it was interesting to read about my
choices, I encourage you to try out some alternatives and pick a set
of tools that works well for you.

In reflecting on my own tool choices I notice that customization and
community activity are key values for me. For example, I like R not so
much because of the design of the language, but because it is flexible
and has an active community building and sharing tools in the form of
R packages. Similarly, I value Emacs because it is easy to configure
and because there is an active community developing Emacs packages. I
value these tools not because of their design per se., but because
they are actually platforms that their user and developer
communities have built tools on top of. The downside of this
preference for power and flexibility is that these tools are often
complex. Some people prefer simpler tools, e.g., Stata instead of
R or Jupyter Notebooks instead of Emacs with Org mode, and that
is perfectly reasonable.

My data science tools workshop notes describe some alternative tools
and is a good place to start if you’re not sure what you should use
for a particular task.

To leave a comment for the author, please follow the link and comment on their blog: Ista Zahn (Posts about R).

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)