It’s an absolute myth that you can send an algorithm over raw data and have insights pop up… Data scientists … spend 50-80% of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets. But if the value comes from combining different data sets, so does the headache… Before a software algorithm can go looking for answers, the data must be cleaned up and converted into a unified form that the algorithm can understand. But practically, because of the diversity of data, you spend a lot of your time being a data janitor, before you can get to the cool, sexy things that got you into the field in the first place.As data analysis experts we justify our existence by (accurately) evangelizing that the bottleneck in the discovery process is usually not data generation, it’s data analysis. I clarify that point further with my collaborators: data analysis is usually the easy part — if I give you properly formatted, tidy, and rigorously quality-controlled data, hitting the analysis “button” is usually much easier than the work that went into cleaning, QC’ing, and preparing the data in the first place. To that effect, I’d like to introduce you to a tool that recently made its way into my data analysis toolbox.
filter()
filters rows from the data frame by some criterionarrange()
arranges rows ascending or descending based on the value(s) of one or more columnsselect()
allows you to select one or more columnsmutate()
allows you to add new columns to a data frame that are transformations of other columnsgroup_by()
and summarize()
are usually used together, and allow you to compute values grouped by some other variable, e.g., the mean, SD, and count of all the values of $y
separately for each level of factor variable $group
.%>%
operator.
%>%
operator: “then”hflights
dataset has information about more than 200,000 flights that departed Houston in 2011. Let’s say we want to do the following:
hflights
datasetgroup_by
the Year, Month, and Dayselect
out only the Day, the arrival delay, and the departure delay variablessummarize
to calculate the mean of the arrival and departure delaysfilter
the resulting dataset where the arrival delay or the departure delay is more than 30 minutes.filter(
summarise(
select(
group_by(hflights, Year, Month, DayofMonth),
Year:DayofMonth, ArrDelay, DepDelay
),
arr = mean(ArrDelay, na.rm = TRUE),
dep = mean(DepDelay, na.rm = TRUE)
),
arr > 30 | dep > 30
)
Notice that the order that we write the code in this example is inside out - we describe our problem as: use hflights
, then group_by
, then select
, then summarize
, then filter
, but traditionally in R we write the code inside-out by nesting functions 4-deep:
filter(summarize(select(group_by(hflights, ...), ...), ...), ...)
To fix this, dplyr provides the %>%
operator (pronounced “then”). x %>% f(y)
turns into f(x, y)
so you can use it to rewrite multiple operations so you can read from left-to-right, top-to-bottom:
hflights %>%
group_by(Year, Month, DayofMonth) %>%
select(Year:DayofMonth, ArrDelay, DepDelay) %>%
summarise(
arr = mean(ArrDelay, na.rm = TRUE),
dep = mean(DepDelay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Writing the code this way actually follows the order we think about the problem (use hflights
, then group_by
, then select
, then summarize
, then filter
it).
You aren’t limited to using %>%
to only dplyr functions. You can use it with anything. E.g., instead of head(iris, 10)
, we could write iris %>% head(10)
to get the first ten lines of the built-in iris dataset. Furthermore, since the input to ggplot
is always a data.frame, we can munge around a dataset then pipe the whole thing into a plotting function. Here’s a simple example where we take the iris dataset, then group it by Species, then summarize it by calculating the mean of the Sepal.Length, then use ggplot2 to make a simple bar plot.
library(dplyr)
library(ggplot2)
iris %>%
group_by(Species) %>%
summarize(meanSepLength=mean(Sepal.Length)) %>%
ggplot(aes(Species, meanSepLength)) + geom_bar(stat="identity")
Once you start using %>%
you’ll wonder to yourself why this isn’t a core part of the R language itself rather than add-on functionality provided by a package. It will fundamentally change the way you write R code, making it feel more natural and making your code more readable. There's a lot more dplyr can do with databases that I didn't even mention, and if you're interested, you should see the other vignettes on the CRAN package page.
As a side note, I’ve linked to it several times here, but you should really check out Hadley’s Tidy Data paper and the tidyr package, vignette, and blog post.
dplyr package: http://cran.r-project.org/web/packages/dplyr/index.html
dplyr vignette: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
dplyr on SO: http://stackoverflow.com/questions/tagged/dplyr
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
]]>It’s an absolute myth that you can send an algorithm over raw data and have insights pop up…Data scientists … spend 50-80% of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.But if the value comes from combining different data sets, so does the headache… Before a software algorithm can go looking for answers, the data must be cleaned up and converted into a unified form that the algorithm can understand.But practically, because of the diversity of data, you spend a lot of your time being a data janitor, before you can get to the cool, sexy things that got you into the field in the first place.
filter()
filters rows from the data frame by some criterionarrange()
arranges rows ascending or descending based on the value(s) of one or more columnsselect()
allows you to select one or more columnsmutate()
allows you to add new columns to a data frame that are transformations of other columnsgroup_by()
and summarize()
are usually used together, and allow you to compute values grouped by some other variable, e.g., the mean, SD, and count of all the values of $y
separately for each level of factor variable $group
.%>%
operator.%>%
operator: “then”hflights
dataset has information about more than 200,000 flights that departed Houston in 2011. Let’s say we want to do the following:hflights
datasetgroup_by
the Year, Month, and Dayselect
out only the Day, the arrival delay, and the departure delay variablessummarize
to calculate the mean of the arrival and departure delaysfilter
the resulting dataset where the arrival delay or the departure delay is more than 30 minutes.filter(
summarise(
select(
group_by(hflights, Year, Month, DayofMonth),
Year:DayofMonth, ArrDelay, DepDelay
),
arr = mean(ArrDelay, na.rm = TRUE),
dep = mean(DepDelay, na.rm = TRUE)
),
arr > 30 | dep > 30
)
hflights
, then group_by
, then select
, then summarize
, then filter
, but traditionally in R we write the code inside-out by nesting functions 4-deep:filter(summarize(select(group_by(hflights, ...), ...), ...), ...)
%>%
operator (pronounced “then”). x %>% f(y)
turns into f(x, y)
so you can use it to rewrite multiple operations so you can read from left-to-right, top-to-bottom:hflights %>%
group_by(Year, Month, DayofMonth) %>%
select(Year:DayofMonth, ArrDelay, DepDelay) %>%
summarise(
arr = mean(ArrDelay, na.rm = TRUE),
dep = mean(DepDelay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
hflights
, then group_by
, then select
, then summarize
, then filter
it).%>%
to only dplyr functions. You can use it with anything. E.g., instead of head(iris, 10)
, we could write iris %>% head(10)
to get the first ten lines of the built-in iris dataset. Furthermore, since the input to ggplot
is always a data.frame, we can munge around a dataset then pipe the whole thing into a plotting function. Here’s a simple example where we take the iris dataset, then group it by Species, then summarize it by calculating the mean of the Sepal.Length, then use ggplot2 to make a simple bar plot.library(dplyr)
library(ggplot2)
iris %>%
group_by(Species) %>%
summarize(meanSepLength=mean(Sepal.Length)) %>%
ggplot(aes(Species, meanSepLength)) + geom_bar(stat="identity")
%>%
you’ll wonder to yourself why this isn’t a core part of the R language itself rather than add-on functionality provided by a package. It will fundamentally change the way you write R code, making it feel more natural and making your code more readable. There's a lot more dplyr can do with databases that I didn't even mention, and if you're interested, you should see the other vignettes on the CRAN package page.by Robert A. Muenchen
Here is my latest update to The Popularity of Data Analysis Software. To save you the trouble of reading all 25 pages of that article, the new section is below. The two most interesting nuggets it contains are:
If you’d like to be alerted to future updates on this topic, you can follow me on Twitter, @BobMuenchen.
Scholarly Articles
The more popular a software package is, the more likely it will appear in scholarly publications as a topic and as a method of analysis. The software that is used in scholarly articles is what the next generation of analysts will graduate knowing, so it’s a good leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect and will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Analytics Articles. Since Google regularly improves its search algorithm, I recollect the data for all years following the protocol described at http://librestats.com/2012/04/12/statistical-software-popularity-on-google-scholar/.
Figure 2a shows the number of articles found for each software package for all the years that Google Scholar can search. SPSS is by far the most dominant package, likely due to its balance between power and ease-of-use. SAS has around half as many, followed by MATLAB and R. Note that the general purpose software MATLAB, Java and Python are included only when found in combination with analytics terms, so view those as much rougher counts than the rest. Neither C nor C++ are included here because it’s very difficult to focus the search compared to the search for jobs above, whose job descriptions commonly include a clear target of skills in “C/C++” and “C or C++”.
From RapidMiner on down, the counts appear to be zero. That’s not the case, but relative to the others, it might as well be.
Figure 2b shows the number of articles for the most popular six classic statistics packages from 1995 through 2013 (the last complete year of data this graph was made). As in Figure 2a, SPSS has a clear lead, but you can see that its dominance peaked in 2007 and its use is now in sharp decline. SAS never came close to SPSS’ level of dominance, and it peaked in 2008.
Since SAS and SPSS dominate the vertical space in Figure 2a by such a wide margin, I removed those two packages and added the next two most popular statistics packages, Systat and JMP in Figure 2c. Freeing up so much space in the plot now allows us to see that the use of R is experiencing very rapid growth and is pulling away from the pack, solidifying its position in third place. In fact, extending the downward trend of SPSS and the upward trend of R make it likely that sometime during the summer of 2014 R became the most dominant package for analytics used in scholarly publications. Due to the lag caused by the publication process, getting articles online, indexing them, etc. we won’t be able to verify that this has happened until well into 2014.
After R, Statistica is in fourth place and growing, but at a much lower rate. Note that in the plots from previous years, Statistica was displayed as a flat line at the very bottom of the graph. That turned out to be a search-related artifact. Many academics who use Statistica don’t mention the package by software name but rather say something like, “we used the statistics package by Statsoft.”
Extrapolating from the trend lines, it is likely that the use of Stata among academics passed that of Statistica fairly early in 2014. The remaining three packages, Minitab, Systat and JMP are all growing but at a much lower rate than either R or Stata.
Courtesy of CRANberries, there is also a diffstat report for the most recent release. As always, more detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.Changes in RcppArmadillo version 0.4.400.0 (2014-08-19)
Upgraded to Armadillo release Version 4.400 (Winter Shark Alley)
added
gmm_diag
class for statistical modelling using Gaussian Mixture Models; includes multi-threaded implementation of k-means and Expectation-Maximisation for parameter estimationadded
clamp()
for clamping values to be between lower and upper limitsexpanded batch insertion constructors for sparse matrices to add values at repeated locations
faster handling of subvectors by
dot()
faster handling of aliasing by submatrix views
Corrected a bug (found by the g++ Address Sanitizer) in sparse matrix initialization where space for a sentinel was allocated, but the sentinel was not set; with extra thanks to Ryan Curtin for help
Added a few unit tests for sparse matrices
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
Tinder is a popular matchmaking application that allows users to connect with others whom they share a physical attraction. New members build their profile by importing their age, gender, geographic information, and photos from their Facebook account. Users are then presented with profiles which meet their search criteria and are able to like or dislike them. Unlike traditional online dating sites, members can only communicate with those individuals who they share a common affinity (you liked them and they liked you).
Tinder is an interesting product that offers an interesting case study for statisticians and data scientists who want to understand how human beings interact on on mobile dating applications. Given that large scale data collection is near impossible without a large team of interns, I decided to collect data on the profiles that were presented to me over a one week period. My goal was to extract information on users and their profiles in order to determine if certain people were more likely to like my Tinder profile. After a couple days, I realized that receiving likes on Tinder was a difficult proposition, and was forced to adjust the data in order to have a robust occurrence rate. Using Naive Bayes, I attempted to glean any insights from the data I collected.
> head(dat) Hair_Color Race Text Pictures Age Miles_Away Shared_Interest Overweight Liked_You 1 Black White Y 5 23 Close 0 N N 2 Blonde White N 4 23 Close 1 N N 3 Black Other Y 4 28 Close 4 N N 4 Blonde White Y 5 23 Close 0 N N 5 Blonde White N 4 21 Close 1 N N 6 Brunette White Y 6 23 Close 0 N N ...
Part 1 of this series is simply focused on providing a high level overview of the problem and what I found. In part 2, I’ll offer a review of Naive Bayes classification and provide a worked out example.
train.ind <- sample(1:nrow(dat), ceiling(nrow(dat)*2/3), replace=FALSE) nb.res <- NaiveBayes(Liked_You ~ Hair_Color + Text + Pictures + Age + Miles_Away, data=dat[train.ind,]) nb.pred <- predict(nb.res, dat[-train.ind,]) accuracy <- table(nb.pred$class, dat[-train.ind,"Liked_You"]) sum(diag(accuracy))/sum(accuracy)
mod = glm(Liked_You ~ Hair_Color + Text + Pictures + Age + Miles_Away + Shared_Interest, data=dat[train.ind,], family=binomial(link = "logit")) library(effects) plot(effect("Pictures", mod), rescale.axis=FALSE) plot(effect("Miles_Away", mod), rescale.axis=FALSE) plot(effect("Hair_Color", mod), rescale.axis=FALSE)
fit = fitted(mod) accuracy = table(fit > .5, dat[train.ind, "Liked_You"]) sum(diag(accuracy)) / sum(accuracy)
I have uploaded a few papers I have written and presented at some national conferences over the past several years. Currently, all the articles relate to election research.
The jsonlite package is a JSON parser/generator optimized for the web. It implements a bidirectional mapping between JSON data and the most important R data types. This is very powerful for interacting with web APIs, or to build pipelines where data seamlessly flows in and out of R through JSON without any manual serializing, parsing or data munging.
The jsonlite package is one of the pillars of the OpenCPU system, which provides an interoperable API to interact with R over HTTP+JSON. However since its release, jsonlite has been adopted by many other projects as well, mostly to grab JSON data from REST APIs in R.
Version 0.9.10 includes two new vignettes to get you up and running with JSON and R in a few minutes.
These vignettes show how to get started analyzing data from Twitter, NY Times, Github, NYC CitiBike, ProPublica, Sunlight Foundation and much more, with 2 or 3 lines of R code.
There are also a few other improvements, most notably support parsing of escaped JSON unicode sequences, which could be important if you are from a country with a non-latin alphabet.
This is the 10th CRAN version of jsonlite, and we are getting very close to a 1.0 release. By now the package does what it should do, has been tested by many users and all outstanding issues have been addressed. The mapping between JSON data and R classes is described in detail in the jsonlite paper, and unit tests are available to validate that implementations behave as prescribed for all data and edge cases. Once the version bumps to 1.0, we plan to switch gears and start focussing more on optimizing performance.