It seems dplyr is overtaking correlation heatmaps

March 8, 2017
By

(This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers)

(… on my blog, that is.)

For a long time, my correlation heatmap with ggplot2 was the most viewed post on this blog. It still leads the overall top list, but by far the most searched and visited post nowadays is this one about dplyr (followed by it’s sibling about plyr).

I fully support this, since data wrangling and reorganization logically comes before plotting (especially in the ggplot2 philosophy).

But it’s also kind of a shame, because it’s not a very good dplyr post, and the one about the correlation heatmap is not a very good ggplot2 post. Thankfully, there is a new edition of the ggplot2 book by Hadley Wickham, and a new book by him and Garrett Grolemund about data analysis with modern R packages. I’m looking forward to reading them.

Personally, I still haven’t made the switch from plyr and reshape2 to dplyr and tidyr. But here is the updated tidyverse-using version of how to quickly calculate summary statistics from a data frame:

library(tidyr)
library(dplyr)
library(magrittr)

data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)),
                   treatment = rep(c(1, 2), 1000),
                   response1 = rnorm(2000, 0, 1),
                   response2 = rnorm(2000, 0, 1))

gather(data, response1, response2, value = "value", key = "variable") %>%
  group_by(sex, treatment, variable) %>%
  summarise(mean = mean(value), sd = sd(value))

Row by row we:

1-3: Load the packages.

5-8: Simulate some nonsense data.

10: Transform the simulated dataset to long form. This means that the two variables response1 and response2 get collected to one column, which will be called ”value”. The column ”key” will indicate which variable each row belongs to. (gather is tidyr’s version of melt.)

11: Group the resulting dataframe by sex, treatment and variable. (This is like the second argument to d*ply.)

12: Calculate the summary statistics.

Source: local data frame [8 x 5]
Groups: sex, treatment [?]

    sex treatment  variable        mean        sd
  (dbl)     (dbl)     (chr)       (dbl)     (dbl)
1     1         1 response1 -0.02806896 1.0400225
2     1         1 response2 -0.01822188 1.0350210
3     1         2 response1  0.06307962 1.0222481
4     1         2 response2 -0.01388931 0.9407992
5     2         1 response1 -0.06748091 0.9843697
6     2         1 response2  0.01269587 1.0189592
7     2         2 response1 -0.01399262 0.9696955
8     2         2 response2  0.10413442 0.9417059

Postat i:computer stuff, data analysis, med mera Tagged: dplyr, ggplot2, R

To leave a comment for the author, please follow the link and comment on their blog: R – On unicorns and genes.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)