Unveiling the Magic of dcast Function in R’s data.table Package

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here is a draft blog post on using the dcast function from the data.table package in R:

Introduction

The dcast function in the data.table package is a handy tool for reshaping data from long to wide format in R. It allows you to easily aggregate values and pivot dataframes. In this post, we’ll explore how to use dcast through several examples.

What is dcast?

The dcast stands for “data cast” and it works similarly to reshape2’s dcast function. The key difference is that dcast from data.table is much faster as it works directly on data.table objects and leverages fast data.table syntax.

To use dcast, you need to have the data.table package installed:

library(data.table)

The main arguments for dcast are:

  • data: the data.table or data.frame to reshape
  • formula: specifies columns to use, in the form value.var ~ key1 + key2
  • fun.aggregate: the aggregation function to apply (default is length)

Reshaping from Long to Wide

Let’s walk through an example with the mtcars dataset. First we convert mtcars to a data.table:

dt <- as.data.table(mtcars)

Say we want to reshape the data from long to wide, aggregating the hp values by cyl. We can use dcast:

dcast(dt, cyl ~ ., value.var="hp", fun.aggregate=mean)
Key: <cyl>
     cyl         .
   <num>     <num>
1:     4  82.63636
2:     6 122.28571
3:     8 209.21429

This aggregates the hp by cyl, casting the other columns as identifiers. The result is a table with one row per cyl, and columns for mean hp and all other variables.

Aggregating Multiple Columns

You can also aggregate multiple value columns in one call. Let’s add aggregating disp by the mean:

dcast(dt, cyl ~ ., value.var=c("hp", "disp"), fun.aggregate=mean)
Key: <cyl>
     cyl        hp     disp
   <num>     <num>    <num>
1:     4  82.63636 105.1364
2:     6 122.28571 183.3143
3:     8 209.21429 353.1000

Now we have mean hp and mean disp aggregated by cyl in the wide format.

Using Multiple Formulas

Another common operation is aggregating over several formulas separately. For example, aggregating hp by cyl and gear.

We can pass a list of formulas to dcast:

dcast(dt, cyl ~ ., value.var="hp", fun.aggregate=mean) + 
  dcast(dt, gear ~ ., value.var="hp", fun.aggregate=mean)
     cyl        .
   <num>    <num>
1:     7 258.7697
2:    10 211.7857
3:    13 404.8143

This outputs two sets of aggregations, by cyl and gear, in a single wide table.

Reshaping from Wide to Long

The melt function from data.table can reshape from wide to long format. For example:

melt(dt, id.vars = "cyl", measure.vars = c("hp", "disp"))
      cyl variable value
    <num>   <fctr> <num>
 1:     6       hp 110.0
 2:     6       hp 110.0
 3:     4       hp  93.0
 4:     6       hp 110.0
 5:     8       hp 175.0
 6:     6       hp 105.0
 7:     8       hp 245.0
 8:     4       hp  62.0
 9:     4       hp  95.0
10:     6       hp 123.0
11:     6       hp 123.0
12:     8       hp 180.0
13:     8       hp 180.0
14:     8       hp 180.0
15:     8       hp 205.0
16:     8       hp 215.0
17:     8       hp 230.0
18:     4       hp  66.0
19:     4       hp  52.0
20:     4       hp  65.0
21:     4       hp  97.0
22:     8       hp 150.0
23:     8       hp 150.0
24:     8       hp 245.0
25:     8       hp 175.0
26:     4       hp  66.0
27:     4       hp  91.0
28:     4       hp 113.0
29:     8       hp 264.0
30:     6       hp 175.0
31:     8       hp 335.0
32:     4       hp 109.0
33:     6     disp 160.0
34:     6     disp 160.0
35:     4     disp 108.0
36:     6     disp 258.0
37:     8     disp 360.0
38:     6     disp 225.0
39:     8     disp 360.0
40:     4     disp 146.7
41:     4     disp 140.8
42:     6     disp 167.6
43:     6     disp 167.6
44:     8     disp 275.8
45:     8     disp 275.8
46:     8     disp 275.8
47:     8     disp 472.0
48:     8     disp 460.0
49:     8     disp 440.0
50:     4     disp  78.7
51:     4     disp  75.7
52:     4     disp  71.1
53:     4     disp 120.1
54:     8     disp 318.0
55:     8     disp 304.0
56:     8     disp 350.0
57:     8     disp 400.0
58:     4     disp  79.0
59:     4     disp 120.3
60:     4     disp  95.1
61:     8     disp 351.0
62:     6     disp 145.0
63:     8     disp 301.0
64:     4     disp 121.0
      cyl variable value

This melts the data to long form based on the id and measure columns.

Additional Tips

  • Use fun.aggregate=length to get counts per group
  • Set fill=NA to output NA for combinations without data instead of 0
  • Use variable.name to set custom column names

Wrapping Up

The dcast function provides a fast way to reshape data and aggregate values in R. It’s perfect for pivoting dataframes and getting data ready for analysis and visualization. The data.table syntax helps make the reshape very fast and efficient. Give it a try on your datasets! Let me know in the comments if you have any other dcast examples to share.

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)