ddply in action

[This article was first published on Decisions and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Top Batting Averages Over Time

Top Batting Averages Over Time

reference:
http://www.baseball-databank.org/

Short
I'm going to use plyr and ggplot2 to look at how top batting averages have changed over time

First load the data:

options(width = 100)
library(ggplot2)

## Warning message: package 'ggplot2' was built under R version 2.14.2

library(plyr)

data(baseball)
head(baseball)

##            id year stint team lg  g  ab  r  h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
## 4   ansonca01 1871     1  RC1    25 120 29 39  11   3  0  16  6  2  2  1  NA  NA NA NA   NA
## 44  forceda01 1871     1  WS3    32 162 45 45   9   4  0  29  8  0  4  0  NA  NA NA NA   NA
## 68  mathebo01 1871     1  FW1    19  89 15 24   3   1  0  10  2  1  2  0  NA  NA NA NA   NA
## 99  startjo01 1871     1  NY2    33 161 35 58   5   1  1  34  4  2  3  0  NA  NA NA NA   NA
## 102 suttoez01 1871     1  CL1    29 128 35 45   3   7  3  23  3  1  1  0  NA  NA NA NA   NA
## 106 whitede01 1871     1  CL1    29 146 40 47   6   5  1  21  2  2  4  1  NA  NA NA NA   NA

It looks like we've loaded the data successfully.

Next, We'll add something that is close to batting average: total hits divided by total at-bats:

baseball$ba = baseball$h/baseball$ab
head(baseball)

##            id year stint team lg  g  ab  r  h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp     ba
## 4   ansonca01 1871     1  RC1    25 120 29 39  11   3  0  16  6  2  2  1  NA  NA NA NA   NA 0.3250
## 44  forceda01 1871     1  WS3    32 162 45 45   9   4  0  29  8  0  4  0  NA  NA NA NA   NA 0.2778
## 68  mathebo01 1871     1  FW1    19  89 15 24   3   1  0  10  2  1  2  0  NA  NA NA NA   NA 0.2697
## 99  startjo01 1871     1  NY2    33 161 35 58   5   1  1  34  4  2  3  0  NA  NA NA NA   NA 0.3602
## 102 suttoez01 1871     1  CL1    29 128 35 45   3   7  3  23  3  1  1  0  NA  NA NA NA   NA 0.3516
## 106 whitede01 1871     1  CL1    29 146 40 47   6   5  1  21  2  2  4  1  NA  NA NA NA   NA 0.3219

Finally, we can use the plyr package to look at how batting averages have changed over time. We'll only consider players who have at least 100 at-bats in a season.

Note: ddply essentially splits the dataset into groups based on the year variable, and then performs the same function on each of the subsets (here, we're executing the topBA function). With the calculation performed on each of the subsets, ddply then collects all of the output into a new data frame.

BA.dat = ddply(baseball, .(year), summarise, topBA = max(ba[ab > 100], na.rm = TRUE))
head(BA.dat, 10)

##    year  topBA
## 1  1871 0.3602
## 2  1872 0.4147
## 3  1873 0.3976
## 4  1874 0.3359
## 5  1875 0.3666
## 6  1876 0.3560
## 7  1877 0.3872
## 8  1878 0.3580
## 9  1879 0.3570
## 10 1880 0.3602

Now, we're ready to use ggplot2 to visually examine the data:

p = ggplot(BA.dat, aes(x = year, y = topBA)) + geom_point()
p

plot of chunk unnamed-chunk-4

While it's only a heuristic judgment at this point, it's pretty clear that we have a downward trend over time.

To leave a comment for the author, please follow the link and comment on their blog: Decisions and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)