ddply in action

March 7, 2013
By

(This article was first published on Decisions and R, and kindly contributed to R-bloggers)

Top Batting Averages Over Time

Top Batting Averages Over Time

reference:
http://www.baseball-databank.org/

Short
I'm going to use plyr and ggplot2 to look at how top batting averages have changed over time

First load the data:

options(width = 100)
library(ggplot2)
## Warning message: package 'ggplot2' was built under R version 2.14.2
library(plyr)

data(baseball)
head(baseball)
##            id year stint team lg  g  ab  r  h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
## 4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
## 44 forceda01 1871 1 WS3 32 162 45 45 9 4 0 29 8 0 4 0 NA NA NA NA NA
## 68 mathebo01 1871 1 FW1 19 89 15 24 3 1 0 10 2 1 2 0 NA NA NA NA NA
## 99 startjo01 1871 1 NY2 33 161 35 58 5 1 1 34 4 2 3 0 NA NA NA NA NA
## 102 suttoez01 1871 1 CL1 29 128 35 45 3 7 3 23 3 1 1 0 NA NA NA NA NA
## 106 whitede01 1871 1 CL1 29 146 40 47 6 5 1 21 2 2 4 1 NA NA NA NA NA

It looks like we've loaded the data successfully.

Next, We'll add something that is close to batting average: total hits divided by total at-bats:

baseball$ba = baseball$h/baseball$ab
head(baseball)
##            id year stint team lg  g  ab  r  h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp     ba
## 4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA 0.3250
## 44 forceda01 1871 1 WS3 32 162 45 45 9 4 0 29 8 0 4 0 NA NA NA NA NA 0.2778
## 68 mathebo01 1871 1 FW1 19 89 15 24 3 1 0 10 2 1 2 0 NA NA NA NA NA 0.2697
## 99 startjo01 1871 1 NY2 33 161 35 58 5 1 1 34 4 2 3 0 NA NA NA NA NA 0.3602
## 102 suttoez01 1871 1 CL1 29 128 35 45 3 7 3 23 3 1 1 0 NA NA NA NA NA 0.3516
## 106 whitede01 1871 1 CL1 29 146 40 47 6 5 1 21 2 2 4 1 NA NA NA NA NA 0.3219

Finally, we can use the plyr package to look at how batting averages have changed over time. We'll only consider players who have at least 100 at-bats in a season.

Note: ddply essentially splits the dataset into groups based on the year variable, and then performs the same function on each of the subsets (here, we're executing the topBA function). With the calculation performed on each of the subsets, ddply then collects all of the output into a new data frame.


BA.dat = ddply(baseball, .(year), summarise, topBA = max(ba[ab > 100], na.rm = TRUE))
head(BA.dat, 10)
##    year  topBA
## 1 1871 0.3602
## 2 1872 0.4147
## 3 1873 0.3976
## 4 1874 0.3359
## 5 1875 0.3666
## 6 1876 0.3560
## 7 1877 0.3872
## 8 1878 0.3580
## 9 1879 0.3570
## 10 1880 0.3602

Now, we're ready to use ggplot2 to visually examine the data:

p = ggplot(BA.dat, aes(x = year, y = topBA)) + geom_point()
p

plot of chunk unnamed-chunk-4

While it's only a heuristic judgment at this point, it's pretty clear that we have a downward trend over time.

To leave a comment for the author, please follow the link and comment on their blog: Decisions and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)