# sab-R-metrics: Merging Data Sets

June 22, 2011
By

(This article was first published on The Prince of Slides, and kindly contributed to R-bloggers)

I am finally back from Greece and recovered from jet lag. Fortunately, I did not get tear gassed while in Athens, though there were riot police everywhere the whole time we visited. Today, I'm going to start getting my feet wet again with a shorter sab-R-metrics post to assure everyone I'm not too MIA.

Often times we have lots of data in different files that we want to link together. If you have the information in an SQL database, there are ways to match things up using R. However, I am no database management wizard and prefer to be able to look at my data in a full table format. Unfortunately, this causes problems when I want to make sure to have player names linked to the player ids in my Pitch F/X data. The issue is that the F/X data may have multiple instances or rows with the same player, while the player information file only has player ids and player names once (one per row). Doing this manually can take forever (sometimes almost literally), and we need a quick way to import player names to the correct rows. Pitch F/X tools like Joe Lefkowitz's already do this for you; however, if you have your own F/X database--or any other data with player ids that you would like to merge some data into--this tutorial should come in handy.

Luckily, R has a nice function, 'merge()', which allows for easy merging of files. While I used to use SPSS to do this, once I found the R version I'll never go back. The SPSS version is pretty handy, but extremely slow for large files and the software is outrageously expensive.

First, I want you to download a file of 5,000 pitches here. Once you have it in the correct place, load it into R and take a look at it.

#set working directory
setwd("c:/Users/Millsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics")

As you can see, there are no player names in this file. While you could go through and add them in manually--say in Excel or something like that--this would take way too long. To get an idea of the number of names to be imported just for this small pitch file, use the following code:

##give an idea of hte amount of work that manually merging would take
length(pitches[,1])
length(unique(pitches$batter_id)) length(unique(pitches$pitcher_id))

The first line of code above tells us the number of rows in the data set--or the length of the first column in the data. This comes in handy to make sure R loaded the number of rows you expected to see. The second line of code again uses the 'length()' function, but adds a new function we have not seen yet: 'unique()'. What this does is tells us how many different/unique batter ids there are in the data set. The third line of code does the same for pitcher ids. You can also use the 'unique()' function on its own, and R will print each of the player ids within the data file (you could also assign this list or vector as an object using the assignment operator '<-'). Unique will come in handy when we get into more advanced "for loops" later on.

As you can see, there are 286 unique batter ids and 113 unique pitcher ids. In addition, there are many repeats, as there are 5,000 observations in the data file. Doing this manually would take forever. Luckily, I have a file with the player ids, the player names, player height and weight, player birth dates, and the first year played in pro ball, MLB, and the last year played in MLB. We'll use R to easily merge this into our pitch file so that we can have player names and account for height and age of the player in our analyses using the pitch data.

First, go ahead and download the file with player names and some other information here. Stick that into the same directory as the previous file and load it into R. As always, take a look at the file to make sure it loaded correctly:

Before doing any merging, we'll have to adjust some things with this file. For the 'merge()' function to work, you have to choose a variable that is contained in BOTH data sets to merge on. For our purposes, we'll use the id of the player. Unfortunately, the name of the variable is different in each file. This is an easy fix. While we're at it, it is probably a good idea to discriminate between the batter and pitcher names and information in the file, since both will be displayed in each row. So first thing is first...let's rename the variables. For this, we'll use another new function, 'colnames()'. The following code should rename everything the way we want, and we'll start by merging the new data for batters. Be sure not to omit the names of any columns or you will get an error:

##rename columns for batters
colnames(players) <- c("batter_id", "b_first", "b_last", "b_height", "b_weight", "b_birth_year", "b_pro_played_first", "b_mlb_played_first", "b_mlb_played_last")

Always check to be sure things went correctly. There is actually an option to do this automatically in the 'merge()' function as the command "suffix=". On data sets with a large number of columns, this can save you time. But I found this to be a good time to introduce the "colnames()" function.

Now we have two files with a similar variable to match on. It's time to use the 'merge()' function. The merge function asks first for a 'x' data set (the first one), and then a 'y' data set (the second one). It is important to remember what order you place them in the function, as you will also need to tell R that you want to keep all of the original pitches in this new merged data. To save space in R--once I know things are working right--I simply reassign the merged data set as the original name 'pitches'.

To ensure that R makes a data set using all the pitches in the file, we want to use the option "all.x=T" or "all.y=T". This will tell R that the players data are just a table being used for the pitch data, while we keep all the pitch data in tact in the new merged table. Finally, we need to tell R which variable to match on using by="batter_id". Be sure to put the variable name in quotes. The following code should do this for us:

##do merge for batters
pitches <- merge(pitches, players, by="batter_id", all.x=T)

Notice that it puts the "batter_id" variable in the first row of this new data set. That's okay, and you can always restructure your data if this bothers you. Now let's do the same for the pitchers in the pitch data. Don't forget to rename the variables in your player information table so that they don't overwrite the batter information, and also so that it matches on pitcher id, rather than batter id:

##rename columns for pitchers
colnames(players) <- c("pitcher_id", "p_first", "p_last", "p_height", "p_weight", "p_birth_year", "p_pro_played_first", "p_mlb_played_first", "p_mlb_played_last")

##do merge for pitchers
pitches <- merge(pitches, players, by="pitcher_id", all.x=T)

Now, looking at the data, my first row has the 69 inch, 180 pound Dustin Pedroia against a lanky 72 inch, 160 pound Miguel Bautista. For this pitch, Pedroia gets a hit. You can even double check that the players are correct by looking at the "ab_des" column, which gives a full description of what happened in the at bat. Sure enough, it says, "Dustin Pedroia singles on a line drive to left fielder Ryan Langerhans. J. Drew to 2nd.". Things seemed to have gone well here. Now, you can save the new file so you don't have to worry about merging again with the following code:

##write new table
write.csv(pitches, file="mergedpitches.csv", row.names=F)

Hopefully this will help out some of those looking to merge data together. There is much of this needed with the different data sets (pitch f/x, Retrosheet, Baseball Reference, etc.) around the web. You'll need a full mapping of all player ids. I got mine from the Universal ID Project, here is a link at The Book Blog for last year's version (I can't find the most recent link).

In the end, R's functionality here is better than any other program that I have come across. You always need to double check the data to make sure there aren't any bugs. This is especially true with even larger data. Ultimately, this can make life in R and baseball analytics about a million times easier--just be careful. There are a few things I didn't go over here (like having it automatically sort when merging), so you can always check out how to use the function yourself with the R command "help(merge)". Hope this helps!

Pretty R Code:
############################# ################Sidetrack for Merging of Data Tables #############################   #set working directory setwd("c:/Users/Millsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics")   ##load pitch file pitches <- read.csv(file="PitchesMerging.csv", h=T) head(pitches)   ##give an idea of hte amount of work that manually merging would take length(pitches[,1]) length(unique(pitches$batter_id)) length(unique(pitches$pitcher_id))   ##load player information file players <- read.csv(file="detailedplayers.csv", h=T) head(players)   ##rename columns for batters colnames(players) <- c("batter_id", "b_first", "b_last", "b_height", "b_weight", "b_birth_year", "b_pro_played_first",  "b_mlb_played_first", "b_mlb_played_last") head(players)   ##do merge for batters pitches <- merge(pitches, players, by="batter_id", all.x=T) head(pitches)   ##rename columns for pitchers colnames(players) <- c("pitcher_id", "p_first", "p_last", "p_height", "p_weight", "p_birth_year", "p_pro_played_first",  "p_mlb_played_first", "p_mlb_played_last") head(players)   ##do merge for pitchers pitches <- merge(pitches, players, by="pitcher_id", all.x=T) head(pitches)     ##write new table write.csv(file="mergedpitches.csv", row.names=F)