sab-R-metrics: Introduction to R

January 5, 2011
By

(This article was first published on The Prince of Slides, and kindly contributed to R-bloggers)

In a recent post, I briefly mentioned that I may turn a majority of the focus of this blog to teaching R commands for use with sabermetric analysis. Only a few days later, Ricky Zanker began a new column at The Hardball Times doing just that. But that's okay. Hopefully both his and mine can complement one another. If there is anything I've learned while developing my skills in statistical programming, it is that everyone has something to offer when learning the material. It is up to you to choose the way you are most comfortable with.

For some great examples, I'd say the top tier analysis in R is done by Dave Allen, Albert Lyu, and Jeremy Greenhouse. Just Google them and you will find some fascinating posts. This won't just be Pitch F/X, though, and I'll try and include some basic regression and multivariate analysis that you may be interested in. There are so many things to be done with R, and I've only begun to touch the surface in my own work online and in academia. As I explained in my "Introduction to the Prince of Slides", I wanted to provide more visualizations and present some multivariate techniques for data. Baseball statistics are one of the most fun places to do this.

R provides a number of ways to do similar things, so it is a great place to develop your own style of analysis. This is especially true with graphics in R. Because Ricky already has a post on Downloading R, I am going to send you his way to learn how to download it on your machine. It is completely FREE, unlike its counterparts SAS, STATA, SPSS, MATLAB, GAUSS, etc. Each of these programs has its own uses that sometimes out-performs R's functionality in my opinion (which is why I actually use Excel, STATA, SPSS, and GAUSS in different situations). However, for sabermetric analysis, multivariate analysis, and visualizations, R is certainly top notch. Now, on to the beginning of what I hope to be a useful series for beginner and--later on--more advanced baseball analysts.

I'll start today with loading your data into R. R interfaces fairly well with SQL databases. For SQL help, check out Mike Fast's example here. I am going to begin with a more simple approach for your data: importing a csv file that you made in Excel.

The first thing you'll need to do is create an Excel spreadsheet. Be sure to include variable names at the top of the data table so you know what is what when you import it into R. Some basics on naming your variables are below:

1) While there are many style guides out there, I suggest naming them something you are most comfortable with when you begin. Name it something that makes sense to you, but try to adhere to what Google says about these (in the style guide link).

2) Remember not to use spaces for your variable names. If you want to separate words, it is best to use "." like this: "at.bats".

(Addendum: Tango suggests using the "_" rather than a "." to separate things in the variable names. I in fact like the underscore method much better as well, but the Google style guide warns against it. I'm still not sure why (and I've never had a problem with it in R), but if I find out I'll be sure to post it.)

(Addendum 2: Peter at The Book Blog commented that underscores were used as assignment operators in previous versions of R (before 1.8). Therefore, underscores seem to be fine if you have no issues with limiting backward compatibility for earlier R versions. So I'd suggest using what you are comfortable with.)

3) R is case-sensitive, which means you need to remember that if you name something "Atbats", then "atbats" won't call it. To reduce keystrokes, I try my best to name my variables in ways that keeps everything lowercase.

4) You cannot begin your variable name with a number. That means if you import data with a column for Doubles named "2B", it will show up in R as "X2B" automatically. Just avoid it altogether to begin with in order to avoid confusion.

5) Do your best not to name things that are also functions in R. I have made this mistake in the past (i.e. "ump", "data", "for" are all bad ideas). If you have a question as to whether your variable name is an R function, just type "help()" with your variable name in the parentheses. If it is an R function, a help file will open up in your browser.

Okay, now we're ready to import. From here, I assume you can set up a table with rows and columns in Excel that makes sense. Remember, your variables (HR, RBI, H, etc.) should be columns, while your observations (Albert Pujols, Mark McGwire, Hank Aaron) should be the rows. Let's say you have an Excel file set up like this, but notice that I have not adjusted the variable names to interface with R at this point in the file (directly from Baseball Reference's Play Index,):


NOTE: I've added a link to a version of the file above directly from Baseball Reference thanks to commenter, JAIME. Hopefully this will help. Just CLICK HERE for the data and you can copy and paste the comma separated data into Excel (to get CSV format on the Baseball Reference page, just click the red "CSV" text just above where the data is displayed).

Now, go to "File -> Save As". In that window, be sure to change the "Save As Type" to "CSV (Comma Delimited)". There are plenty of other options for reading data into R (including an Excel interface, Text files, etc.). I suggest checking them out. You can always Google something like "R to Excel interface" and find something thanks to the public nature of the R network. Finally, be sure to choose a location on your computer that you can find. That way, you know where to tell R to search for the file when you want to import it. I'll say I saved things to my desktop on my home computer.

Lastly, be sure to name your file something simple without spaces in the name. For my .csv file above with the Hall of Fame data in it, I have it named Hall of Fame Hitters as an Excel file. However, for ease of use in R (and to keep from confusing it), I try to simplify these longer names down. So, my file name here will be "hallhitters.csv". It is good to always have an Excel Workbook version of your csv files in case something goes wrong with the file translation (for example, commas in the Pitch F/X Gameday data--bad juju on the part of MLB--that I'll get to later on down the road).


R Time!

Now we're ready to do our R magic. Go ahead an open up R. If you've never bothered with any sort of programming before, I know the initial screen can look daunting. Nowhere to point and click. No directions. Just a blank screen that looks like you have to write nothing but programming code. I was not happy when my graduate statistics class did not allow me to use SPSS, but R really grew on me once I worked out the bugs. You will get frustrated. It will be very hard to pick up on typos in your code at first. But hang in there, and later on it will become more and more easy. I hope these tutorials will minimize frustration and encourage you to widen the landscape of analysis in baseball.

To begin, go up to "File -> Open Script" in the R program. This opens a new text document within R where you can edit your code. If you just type your code directly into the command line, then editing it is not easy. In addition, by typing your working R-code into the Script Editor, you can save it and always have a record of the working code. At the top of this document, type "#" (the number sign indicates that this is text commentary, rather than code) and then name it something. I'll call mine

#sab-R-metrics Introduction

Press enter twice (to leave a space...space is always a good thing for readability later on). Now, we want to set the working directory for R to find our data file and save subsequent work we do. Begin by commenting what the next line does with "#set working directory". From there, press enter and go to the next line. We use the command "setwd()" in order to tell R where to go. You'll need to know where your file is and the full directory name on your computer. If you are in Windows, the best way to figure that out is to go to your file, Right Click, and go to Properties. From there, it tells you the Location of the data. Use that for your reference and type it into your 'setwd' function as follows:

#set working directory
setwd("c:/Users/Millsy/Desktop")

Notice that the slashes are opposite of how Windows represents it in the 'Properties' window (they use "\"). Also, make sure to put quotes around your directory. Finally, put your cursor on the line with the 'setwd' function, and press "CNTRL + R" once you think you've got everything typed in correctly. R will automatically run this line in the command window. If there are no typos, your working directory is set and we're ready to load in your data file.

Remember that I named my file above "hallhitters.csv". We'll need this information (OF COURSE) to load the file into R. For loading a .csv file into the program, we'll use the R function 'read.csv'. But, we also need to give it a name in our R environment. I'll just call it "hitters". When assigning a name in R, it is important again not to name it something that is already a function in the program (see above). I typed

help(hitters)

and nothing came up. Therefore, I'm probably safe using this name for my data. To assign a name to something, we use a little arrow pointing left. Really, it is just a combination of "-" and "<" or together, "<-". Remember to comment your line as well. Below is the code used to load in the .csv file with my comment as to what is going on:

#load hall of fame hitters data
hitters <- read.csv(file="hallhitters.csv", h=T)

As you can see above, I use the little arrow to assign the data to the name "hitters". Within the function 'read.csv' I ensure that R knows the file name (always in quotes after "file=". Also, be sure to always use the .csv file extension. Finally, the portion that says "h=T" tells R that my data has headers for the columns. Here, the 'h' is for 'header' and 'T' means 'True'. So, it is true that there are column headings in my data table. Now put the cursor on the line and press "CTRL + R" again.

Now, often times we like to look at our data file to make sure things look right. Unfortunately, we don't really have a spreadsheet viewing option in R like other programs. So, if we have very large data, we cannot just type "hitters" to look at the data. It will take up too much of the command window and be pretty messy. Therefore, knowing about the functions "head()" and "tail()" is always useful. This way, we can make sure that our data loaded in correctly. The "head" function allows you to view the first 6 rows of your data (and all columns) in the command window along with the variable names. "tail" does the same thing, but with the last 6 rows of the data. Let's try it below (remember, put it in your Script, then use "CTRL + R" to run it once the code is correct...if it doesn't work, check for typos, fix them in the Script editor, and try again):

head(hitters) #view first 6 rows to ensure correct data import

tail(hitters) #view last 6 rows to ensure correct data import

Notice that I can also comment the lines using the "#" next to the line of code. I like this less because lines of code are usually a bit longer and it ends up making the line too long to view in the text editor at once without scrolling. Remember to keep it simple!

Okay, we've got the data in. We're itching to do some analysis. However, we need to know some more things about our data before diving right in. That will be saved for next time. In my next post, I'll talk about some basic data calls, basic functions, and VERY basic ways to work with vectors in R.

Go ahead and save your Script. To do this, make sure that you have the Script selected (just put the cursor in it). Then go to "File -> Save" and save it where you think would be best. Name it something informative. While it saves it as a ".r" file, this text can also be opened using any text editor. Why does that matter? Well, with the ever changing landscape of computing, file types get discarded a lot (remember the mess with Excel 2003 to 2007 when it first came out?). Text files (and .csv files) will likely always be around. Since you have your .csv and your .r or .txt files, hopefully you will be able to open them and run the code again 20 years from now!

Lastly, close your Script once it is saved (be SURE that you in fact had that window selected when you choose Save). Next, close R. R will ask you if you want to "Save Your Workspace". Always say "NO"!. If you save your workspace in R, then it will save things you have named, etc. You don't necessarily want this and it can interfere with things you do the next time you open the program. This is why it is good practice to have your working code in the Script editor and save it there. You can always just highlight it, press "CTRL + R" and be right back where you left off last time.

At the end of this first tutorial, your script editor should read something like this (depending on your directory and file name, and not including the color coding, provided here by the Pretty R Tool):

#sab-R-metrics Introduction

#set working directory
setwd("c:/Users/Millsy/Desktop")

#load hall of fame hitters data
hitters <- read.csv(file="hallhitters.csv", h=T)

head(hitters) #view first 6 rows to ensure correct data import
tail(hitters) #view last 6 rows to ensure correct data import

To leave a comment for the author, please follow the link and comment on his blog: The Prince of Slides.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.