# Working with data frames

January 5, 2012
By

(This article was first published on pitchR/x, and kindly contributed to R-bloggers)

R, just like other programming languages, has different types of objects. Matrices, arrays, data.frames, lists, vectors, tables, etc. But by far the most important for working with baseball data is going to be dataframes.

I'm not sure of the level of experience of the people reading this, so the first part will be more beginner oriented. If that bores you, skip down to the advanced section.

In the example post (just scroll down), we looked at some Roy Halladay data. It is stored in a data frame. Just as a reminder, to get this data, do the following:

Go here: http://joelefkowitz.com/pitcher_card.php?pid=136880 and click "Download excel file." This is Roy Halladay's data from 2011. Open the file in excel, click "save as", and change the file extension to .csv so it looks like "halladay.csv". This will make it easier to import to R.

Go here, and download R. Then open R. To read in the file, we need to change our working directory. We can do this with the setwd() command. Mine looks like this:

setwd("C:/Users/Josh/baseball_stuff/PITCHRX")

Now read in the data. Type:

We know that this is an object of type data frame by typing in

class(pitcher)

Say I no longer want all of this data. For some reason, I only want the first 10 rows. How do we get a new data frame and store it in a new object?

A data frame, much like a matrix (two-dimensional array), has two dimensions - rows and columns. To index a data frame (or a matrix), we use brackets in R next to the object, like so:

pitcher[i, j]

Where "i" denotes the rows that you want and "j" is the columns. To get the first ten rows we could do this a few ways. Lets assign the numbers 1-10 to an object:

i = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

The command "c" means concatenate, and does just that -- concatenate and bunch of things together. You can use it on many data types, not just numeric data. But typing out all those numbers? Inefficient. We can do the same thing much faster with:

i = 1:10

Now index:

newdf = pitcher[i, ]

Since we want all columns, we just leave in a blank space next to the comma. And yes, you need the comma so R knows which dimensions you are indexing.

But we don't even need to assign anything to i . We can just do:

newdf  = pitcher[1:10, ]

But why would we ever do this? A more powerful method is to index a method based on a condition. Say we only want Roy Halladay's curveballs.

We can look at individual columns of a data frame with the "$" operator. Pitch type data is housed in the "pitch_type" column, which we can access with pitcher$pitch_type

To find which rows were curveballs, we can use the logical operator for equivalency "==" . This is used in many languages. Type:

pitcher$pitch_type == "CU" You actually just get a bunch of boolean values (true or false). But next to the values where it's true is a row index telling us which row these pitches can be located at. We can use these boolean vectors to index data frames. curveballs = pitcher[pitcher$pitch_type == "CU", ]

This will index the pitcher data frame such the new object, curveballs, will only have rows where the conditional is true. This is really, really powerful. You can use multiple conditionals to subset different combinations of rows and columns, such as:

df = pitcher[pitcher$pitch_type=="FT" & pitcher$start_speed > 91, c("pfx_x", "pfx_z", "start_speed")]

This would find all rows where the entire conditional is true (& means and), and only takes the data from those columns which we described with the "c" command. We also could have described which columns using numbers for their indices like so (start_speed is the 26th column, and so on):

df = pitcher[pitcher$pitch_type=="FT" & pitcher$start_speed > 91, c(26, 34, 35)]

You now have a new data frame with only the columns start_speed, pfx_x, pfx_z in it and only with two-seam pitches with an initial velocity greater than 91.

Another way is to use the subset command. You can use the command like this:

df = subset(pitcher, pitch_type == "CU")

And this would be equivalent to:

pitcher = pitcher[pitcher$pitch_type == 'CU', ] But that's pretty fancy isn't it? We didn't have to access the column in the subset command with the$ operator, we just named it. The evaluation of the subset command is controlled such that it occurs within the dataframe. But what we if we define an object in a global environment and try and use it in the subset function?

x = 'CU'

df = subset(pitcher, pitch_type == x)

What happens? Well, it actually works just like what we would want it to. But that means that the subset function will also work with global objects in conditionals. This means that subset has different scoping rules than the normal indexing operations we looked at later. The conditional is first evaluated within the context of the environment, which in this case is "pitcher". But it can still go back a generation to the global environment if it needs to. This is dynamic scoping, which is different than the lexical scoping that is normally used by the language.

We can basically replicate this function like so:

conditional = substitute(pitch_type == "CU")

df = pitcher[eval(conditional, pitcher), ]

The object conditional is here is of a type "call." Evaluating this call in the context of a dataframe gives us a bunch of true/false values - another boolean vector. In fact, this is equivalent to:

pitcher\$pitch_type=='CU'

So we can see how the subset command works, and it's pretty fancy, using some scoping and evaluation tricks -- it's pretty advanced stuff. The subset command is meant to save typing, but I'm not really sure it does unless you have a lot of conditionals. It also is a bit like voodoo magic if you don't how it works.