R-bloggers

Taking a Subset of a Data Frame in R

(This article was first published on R – Quality and Innovation, and kindly contributed to R-bloggers)

I just wrote a new chapter for my students describing how to subset a data frame in R. The full text is available at https://docs.google.com/document/d/1K5U11-IKRkxNmitu_lS71Z6uLTQW_fp6QNbOMMwA5J8/edit?usp=sharing but here’s a preview:

Let’s load in ChickWeight, one of R’s built in datasets. This contains the weights of little chickens at 12 different times throughout their lives. The chickens are on different diets, numbered 1, 2, 3, and 4. Using the str command, we find that there are 578 observations in this data frame, and two different categorical variables: Chick and Diet.


> data(ChickWeight)
> head(ChickWeight)
  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1
> str(ChickWeight)
Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame':      578 obs. of  4 variables:
 $ weight: num  42 51 59 64 76 93 106 125 149 171 ...
 $ Time  : num  0 2 4 6 8 10 12 14 16 18 ...
 $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
 $ Diet  : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "formula")=Class 'formula' length 3 weight ~ Time | Chick
  .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
 - attr(*, "outer")=Class 'formula' length 2 ~Diet
  .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
 - attr(*, "labels")=List of 2
  ..$ x: chr "Time"
  ..$ y: chr "Body weight"
 - attr(*, "units")=List of 2
  ..$ x: chr "(days)"
  ..$ y: chr "(gm)"

Get One Column: Now that we have a data frame named ChickWeight loaded into R, we can take subsets of these 578 observations. First, let’s assume we just want to pull out the column of weights. There are two ways we can do this: specifying the column by name, or specifying the column by its order of appearance. The general form for pulling information from data frames is data.frame[rows,columns] so you can get the first column in either of these two ways:


ChickWeight[,1]   		# get all rows, but only the first column
ChickWeight[,c("weight")]	# get all rows, and only the column named “weight”

Get Multiple Columns: If you want more than one column, you can specify the column numbers or the names of the variables that you want to extract. If you want to get the weight and diet columns, you would do this:


ChickWeight[,c(1,4)]   		# get all rows, but only 1st and 4th columns
ChickWeight[,c("weight","Diet")]	# get all rows, only “weight” & “Diet” columns

If you want more than one column and those columns are next to each other, you can do this:


ChickWeight[,c(1:3)]

Get One Row: You can get the first row similarly to how you got the first column, and any other row the same way:


ChickWeight[1,]   		# get first row, and all columns
ChickWeight[82,]   		# get 82nd row, and all columns

Get Multiple Rows: If you want more than one row, you can specify the row numbers you want like this:


> ChickWeight[c(1:6,15,18,27),] 
   weight Time Chick Diet      
1      42    0     1    1   
2      51    2     1    1 
3      59    4     1    1    
4      64    6     1    1    
5      76    8     1    1 
6      93   10     1    1    
15     58    4     2    1    
18    103   10     2    1 
27     55    4     3    1    

To leave a comment for the author, please follow the link and comment on their blog: R – Quality and Innovation.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...