Some everyday data tasks: a few hints with R

[This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We all work with data frames and it is important that we know how we can reshape them, as necessary to meet our needs. I think that there are, at least, four routine tasks that we need to be able to accomplish:

  1. subsetting
  2. sorting
  3. casting
  4. melting

Obviously, there is a wide array of possibilities; I’ll just mention a few, which I regularly use.

Subsetting the data

Subsetting means selecting the records (rows) or the variables (columns) which satisfy certain criteria. Let’s take the ‘students.csv’ dataset, which is available on one of my repositories. It is a database of student’s marks in a series of exams for different subjects.

students <- read.csv("https://www.casaonofri.it/_datasets/students.csv", header=T)
head(students)
##   Id  Subject       Date Mark Year HighSchool
## 1  1 AGRONOMY 10/06/2002   30 2001   CLASSICO
## 2  2 AGRONOMY 08/07/2002   24 2001    AGRARIO
## 3  3 AGRONOMY 24/06/2002   30 2001    AGRARIO
## 4  4 AGRONOMY 24/06/2002   26 2001   CLASSICO
## 5  5 AGRONOMY 23/01/2003   30 2001   CLASSICO
## 6  6 AGRONOMY 09/09/2002   28 2001    AGRARIO

Let’s say that we want a new dataset, which contains only the students with marks higher than 28.

subData <- subset(students, Mark >= 28)
head(subData)
##    Id  Subject       Date Mark Year  HighSchool
## 1   1 AGRONOMY 10/06/2002   30 2001    CLASSICO
## 3   3 AGRONOMY 24/06/2002   30 2001     AGRARIO
## 5   5 AGRONOMY 23/01/2003   30 2001    CLASSICO
## 6   6 AGRONOMY 09/09/2002   28 2001     AGRARIO
## 11 11 AGRONOMY 09/09/2002   28 2001 SCIENTIFICO
## 17 17 AGRONOMY 10/06/2002   30 2001    CLASSICO

Let’s make it more difficult and extract the records were mark is ranging from 26 to 28 (margins included. Look at the AND clause):

subData <- subset(students, Mark <= 28 & Mark >=26)
head(subData)
##    Id  Subject       Date Mark Year  HighSchool
## 4   4 AGRONOMY 24/06/2002   26 2001    CLASSICO
## 6   6 AGRONOMY 09/09/2002   28 2001     AGRARIO
## 7   7 AGRONOMY 24/02/2003   26 2001    CLASSICO
## 8   8 AGRONOMY 09/09/2002   26 2001 SCIENTIFICO
## 10 10 AGRONOMY 08/07/2002   27 2001    CLASSICO
## 11 11 AGRONOMY 09/09/2002   28 2001 SCIENTIFICO

Now we are interested in those students who got a mark ranging from 26 to 28 in MATHS (please note the equality relationship written as ‘==’):

subData <- subset(students, Mark <= 28 & Mark >=26 & 
                    Subject == "MATHS")
head(subData)
##      Id Subject       Date Mark Year  HighSchool
## 115 115   MATHS 15/07/2002   26 2001     AGRARIO
## 124 124   MATHS 16/09/2002   26 2001 SCIENTIFICO
## 138 138   MATHS 04/02/2002   27 2001    CLASSICO
## 144 144   MATHS 10/02/2003   27 2001    CLASSICO
## 145 145   MATHS 04/07/2003   27 2002    CLASSICO
## 146 146   MATHS 28/02/2002   28 2001     AGRARIO

Lets’ look for good students either in MATHS or in CHEMISTRY (OR clause; note the ‘|’ operator):

subData <- subset(students, Mark <= 28 & Mark >=26 & 
                    Subject == "MATHS" | 
                    Subject == "CHEMISTRY")
head(subData)
##    Id   Subject       Date Mark Year HighSchool
## 64 64 CHEMISTRY 18/06/2003   20 2002    AGRARIO
## 65 65 CHEMISTRY 06/06/2002   21 2001   CLASSICO
## 66 66 CHEMISTRY 20/02/2003   21 2002   CLASSICO
## 67 67 CHEMISTRY 20/02/2003   18 2002    AGRARIO
## 68 68 CHEMISTRY 04/06/2002   28 2001      ALTRO
## 69 69 CHEMISTRY 26/06/2002   23 2001 RAGIONERIA

We can also select columns; for example we may want to display only the ‘Subject’, ‘Mark’ and ‘HighSchool’ columns:

subData <- subset(students, Mark <= 28 & Mark >=26 & 
                    Subject == "MATHS" | 
                    Subject == "CHEMISTRY",
                  select = c(Subject, Mark, HighSchool))
head(subData)
##      Subject Mark HighSchool
## 64 CHEMISTRY   20    AGRARIO
## 65 CHEMISTRY   21   CLASSICO
## 66 CHEMISTRY   21   CLASSICO
## 67 CHEMISTRY   18    AGRARIO
## 68 CHEMISTRY   28      ALTRO
## 69 CHEMISTRY   23 RAGIONERIA

We can as well drop the unwanted columns:

subData <- subset(students, Mark <= 28 & Mark >=26 & 
                    Subject == "MATHS" | 
                    Subject == "CHEMISTRY",
                  select = c(-Id, 
                             -Date,
                             -Year))
head(subData)
##      Subject Mark HighSchool
## 64 CHEMISTRY   20    AGRARIO
## 65 CHEMISTRY   21   CLASSICO
## 66 CHEMISTRY   21   CLASSICO
## 67 CHEMISTRY   18    AGRARIO
## 68 CHEMISTRY   28      ALTRO
## 69 CHEMISTRY   23 RAGIONERIA

Using the function ‘subset()’ is very easy. However, we might have higher flexibility if we subset by using indices. We already know that the notation ‘dataframe[i,j]’ returns the element in the i-th row and j-th column in a data frame. We can of course replace i and j with some subsetting rules. For example, taking the exams where the mark is comprised between 25 and 29 is done as follows:

subData <- students[(students$Mark <= 29 & students$Mark >=25),]
head(subData)
##    Id  Subject       Date Mark Year  HighSchool
## 4   4 AGRONOMY 24/06/2002   26 2001    CLASSICO
## 6   6 AGRONOMY 09/09/2002   28 2001     AGRARIO
## 7   7 AGRONOMY 24/02/2003   26 2001    CLASSICO
## 8   8 AGRONOMY 09/09/2002   26 2001 SCIENTIFICO
## 10 10 AGRONOMY 08/07/2002   27 2001    CLASSICO
## 11 11 AGRONOMY 09/09/2002   28 2001 SCIENTIFICO

This is useful to quickly edit the data. For example, if we want to replace all marks from 25 to 29 with NAs (Not Available), we can simply do:

subData <- students
subData[(subData$Mark <= 29 & subData$Mark >=25), "Mark"] <- NA
head(subData)
##   Id  Subject       Date Mark Year HighSchool
## 1  1 AGRONOMY 10/06/2002   30 2001   CLASSICO
## 2  2 AGRONOMY 08/07/2002   24 2001    AGRARIO
## 3  3 AGRONOMY 24/06/2002   30 2001    AGRARIO
## 4  4 AGRONOMY 24/06/2002   NA 2001   CLASSICO
## 5  5 AGRONOMY 23/01/2003   30 2001   CLASSICO
## 6  6 AGRONOMY 09/09/2002   NA 2001    AGRARIO

Please note that I created a new dataset to make the replacement, in order not to modify the original dataset. Of course, I can use the ‘is.na()’ function to find the missing data and edit them.

subData[is.na(subData$Mark) == T, "Mark"] <- 0 
head(subData)
##   Id  Subject       Date Mark Year HighSchool
## 1  1 AGRONOMY 10/06/2002   30 2001   CLASSICO
## 2  2 AGRONOMY 08/07/2002   24 2001    AGRARIO
## 3  3 AGRONOMY 24/06/2002   30 2001    AGRARIO
## 4  4 AGRONOMY 24/06/2002    0 2001   CLASSICO
## 5  5 AGRONOMY 23/01/2003   30 2001   CLASSICO
## 6  6 AGRONOMY 09/09/2002    0 2001    AGRARIO

Sorting the data

Sorting is very much like subsetting by indexing. I just need to use the ‘order’ function. For example, let’s sort the students dataset by mark:

sortedData <- students[order(students$Mark), ]
head(sortedData)
##    Id   Subject       Date Mark Year  HighSchool
## 51 51   BIOLOGY 01/03/2002   18 2001    CLASSICO
## 67 67 CHEMISTRY 20/02/2003   18 2002     AGRARIO
## 76 76 CHEMISTRY 24/02/2003   18 2002       ALTRO
## 79 79 CHEMISTRY 18/06/2003   18 2002     AGRARIO
## 82 82 CHEMISTRY 18/07/2002   18 2001     AGRARIO
## 83 83 CHEMISTRY 23/01/2003   18 2001 SCIENTIFICO

We can also sort by decreasing order:

sortedData <- students[order(-students$Mark), ]
head(sortedData)
##    Id  Subject       Date Mark Year HighSchool
## 1   1 AGRONOMY 10/06/2002   30 2001   CLASSICO
## 3   3 AGRONOMY 24/06/2002   30 2001    AGRARIO
## 5   5 AGRONOMY 23/01/2003   30 2001   CLASSICO
## 17 17 AGRONOMY 10/06/2002   30 2001   CLASSICO
## 18 18 AGRONOMY 10/06/2002   30 2001    AGRARIO
## 19 19 AGRONOMY 09/09/2002   30 2001    AGRARIO

We can obviously use multiple keys. For example, let’s sort by subject within marks:

sortedData <- students[order(-students$Mark, students$Subject), ]
head(sortedData)
##    Id  Subject       Date Mark Year HighSchool
## 1   1 AGRONOMY 10/06/2002   30 2001   CLASSICO
## 3   3 AGRONOMY 24/06/2002   30 2001    AGRARIO
## 5   5 AGRONOMY 23/01/2003   30 2001   CLASSICO
## 17 17 AGRONOMY 10/06/2002   30 2001   CLASSICO
## 18 18 AGRONOMY 10/06/2002   30 2001    AGRARIO
## 19 19 AGRONOMY 09/09/2002   30 2001    AGRARIO

If I want to sort in decreasing order on a character variable (such as Subject), I need to use the helper function ‘xtfrm()’:

sortedData <- students[order(-students$Mark, -xtfrm(students$Subject)), ]
head(sortedData)
##      Id Subject       Date Mark Year  HighSchool
## 116 116   MATHS 01/07/2002   30 2001       ALTRO
## 117 117   MATHS 18/06/2002   30 2001  RAGIONERIA
## 118 118   MATHS 09/07/2002   30 2001     AGRARIO
## 121 121   MATHS 18/06/2002   30 2001  RAGIONERIA
## 123 123   MATHS 09/07/2002   30 2001    CLASSICO
## 130 130   MATHS 07/02/2002   30 2001 SCIENTIFICO

Casting the data

When we have a dataset in the LONG format, we might be interested in reshaping it in the WIDE format. This is the same as what the ‘pivot table’ function in Excel does. For example, take the ‘rimsulfuron.csv’ dataset in my repository. This contains the results of a randomised block experiment, where we have 16 herbicides in four blocks. The dataset is in the LONG format, with one row per plot.

rimsulfuron <- read.csv("https://www.casaonofri.it/_datasets/rimsulfuron.csv", header=T)
head(rimsulfuron)
##   Herbicide Block Height     Yield
## 1         A    B1  183.4  93.86601
## 2         B    B1  187.0 103.14767
## 3         C    B1  188.4  92.70994
## 4         D    B1  183.2  88.74208
## 5         E    B1  184.2  90.96575
## 6         F    B1  178.8  98.40900

Lets’put this data frame in the WIDE format, with one row per herbicide and one column per block. To do so, I usually use the ‘cast()’ function in the ‘reshape’ package.

library(reshape)
castData <- cast(Herbicide ~ Block, data = rimsulfuron,
     value = "Yield")
head(castData)
##   Herbicide        B1        B2        B3        B4
## 1         A  93.86601 105.38976  94.13755 111.19865
## 2         B 103.14767  98.48589 106.13508 128.31921
## 3         C  92.70994 106.73288  97.11650 117.76346
## 4         D  88.74208 113.74683 104.51089 126.79076
## 5         E  90.96575 113.00256 104.39530 113.00646
## 6         F  98.40900  99.89433 101.68348  93.90775

Other similar functions are available within the ‘reshape2’ and ‘tidyr’ packages.

Melting the data

In this case we do the reverse: we transform the dataset from WIDE to LONG format. For this task, I like the ‘melt()’ function in the ‘reshape2’ package, which requires a data frame as input. I would like to use the ‘castData’ object which we have just created by using the ‘cast()’ function above. Unfortunately, this object has a ‘cast_df’ class. Therefore, in order to avoid weird results, I need to change ‘castData’ into a data frame, by using the ‘as.data.frame()’ function.

library(reshape2)
castData <- as.data.frame(castData)
mdati <- melt(castData,
              variable.name = "Block",
              value.name = "Yield",
              id=c("Herbicide"))

head(mdati)
##   Herbicide Block     Yield
## 1         A    B1  93.86601
## 2         B    B1 103.14767
## 3         C    B1  92.70994
## 4         D    B1  88.74208
## 5         E    B1  90.96575
## 6         F    B1  98.40900

Have a nice work with these functions!

To leave a comment for the author, please follow the link and comment on their blog: R on The broken bridge between biologists and statisticians.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)