Some everyday data tasks: a few hints with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We all work with data frames and it is important that we know how we can reshape them, as necessary to meet our needs. I think that there are, at least, four routine tasks that we need to be able to accomplish:
- subsetting
- sorting
- casting
- melting
Obviously, there is a wide array of possibilities; I’ll just mention a few, which I regularly use.
Subsetting the data
Subsetting means selecting the records (rows) or the variables (columns) which satisfy certain criteria. Let’s take the ‘students.csv’ dataset, which is available on one of my repositories. It is a database of student’s marks in a series of exams for different subjects.
students <- read.csv("https://www.casaonofri.it/_datasets/students.csv", header=T) head(students) ## Id Subject Date Mark Year HighSchool ## 1 1 AGRONOMY 10/06/2002 30 2001 CLASSICO ## 2 2 AGRONOMY 08/07/2002 24 2001 AGRARIO ## 3 3 AGRONOMY 24/06/2002 30 2001 AGRARIO ## 4 4 AGRONOMY 24/06/2002 26 2001 CLASSICO ## 5 5 AGRONOMY 23/01/2003 30 2001 CLASSICO ## 6 6 AGRONOMY 09/09/2002 28 2001 AGRARIO
Let’s say that we want a new dataset, which contains only the students with marks higher than 28.
subData <- subset(students, Mark >= 28) head(subData) ## Id Subject Date Mark Year HighSchool ## 1 1 AGRONOMY 10/06/2002 30 2001 CLASSICO ## 3 3 AGRONOMY 24/06/2002 30 2001 AGRARIO ## 5 5 AGRONOMY 23/01/2003 30 2001 CLASSICO ## 6 6 AGRONOMY 09/09/2002 28 2001 AGRARIO ## 11 11 AGRONOMY 09/09/2002 28 2001 SCIENTIFICO ## 17 17 AGRONOMY 10/06/2002 30 2001 CLASSICO
Let’s make it more difficult and extract the records were mark is ranging from 26 to 28 (margins included. Look at the AND clause):
subData <- subset(students, Mark <= 28 & Mark >=26) head(subData) ## Id Subject Date Mark Year HighSchool ## 4 4 AGRONOMY 24/06/2002 26 2001 CLASSICO ## 6 6 AGRONOMY 09/09/2002 28 2001 AGRARIO ## 7 7 AGRONOMY 24/02/2003 26 2001 CLASSICO ## 8 8 AGRONOMY 09/09/2002 26 2001 SCIENTIFICO ## 10 10 AGRONOMY 08/07/2002 27 2001 CLASSICO ## 11 11 AGRONOMY 09/09/2002 28 2001 SCIENTIFICO
Now we are interested in those students who got a mark ranging from 26 to 28 in MATHS (please note the equality relationship written as ‘==’):
subData <- subset(students, Mark <= 28 & Mark >=26 & Subject == "MATHS") head(subData) ## Id Subject Date Mark Year HighSchool ## 115 115 MATHS 15/07/2002 26 2001 AGRARIO ## 124 124 MATHS 16/09/2002 26 2001 SCIENTIFICO ## 138 138 MATHS 04/02/2002 27 2001 CLASSICO ## 144 144 MATHS 10/02/2003 27 2001 CLASSICO ## 145 145 MATHS 04/07/2003 27 2002 CLASSICO ## 146 146 MATHS 28/02/2002 28 2001 AGRARIO
Lets’ look for good students either in MATHS or in CHEMISTRY (OR clause; note the ‘|’ operator):
subData <- subset(students, Mark <= 28 & Mark >=26 & Subject == "MATHS" | Subject == "CHEMISTRY") head(subData) ## Id Subject Date Mark Year HighSchool ## 64 64 CHEMISTRY 18/06/2003 20 2002 AGRARIO ## 65 65 CHEMISTRY 06/06/2002 21 2001 CLASSICO ## 66 66 CHEMISTRY 20/02/2003 21 2002 CLASSICO ## 67 67 CHEMISTRY 20/02/2003 18 2002 AGRARIO ## 68 68 CHEMISTRY 04/06/2002 28 2001 ALTRO ## 69 69 CHEMISTRY 26/06/2002 23 2001 RAGIONERIA
We can also select columns; for example we may want to display only the ‘Subject’, ‘Mark’ and ‘HighSchool’ columns:
subData <- subset(students, Mark <= 28 & Mark >=26 & Subject == "MATHS" | Subject == "CHEMISTRY", select = c(Subject, Mark, HighSchool)) head(subData) ## Subject Mark HighSchool ## 64 CHEMISTRY 20 AGRARIO ## 65 CHEMISTRY 21 CLASSICO ## 66 CHEMISTRY 21 CLASSICO ## 67 CHEMISTRY 18 AGRARIO ## 68 CHEMISTRY 28 ALTRO ## 69 CHEMISTRY 23 RAGIONERIA
We can as well drop the unwanted columns:
subData <- subset(students, Mark <= 28 & Mark >=26 & Subject == "MATHS" | Subject == "CHEMISTRY", select = c(-Id, -Date, -Year)) head(subData) ## Subject Mark HighSchool ## 64 CHEMISTRY 20 AGRARIO ## 65 CHEMISTRY 21 CLASSICO ## 66 CHEMISTRY 21 CLASSICO ## 67 CHEMISTRY 18 AGRARIO ## 68 CHEMISTRY 28 ALTRO ## 69 CHEMISTRY 23 RAGIONERIA
Using the function ‘subset()’ is very easy. However, we might have higher flexibility if we subset by using indices. We already know that the notation ‘dataframe[i,j]’ returns the element in the i-th row and j-th column in a data frame. We can of course replace i and j with some subsetting rules. For example, taking the exams where the mark is comprised between 25 and 29 is done as follows:
subData <- students[(students$Mark <= 29 & students$Mark >=25),] head(subData) ## Id Subject Date Mark Year HighSchool ## 4 4 AGRONOMY 24/06/2002 26 2001 CLASSICO ## 6 6 AGRONOMY 09/09/2002 28 2001 AGRARIO ## 7 7 AGRONOMY 24/02/2003 26 2001 CLASSICO ## 8 8 AGRONOMY 09/09/2002 26 2001 SCIENTIFICO ## 10 10 AGRONOMY 08/07/2002 27 2001 CLASSICO ## 11 11 AGRONOMY 09/09/2002 28 2001 SCIENTIFICO
This is useful to quickly edit the data. For example, if we want to replace all marks from 25 to 29 with NAs (Not Available), we can simply do:
subData <- students subData[(subData$Mark <= 29 & subData$Mark >=25), "Mark"] <- NA head(subData) ## Id Subject Date Mark Year HighSchool ## 1 1 AGRONOMY 10/06/2002 30 2001 CLASSICO ## 2 2 AGRONOMY 08/07/2002 24 2001 AGRARIO ## 3 3 AGRONOMY 24/06/2002 30 2001 AGRARIO ## 4 4 AGRONOMY 24/06/2002 NA 2001 CLASSICO ## 5 5 AGRONOMY 23/01/2003 30 2001 CLASSICO ## 6 6 AGRONOMY 09/09/2002 NA 2001 AGRARIO
Please note that I created a new dataset to make the replacement, in order not to modify the original dataset. Of course, I can use the ‘is.na()’ function to find the missing data and edit them.
subData[is.na(subData$Mark) == T, "Mark"] <- 0 head(subData) ## Id Subject Date Mark Year HighSchool ## 1 1 AGRONOMY 10/06/2002 30 2001 CLASSICO ## 2 2 AGRONOMY 08/07/2002 24 2001 AGRARIO ## 3 3 AGRONOMY 24/06/2002 30 2001 AGRARIO ## 4 4 AGRONOMY 24/06/2002 0 2001 CLASSICO ## 5 5 AGRONOMY 23/01/2003 30 2001 CLASSICO ## 6 6 AGRONOMY 09/09/2002 0 2001 AGRARIO
Sorting the data
Sorting is very much like subsetting by indexing. I just need to use the ‘order’ function. For example, let’s sort the students dataset by mark:
sortedData <- students[order(students$Mark), ] head(sortedData) ## Id Subject Date Mark Year HighSchool ## 51 51 BIOLOGY 01/03/2002 18 2001 CLASSICO ## 67 67 CHEMISTRY 20/02/2003 18 2002 AGRARIO ## 76 76 CHEMISTRY 24/02/2003 18 2002 ALTRO ## 79 79 CHEMISTRY 18/06/2003 18 2002 AGRARIO ## 82 82 CHEMISTRY 18/07/2002 18 2001 AGRARIO ## 83 83 CHEMISTRY 23/01/2003 18 2001 SCIENTIFICO
We can also sort by decreasing order:
sortedData <- students[order(-students$Mark), ] head(sortedData) ## Id Subject Date Mark Year HighSchool ## 1 1 AGRONOMY 10/06/2002 30 2001 CLASSICO ## 3 3 AGRONOMY 24/06/2002 30 2001 AGRARIO ## 5 5 AGRONOMY 23/01/2003 30 2001 CLASSICO ## 17 17 AGRONOMY 10/06/2002 30 2001 CLASSICO ## 18 18 AGRONOMY 10/06/2002 30 2001 AGRARIO ## 19 19 AGRONOMY 09/09/2002 30 2001 AGRARIO
We can obviously use multiple keys. For example, let’s sort by subject within marks:
sortedData <- students[order(-students$Mark, students$Subject), ] head(sortedData) ## Id Subject Date Mark Year HighSchool ## 1 1 AGRONOMY 10/06/2002 30 2001 CLASSICO ## 3 3 AGRONOMY 24/06/2002 30 2001 AGRARIO ## 5 5 AGRONOMY 23/01/2003 30 2001 CLASSICO ## 17 17 AGRONOMY 10/06/2002 30 2001 CLASSICO ## 18 18 AGRONOMY 10/06/2002 30 2001 AGRARIO ## 19 19 AGRONOMY 09/09/2002 30 2001 AGRARIO
If I want to sort in decreasing order on a character variable (such as Subject), I need to use the helper function ‘xtfrm()’:
sortedData <- students[order(-students$Mark, -xtfrm(students$Subject)), ] head(sortedData) ## Id Subject Date Mark Year HighSchool ## 116 116 MATHS 01/07/2002 30 2001 ALTRO ## 117 117 MATHS 18/06/2002 30 2001 RAGIONERIA ## 118 118 MATHS 09/07/2002 30 2001 AGRARIO ## 121 121 MATHS 18/06/2002 30 2001 RAGIONERIA ## 123 123 MATHS 09/07/2002 30 2001 CLASSICO ## 130 130 MATHS 07/02/2002 30 2001 SCIENTIFICO
Casting the data
When we have a dataset in the LONG format, we might be interested in reshaping it in the WIDE format. This is the same as what the ‘pivot table’ function in Excel does. For example, take the ‘rimsulfuron.csv’ dataset in my repository. This contains the results of a randomised block experiment, where we have 16 herbicides in four blocks. The dataset is in the LONG format, with one row per plot.
rimsulfuron <- read.csv("https://www.casaonofri.it/_datasets/rimsulfuron.csv", header=T) head(rimsulfuron) ## Herbicide Block Height Yield ## 1 A B1 183.4 93.86601 ## 2 B B1 187.0 103.14767 ## 3 C B1 188.4 92.70994 ## 4 D B1 183.2 88.74208 ## 5 E B1 184.2 90.96575 ## 6 F B1 178.8 98.40900
Lets’put this data frame in the WIDE format, with one row per herbicide and one column per block. To do so, I usually use the ‘cast()’ function in the ‘reshape’ package.
library(reshape) castData <- cast(Herbicide ~ Block, data = rimsulfuron, value = "Yield") head(castData) ## Herbicide B1 B2 B3 B4 ## 1 A 93.86601 105.38976 94.13755 111.19865 ## 2 B 103.14767 98.48589 106.13508 128.31921 ## 3 C 92.70994 106.73288 97.11650 117.76346 ## 4 D 88.74208 113.74683 104.51089 126.79076 ## 5 E 90.96575 113.00256 104.39530 113.00646 ## 6 F 98.40900 99.89433 101.68348 93.90775
Other similar functions are available within the ‘reshape2’ and ‘tidyr’ packages.
Melting the data
In this case we do the reverse: we transform the dataset from WIDE to LONG format. For this task, I like the ‘melt()’ function in the ‘reshape2’ package, which requires a data frame as input. I would like to use the ‘castData’ object which we have just created by using the ‘cast()’ function above. Unfortunately, this object has a ‘cast_df’ class. Therefore, in order to avoid weird results, I need to change ‘castData’ into a data frame, by using the ‘as.data.frame()’ function.
library(reshape2) castData <- as.data.frame(castData) mdati <- melt(castData, variable.name = "Block", value.name = "Yield", id=c("Herbicide")) head(mdati) ## Herbicide Block Yield ## 1 A B1 93.86601 ## 2 B B1 103.14767 ## 3 C B1 92.70994 ## 4 D B1 88.74208 ## 5 E B1 90.96575 ## 6 F B1 98.40900
Have a nice work with these functions!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.