R Training – Data Manipulation
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is the third module of a five-module training on R I conceived and taught to my ex-colleagues back in December 2016. RStudio is the suggested IDE to go through this module. For the raw code, example data and resources visit this repo on GitHub.
Setup
Load the packages in your workspace
library(data.table) library(dplyr)
Let’s work with iris data
data("iris")
Convert to data.table
data.table is an efficient package to perform data manipulation. It is often convenient to convert a data.frame into a data.table to leverage the advanced data manipulation features of this package.
dt <- data.table(iris)
the good thing is that data.frame class is not lost in this way.
class(dt)
Be careful when calling functions expecting one class or the other (e.g. ggplot2)
Removing NAs
Depending on the study we may want to simply exclude missing values.
iris <- iris[complete.cases(iris),]
Coalescing NAs to 0s
Another option is to replace NAs with 0s. This function will make the trick using data.table syntax.
replaceNAs = function(DT) { for (j in seq_len(ncol(DT))) set(DT,which(is.na(DT[[j]])),j,0) } replaceNAs(iris)
data.table presentation
data.table is a package to handle data frames. It is particularly appreciated for its speed in performing operations on large data sets. It also makes it very easy to manipulate data for those knowing the basics of SQL since you can think of a data.table object as being structured this way:
- DT[WHERE, SELECT, GROUP BY]
dplyr presentation
Quoting from CRAN dplyr package is a fast, consistent tool for working with data-frame-like objects. The main methods of this package to manipulate data are the following:
- Filter(), to select specific rows
- Arrange() the rows of your data into some order
- Mutate() your data frame to contain new columns
- Summarise() groups of your data in some way
but there are also lesser known tricks you can do with this package. Here we’ll cover just a few examples to give you an idea of its functioning.
Another awesome thing about dplyr is that it works seamlessly with the pipeline operator.
The pipeline operator from magrittr package (installed along with dplyr) allows you to create pipes. Pipes are chains which take the output from a function and feed it to the first argument of the next function. In this way it becomes very elegant and fast to query data frames.
Filtering rows
data table way
iris_setosa <- dt[Species=="setosa"] # only 'setosa' rows iris_big <- dt[Petal.Width>2] # only rows with petal width greater than 2 iris_big <- dt[Petal.Width>2 | Petal.Length>3] # only rows with petal width greater than 2 or petal length greater than 3 setosaBig <- dt[Species=="setosa" & Petal.Width>0.5] # only rows 'setosa' and with petal width greater than 0.5 iris_sub <- dt[Species %in% c("setosa", "virginica")] # only rows setosa or virginica
dplyr way
setosa <- dt %>% filter(Species == "setosa") # only rows setosa
Subsetting columns
Data.table way
keep <- c("Sepal.Length","Petal.Length","Species") iris_keep <- dt[,keep,with=F]
dplyr way by name of variables
iris_keep_dplyr <- select(dt,Sepal.Length,Petal.Length,Species)
or using the pipeline operator
iris_keep_dplyr <- dt %>% select(Sepal.Length,Petal.Length,Species)
or by position of variables
iris_keep_dplyr <- dt %>% select(1,3,5)
Group by
data frame way
tapply(dt$Petal.Length, INDEX = dt$Species, FUN = sum)
Data.table way
dt[,.(sum_pet_length=sum(Petal.Length)),by=Species]
dplyr way
dt %>% group_by(Species) %>% summarise(sum_pet_length = sum(Petal.Length))
Group by several variables, data.table way
head(dt[, list(avg_pet_length = mean(Petal.Width)), by=.(Species, Sepal.Length)])
Group by several variables, dplyr way
dt %>% group_by(Species, Sepal.Length) %>% summarise(avg_pet_length = mean(Petal.Width)) %>% head()
Merging tables
Typically a complex analysis will involve more than one table and often we will need to merge them to perform our analyses. When the tables have a similar structure by row or by column we may need to bind them.
Binding
Same columns: row binding
# Same columns, row binding dt_1 <- dt dt_total <- rbind(dt, dt_1 )
Same rows: column binding
vector <- 1:nrow(dt) dt_total <- cbind(dt, vector)
Joining
More often what we would like to do is to merge different tables based on the values of some common variable.
- to explore left_outer_join we make up a dataset with some info (totally fake!) on iris species
infoIris <- data.frame(s = levels(iris$Species), discovery = c(1950, 1960, 1961), color = c("red", "yellow", "brown"))
- to explore inner_join we make up another dataset
flowers <- data.frame(Species = c("setosa", "virginica", "tulipan", "margherita"), n = 1:4)
Joining with base functions
- with base merge functions the key variable has to have the same name on both tables
names(infoIris)[1] <- "Species" left_join_base <- merge(dt, infoIris, by = "Species", all.x = T)
- Inner join
inner_join <- merge(dt, flowers, by = "Species")
- Right outer join
right_join_base <- merge(dt, flowers, by = "Species", all.y = T)
Joining, data table way
- Left outer join
setkey(dt, "Species") # We need to assign a key to join as sql way before properly doing it infoIris_dt <- data.table(infoIris) setkey(infoIris_dt, "Species") # event though in this case they already coincide with data.table there's no need for that because we are explicitly setting a common key left_outer_join <- dt[infoIris_dt]
- Inner join
flowers_dt <- data.table(flowers) setkey(flowers_dt, "Species") # no need to make the names coincide here because we are setting a key inner_join <- dt[flowers_dt, nomatch = 0]
- Right outer join
right_outer_join <- flowers_dt[dt]
Joining, dplyr way
- left join
left_join_dyplr <- left_join(dt, infoIris, by = "Species")
- inner join
inner_join_dplyr <- inner_join(dt, flowers, by = "Species")
- right join
right_join_dyplr <- right_join(dt, flowers, by = "Species")
dplyr anti join
Suppose there is one full dataset (‘A’) and another one subset of the first (‘B’). A nice way to get only the data of A that it is not on B is the following:
t <- anti_join(dt, flowers, by = "Species")
That’s it for this module! If you have gone through all this code you should have learnt the basics of two fundamentals package in R for data manipulation, data.table and dplyr.
When you’re ready, go ahead with the fourth module: R training – data visualization.
The post R Training – Data Manipulation appeared first on SLOW DATA.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.