I have found the following commands quite useful during the EDA part of any Data Science project. We will work with the
tidyverse package where we will actually need the
dplyr and the
ggplot2 only and with the
select_if | rename_if
select_if function belongs to
dply and is very useful where we want to choose some columns based on some conditions. We can also add a function that applies to column names.
Example: Let’s say that I want to choose only the numeric variables and to add the prefix “numeric_” to their column names.
library(tidyverse) iris%>%select_if(is.numeric, list(~ paste0("numeric_", .)))%>%head()
Notice that we can also use the
rename_if in the same way. An important note is that the rename_if(), rename_at(), and rename_all() have been superseded by rename_with(). The matching select statements have been superseded by the combination of a select() + rename_with().
These functions were superseded because mutate_if() and friends were superseded by across(). select_if() and rename_if() already use tidy selection so they can’t be replaced by across() and instead we need a new function.
In many Data Science projects, we want one particular column (usually the dependent variable y) to appear first or last in the dataset. We can achieve this using the
Example: Let’s say that I want the column Species to appear first in my dataset.
mydataset<-iris%>%select(Species, everything()) mydataset%>%head()
Example: Let’s say that I want the column Species to appear last in my dataset.
This is a little bit tricky. Have a look below at how we can do it. We will work with the
mydataset where the Species column appears first and we will remove it to the last column.
relocate() is a new addition in dplyr 1.0.0. You can specify exactly where to put the columns with .before or .after
Example: Let’s say that I want the Petal.Width column to appear next to Sepal.Width
Notice that we can also set to appear after the last column.
Example: Let’s say that I want the Petal.Width to be the last column
You can find more info in the tidyverse documentation
When we work with data frames and we select a single column, sometimes we the output to be as.vector. We can achieve this with the
pull() which is part of
Example: Let’s say that I want to run a t.test in the Sepal.Length for setosa versus virginica. Note the the t.test function expects numeric vectors.
setosa_sepal_length<-iris%>%filter(Species=='setosa')%>%select(Sepal.Length)%>%pull() virginica_sepal_length<-iris%>%filter(Species=='virginica')%>%select(Sepal.Length)%>%pull() t.test(setosa_sepal_length,virginica_sepal_length)
When you work with ggplot2 sometimes is frustrating when you have to reorder the factors based on some conditions. Let’s say that we want to show the boxplot of the Sepal.Width by Species.
Example: Let’s assume that we want to reorder the boxplot based on the Species’ median.
We can do that easily with the
reorder() from the
iris%>%ggplot(aes(x=reorder(Species,Sepal.Width, FUN = median), y=Sepal.Width))+geom_boxplot()+xlab("Species")