Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have found the following commands quite useful during the EDA part of any Data Science project. We will work with the tidyverse package where we will actually need the dplyr and the ggplot2 only and with the irisdataset.

## select_if | rename_if

The select_if function belongs to dply and is very useful where we want to choose some columns based on some conditions. We can also add a function that applies to column names.

Example: Let’s say that I want to choose only the numeric variables and to add the prefix “numeric_” to their column names.

library(tidyverse)



Output:

Notice that we can also use the rename_if in the same way. An important note is that the rename_if(), rename_at(), and rename_all() have been superseded by rename_with(). The matching select statements have been superseded by the combination of a select() + rename_with().

These functions were superseded because mutate_if() and friends were superseded by across(). select_if() and rename_if() already use tidy selection so they can’t be replaced by across() and instead we need a new function.

## everything

In many Data Science projects, we want one particular column (usually the dependent variable y) to appear first or last in the dataset. We can achieve this using the everything() from dplyr package.

Example: Let’s say that I want the column Species to appear first in my dataset.

mydataset<-iris%>%select(Species, everything())



Example: Let’s say that I want the column Species to appear last in my dataset.

This is a little bit tricky. Have a look below at how we can do it. We will work with the mydataset where the Species column appears first and we will remove it to the last column.

mydataset%>%select(-Species, everything())%>%head()



## relocate

The relocate() is a new addition in dplyr 1.0.0. You can specify exactly where to put the columns with .before or .after

Example: Let’s say that I want the Petal.Width column to appear next to Sepal.Width

iris%>%relocate(Petal.Width, .after=Sepal.Width)%>%head()


Notice that we can also set to appear after the last column.

Example: Let’s say that I want the Petal.Width to be the last column

iris%>%relocate(Petal.Width, .after=last_col())%>%head()



## pull

When we work with data frames and we select a single column, sometimes we the output to be as.vector. We can achieve this with the pull() which is part of dplyr.

Example: Let’s say that I want to run a t.test in the Sepal.Length for setosa versus virginica. Note the the t.test function expects numeric vectors.

setosa_sepal_length<-iris%>%filter(Species=='setosa')%>%select(Sepal.Length)%>%pull()
virginica_sepal_length<-iris%>%filter(Species=='virginica')%>%select(Sepal.Length)%>%pull()

t.test(setosa_sepal_length,virginica_sepal_length)



## reorder

When you work with ggplot2 sometimes is frustrating when you have to reorder the factors based on some conditions. Let’s say that we want to show the boxplot of the Sepal.Width by Species.

iris%>%ggplot(aes(x=Species, y=Sepal.Width))+geom_boxplot()



Example: Let’s assume that we want to reorder the boxplot based on the Species’ median.

We can do that easily with the reorder() from the stats package.

iris%>%ggplot(aes(x=reorder(Species,Sepal.Width, FUN = median), y=Sepal.Width))+geom_boxplot()+xlab("Species")