# Basic Data Analysis with dplyr

**Analytics in R**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The **dplyr** is a very useful package in **R** for data manipulation. Created and maintained by Hadley Wickham, it contains some very useful functions for data analysis and manipulation. Here, I will show some of the most basic but important functions to perform data analysis.

For this exercise, we’ll use the data package **Cars93** available in the **R** package **MASS**. We’ll also be using the package **dplyr** to analyze data from the dataset. To first load the data and the package, we’ll use the following lines of codes**library(MASS)****library(dplyr)**

Now we’ll call the data

**data(Cars93)**

Before digging in, we’ll first see what the data looks like. The package **dplyr** has a cool function **View** to view the dataset in RStudio.**View(Cars93)**

As you can see, the dataset has 93 rows and 27 columns (although the image above only shows 17 rows and 7 columns) and is a dataframe of car manufacturers, their models and different variables like their price, horsepower, engine size etc. For the rest of this exercise, I’ll only show the first few rows and columns because showing all would be very tedious.

The head of the dataset looks like this:

**head(Cars93)**

Now that we’ve an idea of what the dataset looks like, let’s get started with the functions.

**Filter**

The **filter** is a function that returns the rows that satisfy certain conditions. It is similar to the default **subset** function. The first argument is the name of the dataframe and the subsequent arguments are the filters to choose specific data based on the criteria selected. For this example, let’s say we will only want the list of cars manufacturers that are small and whose price is below 30.**filter(Cars93, Type==”Small” & Max.Price<30)**

As you can see, only cars whose **Type** is **Small** are shown. There are 21 such cars in the dataframe.

Similarly, we can also use this function to choose only cars that have airbags for both the driver and the passenger.**filter(Cars93, AirBags == “Driver & Passenger”)**

**Slice**

The **slice** function selects rows by their position. Let’s say we only want to see the first twenty rows of this dataset. This is also like the default **head** function, but the **head** function by default gives the first 5 rows.**slice(Cars93, 1:20)**

**Mutate**

The m**utate** function is used to add new variables. But what’s cool is that the new variables can be functions of other variables. So let’s say I want a new variable that calculates the deviation of the price ranges. To do this, I’ll create a new variable **ratio** and calculate it by dividing the maximum price by the minimum price.**mutate(Cars93, ratio = Max.Price/Min.Price)**

The newly created column will be the last column. However, I’ve added it to the third column titled **ratio** by using the following lines of code.

**Cars93 <- Cars93[c(1,2,29,3)]**

**Select**

The s**elect** function allows you to select specific rows. The first argument is the dataset, and the subsequent arguments are used to specify which columns you want to use. For example, if you want only the first five rows, you can use:

**select(Cars93, 1:5)**

Similarly, if you only want only the columns: Manufacturer, Model, Price, MPG.city, Horsepower, you can use:**select(Cars93, Manufacturer, Model, Price, MPG.city, Horsepower)**

If you want to see all columns except the columns from Manufacturer to Type, you can use:**select(Cars93, -(Manufacturer:Type))**

**Summarise and Group_by**

The s**ummarise** function is used to summarize multiple values of a variable. Used in conjunction with other functions, like **group_by**, this can be a very useful tool to analyze data. Here, I want to group all car types and see their mean mileage average mileage. To do this, I use:**mutate(Cars93, meanmpg=(MPG.city+MPG.highway)/2)****summarise(group_by(cars93, Type), mean(meanmpg))**

**N_Distinct**

The** n_distinct** function shows the number of distinct values in a vector. The code:

**n_distinct(Cars93$Type)****[1] 6**

shows the different types of car (small, midsize, compact etc.) in the dataset.

**Top_n**

The** top_n** function is similar to the slice function as it shows the specified number of rows in the dataset.

**top_n(Cars93, 5, MPG.highway)**

**Arrange**

The **arrange** function is used to arrange rows by variables. Here, I arranged the dataset according to price in descending order using the **desc** function, meaning the most expensive car manufacturers and the models will show up first.

**arrange(Cars93, desc(Price))**

**Sample**

The **sample** function is used to select random rows from a table.**sample_n(Cars93, size=10)**

This code gives us a random list of 10 car manufacturers and the other variables.

**Piping**

The **Pipe** is an operator that allows the user to connect multiple codes together. This is particularly helpful when writing multiple lines of code and you do not want to see the output of every single line.

For example, let’s say from the dataset **Cars93**, I only want car models that have airbags in both the driver and passenger seats; then I want to group the car types and then summarize according to mileage on highways.**Cars93 %>% filter(AirBags == “Driver & Passenger”) %>% group_by(Type) %>% summarise(mean(MPG.highway))**

These are very basic examples, but it’s quite easy to see why they’ll be very useful in data analysis.

This brings me to the end of this article. I hope you found it useful. Please feel free to leave a comment below.

**leave a comment**for the author, please follow the link and comment on their blog:

**Analytics in R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.