[This article was first published on r – Appsilon | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Exploratory Data Analysis With dplyr

When it comes to data analysis in R, you should look no further than the dplyr package. It’s an excellent all-rounder – providing you with extensive drill-down abilities while keeping the coding clean and minimal.

Are you completely new to R? Check out what you can do with the language.

Today you’ll learn how to do exploratory data analysis on the well-known Gapminder dataset. It contains historical (1952-2007) data on various indicators, such as life expectancy and GDP, for countries worldwide.

The article is structured as follows:

If you’re following along, you’ll need to have two packages installed – dplyr and gapminder. Once installed, you can import them with the following code:

A call to the head() function will show the first six rows of the dataset:

Image 1 – First six rows of the Gapminder dataset

You now have everything loaded, which means you can begin with the analysis.

Let’s start with something simple. For example, let’s say you want to records for the United States for 1997, 2002, and 2007. To get these, you’ll have to filter the dataset by continent, country, and year. It can all be done in a single filter() function:

The results are shown in the following image:

Image 2 – United States records for 1997, 2002, and 2007

So, what happened here? As you can see, you can use the filter() function to keep only the records of interest. If you need an exact match, use the == sign. If multiple values match your search criterion, use the %in% operator. As simple as that.

## Data Summaries

Summary statistics are a great starting point in any exploratory data analysis. They enable you to find a value that best describes a sample of data or a list of values that best represents each subset of the sample.

A simple average is a good place to start. Here’s how you can find the average life expectancy in the United States for 2007:

The results are shown below:

Image 3 – Average life expectancy for the United States in 2007

Let’s take this a step further and calculate the average life expectancy per continent in 2007. You’ll need to use the group_by() function to do so:

The results are shown in the following image:

Image 4 – Average life expectancy per continent in 2007

If you’re anything like me, you’ll find the above information useful but not presented in the best way. We’re dealing with average life expectancy – meaning higher is better. Having that in mind, it’s a good practice to sort the results descendingly.

Let’s see how with a slightly different example. The code below sorts continents by their total population:

The results are shown below:

Image 5 – Total population per continent

You now know how to calculate basic summary statistics – an essential part of any data analysis. Next, you’ll learn how to create derived columns and test assumptions.

## Creating Derived Variables and Testing Assumptions

A derived column indicates a column introduced by the developer – usually by combining values from several different columns. For example, you could calculate the total GDP of a country by multiplying GDP per capita by the country’s population.

Let’s do just that in code. The mutate() function is used to calculate derived columns. It uses the following syntax: newColumn = your_calculation:

The results are shown in the image below:

Image 6 – Total GDP per country/year combination

Let’s apply this knowledge to something useful – testing assumptions. We assume that higher GDP per capita values lead to higher life expectancy. Keep in mind that we’re not doing formal hypothesis testing here – but instead examining the results and eyeballing if they make sense for our assumption.

To test the assumption, you’ll calculate the percentiles from the lifeExp column. This will tell you how many percent of the countries have an identical or lower life expectancy than the current country:

The results as shown below:

Image 7 – Life expectancy percentile sorted descendingly by GDP per capita

From the above image, you can see countries sorted by GDP per capita and their respective life expectancy percentile on the right. All of the countries are well above the average (50th percentile), with the lowest one being at the 68th percentile.

Before you can “verify” the above claim, you’ll have to look at the other end – are countries with the lowest GDP per capita located near the lowest percentiles?

You’ll only need to sort the dataset ascendingly:

The results are shown in the image below:

Image 8 – Life expectancy percentile sorted ascendingly by GDP per capita

Yes – our claim seems to make perfect sense. Once again, this wasn’t a formal hypothesis test, but instead a test of simple assumptions.

The term “advanced” is a bit abstract in data analysis, to say at least. If you’re fluent in R and dplyr and have a couple of years of experience, there’s virtually nothing you can’t do, so nothing seems to be advanced. On the other hand, even the most basic filtering and aggregating may seem like a big deal if you’re starting out.

For that reason, this section treats the term “advanced” as providing the complete answer to a more complicated question – so multiple operations are required.

For example, let’s say you have to find out the top 10 countries in the 90th percentile regarding life expectancy in 2007. You can reuse some of the logic from the previous sections, but answering this question alone requires multiple filterings and subsetting:

As you can see, the filter() function was used twice – the first time to select the year, and the second time to remove the records that are below the 90th percentile, since you’re only interested in the top 10. The top_n() function is used to select the best n countries arranged by a specific column, specified by the wt argument.

The results are shown below:

Image 9 – Top 10 countries above the 90th percentile (life expectancy)

But what if you had to calculate the opposite – worst 10 countries below the 10th percentile? The syntax is quite similar, except for the second filtering, and the top_n() function, where n is prefixed with a minus sign:

The minus prefix ensures the bottom 10 records are shown instead of the top 10:

Image 10 – Worst 10 countries below the 10th percentile (life expectancy)

And that’s just enough for today. Let’s wrap things up in the next section.

## Conclusion

Today you’ve learned how to use the dplyr package for exploratory data analysis. The quality of the analysis depends much on the quality of your questions, so make sure to ask the right questions first. If you know how to do that, analysis shouldn’t be too much of a trouble.

If you want to learn more about data analysis and everything R-related, stay tuned to the Appsilon blog. Also, make sure to subscribe to our newsletter, so you never miss an update.