Explore Your Dataset in R

November 5, 2018
By

(This article was first published on Little Miss Data, and kindly contributed to R-bloggers)

As person who works with data, one of the most exciting activities is to explore a fresh new dataset. You’re looking to understand what variables you have, how many records the data set contains, how many missing values, what is the variable structure, what are the variable relationships and more. While there is a ton you can do to get up and running, I want to show you a few simple commands to help you get a fast overview of the data set you are working with.

dataExplorerGifLg.gif

 

Simple Exploratory Data Analysis (EDA)

Download the data set

Before we get rolling with the EDA, we want to download our data set. For this example, we are going to use the dataset produced by my recent science, technology, art and math (STEAM) project.

#Load the readr library to bring in the dataset
library(readr)

#Download the data set
df= read_csv('https://raw.githubusercontent.com/lgellis/STEM/master/DATA-ART-1/Data/FinalData.csv', col_names = TRUE)

Now that we have the data set all loaded, and it’s time to run some very simple commands to preview the data set and it’s structure.

Head

To begin, we are going to run the head function, which allows us to see the first 6 rows by default. We are going to override the default and ask to preview the first 10 rows.

head(df, 10)

 

glimpse().png

dim and Glimpse

Next, we will run the dim function which displays the dimensions of the table. The output takes the form of row, column.

And then we run the glimpse function from the dplyr package. This will display a vertical preview of the dataset. It allows us to easily preview data type and sample data.

dim(df)

#Displays the type and a preview of all columns as a row so that it's very easy to take in.

library(dplyr)
glimpse(df)

 

Summary

We then run the summary function to show each column, it’s data type and a few other attributes which are especially useful for numeric attributes. We can see that for all the numeric attributes, it also displays min, 1st quartile, median, mean, 3rd quartile and max values.

summary(df)

 

Skim

Next we run the skim function from the skimr package. The skim function is a good addition to the summary function. It displays most of the numerical attributes from summary, but it also displays missing values, more quantile information and an inline histogram for each variable!

library(skimr)
skim(df)

 

create_report in DataExplorer

And finally the pièce de résistance, the main attraction and the reason I wrote this blog; the create_report function in the DataExplorer package. This awesome one line function will pull a full data profile of your data frame. It will produce an html file with the basic statistics, structure, missing data, distribution visualizations, correlation matrix and principal component analysis for your data frame! I recently learned about this function in a workshop given by Stephe Locke hosted by R Ladies Austin. This function is a game changer!

library(DataExplorer)
DataExplorer::create_report(df)

DataExplorer.png

THANK YOU

Thanks for reading along while we explored some simple EDA in R.  Please share your thoughts and creations with me on twitter

Note that the full code is available on my  github repo.  If you have trouble downloading the file from github, go to the main page of the repo and select “Clone or Download” and then “Download Zip”.

To leave a comment for the author, please follow the link and comment on their blog: Little Miss Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)