Inspecting Data Structures

[This article was first published on Quantargo Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The first step of any data related task is to inspect the data we are dealing with. This is crucial for data wrangling as well, since we need to explore the current structure of the data, in order to identify the required transformations.

  • Inspect tabular data interactively with View()
  • Examine the data structure of each object using str()
View(___)
str(___)

Interactive Inspection with View()

Before starting with any kind of data analysis, it is crucial to understand the data we are dealing with. Plotting is a very important tool to get a quick overview of the statistical properties of data and to detect possible outliers. However, visualization might not always be possible, due to the size or complexity of the data set.

As an alternative solution, it might be convenient to interactively dig through the data set. This could be done by a spreadsheet-like interface, similar to Microsoft Excel, which enables to filter, sort and inspect tabular data structures.

R provides the function View(), which shows an interactive data viewer. Depending on the used platform and editor, this viewer might look differently. Below you can see an example of the View() function in RStudio:

View(gapminder)

Quiz: Interactive Inspection with View()

Why should you inspect data sets with View() before starting with your analysis?
  • Get a first impression of the data quality.
  • Find outliers and missing values.
  • Interactively inspect the data set.
  • Create reproducible outputs for reports.
Start Quiz

Exercise: Interactive Inspection with View()

Use the View() function on the gapminder data set and determine the country with the highest life expectancy. Pay also attention to year the projection was made. Set the variables country and year accordingly!

Start Exercise

Examining Data Structures with str()

Sometimes we need to analyze very large and complex data structures. Displaying these data sources may already be overwhelming and simply not possible with interactive tools. In these cases, the str() function comes to the rescue and prints the structure, as well as the first few values of any R object. Even very large and complex data structures can easily be displayed in the console that way.

As an example, let’s take a look at structure of the TitanicSurvival data set:

library(carData)
str(TitanicSurvival)
'data.frame':   1309 obs. of  4 variables:
 $ survived      : Factor w/ 2 levels "no","yes": 2 2 1 1 1 2 2 1 2 1 ...
 $ sex           : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
 $ age           : num  29 0.917 2 30 25 ...
 $ passengerClass: Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...

It consists of three factor columns (survived, sex and passengerClass) and one numeric column age. Note, that for factor columns both the labels (e.g. "no","yes") as well as the integer values are displayed.

Quiz: Examining Data Structures with str()

In which cases is it benefitial to use the str() function?
  • Get an overview of highly complex data sets.
  • Create summary statistics describing the data set.
  • Plot histograms.
  • Only for data.frames. str() can only handle data.frames and cannot be used for other objects.
Start Quiz

Quiz: Interpret the Output of str()

library(babynames)
str(babynames)
tibble [1,924,665 × 5] (S3: tbl_df/tbl/data.frame)
 $ year: num [1:1924665] 1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
 $ sex : chr [1:1924665] "F" "F" "F" "F" ...
 $ name: chr [1:1924665] "Mary" "Anna" "Emma" "Elizabeth" ...
 $ n   : int [1:1924665] 7065 2604 2003 1939 1746 1578 1472 1414 1320 1288 ...
 $ prop: num [1:1924665] 0.0724 0.0267 0.0205 0.0199 0.0179 ...
Examine the output of the str() function with the babynames dataset above. Which statements about the data set are correct?
  • The data set has five rows.
  • The data set has five columns.
  • The prop column is of type numeric.
  • The column sex is of type factor.
Start Quiz

Inspecting Data Structures is an excerpt from the course Advanced Data Transformation, which is available at quantargo.com

VIEW FULL COURSE

To leave a comment for the author, please follow the link and comment on their blog: Quantargo Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)