# First steps of data exploration and visualization with Tidyverse

October 8, 2018
By

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Categories

Tags

In this post, I will show you, how to use visualization and transformation for exploring your data in R. I will use several functions that come with Tidyverse package.

In general, there are two types of variables, categorical and continuous. In this section, I will show the best option to examine their distributions using the data from NHANES.

### Load the library and data:

``````library(tidyverse)
library(RNHANES)
d13 = nhanes_load_data("DEMO_H", "2013-2014") %>%
transmute(SEQN=SEQN, wave=cycle, INDFMIN2, RIDRETH1) %>%
left_join(nhanes_load_data("BMX_H", "2013-2014"), by="SEQN") %>%
select(SEQN, wave, INDFMIN2, RIDRETH1, BMXBMI) %>%
mutate(
annincome = recode_factor(INDFMIN2,
'1' = "lowest",
'2' = "lowest",
'3' = "lowest",
'4' = "low",
'5' = "low",
'6' = "low",
'7' = "medium",
'8' = "medium",
'9' = "medium",
'10' = "high",
'12' = "high",
'13' = "high",
'14' = "highest",
'15' = "highest")) %>%
filter(!is.na(BMXBMI), !is.na(annincome))
``````

With the dataset created I will visualize the distribution using a bar chart.

``````ggplot(data = d13) +
geom_bar(aes(annincome))
`````` To see the exact number for each category, I can also calculate these values with `count()`

``````d13 %>%
count(annincome)
``````

For a continuous variable it is necessary to use the histogram. I chose to see how BMI is distributed in NHANES population for 2013, with `binwidth = 5`, so cut the variable by 5 unit increase.

``````ggplot(data = d13) +
geom_histogram(aes(BMXBMI), binwidth = 5)
`````` Combining 'ggplot2' and 'dplyr', I can see the relevant values fo Bmi with the function `cut_width()` by 5 unit increase)

``````d13 %>%
count(cut_width(BMXBMI, 5))
``````

To combine the information I showed previously in the same plot, for information about BMI and annual income I will use `geomfreqpoly()`, and have the multiple histograms below.

``````ggplot(data = d13, aes(BMXBMI, color = annincome)) +
geom_freqpoly(binwidth = 1)
`````` ### A categorical and a continuous variable

Now I am going to demonstrate a link of a continuous variable based on the other categorical variable using the boxplot.

``````ggplot(data = d13, aes(annincome, BMXBMI)) +
geom_boxplot()
`````` So for each box, the middle line is the median 50th percentile for each category. In my case, if I chose category medium for annual income the median of BMI is ~27. The upper and the lower line of the box shows 75th (BMI=31) percentile, and 25th (BMI=20) percentile and the distance between them is called the Interquartile Range.

### Two categorical variables

For two categorical variable, I need to visualize the relation between them, but I also would like to know the number of observations, so I will use 'geom_tile' and 'fill aesthetic' and have the graph below.

``````d13 %>%
mutate(race = recode_factor(RIDRETH1,
`1` = "Mexian American",
`2` = "Hispanic",
`3` = "Non-Hispanic, White",
`4` = "Non-Hispanic, Black",
`5` = "Others")) %>%
count(race, annincome) %>%
ggplot(aes(race, annincome)) +
geom_tile(aes(fill = n))
`````` ### Two continuous variables

Below, I will see how do BMI and cholesterol come along with each other drawn in a scatterplot.

``````data13 = d13 %>%
left_join(nhanes_load_data("TCHOL_H", "2013-2014"), by="SEQN") %>%
select(SEQN, wave, INDFMIN2, RIDRETH1, BMXBMI, LBXTC)

ggplot(data = data13) +
geom_point(aes(BMXBMI, LBXTC))
`````` Because the points overplot in the previous scatterplot, I can use 'alpha aesthetic' for a more useful graph.

``````ggplot(data = data13) +
geom_point(aes(BMXBMI, LBXTC),
alpha = 1/20)
`````` Another way to visualize a relationship of two continuous variables is by using bins and treating one of the variables as a definite. Adding 'cut_number' will make the comparison fairer as there is the same number of points in each bin.

``````ggplot(data = data13, aes(BMXBMI, LBXTC)) +
geom_boxplot(aes(group = cut_number(BMXBMI, 20)))
`````` Hope this post will help you chose the right and best way to illustrate distribution and relations within and between variables.

Related Post

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)