Site icon R-bloggers

Exploratory Data Analysis Using R (Part-I)

[This article was first published on R Language in Datazar on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The greatest value of a picture is when it forces us to notice what we never expected to see. — John W. Tukey. Exploratory Data Analysis.

Why do we use exploratory graphs in data analysis?

Data –We will use the air-quality dataset available in R for our analysis.The entire project can be found here. You can go and try it for yourself by running it on Datazar.

library(datasets)
head(airquality)

Summaries of Data

One dimensional Data– Univariate EDA for a quantitative variable is a way to make preliminary assessments about the population distribution of the variable using the data of the observed sample.

When we are dealing with a single datapoint, let’s say temperature or, wind speed, or age, the following techniques are used for the initial exploratory data analysis.

summary(airquality$Wind)
Summary Of Windspeed
IQR (interquartile range) = Q3 — Q1, (the box in the plot)
whiskers = ±1.58IQR/√ n ∗ IQR, where n is the number of samples. (datapoints)
boxplot(airquality$Wind~airquality$Month,col=”purple”)
Wind Speed by Month
hist(airquality$Wind,col=”gold”)
rug(airquality$Wind)#(Optional)plots the point below in a histogram
barplot(table(chickwts$feed),col = “wheat”, main=”Number Of Chickens by diet type”)

Two dimensional Data– Multivariate non-graphical EDA techniques generally show the relationship between two or more variables in the form of either cross-tabulation or statistics.

For two quantitative variables, the basic graphical EDA technique is the scatterplot which has one variable on the x-axis, one on the y-axis and a point for each case in your dataset. If one variable is explanatory and the other is outcome, it is a very, very strong convention to put the outcome on the y (vertical) axis.

One or two additional categorical variables can be accommodated on the scatterplot by encoding the additional information in the symbol type and/or color.

We will use the Males.csv dataset (present in the project on Datazar, to check whether being a part of an union impacts the salaries of young american males.

males<-read.csv(“dataset0.csv”) 
head(males)
samplemales<- males[1:100,] # we used first 100 rows
with(samplemales ,plot(exper,wage, col= union)) 
#union is a categorical variable represented by color
Scatter plot to represent age vs experience (the color represent whether the employee is a part of an union)

We can also use multiple scatter plots to understand better, whether being part of an union impacts an employees salary.

We can see that, most employees are not part of an union and they tend to earn more than employees who are a part of an union.Correlation doesn’t always mean causation, as it might be the case, the high paying industries do not allow their employees to form unions.

In a nutshell: You should always perform appropriate EDA before further analysis of your data

Lastly, I wish you all a merry Christmas and a very happy new year. I will come back with the next edition of EDA in New Year. Till then, happy modeling!


Exploratory Data Analysis Using R (Part-I) was originally published in Datazar on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.