Exploratory data analysis techniques

[This article was first published on Data Perspective, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my previous blog post I have explained the steps needed to solve a data analysis problem. Going further, I will be discussing in-detail each and every step of Data Analysis. In this post, we shall discuss about exploratory Analysis.

What is Exploratory Analysis?
“Understanding data visually”

Exploratory Analysis means analyzing the datasets to summarize their main characteristics, often visually. This is the first step of any data analysis.
Objectives:
  • Know the data types of the dataset – whether continuous/discreet/categorical
  • Understand how the data is distributed
  • Extract important input variables for the analysis
  • Identify outliers
  • To identity patterns, if exists

Exploratory Analysis Techniques:
  • Box-Plot
  • Histogram
  • Trend analysis
  • Scatter Plots

Let us understand the exploratory analysis by considering a data analysis problem.

Problem statement:
to analyze the incidents/events occurred over past 3 years and try to predict the event occurring in the future.

Solution: 
After understanding the problem statement and gaining the sufficient domain knowledgeIdentify the data sources & download the data into the programming environment.

The next step is to perform an Exploratory analysis as explained here. in today’s post we shall look how  exploratory analysis can be done.

Types of Exploratory analysis:

Type1: Understanding the data – variable names, dimensions of the dataset, data types of each and every variable. 

data = read.csv(“datasource.csv”) #load data 

dim(data) #know the dimensions of the data
[1] 839 50

Colnames(data) #know the column names 
[1] “Incidents” “Year of Occurance” “Location.of.Occurrence” “Date.of.Occurence” [5]”Time.of.Occurrence” “Operational Phase”

Str(data) # know the data types of each of the variable – continuous/descrete/categorical
$ Incidents: int 41505 41537 41539 41565 41589 41596 41598
$ Vehicle.Type : Factor w/ 7 levels “”,”Volvo(all series)”,..: 6 2 2 2 6 6
$ Location.of.Occurrence: Factor w/ 101 levels “”,”Abidjan”,”Accra”,..: 53 35 35 35 96
$ Date.of.Occurence: Factor w/ 520 levels “1/1/2010″,”1/1/2012” ..: 1 32 37 

Sum(!is.na(data$Date.of.Occurance) # counting the number of missing values in the column

Type2:Creating new varaibles/data type conversion suitable for the analysis – like factor variables into numerical,dates into year/month/day,time into hour of the day, etc. according to our convineince. 

#Extracting year/month from Date of occurrence and creating new variables
xn = as.POSIXct(data$”Date.of.Occurence”,format=”%m/%d/%Y”)
data[“year”] = as.numeric(format(xn,”%Y”))
data[“month”] = as.numeric(format(xn,”%m”))
str(data$”year”) 
num [1:839] 2010 2010 2010 2010 2010 2010
#extracting hour of the day and creating new variable TimeOfOccurance
(TOC) data[“TOC”] = sub(“:.*”, “”, data$”Time.of.Occurrence”)
str(data$”TOC”)
chr [1:839] “18” “6” “21” “13” “16” “13” “11” “11” “15” “1” “13” “6”

Type3: Observe the summary of each and every variable to understand the variables. summary(data$”Vehicle.Type”)
Volvo (all series) ASHOK (all series) FIAT (all series) Maruthi (all series)
       210                    49                     71                      39 
Type4: Decide which variables are good for analysis by using trends, boxplots, histograms etc. 

boxplot(formula=as.numeric(data$”operational.Phase”)~data$”year”,col=”blue”)


Box plot distribution of incidents occurring over the years.

hist(data[which(data$”year” == 2011),]$”month”,breaks = “Sturges”,col=c(‘blue’,’red’,’green’),labels=T) 
The above histogram depicts the month wise distribution of incidents occurred in 2011 Trend analysis
In the above graph, we can bring out the below inferences:
Sharp fall in the data in 2012 might be not capturing of the incidents
An average of 30 incidents occurring monthly
In the month of Feb there is sharp fall in the incidents

In the above trend image with graph in red color is plotted against Number of people in the deck and number of Incidents.
This clearly says that there is no relation between the Incident occurring and number of people in the deck.

Hope the above post gives you a very good understanding of how exploratory analysis can be done. In my next post we shall learn how to do forecasting using Linear regression. 



To leave a comment for the author, please follow the link and comment on their blog: Data Perspective.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)