Exploratory Data Analysis (EDA) for Journal Submissions
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Key points
- Exploratory data analysis (EDA) is crucial in any data analysis project. It involves exploring, summarizing, and visualizing your data to gain insights, identify patterns, and detect outliers.
- EDA can also help you formulate hypotheses, choose appropriate statistical tests, and communicate your findings effectively.
- In this article, I will explain how I perform EDA in R using tidyverse packages, a collection of tools for data manipulation, visualization, and modeling, and my article in Impact Factor Journal.
- I will use a generated dataset for this tutorial that contains information about 1000 students from different countries, their academic performance, and their satisfaction with their university.
- You will learn how to Load and view the data in R, Summarize the data using descriptive statistics, Visualize the data using charts and graphs, Identify missing values and outliers, Transform and filter the data, Perform hypothesis testing and correlation analysis, Generate an EDA report using R Markdown.
Packages and Functions its Description
tidyverse
Function |
Description |
data() |
Load a
built-in dataset |
head() |
View the first six rows of a dataset |
summary() |
Summarize
a dataset using descriptive statistics |
ggplot() |
Create a plot using the grammar of graphics |
geom_bar() |
Add a bar
chart layer to a plot |
geom_histogram() |
Add a histogram layer to a plot |
geom_boxplot() |
Add a
boxplot layer to a plot |
geom_point() |
Add a scatterplot layer to a plot |
geom_smooth() |
Add a
smoothed line layer to a plot |
facet_wrap() |
Wrap a plot into multiple panels based on a factor |
aes() |
Define
the aesthetic mapping of a plot |
labs() |
Modify the labels of a plot |
theme() |
Modify
the theme of the plot |
filter() |
Filter rows of a dataset based on a condition |
select() |
Select
columns of a dataset |
mutate() |
Create or modify columns of a dataset |
group_by() |
Group a
dataset by one or more variables |
summarize() |
Summarize a dataset by applying a function to each group |
arrange() |
Arrange
rows of a dataset by one or more variables |
na.omit() |
Remove rows with missing values from a dataset |
is.na() |
Check if
a value is missing |
t.test() |
Perform a t-test |
cor.test() |
Perform a
correlation test |
rmarkdown::render() |
Render an R Markdown document |
Hi, I’m Zubair Goraya, a PhD scholar and a certified data analyst-freelancer with 5 years of experience. I’m also a contributor to Data Analysis, a website that provides tutorials related to Rstudio. I am passionate about data science and statistics and enjoy sharing my knowledge and skills with others. I have published several papers in international journals and helped many students and researchers with their data analysis projects.
In this article, I will share my insights on exploratory data analysis (EDA) in R and how it can help you prepare your data for international journal publication.
Table of Contents
Exploratory Data Analysis (EDA) for Journal-Ready Data
Data is everywhere. We live in a world where we can collect, store, and analyze massive amounts of data from various sources and domains. Data can help us understand the world better, make informed decisions, and solve complex problems. However, data alone is not enough. We must process, transform, and interpret the data to extract meaningful information and insights. This is where data analysis comes in.
- Exploratory data analysis (EDA) is the first phase, where we explore, summarize, and visualize the data to gain insights, identify patterns, and detect outliers.
- Confirmatory data analysis (CDA) is the second phase, where we confirm, validate, and generalize the findings from EDA using statistical tests and models.
My Journey with EDA in R
EDA is a crucial step in any data analysis project. It helps us understand the variables’ characteristics, distribution, and relationships in our data. It also helps us formulate hypotheses, choose appropriate statistical tests, and communicate our findings effectively. EDA can also reveal any problems or issues with the data, such as
- Missing values,
- Outliers, or errors, help us fix them before proceeding to the next phase.
Data
The first step of EDA is to generate and load the data in R. I will use random data generated using R to create a dataset with the variables and values I want. Alternatively, you can use any other tool of your choice or use a real dataset that you have. I generate this data set by using the following code:
# Set the seed for reproducibility set.seed(123) # Generate the dataset student_data <- data.frame( id = 1:1000, # Unique identifier country = sample(c("China", "India", "USA", "UK", "Canada", "Brazil"), 1000, replace = TRUE, prob = c(0.2, 0.2, 0.15, 0.15, 0.15, 0.15)), # Country of origin gender = sample(c("Male", "Female"), 1000, replace = TRUE, prob = c(0.5, 0.5)), # Gender age = sample(18:25, 1000, replace = TRUE), # Age major = sample(c("Math", "CS", "Econ", "Eng", "Bio", "Art"), 1000, replace = TRUE, prob = c(0.2, 0.2, 0.15, 0.15, 0.15, 0.15)), # Major field of study gpa = round(runif(1000, min = 2, max = 4), 1), # Grade point average sat = sample(seq(1000, 1600, by = 50), 1000, replace = TRUE), # SAT score toefl = sample(seq(80, 120, by = 5), 1000, replace = TRUE), # TOEFL score ielts = round(runif(1000, min = 5, max = 9), 1), # IELTS score gre = sample(seq(260, 340, by = 10), 1000, replace = TRUE), # GRE score satisfaction = sample(1:5, 1000, replace = TRUE) # Satisfaction level )
This code will create a dataset called student_data, with 1000 rows and 11 columns. Each row represents a student, and each column represents a variable. The variables are:
- id: A unique identifier for each student
- country: The country of origin of the student, with six possible values: China, India, USA, UK, Canada, and Brazil. The probability of each value is set to be proportional to the population of each country.
- gender: The student's gender, with two possible values: Male and Female. The probability of each value is set to be 0.5, meaning that the dataset has an equal number of male and female students.
- age: The student's age, with a possible range from 18 to 25. The value of each age is randomly generated from a uniform distribution.
- major: The major field of study of the student, with six possible values: Math, CS, Econ, Eng, Bio, and Art. The probability of each value is proportional to the popularity of each major among students.
- gpa: The student's grade point average, with a possible range from 2 to 4. The value of each gpa is randomly generated from a normal distribution with a mean of 3 and a standard deviation of 0.1.
- sat: The student's score on the SAT test, with a possible range from 1000 to 1600. The value of each sat is randomly generated from a normal distribution with a mean of 1300 and a standard deviation of 50.
- toefl: The student's score on the TOEFL test, with a possible range from 80 to 120. The value of each toefl is randomly generated from a normal distribution with a mean of 100 and a standard deviation of 5. However, there is a 20% chance that the value of toefl is missing, indicated by NA, which means unavailable. This is because some students may not have taken the TOEFL test, or may not have reported their score.
- ielts: The student's score on the IELTS test, with a possible range from 5 to 9. The value of each ielts is randomly generated from a normal distribution with a mean of 7 and a standard deviation of 0.5. However, there is a 20% chance that the value of ielts is missing, indicated by NA, which means unavailable. This is because some students may not have taken the IELTS test, or may not have reported their score.
- gre: The student's score on the GRE test, with a possible range from 260 to 340. The value of each gre is randomly generated from a normal distribution with a mean of 300 and a standard deviation of 10. However, there is a 20% chance that the value of gre is missing, indicated by NA, which means unavailable. This is because some students may not have taken the GRE test, or may not have reported their score.
- satisfaction: The student's level of satisfaction with their university, on a scale from 1 (very dissatisfied) to 5 (very satisfied). The value of each satisfaction is randomly generated from a uniform distribution.
Overview of the data
To view the top five rows, the Number of columns and rows, names, and structure of the data. The following code is used.
# names of the varaibles names(student_data) # dimesion of the data set dim(student_data) # str of the data str(student_data) # Top five rows of the data head(student_data,5)
The output should look like this:
Summarizing the data using Descriptive Statistics
The next step of EDA is to summarize the data using descriptive statistics. Descriptive statistics are numerical measures that describe the characteristics of the data, such as the mean, median, mode, standard deviation, range, frequency, and percentage. Descriptive statistics can help us understand the data's central tendency, variability, and distribution. Before we find descriptive statistics, we must perform data transformation like character variables should be converted into factor variables. It can be done using the simple functions in the base library or the mutate function from the dplyr library that was part of the tidyverse package.
I will use the summary() function to summarize the data using descriptive statistics, which returns a summary of each variable in the dataset, including the minimum, maximum, mean, median, first quartile, third quartile, and number of missing values. I will use the following code:
library(dplyr) student_data<-student_data %>% mutate_if(is.character,as.factor) summary(student_data)
The output I get
From this output, I can see the descriptive statistics of each variable in the dataset. For example, I can see that the mean age of the students is 21.49, the mean GPA is 3.002, and the mean satisfaction is 3.01. I can also see that the most common countries are Brazil, Canada, China, and India; the most common genders are female and male, and the most common majors are art and biology. I will deal with the missing values later in this article.
Visualizing the Data using Graphs
The next step of EDA is to visualize the data using charts and graphs. Charts and graphs are graphical representations of the data that can help us see the data's patterns, trends, and outliers. Charts and graphs can also help us compare the variables and their distributions and explore their relationships.








I will use the ggplot() function to visualize the data using charts and graphs, part of the tidyverse package. The ggplot() function allows us to create a plot using the grammar of graphics, a system for describing and building graphs using layers. Each layer can specify a different aspect of the plot, such as the data, the aesthetic mapping, the geometric object, the statistical transformation, the scale, the coordinate system, the facet, the label, and the theme.
In this article, I will use the following types of charts and graphs to visualize the data:
Bar chart
A bar chart is a graph that uses rectangular bars to show the frequency or proportion of a categorical variable. A bar chart can help us see the distribution and comparison of a categorical variable across different levels or groups.
library(ggplot2) ggplot(student_data, aes(x = country)) + # Define the data and the x-axis variable geom_bar() + # Add a bar chart layer labs(title = "Bar chart of country", # Add a title x = "Country", # Add a label for the x-axis y = "Count") # Add a label for the y-axis
Bar Chart of Gender
# Bar chart of gender ggplot(student_data, aes(x = gender)) + # Define the data and the x-axis variable geom_bar() + # Add a bar chart layer labs(title = "Bar chart of gender", # Add a title x = "Gender", # Add a label for the x-axis y = "Count") # Add a label for the y-axis
Related Posts
# Bar chart of major ggplot(student_data, aes(x = major, fill=major)) + # Define the data and the x-axis variable geom_bar() + # Add a bar chart layer labs(title = "Bar chart of major", # Add a title x = "Major", # Add a label for the x-axis y = "Count") # Add a label for the y-axis
Histogram
A histogram is a graph that uses rectangular bars to show the frequency or density of a numerical variable. A histogram can help us see the shape and spread of a numerical variable and identify any outliers or gaps in the data.
# Histogram of age ggplot(student_data, aes(x = age)) + # Define the data and the x-axis variable geom_histogram(bins = 8) + # Add a histogram layer with 8 bins labs(title = "Histogram of age", # Add a title x = "Age", # Add a label for the x-axis y = "Count") # Add a label for the y-axis
Histogram of gpa
ggplot(student_data, aes(x = gpa)) + # Define the data and the x-axis variable geom_histogram(bins = 10) + # Add a histogram layer with 10 bins labs(title = "Histogram of gpa", # Add a title x = "GPA", # Add a label for the x-axis y = "Count") # Add a label for the y-axis
ggplot(student_data, aes(x = sat)) + # Define the data and the x-axis variable geom_histogram(bins = 10) + # Add a histogram layer with 10 bins labs(title = "Histogram of sat", # Add a title x = "SAT", # Add a label for the x-axis y = "Count") # Add a label for the y-axis

Histogram of toefl
# Histogram of toefl ggplot(student_data, aes(x = toefl)) + # Define the data and the x-axis variable geom_histogram(bins = 10) + # Add a histogram layer with 10 bins labs(title = "Histogram of toefl", # Add a title x = "TOEFL", # Add a label for the x-axis y = "Count") # Add a label for the y-axis

Histogram of ielts
# Histogram of ielts ggplot(student_data, aes(x = ielts)) + # Define the data and the x-axis variable geom_histogram(bins = 10) + # Add a histogram layer with 10 bins labs(title = "Histogram of ielts", # Add a title x = "IELTS", # Add a label for the x-axis y = "Count") # Add a label for the y-axis

Boxplot
A boxplot is a graph that uses a box and whiskers to show the summary statistics of a numerical variable. A boxplot can help us see a numerical variable's median, quartiles, range, and outliers, and compare them across different levels or groups of a categorical variable.
# Boxplot of age by country ggplot(student_data, aes(x = country, y = age, fill=country)) + # Define the data and the x-axis and y-axis variables geom_boxplot() + # Add a boxplot layer labs(title = "Boxplot of age by country", # Add a title x = "Country", # Add a label for the x-axis y = "Age") # Add a label for the y-axis

Boxplot of Satisfaction by Country