Exploratory Data Analysis (EDA) for Journal Submissions

Name: Exploratory Data Analysis for International Journals I PhD Insight
Brand: Data Analysis
Rating: 4.8 (150 reviews)

Posted on November 8, 2023 by Zubair Goraya in R bloggers | 0 Comments

[This article was first published on RStudioDataLab, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Key points

Exploratory data analysis (EDA) is crucial in any data analysis project. It involves exploring, summarizing, and visualizing your data to gain insights, identify patterns, and detect outliers.
EDA can also help you formulate hypotheses, choose appropriate statistical tests, and communicate your findings effectively.
In this article, I will explain how I perform EDA in R using tidyverse packages, a collection of tools for data manipulation, visualization, and modeling, and my article in Impact Factor Journal.
I will use a generated dataset for this tutorial that contains information about 1000 students from different countries, their academic performance, and their satisfaction with their university.
You will learn how to Load and view the data in R, Summarize the data using descriptive statistics, Visualize the data using charts and graphs, Identify missing values and outliers, Transform and filter the data, Perform hypothesis testing and correlation analysis, Generate an EDA report using R Markdown.

Exploratory Data Analysis (EDA) for Journal Submissions

Packages and Functions its Description

The list of packages and functions I will use going to use in this article.

tidyverse Function	Description
data()	Load a built-in dataset
head()	View the first six rows of a dataset
summary()	Summarize a dataset using descriptive statistics
ggplot()	Create a plot using the grammar of graphics
geom_bar()	Add a bar chart layer to a plot
geom_histogram()	Add a histogram layer to a plot
geom_boxplot()	Add a boxplot layer to a plot
geom_point()	Add a scatterplot layer to a plot
geom_smooth()	Add a smoothed line layer to a plot
facet_wrap()	Wrap a plot into multiple panels based on a factor
aes()	Define the aesthetic mapping of a plot
labs()	Modify the labels of a plot
theme()	Modify the theme of the plot
filter()	Filter rows of a dataset based on a condition
select()	Select columns of a dataset
mutate()	Create or modify columns of a dataset
group_by()	Group a dataset by one or more variables
summarize()	Summarize a dataset by applying a function to each group
arrange()	Arrange rows of a dataset by one or more variables
na.omit()	Remove rows with missing values from a dataset
is.na()	Check if a value is missing
t.test()	Perform a t-test
cor.test()	Perform a correlation test
rmarkdown::render()	Render an R Markdown document

Hi, I’m Zubair Goraya, a PhD scholar and a certified data analyst-freelancer with 5 years of experience. I’m also a contributor to Data Analysis, a website that provides tutorials related to Rstudio. I am passionate about data science and statistics and enjoy sharing my knowledge and skills with others. I have published several papers in international journals and helped many students and researchers with their data analysis projects.

In this article, I will share my insights on exploratory data analysis (EDA) in R and how it can help you prepare your data for international journal publication.

Table of Contents

Exploratory Data Analysis (EDA) for Journal-Ready Data

Data is everywhere. We live in a world where we can collect, store, and analyze massive amounts of data from various sources and domains. Data can help us understand the world better, make informed decisions, and solve complex problems. However, data alone is not enough. We must process, transform, and interpret the data to extract meaningful information and insights. This is where data analysis comes in.

Data analysis is applying statistical and computational methods to data to answer questions, test hypotheses, and discover patterns. Data analysis can be divided into two main phases:

Exploratory data analysis (EDA) is the first phase, where we explore, summarize, and visualize the data to gain insights, identify patterns, and detect outliers.
Confirmatory data analysis (CDA) is the second phase, where we confirm, validate, and generalize the findings from EDA using statistical tests and models.

My Journey with EDA in R

EDA is a crucial step in any data analysis project. It helps us understand the variables’ characteristics, distribution, and relationships in our data. It also helps us formulate hypotheses, choose appropriate statistical tests, and communicate our findings effectively. EDA can also reveal any problems or issues with the data, such as

Missing values,
Outliers, or errors, help us fix them before proceeding to the next phase.

In this article, I will show you how to perform EDA in R using tidyverse packages, a collection of data manipulation, visualization, and modeling tools.

Before we start, you must have an idea of these things:

Data

The first step of EDA is to generate and load the data in R. I will use random data generated using R to create a dataset with the variables and values I want. Alternatively, you can use any other tool of your choice or use a real dataset that you have. I generate this data set by using the following code:

# Set the seed for reproducibility
set.seed(123)
# Generate the dataset
student_data <- data.frame(
  id = 1:1000, # Unique identifier
  country = sample(c("China", "India", "USA", "UK", "Canada", "Brazil"), 1000, replace = TRUE, prob = c(0.2, 0.2, 0.15, 0.15, 0.15, 0.15)), # Country of origin
  gender = sample(c("Male", "Female"), 1000, replace = TRUE, prob = c(0.5, 0.5)), # Gender
  age = sample(18:25, 1000, replace = TRUE), # Age
  major = sample(c("Math", "CS", "Econ", "Eng", "Bio", "Art"), 1000, replace = TRUE, prob = c(0.2, 0.2, 0.15, 0.15, 0.15, 0.15)), # Major field of study
  gpa = round(runif(1000, min = 2, max = 4), 1), # Grade point average
  sat = sample(seq(1000, 1600, by = 50), 1000, replace = TRUE), # SAT score
  toefl = sample(seq(80, 120, by = 5), 1000, replace = TRUE), # TOEFL score
  ielts = round(runif(1000, min = 5, max = 9), 1), # IELTS score
  gre = sample(seq(260, 340, by = 10), 1000, replace = TRUE), # GRE score
  satisfaction = sample(1:5, 1000, replace = TRUE) # Satisfaction level
)

This code will create a dataset called student_data, with 1000 rows and 11 columns. Each row represents a student, and each column represents a variable. The variables are:

id: A unique identifier for each student
country: The country of origin of the student, with six possible values: China, India, USA, UK, Canada, and Brazil. The probability of each value is set to be proportional to the population of each country.
gender: The student's gender, with two possible values: Male and Female. The probability of each value is set to be 0.5, meaning that the dataset has an equal number of male and female students.
age: The student's age, with a possible range from 18 to 25. The value of each age is randomly generated from a uniform distribution.
major: The major field of study of the student, with six possible values: Math, CS, Econ, Eng, Bio, and Art. The probability of each value is proportional to the popularity of each major among students.
gpa: The student's grade point average, with a possible range from 2 to 4. The value of each gpa is randomly generated from a normal distribution with a mean of 3 and a standard deviation of 0.1.
sat: The student's score on the SAT test, with a possible range from 1000 to 1600. The value of each sat is randomly generated from a normal distribution with a mean of 1300 and a standard deviation of 50.
toefl: The student's score on the TOEFL test, with a possible range from 80 to 120. The value of each toefl is randomly generated from a normal distribution with a mean of 100 and a standard deviation of 5. However, there is a 20% chance that the value of toefl is missing, indicated by NA, which means unavailable. This is because some students may not have taken the TOEFL test, or may not have reported their score.
ielts: The student's score on the IELTS test, with a possible range from 5 to 9. The value of each ielts is randomly generated from a normal distribution with a mean of 7 and a standard deviation of 0.5. However, there is a 20% chance that the value of ielts is missing, indicated by NA, which means unavailable. This is because some students may not have taken the IELTS test, or may not have reported their score.
gre: The student's score on the GRE test, with a possible range from 260 to 340. The value of each gre is randomly generated from a normal distribution with a mean of 300 and a standard deviation of 10. However, there is a 20% chance that the value of gre is missing, indicated by NA, which means unavailable. This is because some students may not have taken the GRE test, or may not have reported their score.
satisfaction: The student's level of satisfaction with their university, on a scale from 1 (very dissatisfied) to 5 (very satisfied). The value of each satisfaction is randomly generated from a uniform distribution.

Overview of the data

To view the top five rows, the Number of columns and rows, names, and structure of the data. The following code is used.

# names of the varaibles
names(student_data)
# dimesion of the data set
dim(student_data)
# str of the data
str(student_data)
# Top five rows of the data
head(student_data,5)

The output should look like this:

Summarizing the data using Descriptive Statistics

The next step of EDA is to summarize the data using descriptive statistics. Descriptive statistics are numerical measures that describe the characteristics of the data, such as the mean, median, mode, standard deviation, range, frequency, and percentage. Descriptive statistics can help us understand the data's central tendency, variability, and distribution. Before we find descriptive statistics, we must perform data transformation like character variables should be converted into factor variables. It can be done using the simple functions in the base library or the mutate function from the dplyr library that was part of the tidyverse package.

I will use the summary() function to summarize the data using descriptive statistics, which returns a summary of each variable in the dataset, including the minimum, maximum, mean, median, first quartile, third quartile, and number of missing values. I will use the following code:

library(dplyr)
student_data<-student_data %>% 
  mutate_if(is.character,as.factor)
summary(student_data)

The output I get

Summarizing the data using descriptive statistics

From this output, I can see the descriptive statistics of each variable in the dataset. For example, I can see that the mean age of the students is 21.49, the mean GPA is 3.002, and the mean satisfaction is 3.01. I can also see that the most common countries are Brazil, Canada, China, and India; the most common genders are female and male, and the most common majors are art and biology. I will deal with the missing values later in this article.

Visualizing the Data using Graphs

The next step of EDA is to visualize the data using charts and graphs. Charts and graphs are graphical representations of the data that can help us see the data's patterns, trends, and outliers. Charts and graphs can also help us compare the variables and their distributions and explore their relationships.

I will use the ggplot() function to visualize the data using charts and graphs, part of the tidyverse package. The ggplot() function allows us to create a plot using the grammar of graphics, a system for describing and building graphs using layers. Each layer can specify a different aspect of the plot, such as the data, the aesthetic mapping, the geometric object, the statistical transformation, the scale, the coordinate system, the facet, the label, and the theme.

In this article, I will use the following types of charts and graphs to visualize the data:

Bar chart

A bar chart is a graph that uses rectangular bars to show the frequency or proportion of a categorical variable. A bar chart can help us see the distribution and comparison of a categorical variable across different levels or groups.

Bar chart of country

library(ggplot2)
ggplot(student_data, aes(x = country)) + # Define the data and the x-axis variable
  geom_bar() + # Add a bar chart layer
  labs(title = "Bar chart of country", # Add a title
       x = "Country", # Add a label for the x-axis
       y = "Count") # Add a label for the y-axis

Bar Chart of Gender

# Bar chart of gender
ggplot(student_data, aes(x = gender)) + # Define the data and the x-axis variable
  geom_bar() + # Add a bar chart layer
  labs(title = "Bar chart of gender", # Add a title
       x = "Gender", # Add a label for the x-axis
       y = "Count") # Add a label for the y-axis

Bar chart of Major

# Bar chart of major
ggplot(student_data, aes(x = major, fill=major)) + # Define the data and the x-axis variable
  geom_bar() + # Add a bar chart layer
  labs(title = "Bar chart of major", # Add a title
       x = "Major", # Add a label for the x-axis
       y = "Count") # Add a label for the y-axis

Histogram

A histogram is a graph that uses rectangular bars to show the frequency or density of a numerical variable. A histogram can help us see the shape and spread of a numerical variable and identify any outliers or gaps in the data.

Histogram of age

# Histogram of age
ggplot(student_data, aes(x = age)) + # Define the data and the x-axis variable
  geom_histogram(bins = 8) + # Add a histogram layer with 8 bins
  labs(title = "Histogram of age", # Add a title
       x = "Age", # Add a label for the x-axis
       y = "Count") # Add a label for the y-axis

Histogram of gpa

ggplot(student_data, aes(x = gpa)) + # Define the data and the x-axis variable
  geom_histogram(bins = 10) + # Add a histogram layer with 10 bins
  labs(title = "Histogram of gpa", # Add a title
       x = "GPA", # Add a label for the x-axis
       y = "Count") # Add a label for the y-axis

Histogram of sat

ggplot(student_data, aes(x = sat)) + # Define the data and the x-axis variable
  geom_histogram(bins = 10) + # Add a histogram layer with 10 bins
  labs(title = "Histogram of sat", # Add a title
       x = "SAT", # Add a label for the x-axis
       y = "Count") # Add a label for the y-axis

Histogram of toefl

# Histogram of toefl
ggplot(student_data, aes(x = toefl)) + # Define the data and the x-axis variable
  geom_histogram(bins = 10) + # Add a histogram layer with 10 bins
  labs(title = "Histogram of toefl", # Add a title
       x = "TOEFL", # Add a label for the x-axis
       y = "Count") # Add a label for the y-axis

Histogram of ielts

# Histogram of ielts
ggplot(student_data, aes(x = ielts)) + # Define the data and the x-axis variable
  geom_histogram(bins = 10) + # Add a histogram layer with 10 bins
  labs(title = "Histogram of ielts", # Add a title
       x = "IELTS", # Add a label for the x-axis
       y = "Count") # Add a label for the y-axis

People Also Read:

Boxplot

A boxplot is a graph that uses a box and whiskers to show the summary statistics of a numerical variable. A boxplot can help us see a numerical variable's median, quartiles, range, and outliers, and compare them across different levels or groups of a categorical variable.

Boxplot of age by Country

# Boxplot of age by country
ggplot(student_data, aes(x = country, y = age, fill=country)) + # Define the data and the x-axis and y-axis variables
  geom_boxplot() + # Add a boxplot layer
  labs(title = "Boxplot of age by country", # Add a title
       x = "Country", # Add a label for the x-axis
       y = "Age") # Add a label for the y-axis

Boxplot of Satisfaction by Country

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Exploratory Data Analysis (EDA) for Journal Submissions

Key points

Packages and Functions its Description

Exploratory Data Analysis (EDA) for Journal-Ready Data

My Journey with EDA in R

Data

Summarizing the data using Descriptive Statistics

Visualizing the Data using Graphs

Bar chart

Histogram

Boxplot

Scatterplot

Identifying missing values and outliers in the data

How do I identify missing values and outliers using R

Transforming and filtering the data

Hypothesis Testing and Correlation Analysis

Conclusion

Limitations and Future Directions

Frequently Asked Questions (FAQs)

Related

Key points

Packages and Functions its Description

Exploratory Data Analysis (EDA) for Journal-Ready Data

My Journey with EDA in R

Data

Summarizing the data using Descriptive Statistics

Visualizing the Data using Graphs

Bar chart

Histogram

Boxplot

Scatterplot

Identifying missing values and outliers in the data

How do I identify missing values and outliers using R

Transforming and filtering the data

Hypothesis Testing and Correlation Analysis

Conclusion

Limitations and Future Directions

Frequently Asked Questions (FAQs)

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)