Descriptive Statistics in R, You’ll learn about descriptive statistics in this tutorial, which is one strategy you might employ in exploratory data analysis.
Before you invest time constructing intricate models, it’s necessary to first study your data when you start analyzing data. Calculating descriptive statistics for your data is an easy approach to do so.
Descriptive Statistics in R
Descriptive statistical analysis aids in describing the fundamental characteristics of a dataset and gives a brief description of the sample and data measurements.
One approach to do this is to use the tidyverse dplyr summarise() function.
The summarise() function is frequently used in conjunction with group by() to summarise each group into a single-row summary.
The column names that contain the categorical variables for which you want to create summary statistics are passed as parameters to group by().
Let’s take airline data set for analysis,
library(tidyverse) library(dplyr) library(ggplot2) data<-read.csv("D:/RStudio/Airlinedata.csv",1) summary<-data %>% group_by(Reporting_Airline) %>% summarize(mean=mean(ArrDelayMinutes,na.rm=TRUE), std_dev=sd(ArrDelayMinutes,na.rm=TRUE)) summary Reporting_Airline mean std_dev 1 AA 10.12226 24.95855 2 AS 12.91111 25.59903 3 B6 18.55039 47.14758 4 DL 13.83650 48.08742 5 HP 19.21429 25.31700 6 PA (1) 33.54545 118.73265 7 TW 15.59459 36.82041 8 UA 11.73462 27.09474 9 VX 14.93798 31.14125
The mean, total number of data points, standard deviation, quartiles, and extreme values can all be displayed in summary statistics.
In these statistics, any NA values are automatically skipped. The summarize() function helps you understand how your variables are distributed.
Your dataset can also include categorical variables. These are variables with discrete values that can be classified into different categories or groups.
The reporting airline, for example, is a categorical variable in this dataset, with the following categories: UA, AS, DL, and six others.
Using the count() function is one technique to summarise categorical data.
data %>% count(Reporting_Airline) Reporting_Airline n 1 AA 1096 2 AS 45 3 B6 258 4 DL 526 5 HP 14 6 PA (1) 33 7 TW 185 8 UA 569 9 VX 129
Boxplots are a wonderful way to visualize numeric data since they allow you to see the data’s various distributions. A boxplot depicts the data’s median or the location of the middle data point.
Let’s take one of the random box plots for illustration purposes.
The 75th percentile is represented by the Upper Quartile, whereas the 25th percentile is represented by the Lower Quartile.
The Interquartile Range is the data between the Upper and Lower Quartiles.
The Lower and Upper Extremes are next.
Above the 75th percentile, these are computed as 1.5 times the interquartile range, and below the 25th percentile, they are calculated as 1.5 times the IQR.
Finally, outliers appear as separate dots outside the upper and lower extremities on boxplots.
Outliers, as well as the distribution and skewness of the data, can be clearly identified using boxplots.
A scatter plot is an excellent method to represent the relationship between two variables.
In a scatter plot, each observation is represented by a point. The relationship between the two variables is depicted in a scatterplot.
Let’s take one of the random scatter plots for illustration purposes.
The predictor variable is the one that you’re using to make a prediction.
The variable you’re trying to forecast is known as the target variable.
In a scatterplot, the predictor variable is usually plotted on the x-axis (horizontal axis), while the target variable is plotted on the y-axis (or vertical axis).
In this tutorial, you learned that descriptive statistics are used to characterize basic characteristics of data and to provide brief summaries of the sample and data measurements.
You may visualize data distribution and correlations visually using boxplots and scatter plots.