Data Format in R, You’ll learn about data formats and why reformatting data can help you enhance your data analysis in this tutorial.
Data is typically acquired from a variety of sources and by a variety of persons, and it is kept in a variety of formats.
Data formatting is the process of transforming data into a standardized format that allows you to make meaningful comparisons.
Data formatting is an important aspect of dataset cleansing since it guarantees that data is consistent and easy to understand.
Let’s take an example of data set containing Cities, Bangalore, Bengaluru, Bnglr all are the different expressions be used to symbolize Bangalore City.
In the majority of cases, you’ll want to consider them all as a single unit, or format, to make statistical analysis easier later on.
Data Format in R
As discussed in one of our old posts, the same dataset will utilize here also.
library(tidyverse) library(dplyr) library(ggplot2) data<-read.csv("D:/RStudio/Airlinedata.csv",1) head(data)
There is a column called “FlightDate” in the Airline dataset. The “FlightDate” field is formatted as “year-month-day,” with 2003 as the year, 03 as March, and 28 as the day.
The “FlightDate” field can be separated into three columns: “year,” “month,” and “day.”
Reformatting the date in tidyverse is as simple as typing one line of code. You can do the same while utilizing different packages also but here we are concentrating only on tidyverse package.
Because one of our old posts discussed the important “packages for data science” contains tidyverse.
This example reformats the column with the separate() function, separating the date and renaming the three new columns “year,” “month,” and “date.”
data1<-data %>% separate (FlightDate,sep="-", into=c("year","month", "day")) head(data1)
The data type may be wrongly determined for a variety of reasons, including when importing a dataset into R or processing a variable.
For example, the allocated data type for the flight date is “character,” despite the fact that the desired data type is numeric.
It’s critical to investigate the column’s data type and convert it to the correct data type for further analysis; otherwise, the models you later construct may act strangely, and valid data may be interpreted as missing data.
The sapply() function in R can be used to verify the data type of each column in a dataset to determine column data types.
If this gives the wrong conversion then you can make use of mutate function.
data2<-data1 %>% select(year, month, day) %>% mutate_all(type.convert) %>% mutate_if(is.character,as.numeric) str(data2)
You learned in this tutorial that reformatting data is a method of bringing information into a common standard of expression, which allows you to make meaningful comparisons.