From excel to R: Part 1

[This article was first published on Zahier Nasrudin, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome welcome.

In this blog post, we will be exploring:

  1. How to efficiently read data from multiple sheets in an Excel workbook and combine them into a single dataframe using R.
  2. Dealing with mixed date formats using the janitor package

By the end of this post, you & me should have a solid understanding of how to read and combine data from multiple sheets in an Excel workbook using R, and how to handle mixed date formats to ensure consistent data analysis.

Load library

Show code
library(readxl)
library(dplyr)
library(purrr)
library(janitor)

Define path

First, let’s define the path to the Excel file that we want to read in. In this example, we’ll be using the here package to make the path relative to the current project directory. This data set can be obtained from my github repository

Show code
## Define path
path <- here::here("_posts/2023-03-08-excel-to-r/input/store-sales.xlsx")

Reading all sheets

To read all the sheets:

Show code
df_excel <- path %>% 
  excel_sheets() %>%
  map_df(~read_excel(.x, path = path) %>%
           mutate(sheet_name = .x)) 
  1. The excel_sheets() function from the readxl package is used to extract the names of all the sheets in the Excel file.
  2. The map_df() function from the purrr package meanwhile is used to apply a function to each element of a vector and combine the results into a data frame
  3. In this case, we are using map_df() to read in each sheet of the Excel file; using the read_excel() function from the readxl package.
  4. Finally, we add a new column. This new column is called sheet_name and contains the name of the sheet that the data came from.

The dataframe:

Show code
df_excel %>%
  knitr::kable()
Year Date Store Sales sheet_name
2012 1/26/2012 A 2000 Year1
2012 1/26/2012 B 2500 Year1
2012 1/28/2012 C 3400 Year1
2012 Jan 21, 2012 D 6000 Year1
2013 1/26/2013 A 2000 Year2
2013 1/21/2013 B 2500 Year2
2013 1/28/2013 C 3400 Year2
2013 Jan 25, 2013 D 6000 Year2
2013 41302 E 5500 Year2
2014 1/26/2014 A 2000 Year3
2014 1/28/2014 B 2500 Year3
2014 2/28/2014 C 3400 Year3
2014 March 25, 2014 D 6000 Year3
2014 41789 E 5500 Year3

Mixed date formats

We have successfully loaded the information into one dataframe. However, it appears that this dataframe contains a column (Date) with mixed date formats that have been imported from Excel. This can often be a problem when working with date data, as it can make it difficult to perform consistent operations and analyses on the data. To fix this issue, we will need to standardize the date format across the entire column, so that all dates can be treated consistently. One way to accomplish this is to use a function like convert_to_date() from the janitor package, which can convert the dates to a consistent format while also handling any errors or inconsistencies in the data:

Show code
df_excel <- df_excel %>%
  mutate(Date_change = convert_to_date(Date, string_conversion_failure = "warning",
                                       character_fun = lubridate::mdy))
  1. To explain,convert_to_date() function from the janitor package is used to convert the Date column to a consistent date format.
  2. The string_conversion_failure parameter is set to “warning”, to indicate that any non-date strings in the Date column will be converted to NA values and a warning message will be displayed, just to be safe.
  3. The character_fun parameter is set to lubridate::mdy, which indicates the expected order of month, day, and year in the date string.

Please refer to Date_change for the standardized date column:

Show code
## Show dataset


df_excel %>%
  knitr::kable()
Year Date Store Sales sheet_name Date_change
2012 1/26/2012 A 2000 Year1 2012-01-26
2012 1/26/2012 B 2500 Year1 2012-01-26
2012 1/28/2012 C 3400 Year1 2012-01-28
2012 Jan 21, 2012 D 6000 Year1 2012-01-21
2013 1/26/2013 A 2000 Year2 2013-01-26
2013 1/21/2013 B 2500 Year2 2013-01-21
2013 1/28/2013 C 3400 Year2 2013-01-28
2013 Jan 25, 2013 D 6000 Year2 2013-01-25
2013 41302 E 5500 Year2 2013-01-28
2014 1/26/2014 A 2000 Year3 2014-01-26
2014 1/28/2014 B 2500 Year3 2014-01-28
2014 2/28/2014 C 3400 Year3 2014-02-28
2014 March 25, 2014 D 6000 Year3 2014-03-25
2014 41789 E 5500 Year3 2014-05-30
To leave a comment for the author, please follow the link and comment on their blog: Zahier Nasrudin.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)