Why should I use R: The Excel R Data Wrangling comparison: Part 1

[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is part 1 of an ongoing series on why you should use R. Future blogs will be linked here as they are released.

The era of data manipulation and analysis using programming languages has arrived. But it can be tough to find the time and the right resources to fully switch over from more manual, time-consuming solutions, such as Excel. In this blog we will show a comparison between Excel and R to get you started!

When choosing between R and Excel, it is important to understand how both solutions can get you the results you need. However, one can make it an easy, reputable, convenient process, whereas the other can make it an extremely frustrating, time-consuming process prone to human errors.

R and Excel

When opening Excel and applying data manipulation techniques to your data, are you easily able to tell what manipulations have been made without clicking on the column or cells? If you were to share these Excel sheets with colleagues are they easily able to replicate your analyses without you telling them where to click or which formulas were applied?

With R all of these are possible. You automatically have all the code visible and in front of you in the form of scripts. Reading and understanding the code is possible because of its easy-to-use, easy-to-read syntax which allows you to track what the code is doing without having to be concerned about any hidden functions or modifications happening in the background.

Most people already learned the basics of Microsoft Excel in school. Once the data has been imported into an Excel sheet, using a point-and-click technique we can easily create basic graphs and charts. R, on the other hand, is a programming language with a steeper learning curve. It will take at most two weeks to become familiar with the basics of the language and the RStudio user interface. Luckily using R can easily become second-nature with practice.


Data comes in all shapes and sizes. It can often be difficult to know where to start. Whatever your problem, Jumping Rivers can help.


Replicating Analysis

R, while having a slightly steep learning curve, has the ability to reproduce analyses repeatedly and with different data sets. This is very helpful for large projects containing multiple data sets as it keeps our processes clean and consistent. Excel however, because of the point-and-click interface, allows us to rely frequently on memory and repetition, so we would have to repeat the same analyses multiple times by either copying and pasting or simply repeating the point-and-click process, which can be time-consuming, messy, and prone to human errors.

Unlike Excel, R is completely free and benefits from a large community of open-source contributors. To install R and the IDE (RStudio Desktop) to work with R, download and install the relevant versions for your operating system. Once you have successfully installed the IDE, the following user interface will be visible

The Posit IDE.

The area on the left is where you will write R code in scripts, use terminals and run jobs. The right hand side of the IDE is comprised of two sections. The top is the environment that stores a list of defined variables and data sets, view the history, and connect to other database. The area below contains five different tabs: the Files tab which lists all of the folders within this project, the Plots tab, which displays any plots that have been generated; the Packages tab which allows you to manage packages within your environment; the Help tab which provides a manual; and the Viewer tab which allows you to view generated interactive content.

Loading the data sets

Excel

The data import steps in Excel are quite straightforward to a day-to-day Excel user, however, it is certainly not reproducible.

Steps:

1. Click the Data tab on the Ribbon

2. Click the Get Data button

3. Select From File

4. Select from TEXT/CSV

5. Select the file and click Import

6. Click Load

R

There are various ways to import data sets such as local files, online datasets and even through database connections. We will use the read_csv() function from the {readr} package to import our csv files. But first, what are packages? R packages are a collection of R functions, compiled code and sample data that can be installed by R users. Before using an R function such as read_csv() to import the data, we are required to install and load the {readr} package. Packages are great because rather than having to have a huge programme containing everything you could possibly need, the different packages specialise in different things, and can be loaded in as and when you need them, saving a lot of space.

# Installing the package
install.packages("readr")

# Loading the package
library(readr)

# Importing the data
movies_data <- read_csv("blog data.csv")

Exploring our data

Before getting started with any data manipulation, let’s explore our data.

Excel

Excel has one basic data structure, which is the cell. These Excel cells are extremely flexible as they store data of various types (numeric, logical and characters). To obtain an overview of the data we could simply just scroll through the Excel data sheet. Now, let’s imagine a data set of 1 million rows and 200 columns, would it still be as easy to scroll through the data sheet to obtain an overview of data? Could we quickly and reliably view all the column names? To me, manually scrolling seems like a very time consuming, unreliable and messy process.

Screenshot of the data in an Excel spreadsheet.

R

To view our data in R, we could simply click on it in the environment or we could call the name of the data set in the script. If we are working with a large data set, we can also view a subset of this data by using functions like head() and tail(). We could also use the colnames function to programmatically display the variable names within our data.

movies_data

## # A tibble: 26 × 5
##    Country      Year Highest_profit Number_movies no_employees
##    <chr>       <dbl>          <dbl>         <dbl>        <dbl>
##  1 England      2011            100             3         1500
##  2 America      2012            150             2         2000
##  3 America      2013            300             4         4000
##  4 England      2013            130             2         4020
##  5 South Korea  2013            177             3         5300
##  6 America      2014            350             1         3150
##  7 South Korea  2015            700             6         6000
##  8 England      2016            650             2         5000
##  9 America      2016            230             1         1420
## 10 China        2016            440             3         5000
## # … with 16 more rows

str(movies_data) # Displays the structure of the data

## spc_tbl_ [26 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Country       : chr [1:26] "England" "America" "America" "England" ...
##  $ Year          : num [1:26] 2011 2012 2013 2013 2013 ...
##  $ Highest_profit: num [1:26] 100 150 300 130 177 350 700 650 230 440 ...
##  $ Number_movies : num [1:26] 3 2 4 2 3 1 6 2 1 3 ...
##  $ no_employees  : num [1:26] 1500 2000 4000 4020 5300 3150 6000 5000 1420 5000 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Country = col_character(),
##   ..   Year = col_double(),
##   ..   Highest_profit = col_double(),
##   ..   Number_movies = col_double(),
##   ..   no_employees = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(movies_data) # Displays the first six rows of the data

## # A tibble: 6 × 5
##   Country      Year Highest_profit Number_movies no_employees
##   <chr>       <dbl>          <dbl>         <dbl>        <dbl>
## 1 England      2011            100             3         1500
## 2 America      2012            150             2         2000
## 3 America      2013            300             4         4000
## 4 England      2013            130             2         4020
## 5 South Korea  2013            177             3         5300
## 6 America      2014            350             1         3150

tail(movies_data) # Displays the last six rows of data

## # A tibble: 6 × 5
##   Country      Year Highest_profit Number_movies no_employees
##   <chr>       <dbl>          <dbl>         <dbl>        <dbl>
## 1 England      2021            120             1         1325
## 2 America      2021            800             3         6800
## 3 America      2022            400             2         7200
## 4 China        2021            230             2         3101
## 5 South Korea  2022            320             1         4300
## 6 England      2022            221             2         3999

colnames(movies_data) # Displays all the variable names

## [1] "Country"        "Year"           "Highest_profit" "Number_movies" 
## [5] "no_employees"

The movies data is comprised of five columns: country, year, highest profit gained per movie, number of movies produced and number of employees on set during production. It is clear that R programmatically displays the output of our data whereas Excel requires of a lot of eye-balling and manual scrolling. If we were interested in displaying a subset of our data, in a report for example, using R we could simply use the functions above. To do this in Excel we would have to copy and paste the first 6 rows of the data and manually add it to the report document.

Summary Statistics

Now, let’s apply some summary statistics on our data. Summary statistics provide a quick summary of data and are particularly useful for comparing one project to another, or before and after.

Excel

It is very well known that Excel has a data storage limitation per spreadsheet. It can have a very limited amount of columns and rows, while R is made to handle larger data sets. Excel files are also known to crash when they exceed 20 tabs of data. Excel is able to handle a good chunk of data, but not much. This becomes very risky when you unknowingly start to lose data because the file has become too big and is unable to save. To generate summary statistics (such as the minimum and maximum values) of our data in Excel, we followed a few steps:

  1. Scroll to the Home tab

  2. In the Editing group, click the arrow next to AutoSum

  3. Click Min

  4. Click Max

  5. Press Enter

These steps were quite easy to follow, however, I often forget where to click or which tab to select. After discussing this workflow with a colleague, we also discovered slight differences in the steps for different versions of Excel. This did not seem very effective or reproducible to us.

R

summary(movies_data)

##    Country               Year      Highest_profit  Number_movies  
##  Length:26          Min.   :2011   Min.   : 11.0   Min.   :1.000  
##  Class :character   1st Qu.:2013   1st Qu.:156.8   1st Qu.:2.000  
##  Mode  :character   Median :2017   Median :320.0   Median :2.500  
##                     Mean   :2017   Mean   :350.0   Mean   :2.654  
##                     3rd Qu.:2021   3rd Qu.:485.0   3rd Qu.:3.000  
##                     Max.   :2022   Max.   :800.0   Max.   :6.000  
##   no_employees 
##  Min.   :1325  
##  1st Qu.:2275  
##  Median :4401  
##  Mean   :4338  
##  3rd Qu.:6375  
##  Max.   :7200

# Stardard deviation
sd(movies_data$Highest_profit) 

## [1] 224.3471

# Highest value of the Highest profit column
min(movies_data$Highest_profit) 

## [1] 11

# Highest value of the Highest profit column
max(movies_data$Highest_profit) 

## [1] 800

The dollar symbol, $, used here simply dictates which data set and column we are using for the analysis. It is evident that the source code of R can be used repeatedly and with different data sets in ways that Excel formulas cannot. R clearly shows the code (instructions), data and columns used for an analysis in ways that Excel does not. If I were to share this script with a colleague they would have a complete understanding on how the summary statistics were generated because of R’s human readable syntax.

Data Wrangling

Data manipulation tools assist us with modifying our data to make it easier to read and organise. For example, one of the easiest data manipulation tools in Excel is inserting columns and rows. The purpose of data manipulation is to create a consistent, organised and clean data set. With this in mind, let’s apply the following data manipulations in Excel and then R:

  1. Rename the columns into a consistent format
  2. Arrange the year column in ascending order
  3. Select and create a new column
  4. Remove a column from the data
  5. Select only the entries for the year 2014
  6. Remove only the entries from rows 4-11

1. Renaming columns in R and Excel

Excel

Renaming columns in R is a completely manual process, which makes it an extremely time-consuming and risky process especially if you are working between multiple messy Excel sheets.

A gif displaying the manual process of renaming columns inexcel.

R

For data manipulation in R, we use a powerful package in R called dplyr. Let’s load and install the package.

# Installing the packages
install.packages("dplyr")

# Loading the packages
library(dplyr)

To rename the columns, there is a handy function called rename(). We simply pass this function the name of our data set (movies_data), and then rename each of the columns. There are other methods available in other packages which can automatically make everything lower case, for example, but for the purposes of this blog, we will stick with dplyr.

# Renaming the column into a consistent format
movies_data <- rename_with(movies_data, tolower)

2. Arrange the year column in ascending order

Excel

To change column to ascending order, we first had to:

  1. Select the year column

  2. Direct to the Sort and Filter tab

  3. Select the option to sort from the largest to the smallest value

A screenshot of the Excel Tab used to sort the data accordingly

R

arrange(movies_data, year)

## # A tibble: 26 × 5
##    country      year highest_profit no_movies no_employees
##    <chr>       <dbl>          <dbl>     <dbl>        <dbl>
##  1 England      2011            100         3         1500
##  2 America      2011            100         3         1500
##  3 America      2012            150         2         2000
##  4 South Korea  2012             11         5         1333
##  5 America      2013            300         4         4000
##  6 England      2013            130         2         4020
##  7 South Korea  2013            177         3         5300
##  8 America      2014            350         1         3150
##  9 South Korea  2015            700         6         6000
## 10 England      2016            650         2         5000
## # … with 16 more rows

Again, with Excel representing a point-and-click nature, it is impossible to identify. by looking at a column, how the data was modified. If I were to replicate these steps in two years time I would likely have forgotten where to point and click. With R however, we have our code which clearly shows each step used to manipulate the data. If I were to return to my script in two years time, I would easily be able to replicate the analysis.

3. Selecting and adding a new column

Let’s reduce our data set by first selecting the country, year, no_movies and highest_profit columns. Then we will generate a new column called complete_profit. The complete_profit column should be generated from taking the highest_profit column divided by the no_movies column.

Excel

Manually creating a new column using Excel.

R

movies_data %>% 
  select(country, year, no_movies, highest_profit) %>% 
  mutate(complete_profit = highest_profit/no_movies)

## # A tibble: 26 × 5
##    country      year no_movies highest_profit complete_profit
##    <chr>       <dbl>     <dbl>          <dbl>           <dbl>
##  1 England      2011         3            100            33.3
##  2 America      2012         2            150            75  
##  3 America      2013         4            300            75  
##  4 England      2013         2            130            65  
##  5 South Korea  2013         3            177            59  
##  6 America      2014         1            350           350  
##  7 South Korea  2015         6            700           117. 
##  8 England      2016         2            650           325  
##  9 America      2016         1            230           230  
## 10 China        2016         3            440           147. 
## # … with 16 more rows

4. Removing a column

Excel

In Excel, inserting or deleting a column is a manual process. First, we select the column then right-click at the top of a column and then select the Delete option.

R

select(movies_data, -year)

## # A tibble: 26 × 4
##    country     highest_profit no_movies no_employees
##    <chr>                <dbl>     <dbl>        <dbl>
##  1 England                100         3         1500
##  2 America                150         2         2000
##  3 America                300         4         4000
##  4 England                130         2         4020
##  5 South Korea            177         3         5300
##  6 America                350         1         3150
##  7 South Korea            700         6         6000
##  8 England                650         2         5000
##  9 America                230         1         1420
## 10 China                  440         3         5000
## # … with 16 more rows

In R, we simply used the select function from the {dplyr} package to select a column of our data frame. To remove a column we put a - in front of the variable to exclude it from our data.

5. Select only the entries for a particular year

Excel

Here we are interested in extracting the data collected only during the year 2021. Using Excel software, we first sort the year column and then manually select the years that we are interested in. While applying this manual technique of selecting pieces of data that we are interested in, it is very easy to select the wrong data or even accidentally delete data.

A gif to manually retain all the data collected during the year2021.

R

filter(movies_data, year == 2021)

## # A tibble: 4 × 5
##   country  year highest_profit no_movies no_employees
##   <chr>   <dbl>          <dbl>     <dbl>        <dbl>
## 1 America  2021            800         3         6800
## 2 England  2021            120         1         1325
## 3 America  2021            800         3         6800
## 4 China    2021            230         2         3101

6. Remove only the row entries from 2-4

Excel

Removing rows in Excel is once again a manual process. We select the rows that we do not want to keep, then right click and delete those rows. These rows are now permanently deleted from the data sheet. If we were interested in adding them back into the sheet, we would have to find it (if we had a back up Excel sheet) and copy and paste it back into our data analysis Excel sheet. If we did not have a back up of the data that we had deleted, then this data would be completely lost.

R

In R we can use the slice function to return a subset of rows based on their position. If you want to remove rows using slice instead of retaining them you can just add a - in front of the row indices you’re passing into the function. So, to remove rows 2, 3, and 4:

slice(movies_data, -(2:4))

## # A tibble: 26 × 2
##    country     no_employees
##    <chr>              <dbl>
##  1 England             1500
##  2 America             2000
##  3 America             4000
##  4 England             4020
##  5 South Korea         5300
##  6 America             3150
##  7 South Korea         6000
##  8 England             5000
##  9 America             1420
## 10 China               5000
## # … with 16 more rows

Using R and Excel

There are multiple ways in which data manipulation is used efficiently in data science. Data formatting is important and must be organised to be read by the various software programs, be it in R or Excel.

Excel is an excellent tool and is easy to use and at times it is the most appropriate tool. Excel is often used for data processing work under general and basic office requirements. However, Excel is limiting in that the data file itself can hold only approximately 1 million rows without the aid of other tools. The basic built in statistical analysis is too simple and has very little practical value. If you are an aspiring data analyst, you will need to expand your toolset and start thinking beyond the rows and columns of a spreadsheet. R functions cover almost any area where data is needed. Getting started with R is very simple especially because of the easy-to-use and understandable syntax. Most importantly, R facilitates reproducible analyses.

A hammer is great for driving nails, but it’s not the only tool out there.

If you’re interested in learning R, then attend our Introduction to R course.

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)