Import Data into R – Part 2

August 10, 2018
By

(This article was first published on Rsquared Academy Blog, and kindly contributed to R-bloggers)

Introduction

This is the second post in the series Importing Data into R. In the
previous post, we explored reading data from flat/delimited files. In this
post, we will:

  • list sheets in an excel file
  • read data from an excel sheet
  • read specific cells from an excel sheet
  • read specific rows
  • read specific columns
  • read data from
    – SAS
    – SPSS
    – STATA

Libraries, Data & Code

We will use the readxl package. It has
no external dependencies as compared to other packages available for reading
data from Excel. The data sets can be downloaded from
here and the codes from
here.

library(readxl)

List Sheets

An excel file may contain several sheets. Let us see how many sheets are
present in sample.xls file and their respective names using excel_sheets().

excel_sheets('sample.xls')
## [1] "ecom"

Read Sheet

Now that we know the number of sheets and their names, let us read data from
the ecom sheet of the sample.xls file using read_excel(). We will specify
the file name, and the sheet name or sheet number.

Case 1: Specify the sheet number

read_excel('sample.xls', sheet = 1)
## # A tibble: 7 x 5
##   channel        users new_users sessions bounce_rate
##                             
## 1 Organic Search 43296     40238    50810 48.72%     
## 2 Direct         12916     12311    16419 49.27%     
## 3 Referral       10983      7636    18105 22.26%     
## 4 Social         10346     10029    11101 61.92%     
## 5 Display         5564      4790     7220 83.30%     
## 6 Paid Search     2687      2205     3438 38.02%     
## 7 Affiliates      1773      1585     2167 55.75%

Case 2: Specify the sheet name

read_excel('sample.xls', sheet = 'ecom')
## # A tibble: 7 x 5
##   channel        users new_users sessions bounce_rate
##                             
## 1 Organic Search 43296     40238    50810 48.72%     
## 2 Direct         12916     12311    16419 49.27%     
## 3 Referral       10983      7636    18105 22.26%     
## 4 Social         10346     10029    11101 61.92%     
## 5 Display         5564      4790     7220 83.30%     
## 6 Paid Search     2687      2205     3438 38.02%     
## 7 Affiliates      1773      1585     2167 55.75%

Notice when you use the sheet name, the name should be enclosed in
single/double quotes.

Read Specific Cells

You may not always want to read all the columns or rows from the excel sheet.
In such cases, you can specify the cells from which the data must be read
which can be achieved using the range argument. So how do we specify the
cells from which to read data? There are different ways of specifying the
cell range and we will look at them one by one:

Method 1

The first method uses the cell names along with : to specify the cell range.
For example, to read data from first 4 rows of columns B and C, we will
specify the range as "B1:C4".

read_excel('sample.xls', sheet = 1, range = "B1:C4")
## # A tibble: 3 x 2
##   users new_users
##        
## 1 43296     40238
## 2 12916     12311
## 3 10983      7636

To read data from first 5 rows of columns A, B and C, we will specify the range as "A1:C5".

read_excel('sample.xls', sheet = 1, range = "A1:C5")
## # A tibble: 4 x 3
##   channel        users new_users
##                  
## 1 Organic Search 43296     40238
## 2 Direct         12916     12311
## 3 Referral       10983      7636
## 4 Social         10346     10029

Method 2

In the second method, we start from a particular cell and specify the number
of rows and columns to be covered keeping the initial cell as anchorage.
In the below example, we want to read 3 rows and 2 columns starting from the
cell A4.

read_excel('sample.xls', sheet = 1, col_names = FALSE,
  range = anchored("A4", dim = c(3, 2)))
## # A tibble: 3 x 2
##   X__1      X__2
##       
## 1 Referral 10983
## 2 Social   10346
## 3 Display   5564

Method 3

In this method, we use the cell_limit() and specify the location of two ends
of a rectangle covering the cells we want to read. For example, to read data
from the first 6 rows and 4 columns, we will specify the range as following:

  • start from the first row of the first column
  • cover all cells upto the 6th row of the 4th column

read_excel('sample.xls', sheet = 1,
  range = cell_limits(c(1, 1), c(6, 4)))
## # A tibble: 5 x 4
##   channel        users new_users sessions
##                      
## 1 Organic Search 43296     40238    50810
## 2 Direct         12916     12311    16419
## 3 Referral       10983      7636    18105
## 4 Social         10346     10029    11101
## 5 Display         5564      4790     7220

You can use NA to indicate the first and last row/column. For example, to
read data from all the rows from the second column onwards:

read_excel('sample.xls', sheet = 1,
  range = cell_limits(c(1, 2), c(NA, NA)))
## # A tibble: 7 x 4
##   users new_users sessions bounce_rate
##                   
## 1 43296     40238    50810 48.72%     
## 2 12916     12311    16419 49.27%     
## 3 10983      7636    18105 22.26%     
## 4 10346     10029    11101 61.92%     
## 5  5564      4790     7220 83.30%     
## 6  2687      2205     3438 38.02%     
## 7  1773      1585     2167 55.75%

Let us quickly look at how we will specify range of cells using the above 3
methods when we want to read data from the first 4 rows of columns B and C:

Method 1

read_excel('sample.xls', sheet = 1,
  range = "B1:C4")
## # A tibble: 3 x 2
##   users new_users
##        
## 1 43296     40238
## 2 12916     12311
## 3 10983      7636

Method 2

read_excel('sample.xls', sheet = 1,
  range = anchored("B1", dim = c(4, 2)))
## # A tibble: 3 x 2
##   users new_users
##        
## 1 43296     40238
## 2 12916     12311
## 3 10983      7636

Method 3

read_excel('sample.xls', sheet = 1,
  range = cell_limits(c(1, 2), c(4, 3)))
## # A tibble: 3 x 2
##   users new_users
##        
## 1 43296     40238
## 2 12916     12311
## 3 10983      7636

Read Specific Rows

When you want to read a subset of rows from the data, use cell_rows() and
specify the row numbers or the range. In the below example, we want to read
the first 4 rows of data from the file.

read_excel('sample.xls', sheet = 1, range = cell_rows(1:4))
## # A tibble: 3 x 5
##   channel        users new_users sessions bounce_rate
##                             
## 1 Organic Search 43296     40238    50810 48.72%     
## 2 Direct         12916     12311    16419 49.27%     
## 3 Referral       10983      7636    18105 22.26%

Read Single Column

If you want to read a single column from the data, use cell_cols() and
specify the column number. In the below example, we read the second column
from the sample.xls file.

read_excel('sample.xls', sheet = 1, range = cell_cols(2))
## # A tibble: 7 x 1
##   users
##   
## 1 43296
## 2 12916
## 3 10983
## 4 10346
## 5  5564
## 6  2687
## 7  1773

Read Multiple Columns

In case of multiple columns, we need to specify the column numbers or the
column range. In the below example, we want to read the 2nd, 4th and 6th column from
the sample.xls file.

read_excel('sample.xls', sheet = 1, range = cell_cols(c(2, 4, 6)))
## # A tibble: 7 x 5
##   users new_users sessions bounce_rate X__1 
##                    
## 1 43296     40238    50810 48.72%      NA   
## 2 12916     12311    16419 49.27%      NA   
## 3 10983      7636    18105 22.26%      NA   
## 4 10346     10029    11101 61.92%      NA   
## 5  5564      4790     7220 83.30%      NA   
## 6  2687      2205     3438 38.02%      NA   
## 7  1773      1585     2167 55.75%      NA

In the next example, we want to read data from the 2nd column upto and
including the 6th column.

read_excel('sample.xls', sheet = 1, range = cell_cols(c(2:6)))
## # A tibble: 7 x 5
##   users new_users sessions bounce_rate X__1 
##                    
## 1 43296     40238    50810 48.72%      NA   
## 2 12916     12311    16419 49.27%      NA   
## 3 10983      7636    18105 22.26%      NA   
## 4 10346     10029    11101 61.92%      NA   
## 5  5564      4790     7220 83.30%      NA   
## 6  2687      2205     3438 38.02%      NA   
## 7  1773      1585     2167 55.75%      NA

Summary

Statistical Softwares

We will use the haven package to read data from files of other statistical softwares such as:

  • SAS
  • SPSS
  • STATA

Library

library(haven)
## Warning: package 'haven' was built under R version 3.5.2

STATA

read_stata('airline.dta')  
## # A tibble: 32 x 6
##     year     y     w     r     l     k
##         
##  1  1948  1.21 0.243 0.145  1.41 0.612
##  2  1949  1.35 0.260 0.218  1.38 0.559
##  3  1950  1.57 0.278 0.316  1.39 0.573
##  4  1951  1.95 0.297 0.394  1.55 0.564
##  5  1952  2.27 0.310 0.356  1.80 0.574
##  6  1953  2.73 0.322 0.359  1.93 0.711
##  7  1954  3.03 0.335 0.403  1.96 0.776
##  8  1955  3.56 0.350 0.396  2.12 0.827
##  9  1956  3.98 0.361 0.382  2.43 0.800
## 10  1957  4.42 0.379 0.305  2.71 0.921
## # ... with 22 more rows

SPSS

read_spss('employee.sav')  
## # A tibble: 474 x 9
##       id gender  educ    jobcat  salary  salbegin  jobtime prevexp minority
##            
##  1     1 m       15      3       57000   27000     98      144     0       
##  2     2 m       16      1       40200   18750     98       36     0       
##  3     3 f       12      1       21450   12000     98      381     0       
##  4     4 f        8      1       21900   13200     98      190     0       
##  5     5 m       15      1       45000   21000     98      138     0       
##  6     6 m       15      1       32100   13500     98       67     0       
##  7     7 m       15      1       36000   18750     98      114     0       
##  8     8 f       12      1       21900    9750     98        0     0       
##  9     9 f       15      1       27900   12750     98      115     0       
## 10    10 f       12      1       24000   13500     98      244     0       
## # ... with 464 more rows

SAS

read_sas('airline.sas7bdat')
## # A tibble: 32 x 6
##     YEAR     Y     W     R     L     K
##         
##  1  1948  1.21 0.243 0.145  1.41 0.612
##  2  1949  1.35 0.260 0.218  1.38 0.559
##  3  1950  1.57 0.278 0.316  1.39 0.573
##  4  1951  1.95 0.297 0.394  1.55 0.564
##  5  1952  2.27 0.310 0.356  1.80 0.574
##  6  1953  2.73 0.322 0.359  1.93 0.711
##  7  1954  3.03 0.335 0.403  1.96 0.776
##  8  1955  3.56 0.350 0.396  2.12 0.827
##  9  1956  3.98 0.361 0.382  2.43 0.800
## 10  1957  4.42 0.379 0.305  2.71 0.921
## # ... with 22 more rows

Summary

To leave a comment for the author, please follow the link and comment on their blog: Rsquared Academy Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)