How to prepare data for analysis in r

Posted on August 18, 2020 by dataenq. in R bloggers | 0 Comments

[This article was first published on dataENQ, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to our first article. This article aims to bestow the audience with commands that R offers to prepare the data for analysis in R.

Why prepare data?

If you are lucky, you may not have to prepare data, though real-life data has gaps, errors, missing values, etc. The quality of the analysis depends on the quality of the data. Remember the golden rule, garbage in, garbage out.

In the majority of cases, we visualize data in tabular form, though tabular data is somewhat easy to work with, yet it’s not always the case. Sometimes data for analysis is available as a combination of different files and requires multiple operations before it is ready for use.

In this article, we aim to look at:

Prerequisite and Setup

To practice along with this article, you need:

R installed on your machine
R Studio (optional)
Data files

Details of the R environment used in this article are as follows:

sessionInfo()
## R version 4.0.1 (2020-06-06)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] data.table_1.12.8 kableExtra_1.1.0  forcats_0.5.0     stringr_1.4.0    
##  [5] dplyr_1.0.0       purrr_0.3.4       readr_1.3.1       tidyr_1.1.0      
##  [9] tibble_3.0.1      ggplot2_3.3.1     tidyverse_1.3.0  
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.0  xfun_0.14         haven_2.3.1       lattice_0.20-41  
##  [5] colorspace_1.4-1  vctrs_0.3.1       generics_0.0.2    viridisLite_0.3.0
##  [9] htmltools_0.5.0   yaml_2.2.1        blob_1.2.1        rlang_0.4.6      
## [13] pillar_1.4.4      glue_1.4.1        withr_2.2.0       DBI_1.1.0        
## [17] dbplyr_1.4.4      modelr_0.1.8      readxl_1.3.1      lifecycle_0.2.0  
## [21] munsell_0.5.0     gtable_0.3.0      cellranger_1.1.0  rvest_0.3.5      
## [25] evaluate_0.14     knitr_1.28        fansi_0.4.1       broom_0.5.6      
## [29] Rcpp_1.0.4.6      scales_1.1.1      backports_1.1.7   webshot_0.5.2    
## [33] jsonlite_1.7.0    fs_1.4.1          hms_0.5.3         digest_0.6.25    
## [37] stringi_1.4.6     grid_4.0.1        cli_2.0.2         tools_4.0.1      
## [41] magrittr_1.5      crayon_1.3.4      pkgconfig_2.0.3   ellipsis_0.3.1   
## [45] xml2_1.3.2        reprex_0.3.0      lubridate_1.7.9   assertthat_0.2.1 
## [49] rmarkdown_2.2     httr_1.4.1        rstudioapi_0.11   R6_2.4.1         
## [53] nlme_3.1-148      compiler_4.0.1

Data

Data used in the article is about s and p 500 companies financials. This data is available under Open Data Commons Public Domain Dedication and License.

Original “s and p 500 companies financials” data is available as one CSV file but for this exercise, we have split the data into five different files. These five files are as follows:

constituents-financials-observations-1.csv
constituents-financials-observations-2.csv
constituents-financials-sectors.csv
constituents-financials-symbol-descriptions.csv
constituents-financials-variables.csv

If you wish, above five files are available here for you to download.

Strategy

The strategy to prepare the data for analysis is to read files into R. Collect all observations into one data table followed by variable name assignment. Merge result data table with sector and symbols data tables to bring all variables together.

We will use the tidyverse package throughout this article.

Reading multiple CSV files into R

R presents several functions to read data. In this article to keep things simple and clear, we will use read.csv to read data into R. As per our strategy, let’s read the observation files and create data tables.

observatiion.files <- list.files(pattern = "^.*observations.*.csv$")
observations.list <- lapply(observatiion.files, read.csv, header = FALSE)

all.observations <- rbindlist(observations.list)

The first statement above creates a character vector of length two. Function list.files search the working directory for any CSV file that has the word “observations” into its filename.

The second statement creates a list object with the help of lapply function. It can take a vector or list as an input and applies a function over the elements. The function we have chosen here is read.csv where the header option is set to false so that the first row is not skipped.

The third and final statement all.observations <- rbindlist(observations.list) creates a data table object by binding all the rows of elements of the list observations.list.

Let’s read other files using fread function.

all.sectors <- fread("constituents-financials-sectors.csv")
all.symbols <- fread("constituents-financials-symbol-descriptions.csv")

variable.names <- fread("constituents-financials-variables.csv", header = FALSE)

Now look at the imported data:

str(all.observations)
## Classes 'data.table' and 'data.frame':   505 obs. of  12 variables:
##  $ V1 : chr  "MMM" "AOS" "ABT" "ABBV" ...
##  $ V2 : num  222.9 60.2 56.3 108.5 150.5 ...
##  $ V3 : num  24.3 27.8 22.5 19.4 25.5 ...
##  $ V4 : num  2.33 1.15 1.91 2.5 1.71 ...
##  $ V5 : num  7.92 1.7 0.26 3.29 5.44 1.28 7.43 3.39 6.19 0.03 ...
##  $ V6 : num  259.8 68.4 64.6 125.9 162.6 ...
##  $ V7 : num  175.5 48.9 42.3 60 114.8 ...
##  $ V8 : num  1.39e+11 1.08e+10 1.02e+11 1.81e+11 9.88e+10 ...
##  $ V9 : num  9.05e+09 6.01e+08 5.74e+09 1.03e+10 5.64e+09 ...
##  $ V10: num  4.39 3.58 3.74 6.29 2.6 ...
##  $ V11: num  11.34 6.35 3.19 26.14 10.62 ...
##  $ V12: chr  "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AOS" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABT" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABBV" ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(all.sectors)
## Classes 'data.table' and 'data.frame':   505 obs. of  2 variables:
##  $ Symbol: chr  "MMM" "AOS" "ABT" "ABBV" ...
##  $ Sector: chr  "Industrials" "Industrials" "Health Care" "Health Care" ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(all.symbols)
## Classes 'data.table' and 'data.frame':   505 obs. of  2 variables:
##  $ Symbol: chr  "MMM" "AOS" "ABT" "ABBV" ...
##  $ Name  : chr  "3M Company" "A.O. Smith Corp" "Abbott Laboratories" "AbbVie Inc." ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(variable.names)
## Classes 'data.table' and 'data.frame':   1 obs. of  12 variables:
##  $ V1 : chr "Symbol"
##  $ V2 : chr "Price"
##  $ V3 : chr "Price/Earnings"
##  $ V4 : chr "Dividend Yield"
##  $ V5 : chr "Earnings/Share"
##  $ V6 : chr "52 Week Low"
##  $ V7 : chr "52 Week High"
##  $ V8 : chr "Market Cap"
##  $ V9 : chr "EBITDA"
##  $ V10: chr "Price/Sales"
##  $ V11: chr "Price/Book"
##  $ V12: chr "SEC Filings"
##  - attr(*, ".internal.selfref")=<externalptr>

All four commands above show the structure of four data.table that we have created using fread function. Let’s look at the first and last data.table above. Both of these data tables are missing meaningful variable (column) names. The column names for first data tables all.observations are available in variable.names. Let’s assign variable names to all.observations using variable.names data table.

Assigning variable names

To assign the variable name, we would extract the first row from the data table variable.names. The output should be converted into a character vector. As a result, all column values of variable.names data table would become elements of a new character vector. We would use this new character vector to assign the column name to our all.observations data table.

colnames(all.observations) <- as.character(variable.names[1,])
str(all.observations)
## Classes 'data.table' and 'data.frame':   505 obs. of  12 variables:
##  $ Symbol        : chr  "MMM" "AOS" "ABT" "ABBV" ...
##  $ Price         : num  222.9 60.2 56.3 108.5 150.5 ...
##  $ Price/Earnings: num  24.3 27.8 22.5 19.4 25.5 ...
##  $ Dividend Yield: num  2.33 1.15 1.91 2.5 1.71 ...
##  $ Earnings/Share: num  7.92 1.7 0.26 3.29 5.44 1.28 7.43 3.39 6.19 0.03 ...
##  $ 52 Week Low   : num  259.8 68.4 64.6 125.9 162.6 ...
##  $ 52 Week High  : num  175.5 48.9 42.3 60 114.8 ...
##  $ Market Cap    : num  1.39e+11 1.08e+10 1.02e+11 1.81e+11 9.88e+10 ...
##  $ EBITDA        : num  9.05e+09 6.01e+08 5.74e+09 1.03e+10 5.64e+09 ...
##  $ Price/Sales   : num  4.39 3.58 3.74 6.29 2.6 ...
##  $ Price/Book    : num  11.34 6.35 3.19 26.14 10.62 ...
##  $ SEC Filings   : chr  "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AOS" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABT" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABBV" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Command str(all.observations) above shows variable names are all assigned to 12 different columns respectively.

Understand the structure of imported data

At this step, we know that all imported files are available to us as data tables. Structure of all.observation data table shows that there are only two character variables and all other variables are of type numeric. On the flip side, the other two data tables have all variables of type character.

Before we start merging data tables, let’s familiarize ourselves with the common variable names between data tables. This will help us join the two data tables. Let’s look at the first 5 rows of each of the data tables and find out commonality.

names(all.observations)
##  [1] "Symbol"         "Price"          "Price/Earnings" "Dividend Yield"
##  [5] "Earnings/Share" "52 Week Low"    "52 Week High"   "Market Cap"    
##  [9] "EBITDA"         "Price/Sales"    "Price/Book"     "SEC Filings"
names(all.sectors)
## [1] "Symbol" "Sector"
names(all.symbols)
## [1] "Symbol" "Name"

The result above shows that there are common variables names to merge the data tables. To bring all the data into one data table, we should start merging all.observations with all.symbols using the Symbol column and again with all.sectors on Symbol column name. Let’s merge the data tables.

Merge data frames

complete.observations <- merge(merge(all.observations, all.symbols, by = "Symbol"), all.sectors , by = "Symbol")

Rearrange and Prepare final data frame

Now when all the data tables are merged, we will rearrange the variables within the final data table to keep all qualitative and quantitative columns together.

complete.observations <- setcolorder(complete.observations, c("Symbol","Name","Sector","Price","Price/Earnings","Dividend Yield","Earnings/Share","52 Week Low","52 Week High","Market Cap","EBITDA","Price/Sales","Price/Book","SEC Filings"))

head(complete.observations)
##    Symbol                     Name                 Sector  Price Price/Earnings
## 1:      A Agilent Technologies Inc            Health Care  65.05          27.45
## 2:    AAL  American Airlines Group            Industrials  48.60           9.92
## 3:    AAP       Advance Auto Parts Consumer Discretionary 109.63          19.54
## 4:   AAPL               Apple Inc. Information Technology 155.15          16.86
## 5:   ABBV              AbbVie Inc.            Health Care 108.48          19.41
## 6:    ABC   AmerisourceBergen Corp            Health Care  91.55          15.54
##    Dividend Yield Earnings/Share 52 Week Low 52 Week High   Market Cap
## 1:      0.8756979           2.10       75.00        49.23  21984606918
## 2:      0.7782101           3.91       59.08        39.21  24594852352
## 3:      0.2183207           6.19      169.55        78.81   8123611867
## 4:      1.5795412           9.20      180.10       131.12 809508034020
## 5:      2.4995599           3.29      125.86        60.05 181386347059
## 6:      1.6132456           1.64      106.27        71.90  20587704101
##         EBITDA Price/Sales Price/Book
## 1:  1094000000   6.4935630       4.56
## 2:  5761000000   0.5802257       6.03
## 3:   853941000   1.1301061       2.51
## 4: 79386000000   3.4586093       5.66
## 5: 10310000000   6.2915710      26.14
## 6:   991884000   0.1739698       9.73
##                                                           SEC Filings
## 1:    http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=A
## 2:  http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AAL
## 3:  http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AAP
## 4: http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AAPL
## 5: http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABBV
## 6:  http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABC

Eight steps above explain to us how to prepare data if the data is presented to us in multiple header-less CSV files.

Finally, when we have reached towards the end of the article, let’s compare the file we have prepared with the original file. This test will confirm that the operations we have performed are correct or not. Let’s read the original CSV file into R and compare it with the final data table we have prepared above.

original.file <- fread("constituents-financials_csv.csv") %>% arrange(Symbol,Name,Sector)

complete.observations <- complete.observations %>% 
arrange(Symbol,Name,Sector)

result <- colSums(original.file != complete.observations)
        which(!!result)
## named integer(0)

The result above confirms that the files are identical.

Lastly, the complete code is as follows:

### ----------------------------------------------------------------------------
### Reading multiple header-less CSV files into R and preparing data to analyze
### Copyright - DataENQ.com
### Version - V1.0
### ----------------------------------------------------------------------------

# finding all observation files into working directory
observatiion.files <- list.files(pattern = "^.*observations.*.csv$")

#reading all files into R using read.csv function
observations.list <- lapply(observatiion.files, read.csv, header = FALSE)

# creating one data table by appending all rows of list elements
all.observations <- rbindlist(observations.list)
        
# reading all other CSV files into R
all.sectors <- fread("constituents-financials-sectors.csv")
all.symbols <- fread("constituents-financials-symbol-descriptions.csv")
variable.names <- fread("constituents-financials-variables.csv", header = FALSE)
        
# assigning columns names by converting the contents of first row from variable.names data table. 
colnames(all.observations) <- as.character(variable.names[1,])
complete.observations <- merge(merge(all.observations, all.symbols, by = "Symbol"), all.sectors , by = "Symbol")
complete.observations <- setcolorder(complete.observations, c("Symbol","Name","Sector","Price","Price/Earnings","Dividend Yield","Earnings/Share","52 Week Low","52 Week High","Market Cap","EBITDA","Price/Sales","Price/Book","SEC Filings"))

original.file <- fread("constituents-financials_csv.csv") %>% arrange(Symbol,Name,Sector)
complete.observations <- complete.observations %>% arrange(Symbol,Name,Sector)
result <- colSums(original.file != complete.observations)
which(!!result)

Image Credit unsplash.com

To leave a comment for the author, please follow the link and comment on their blog: dataENQ.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers