Introduction to Scraping and Wrangling Tables from Research Articles

[This article was first published on Rstats on LIBD rstats club, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Steve Semick.

What do you do when you want to use results from the literature to anchor your own analysis? When these results are in the form of an easily accessible table, such as a .csv file or .xlsx file, then it is simple enough to download them and incorporate them into your research. Often times, however, published findings are not so easy to handle. Today, we’ll go through a practical scenario on scraping an html table from a Nature Genetics article into R and wrangling the data into a useful format. This is what the online table we want to scrape looks like:

Example 1A: Scraping a html table from a webpage

Sometimes a table is online as part of a research article but can’t be easily coerced into a useful format. You can’t copy and paste the table into excel and its not stored elsewhere. In these situations you can use the handy R package rvest to scrape it into R from the webpage.

First load the rvest package to scrape the table.

library("rvest")
library("knitr")

Next, get the url of the webpage where the table is stored.

url <- "https://www.nature.com/articles/ng.2802/tables/2"

Now, the trickiest part of the process. We need find where the table lives on this webpage. An excellent guide to finding out how to do this can be found on Cory Nissen’s blogpost– this is also where I learned about using rvest to scrape html tables. Once you have copied the table location, the rest is easy!

table_path='//*[@id="content"]/div/div/figure/div[1]/div/div[1]/table'
nature_genetics_table2 <- url %>%
  read_html() %>%
  html_nodes(xpath=table_path) %>%
  html_table(fill=T)
nature_genetics_table2 = nature_genetics_table2[[1]]

This is what the first few lines of our scraped product looks like:

kable(nature_genetics_table2[1:4,])
SNPa Chr. Positionb Closest genec Major/minor alleles MAFd Stage 1 Stage 1 Stage 2 Stage 2 Overall Overall Overall
SNPa Chr. Positionb Closest genec Major/minor alleles MAFd OR (95% CI)e Meta P value OR (95% CI)e Meta P value OR (95% CI)e Meta P value I2 (%), P valuef
Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes Known GWAS-defined associated genes
rs6656401 1 207692049 CR1 G/A 0.197 1.17 (1.12–1.22) 7.7 × 10−15 1.21 (1.14–1.28) 7.9 × 10−11 1.18 (1.14–1.22) 5.7 × 10−24 0, 7.8 × 10−1
rs6733839 2 127892810 BIN1 C/T 0.409 1.21 (1.17–1.25) 1.7 × 10−26 1.24 (1.18–1.29) 3.4 × 10−19 1.22 (1.18–1.25) 6.9 × 10−44 28, 6.1 × 10−2

While this table has the information we want, it is clearly still a mess. Which brings us to…

Example 1B: Making messy data useful

Fortunately, there is the right number of columns, but there are lines of text breaking the table and stretching through at least one row of the table. There are two others (not shown), so getting these rows into better data format will be our first task.

Cleaning up the rows

We could look at the table and see these lines are on rows 2, 12, and 18. But let’s do this instead using some R code. The trick here is to note that all the elements of these rows contain the exact same text.

v=which(apply(nature_genetics_table2,1, function(x) length(unique(unlist(x))) )==1)

Great, now let’s get rid of these rows but retain the information. We are going to do this in three steps. First, we will split the data.frame into a list based on the location of these descriptions (rows 2,12, and 18). Then, we will clean this list up by keeping only the list elements with data. We will move the text taking up entire rows to an additional column titled “Description”. Lastly, we will concatenate this cleaned list back into a data.frame.

nature_genetics_table2_list = split(nature_genetics_table2, cumsum(1:nrow(nature_genetics_table2) %in% v))
nature_genetics_table2_list = lapply(nature_genetics_table2_list[2:4], function(y){ y$Description= unique(as.character(y[1,]) ) 
y[-1,] } )

nature_genetics_table2_clean = do.call("rbind",nature_genetics_table2_list)

Now let’s look at our data.

kable(nature_genetics_table2_clean[1:3,])
SNPa Chr. Positionb Closest genec Major/minor alleles MAFd Stage 1 Stage 1 Stage 2 Stage 2 Overall Overall Overall Description
1.3 rs6656401 1 207692049 CR1 G/A 0.197 1.17 (1.12–1.22) 7.7 × 10−15 1.21 (1.14–1.28) 7.9 × 10−11 1.18 (1.14–1.22) 5.7 × 10−24 0, 7.8 × 10−1 Known GWAS-defined associated genes
1.4 rs6733839 2 127892810 BIN1 C/T 0.409 1.21 (1.17–1.25) 1.7 × 10−26 1.24 (1.18–1.29) 3.4 × 10−19 1.22 (1.18–1.25) 6.9 × 10−44 28, 6.1 × 10−2 Known GWAS-defined associated genes
1.5 rs10948363 6 47487762 CD2AP A/G 0.266 1.10 (1.07–1.14) 3.1 × 10−8 1.09 (1.04–1.15) 4.1 × 10−4 1.10 (1.07–1.13) 5.2 × 10−11 0, 9 × 10−1 Known GWAS-defined associated genes

Fixing column names

It’s getting better but is still messy. Let’s clean up those columns names. This part we will do by hand.

colnames(nature_genetics_table2_clean) <- c("SNP", "Chr", "Position", "Closest gene", "Major/minor alleles", "MAF", "Stage1_OR", "Stage1_MetaP", "Stage2_OR","Stage2_MetaP",    "Overall_OR", "Overall_MetaP", "I2_Percent/P","Description")
colnames(nature_genetics_table2_clean)
##  [1] "SNP"                 "Chr"                 "Position"           
##  [4] "Closest gene"        "Major/minor alleles" "MAF"                
##  [7] "Stage1_OR"           "Stage1_MetaP"        "Stage2_OR"          
## [10] "Stage2_MetaP"        "Overall_OR"          "Overall_MetaP"      
## [13] "I2_Percent/P"        "Description"

Making a character variable into a numeric variable

It’s coming along.. Next, we need to make the numbers into, well numbers. This will be more useful in R than a character class of data. To do this, try using the as.numeric function as shown.

as.numeric(nature_genetics_table2_clean$Stage1_MetaP)
## Warning: NAs introduced by coercion
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

It doesn’t work because there are weird symbols that R doesn’t understand. Get rid of them with the gsub command and replace them an E (scientific notation).

nature_genetics_table2_clean$Stage1_MetaP = gsub(" × 10","E",nature_genetics_table2_clean$Stage1_MetaP)

Now try converting to a numeric.

as.numeric(nature_genetics_table2_clean$Stage1_MetaP)
## Warning: NAs introduced by coercion
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

It still doesn’t work!! Take a second closer look at the data. Can you discern why the code failed?

nature_genetics_table2_clean$Stage1_MetaP
##  [1] "7.7E-15" "1.7E-26" "3.1E-8"  "8.8E-10" "9.6E-17" "2.8E-11" "6.5E-16"
##  [8] "1.7E-9"  "5.1E-8"  "1.6E-8"  "3.3E-9"  "5.0E-11" "1.5E-7"  "4.6E-8" 
## [15] "9.6E-5"  "2.5E-6"  "1.3E-5"  "7.4E-6"  "6.7E-6"  "1.0E-5"  "1.6E-6"

Yep, that’s right the symbol is not in fact the same as a minus symbol -! We need to replace it. We’ll use the fact that that symbol always appears immediately after a capital E to our advantage.

Split the string on the E using strsplit.

s = strsplit(nature_genetics_table2_clean$Stage1_MetaP, "E")

Get the first and second half of each string.

firstStr = lapply(s, `[[`, 1 )
secondStr=lapply(s, `[[`, 2 )

Knock off that first character: which the symbol we don’t want and slap a minus sign back on .

secondStr= lapply(secondStr,function(x) {paste0("E-",substring(x,2))})

Finally, stitch the two parts of the string back together now that the minus sign has been corrected and convert it to numeric.

mapply( function(firstStr, secondStr) {as.numeric(paste0(firstStr,secondStr) )}, firstStr, secondStr )
##  [1] 7.7e-15 1.7e-26 3.1e-08 8.8e-10 9.6e-17 2.8e-11 6.5e-16 1.7e-09
##  [9] 5.1e-08 1.6e-08 3.3e-09 5.0e-11 1.5e-07 4.6e-08 9.6e-05 2.5e-06
## [17] 1.3e-05 7.4e-06 6.7e-06 1.0e-05 1.6e-06

It works! Make sure to replace the column in the data.frame.

nature_genetics_table2_clean$Stage1_MetaP= mapply( function(firstStr, secondStr) {as.numeric(paste0(firstStr,secondStr) )}, firstStr, secondStr )

See how it appears in the table now as a numeric? Try wrangling the rest of these columns into a useful format on your own and let me know how it goes.

kable(nature_genetics_table2_clean[1:3,])
SNP Chr Position Closest gene Major/minor alleles MAF Stage1_OR Stage1_MetaP Stage2_OR Stage2_MetaP Overall_OR Overall_MetaP I2_Percent/P Description
1.3 rs6656401 1 207692049 CR1 G/A 0.197 1.17 (1.12–1.22) 0 1.21 (1.14–1.28) 7.9 × 10−11 1.18 (1.14–1.22) 5.7 × 10−24 0, 7.8 × 10−1 Known GWAS-defined associated genes
1.4 rs6733839 2 127892810 BIN1 C/T 0.409 1.21 (1.17–1.25) 0 1.24 (1.18–1.29) 3.4 × 10−19 1.22 (1.18–1.25) 6.9 × 10−44 28, 6.1 × 10−2 Known GWAS-defined associated genes
1.5 rs10948363 6 47487762 CD2AP A/G 0.266 1.10 (1.07–1.14) 0 1.09 (1.04–1.15) 4.1 × 10−4 1.10 (1.07–1.13) 5.2 × 10−11 0, 9 × 10−1 Known GWAS-defined associated genes

Conclusions

Today we went through getting data directly into R from an html table (a table on a webpage) and demonstrated how to get the data into a useful format. There are a couple advantages of doing this work in R instead of using excel– if that’s even an option. First, R is more reproducible. Second, once you have code for wrangling one table, wrangling the next one will be much faster. At the end of the day though, it is always easiest when the table is shared as a csv or excel file; something to keep in mind when preparing your own papers.

Reproducibility

## Session info ----------------------------------------------------------------------------------------------------------
##  setting  value                       
##  version  R version 3.4.4 (2018-03-15)
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_United States.1252  
##  tz       America/New_York            
##  date     2018-04-20
## Packages --------------------------------------------------------------------------------------------------------------
##  package   * version date       source                           
##  backports   1.1.2   2017-12-13 CRAN (R 3.4.3)                   
##  base      * 3.4.4   2018-03-15 local                            
##  blogdown    0.5.12  2018-03-23 Github (rstudio/blogdown@21f14af)
##  bookdown    0.7     2018-02-18 CRAN (R 3.4.3)                   
##  compiler    3.4.4   2018-03-15 local                            
##  curl        3.1     2017-12-12 CRAN (R 3.4.3)                   
##  datasets  * 3.4.4   2018-03-15 local                            
##  devtools  * 1.13.5  2018-02-18 CRAN (R 3.4.3)                   
##  digest      0.6.15  2018-01-28 CRAN (R 3.4.3)                   
##  evaluate    0.10.1  2017-06-24 CRAN (R 3.4.3)                   
##  graphics  * 3.4.4   2018-03-15 local                            
##  grDevices * 3.4.4   2018-03-15 local                            
##  highr       0.6     2016-05-09 CRAN (R 3.4.3)                   
##  htmltools   0.3.6   2017-04-28 CRAN (R 3.4.3)                   
##  httr        1.3.1   2017-08-20 CRAN (R 3.4.3)                   
##  knitr     * 1.20    2018-02-20 CRAN (R 3.4.3)                   
##  magrittr    1.5     2014-11-22 CRAN (R 3.4.3)                   
##  memoise     1.1.0   2017-04-21 CRAN (R 3.4.3)                   
##  methods   * 3.4.4   2018-03-15 local                            
##  R6          2.2.2   2017-06-17 CRAN (R 3.4.3)                   
##  Rcpp        0.12.16 2018-03-13 CRAN (R 3.4.4)                   
##  rmarkdown   1.9     2018-03-01 CRAN (R 3.4.3)                   
##  rprojroot   1.3-2   2018-01-03 CRAN (R 3.4.3)                   
##  rvest     * 0.3.2   2016-06-17 CRAN (R 3.4.3)                   
##  selectr     0.3-2   2018-03-05 CRAN (R 3.4.3)                   
##  stats     * 3.4.4   2018-03-15 local                            
##  stringi     1.1.7   2018-03-12 CRAN (R 3.4.4)                   
##  stringr     1.3.0   2018-02-19 CRAN (R 3.4.3)                   
##  tools       3.4.4   2018-03-15 local                            
##  utils     * 3.4.4   2018-03-15 local                            
##  withr       2.1.2   2018-03-15 CRAN (R 3.4.4)                   
##  xfun        0.1     2018-01-22 CRAN (R 3.4.3)                   
##  xml2      * 1.2.0   2018-01-24 CRAN (R 3.4.3)                   
##  yaml        2.1.18  2018-03-08 CRAN (R 3.4.3)

To leave a comment for the author, please follow the link and comment on their blog: Rstats on LIBD rstats club.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)