Site icon R-bloggers

Tutorial: Web Scraping of Multiple Pages using R

[This article was first published on R Blogs, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In today’s world, data is being generated at an exponential rate. This massive amount of data and information is essential for many individuals and tech giants in various useful ways.

So, having access to precise data in abundance will serve you just right in any field in gaining insights and performing further analysis. Therefore, Web Scraping has become a must have skill especially if you are a data scientist.

All the data is available on the Internet today. But, how to scrape data that might be useful to you? Well, you have got it all sorted out. With all the advanced tools and programming languages, scraping data out from the web is just one cushy job.

Let’s dive straight to the point.

Web Scraping?

Web Scraping is just a technique to convert unorganized data that is usually available on the internet to an organized format so that it can be useful to us.

The very basic idea of scraping data is the old school method of COPY AND PASTE . Well, to be honest, this method might sound easy-peasy but is taxing, monotonous, time-dependent and not at all fascinating.

But with a few lines of code it is utterly possible. So, let’s see how can we scrape data.

Web Scraping using R

Expecting that you all will be having a basic knowledge about how R works and its syntax, lets get straight to this short tutorial where I’ll show you How To Scrape Data using R from multiple pages at once.

For general text data scraping: you can visit: Basic Web Scraping

About the Data

From The Numbers, here lies the complete list of movies with their release dates, production budget and gross revenue information. The profit and loss figures are very rough estimates based on domestic and international box office earnings and domestic video sales, extrapolated to estimate worldwide income to the studio, after deducting retail costs.

Note: The movies’ data is in the tabular format.

Following are the steps you need to follow:

  • Open R Studio. Then in a new file:

Package Installation

Install the required packages.

  • xml2: Xml2 is a wrapper around the comprehensive libxml2 C library that makes it easier to work with XML and HTML in R

  • rvest: rvest helps you scrape information from web. pages.

  • tibble: The tibble package provides utilities for handling tibbles, where “tibble” is a colloquial term for the S3 tbl_df class. The tbl_df class is a special case of the base data.frame.

library(xml2)
library(rvest)   ##very important
library(tibble)


  • Storing the url of the first page of the table with data of about 100 movies in base_url:
base_url <- "https://www.the-numbers.com/movie/budgets/all"


  • Scraping html content from the stored url:
base_webpage <- read_html(base_url)


  • Now, as you can see here, after all/101 is present. Similarly, there are many more pages with 100 movies each in the table all with different urls.

So, should we store 100 urls for 100 pages for 10,000 movies? Ofcourse not! We have certain string formatting styles. You can visit the documentation here.

Hence, for strings, we use %s.

new_urls<- "https://www.the-numbers.com/movie/budgets/all/%s"


  • Creating dataframe of the first 100 movies:
    • html_table(): converts html tables into dataframes.
table_base <- rvest::html_table(base_webpage)[[1]] %>% 
  tibble::as_tibble(.name_repair = "unique") # repair the repeated columns


  • Creating dataframe of the next set of movies:
#creating two empty dataframes
table_new <-data.frame()
df <- data.frame()

#iterator
i<-101

#it loops through 5501 times so as to extract and then store and then combine about 5000 movies so far extracted.
while (i<5502) {
  new_webpage<- read_html(sprintf(new_urls,i))
  table_new <- rvest::html_table(new_webpage)[[1]] %>% 
    tibble::as_tibble(.name_repair = "unique") # repair the repeated columns
  df<- rbind(df,table_new)
  i=i+100
}


  • Merge the table_base and df:
df_movies <- merge(table_base,df, all = T)


  • Let us see how are dataframe looks exactly:
head(df_movies)
##    ...1  ReleaseDate                              Movie ProductionBudget
## 1     1 Apr 23, 2019                  Avengers: Endgame     $400,000,000
## 2 1,000 Apr 28, 2000 The Flintstones in Viva Rock Vegas      $58,000,000
## 3 1,001  Apr 4, 2008                       Leatherheads      $58,000,000
## 4 1,002 Mar 22, 2017                               Life      $58,000,000
## 5 1,003 Dec 18, 2009    Did You Hear About the Morgans?      $58,000,000
## 6 1,004 Dec 12, 2008         Che, Part 1: The Argentine      $58,000,000
##   DomesticGross WorldwideGross
## 1  $858,373,000 $2,797,800,564
## 2   $35,231,365    $59,431,365
## 3   $31,373,938    $41,348,628
## 4   $30,234,022   $100,929,666
## 5   $29,580,087    $80,480,566
## 6    $1,802,521    $31,627,370


Viola! We have accomplished our task.



  • Now if you want, you can create a csv file of this dataframe for physically storing it in your system using:
write.csv(df_movies,"moviesData_tutorial.csv")

Conclusion

See, here it is done. With a few lines of code, we were able to extract data from multiple pages using one single loop. This tutorial basically hints on using string formatting style.

Stay tuned for more tutorials!
Thank You!

To leave a comment for the author, please follow the link and comment on their blog: R Blogs.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.