Web Scraping Exercises

December 20, 2016
By

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

[For this exercise, before proceeding, first read the rvest package help and the selectorgadget help.]

Answers to the exercises are available here.

Exercise 1

Consider the url ‘http://statbel.fgov.be/en/statistics/figures/economy/indicators/prix_prod_con/’
Extract all the information load on table ‘Third Quarter 2016’.

Exercise 2

Consider the url ‘http://www2.sas.com/proceedings/sugi30/toc.html’
Extract all the papers names, from 001-30 to 268-30

Exercise 3

Consider the url ‘http://www.gibbon.se/Retailer/Map.aspx?SectionId=832’
Extract all the options (countries) availables on select button.

Exercise 4

Consider the url ‘http://r-exercises.com/start-here-to-learn-r/’
Extract all the topics available on the url.

Exercise 5

Consider the url ‘http://www.immobiliare.it/Roma/agenzie_immobiliari_provincia-Roma.html’
Extract all inmobiliaries names published on first page.

Exercise 6

Consider the url ‘http://www.gibbon.se/Retailer/Map.aspx?SectionId=832’.
Extract the links to the detailed information of each row on the table.
For example, for the first adress, Karlbergsvägen 32, 113 27 stockholm, the details are
A.E.N HUND I STAN AB
ADRESS OCH ÖPPETTIDER
Karlbergsvägen 32
113 27 STOCKHOLM
Öppettider:
Telefon: 08-313058
Mail-adress: [email protected]
Hemsida:
The link to that details (clicking on Karlbergsvägen 32, 113 27 stockholm) is http://www.gibbon.se/Retailer/Retailer.aspx?ItemId=45128.
You have to extract all the links available, one per row.

Exercise 7

Consider the url ‘https://www.bkk-klinikfinder.de/suche/suchergebnis.php?next=1’
Extract the links to the detailed information of each hospital. For example, for the hospital
Krankenhaus Dresden-Friedrichstadt Städtisches Klinikum, the details are available on the link:
https://www.bkk-klinikfinder.de/krankenhaus/index.php?id=26140094900

Exercise 8

Consider the url scraped in Exercise 7.
Extract the links to ‘Details’ for each hospital display on the first 4 pages.

Exercise 9

Consider the url=’http://www.dictionary.com/browse/’ and the words ‘handy’,’whisper’,’lovely’,’scrape’.
Build a data frame, where the first variables is “Word” and the second variables is “definitions”. Scrape the definitions from the url.

Exercise 10
Consider the url ‘http://www.gibbon.se/Retailer/Map.aspx?SectionId=832’.
Build a data frame with all the information available for each row.
For example, for the first adress, Karlbergsvägen 32, 113 27 stockholm, the details are
A.E.N HUND I STAN AB

ADRESS OCH ÖPPETTIDER
Karlbergsvägen 32
113 27 STOCKHOLM
Öppettider:
Telefon: 08-313058
Mail-adress: [email protected]
Hemsida:
For the second row, Inedalsgatan 5, 112 33 stockholm, the details are
ARKENZOO KUNGSHOLMEN A
ADRESS OCH ÖPPETTIDER
Kungs Zoo AB
Inedalsgatan 5
112 33 STOCKHOLM
Öppettider:
Telefon: 08-7248110
Mail-adress: [email protected]
Hemsida: www.arkenzoo.se

This details will be saved on the first row of the data.frame.
Website address Name of store Phone Number Email adress City Country
1 A.E.N Hund i Stan AB 08-313058 [email protected] Stocholm Sweden
2 www.arkenzoo.se ArkenZoo Kungsholmen A 08-7248110 [email protected] Stocholm Sweden

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)