Data wrangling : I/O (Part-2)

Posted on June 7, 2017 by Vasileios Tsakalos in R bloggers | 0 Comments

[This article was first published on R-exercises, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data wrangling is a task of great importance in data analysis. Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the first part of this series and it aims to cover the importing of data from the web. In many cases, downloading data in order to process them can be time consuming, therefore being able to import the data straight from the web is a ‘nice-to-have’ skill. Moreover, data isn’t always not saved in structured files, but they are on the web in forms of text and tables, in this set of exercise we will go through the latter case. In case you want me to go through the former case as well, please let me know at the comment section.

Before proceeding, it might be helpful to look over the help pages for the getURL, fromJSON, ldply, xmlToList, read_html, html_nodes, html_table, readHTMLTable, htmltab.

Moreover please load the following libraries.
install.packages("RCurl")
library(RCurl)
install.packages("rjson")
library(rjson)
install.packages("XML")
library(XML)
install.packages("plyr")
library(plyr)
install.packages("rvest")
library(rvest)
install.packages("htmltab")
library(htmltab)

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Retrieve the source of the web page “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-1/data.csv” and assign it to the object “url”

Exercise 2

Read the csv file and assign it to the “csv_file” object.

Exercise 3

Do the same as exercise 1, but with the url: “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.txt” and then assign it to the “txt_file” object.
Note: it is a txt file, so you should use the adequate function in order to import it.

Exercise 4

Do the same as exercise 1, but with the url: “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.json” and then assign it to the “json_file” object.
Note: it is a json file, so you should use the adequate function in order to import it.

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

import data into R in several ways while also beeing able to identify a suitable import tool
use SQL code within R
And much more

Exercise 5

Do the same as exercise 1, but with the url: “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.xml” and then assign it to the “xml_file” object.
Note: it is a xml file, so you should use the adequate function in order to import it.

Exercise 6

We will go through web scraping now. Read the html file “http://www.worldatlas.com/articles/largest-cities-in-europe-by-population.html” and assign it to the object “url”.
hint: consider using read_html

Exercise 7

Select the “table” nodes from the html document you retrieved before.
hint: consider using html_nodes

Exercise 8

Convert the node you retrieved at exercise 7, to an actionable list for processing.
hint: consider using html_table

Exercise 9

Let’s go to a faster and more straight forward function, retrieve the html document like you did at exercise 6 and make it an actionable list using the function readHTMLTable.

Exercise 10

This may be a bit tricky, but give it a try. Retrieve the html document like you did at exercise 6 and make it an actionable data frame using the function htmltab.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Data wrangling : I/O (Part-2)

Related

Related exercise sets:

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)