Data wrangling : I/O (Part-2)

June 7, 2017
By

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


Data wrangling is a task of great importance in data analysis. Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the first part of this series and it aims to cover the importing of data from the web. In many cases, downloading data in order to process them can be time consuming, therefore being able to import the data straight from the web is a ‘nice-to-have’ skill. Moreover, data isn’t always not saved in structured files, but they are on the web in forms of text and tables, in this set of exercise we will go through the latter case. In case you want me to go through the former case as well, please let me know at the comment section.

Before proceeding, it might be helpful to look over the help pages for the getURL, fromJSON, ldply, xmlToList, read_html, html_nodes, html_table, readHTMLTable, htmltab.

Moreover please load the following libraries.
install.packages("RCurl")
library(RCurl)
install.packages("rjson")
library(rjson)
install.packages("XML")
library(XML)
install.packages("plyr")
library(plyr)
install.packages("rvest")
library(rvest)
install.packages("htmltab")
library(htmltab)

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Retrieve the source of the web page “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-1/data.csv” and assign it to the object “url”

Exercise 2

Read the csv file and assign it to the “csv_file” object.

Exercise 3

Do the same as exercise 1, but with the url: “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.txt” and then assign it to the “txt_file” object.
Note: it is a txt file, so you should use the adequate function in order to import it.

Exercise 4

Do the same as exercise 1, but with the url: “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.json” and then assign it to the “json_file” object.
Note: it is a json file, so you should use the adequate function in order to import it.

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

  • import data into R in several ways while also beeing able to identify a suitable import tool
  • use SQL code within R
  • And much more

Exercise 5

Do the same as exercise 1, but with the url: “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.xml” and then assign it to the “xml_file” object.
Note: it is a xml file, so you should use the adequate function in order to import it.

Exercise 6

We will go through web scraping now. Read the html file “http://www.worldatlas.com/articles/largest-cities-in-europe-by-population.html” and assign it to the object “url”.
hint: consider using read_html

Exercise 7

Select the “table” nodes from the html document you retrieved before.
hint: consider using html_nodes

Exercise 8

Convert the node you retrieved at exercise 7, to an actionable list for processing.
hint: consider using html_table

Exercise 9

Let’s go to a faster and more straight forward function, retrieve the html document like you did at exercise 6 and make it an actionable list using the function readHTMLTable.

Exercise 10

This may be a bit tricky, but give it a try. Retrieve the html document like you did at exercise 6 and make it an actionable data frame using the function htmltab.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)