Big Data Manipulation in R Exercises

June 9, 2017
By

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


Some times it is necessary to download really big csv files to deliver some analysis. When you hit file sizes in Gigabytes it is useful to use R instead of spreadsheets. This exercise teaches us to manipulate this kind of files.

Answers to the exercises are available here.

Exercise 1
Create a directory canada immigration/Work/Income and put all files related to income then load dplyr.
Download data set from here.

Exercise 2
Create a string vector with file names: 00540002-eng, 00540005-eng, 00540007-eng, 00540009-eng, 00540011-eng, 00540013-eng, 00540015-eng, and 00540017-eng.

Exercise 3
Create a list of data frames and put the data of each file in list position. For example, data[[1]] will contain the first file. To reduce this data size, for each data set select only data from 2014.

Exercise 4
Clean up the first data sets in the list (data[[1]]) and exclude registers that summarizes other like: “Both sexes” to avoid double operations while summarizing.

Exercise 5
Clean up all other data sets in the list and exclude registers the same way discribed at exercise 4. Then, pile up all data in a sigle data set.

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

  • import data into R in several ways while also beeing able to identify a suitable import tool
  • use SQL code within R
  • And much more

Exercise 6
Write a csv file with the recent create data set.

Exercise 7
Create a directory canada immigration/Work/Income and put all files related to income then load dplyr.
Download data set from here.
Create a string vector with file names: 00540018-eng, 00540019-eng, 00540020-eng, 00540021-eng, 00540022-eng, 00540023-eng, 00540024-eng, and 00540025-eng.
Create a list of data frames and put the data of each file in list position. For example, data[[1]] will contain the first file. To reduce this data size, for each data set select only data from 2014.

Exercise 8
Clean up the first data sets in the list (data[[1]]) and exclude registers that summarizes other like: “Both sexes” to avoid double operations while summarizing.

Exercise 9
Clean up all other data sets in the list and exclude registers the same way discribed at exercise 8. Then, pile up all data in a sigle data set.

Exercise 10
Write a csv file with the recent create data set.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)