Data table exercises: keys and subsetting

March 21, 2016
By

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Data keyThe data.table package is a popular R package that facilitates fast selections, aggregations and joins on large data sets. It is well-documented through several vignettes, and even has its own interactive course, offered by Datacamp. For those who want to build some mileage practising the use of data.table, there’s good news! In the coming weeks, we’ll dive into the package with several exercise sets. We’ll start with the first set today, focusing on creating data.tables, defining keys and subsetting. Before proceeding, make sure you have installed the data.table package from CRAN and studied the vignettes.

Answers to the exercises are available here. For the other (upcoming) exercise sets on data.table, check back next week here. If there are any particular topics/problems related to data.table, you’d like to see included in subsequent exercise sets, please post as a comment below.

Exercise 1
Setup: Read the wine quality dataset from the uci repository as a data.table (available for download from: http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv) into an object named df. To demonstrate the speed of data.table, we’re going to make this dataset much bigger, with:
df
Check that the resulting data.table has 4.8 mln. rows and 12 variables.

Exercise 2
Check if df contains any keys. If no keys are present, create a key for the quality variable. Confirm that the key has been set.

Exercise 3
Create a new data.table df2, containing the subset of df with quality equal to 9.

Exercise 4
Remove the key from df, and repeat exercise 3. How much slower is this?

Exercise 5
Create a new data.table df2, containing the subset of df with quality equal to 7, 8 or 9. First without setting keys, then with setting keys and compare run-time.

Exercise 6
Create a new data.table df3 containing the subset of observations from df with:
fixed acidity < 8 and residual sugar < 5 and pH < 3. First without setting keys, then with setting keys and compare run-time. Explain why differences are small.

Exercise 7
Take a bootstrap sample (i.e., with replacement) of the full df data.table without keys, and record run-time. Then, convert to a regular data frame, and repeat. What is the difference in speed? Is there any (speed) benefit in creating a new variable id equal to the row number, creating a key for this variable, and use this key to select the bootstrap?

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Mango solutions



RStudio homepage



Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training



http://www.eoda.de









ODSC

CRC R books series











Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)