# Data table exercises: keys and subsetting

March 21, 2016
By

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

The data.table package is a popular R package that facilitates fast selections, aggregations and joins on large data sets. It is well-documented through several vignettes, and even has its own interactive course, offered by Datacamp. For those who want to build some mileage practising the use of data.table, there’s good news! In the coming weeks, we’ll dive into the package with several exercise sets. We’ll start with the first set today, focusing on creating data.tables, defining keys and subsetting. Before proceeding, make sure you have installed the data.table package from CRAN and studied the vignettes.

Answers to the exercises are available here. For the other (upcoming) exercise sets on data.table, check back next week here. If there are any particular topics/problems related to data.table, you’d like to see included in subsequent exercise sets, please post as a comment below.

Exercise 1
Setup: Read the wine quality dataset from the uci repository as a data.table (available for download from: http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv) into an object named `df`. To demonstrate the speed of data.table, we’re going to make this dataset much bigger, with:
```df Check that the resulting data.table has 4.8 mln. rows and 12 variables.```

Exercise 2
Check if `df` contains any keys. If no keys are present, create a key for the `quality` variable. Confirm that the key has been set.

Exercise 3
Create a new data.table `df2`, containing the subset of `df` with quality equal to 9.

Exercise 4
Remove the key from `df`, and repeat exercise 3. How much slower is this?

Exercise 5
Create a new data.table `df2`, containing the subset of `df` with quality equal to 7, 8 or 9. First without setting keys, then with setting keys and compare run-time.

Exercise 6
Create a new data.table `df3` containing the subset of observations from `df` with:
fixed acidity < 8 and residual sugar < 5 and pH < 3. First without setting keys, then with setting keys and compare run-time. Explain why differences are small.

Exercise 7
Take a bootstrap sample (i.e., with replacement) of the full `df` data.table without keys, and record run-time. Then, convert to a regular data frame, and repeat. What is the difference in speed? Is there any (speed) benefit in creating a new variable `id` equal to the row number, creating a key for this variable, and use this key to select the bootstrap?

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

## Recent popular posts

Contact us if you wish to help support R-bloggers, and place your banner here.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)