Working with Statistics Canada Data in R, Part 1: What is CANSIM?

[This article was first published on Data Enthusiast's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

CANSIM and cansim

CANSIM stands for Canadian Socioeconomic Information Management System. Not long ago it was renamed to “Statistics Canada Data,” but here I’ll be using the repository’s legacy name to avoid confusion with other kinds of data available from Statistics Canada. At the time of writing this, there were over 8,700 data tables in CANSIM, with new tables being added virtually every day. It is the most current source of publicly available socioeconomic data collected by the Government of Canada.

Although CANSIM can be directly accessed online, a much more convenient and reproducible way to access it, is through Census Mapper API using the excellent cansim package developed by Dmitry Shkolnik and Jens von Bergmann. This package allows to search for relevant data tables quickly and precisely, and to read data directly into R without the intermediate steps of manually downloading and unpacking multiple archives, followed by reading each dataset separately into statistical software.

To install and load cansim, run:

install.packages("cansim")
library("cansim")

Let’s also install and load tidyverse, as we’ll be using it a lot:

install.packages("tidyverse")
library("tidyverse")

What to Expect

So why do we have to use tidyverse in addition to cansim (apart from the fact that tidyverse is probably the best data processing software in existence)? Well, data in Statistics Canada CANSIM repository has certain bugs features that one needs to be aware of, and which can best be fixed with the instruments included in tidyverse:

First, it is not always easy to find and retrieve the data manually. After all, you’ll have to search through thousands of data tables looking for the data you need. Online search tool often produces results many pages long, which are not well sorted by relevance (or at least that is my impression).

Second, StatCan data is not in the tidy format, and usually needs to be transformed as such, which is important for convenience and error prevention, as well as for plotting data.

Third, don’t expect the main principles of organizing data into datasets to be observed. For example, multiple variables can be presented as values in the same column of a dataset, with corresponding values stored in the other column, instead of each variable assigned to its own column with values stored in that column (as it should be). If this sounds confusing, the code snippet below will give you a clear example.

Next, the datasets contain multiple irrelevant variables (columns), that result form how Statistics Canada processes and structures data, and do not have much substantive information.

Finally, CANSIM datasets contain a lot of text, i.e. strings, which often are unnecessarily long and cumbersome, and are sometimes prone to typos. Moreover, numeric variables are often incorrectly stored as class “character”.

If you’d like an example of a dataset exhibiting most of these issues, let’s looks at the responses from unemployed Aboriginal Canadians about why they experience difficulties in finding a job. To reduce the size of the dataset, let’s limit it to one province. Run line-by-line:

# Get data
jobdif_0014 <- get_cansim("41-10-0014") %>% 
  filter(GEO == "Saskatchewan")

# Examine the dataset - lots of redundant variables.
glimpse(jobdif_0014, width = 120)

# All variables except VALUE are of class "character",
map(jobdif_0014, class)

# although some contain numbers, not text.
map(jobdif_0014, head)

# Column "Statistics" contains variables’ names instead of values,
unique(jobdif_0014$Statistics)

# while corresponding values are in a totally different column.
head(jobdif_0014$VALUE, 10)

Before You Start

This is an optional step. You can skip it, although if you are planning to use cansim often, you probably shouldn’t.

Before you start working with the package, cansim authors recommend to set up cansim cache to substantially increase the speed of repeated data retrieval from StatCan repositories and to minimize web scraping. cansim cache is not persistent between sessions, so do this either at the start of each session, or set the cache path permanently in your Rprofile file (more on editing Rprofile in Part 4 of these series).

Run (or add to Rprofile):

options(cansim.cache_path = "your cache path")

If your code is going to be executed on different machines, keep in mind that Linux and Windows paths to cache will not be the same:

# Linux path example:
options(cansim.cache_path = "/home/username/Documents/R/.cansim_cache")

# Windows path example:
options(cansim.cache_path = "C:\Users\username\Documents\R\cansim_cache")

To leave a comment for the author, please follow the link and comment on their blog: Data Enthusiast's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)