Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## The quest for income microdata

For a separate project, I’ve been looking for source data on income and wealth inequality. Not aggregate data like Gini coefficients or the percentage of income earned by the bottom 20% or top 1%, but the sources used to calculate those things. Because it’s sensitve personal financial data either from surveys or tax data, it’s not easy to get hold of, particularly if you want to do it in the comfort of your own home and don’t have time / motivation to fill in application forms. So far, what I’ve got is a simulated unit record file (SURF) from the New Zealand Income Survey, published by Statistics New Zealand for educational and instructional purposes. It’s a simplified, simulated version of the original survey and it lets me make graphics like this one:

This plot shows the density of income from all sources for four of the most prominent ethnicity types in New Zealand. New Zealand official statistics allow people to identify with more than one ethnicity, which means there is some double counting (more on that below). Three things leap out at me from this chart:

1. The density curves of the different ethnic groups are very similar visually
2. People of Maori and Pacific Peoples ethnicity have proportionately more \$300-\$400 per week earners than Europeans and Asians, leading to an overall noticeable lean to the lower numbers
3. Weekly income is bimodal, bunching up at \$345 per week and \$820 per week. Actually, this image is misleading in that respect; in reality it is trimodal, with a huge bunch of people with zero income (and some negative), who aren’t shown on this plot because of the logarithmic scale. So you could say that, for New Zealanders getting any income at all, there is a bimodal distribution.

Where’s that bimodal distribution coming from? The obvious candidate is part time workers, and this is seen when we plot income versus hours worked in the next image:

(That interesting diagonal line below which there are very few points is the minimum wage per hour)

Statistics New Zealand warn that while this simulated data is an accurate representation of the actual New Zealand population, it shouldn’t be used for precise statistics, so for now I won’t draw particularly strong conclusions on anything. A simulated unit record file is a great way of solving confidentiality purposes, but in the end it has been created by a statistical model. There’s a risk that interesting inferences might be just picking up something implicit in the model that wasn’t taken into account when it was first put together. That’s not likely to be the case for the basic distribution of the income values, but we’ll note the exploratory finding for now and move on.

A box plot is better for examining the full range of reported ethnicities but not so good for picking up the bi-modal subtlety. It should also be noted that both these graphs delete people who made losses (negative income) in a given week – the data show income from all sources:

Here’s how I drew the graphs, and estimated the two modes. I think I’ll want to re-use this data quite often, so it’s worth while putting in a database that’s accessible from different projects without having to move a bunch of data files around in Windows Explorer. The way Statistics New Zealand have released the file, with codings rather than values for dimensions like ethnicity and region, also makes a database a good way to make the data analysis ready.

After importing the data, the first significant job is to deal with that pesky ethnicity variable. In the released version of the data respondents with two ethnicities have both code numbers joined together eg 12 means both European (1) and Maori (2). To get around this I split the data into two fact tables, one with a single row per respondent with most of the data; and a second just for ethnicity with either one or two rows for each respondent. Here’s how I do that with a combination of {dplyr} and {tidyr}:

```library(dplyr)
library(RODBC)
library(tidyr)
library(mbie) # for AskCreds, which is alternatively directly available
# https://github.com/nz-mbie/mbie-r-package/blob/master/pkg/R/Creds.R

# imports, clean up, and save to database the data from
# http://www.stats.govt.nz/tools_and_services/microdata-access/nzis-2011-cart-surf.aspx

url <- "http://www.stats.govt.nz/~/media/Statistics/services/microdata-access/nzis11-cart-surf/nzis11-cart-surf.csv"

#-----------------fact tables------------------
# Create a main table with a primary key

mutate(survey_id = 1:nrow(nzis))

# we need to normalise the multiple ethnicities, currently concatenated into a single variable
cat("max number of ethnicities is", max(nchar(nzis\$ethnicity)), "n")

select(ethnicity, survey_id) %>%
mutate(First = substring(ethnicity, 1, 1),
Second = substring(ethnicity, 2, 2)) %>%
select(-ethnicity) %>%
gather(ethnicity_type, ethnicity_id, -survey_id) %>%
filter(ethnicity_id != "")

# drop the original messy ethnicity variable and tidy up names on main header
select(-ethnicity) %>%
rename(region_id = lgr,
sex_id = sex,
agegrp_id = agegrp,
qualification_id = qualification,
occupation_id = occupation)```

Second step is to re-create the dimension tables that turn the codes (eg 1 and 2) into meaningful values (European and Maori). Statistics New Zealand provide these, but unfortunately in an Excel workbook that’s easier for humans than computers to link up to the data. There’s not too many of so it’s easy enough to code them by hand, which the next set of code does:

```#-----------------dimension tables------------------
# all drawn from the data dictionary available at the first link given above
d_sex <- data_frame(sex_id = 1:2, sex = c("male", "female"))

d_agegrp <- data_frame(
agegrp_id = seq(from = 15, to = 65)) %>%
mutate(agegrp = ifelse(agegrp_id == 65, "65+", paste0(agegrp_id, "-", agegrp_id + 4)))

d_ethnicity <- data_frame(ethnicity_id = c(1,2,3,4,5,6,9),
ethnicity = c(
"European",
"Maori",
"Pacific Peoples",
"Asian",
"Middle Eastern/Latin American/African",
"Other Ethnicity",
"Residual Categories"))

d_occupation <- data_frame(occupation_id = 1:10,
occupation = c(
"Managers",
"Professionals",
"Community and Personal Service Workers",
"Sales Workers",
"Machinery Operators and Drivers",
"Labourers",
"Residual Categories",
"No occupation"
))

d_qualification <- data_frame(qualification_id = 1:5,
qualification = c(
"None",
"School",
"Bachelor or Higher",
"Other"
))

d_region <- data_frame(region_id =1:12,
region = c("Northland", "Auckland", "Waikato", "Bay of Plenty", "Gisborne / Hawke's Bay",
"Taranaki", "Manawatu-Wanganui", "Wellington",
"Nelson/Tasman/Marlborough/West Coast", "Canterbury", "Otago", "Southland"))```

The final step in the data cleaning is to save all of our tables to a database, create some indexes so they work nice and fast, and join them up in an analysis-ready view. In the below I use an ODBC (open database connectivity) connection to a MySQL server called "PlayPen". R plays nicely with databases; set it up and forget about it.

```#---------------save to database---------------
creds <- AskCreds("Credentials for someone who can create databases")

PlayPen <- odbcConnect("PlayPen_prod", uid = creds\$uid, pwd = creds\$pwd)
try(sqlQuery(PlayPen, "create database nzis11") )
sqlQuery(PlayPen, "use nzis11")

# fact tables.  These take a long time to load up with sqlSave (which adds one row at a time)
# but it's easier (quick and dirty) than creating a table and doing a bulk upload from a temp
# file.  Any bigger than this you'd want to bulk upload though - took 20 minutes or more.
sqlSave(PlayPen, f_ethnicity, addPK = TRUE, rownames = FALSE)
# add a primary key on the fly in this case.  All other tables
# have their own already created by R.

# dimension tables
sqlSave(PlayPen, d_sex, addPK = FALSE, rownames = FALSE)
sqlSave(PlayPen, d_agegrp, addPK = FALSE, rownames = FALSE)
sqlSave(PlayPen, d_ethnicity, addPK = FALSE, rownames = FALSE)
sqlSave(PlayPen, d_occupation, addPK = FALSE, rownames = FALSE)
sqlSave(PlayPen, d_qualification, addPK = FALSE, rownames = FALSE)
sqlSave(PlayPen, d_region, addPK = FALSE, rownames = FALSE)

#----------------indexing----------------------

sqlQuery(PlayPen, "ALTER TABLE d_sex ADD PRIMARY KEY(sex_id)")
sqlQuery(PlayPen, "ALTER TABLE d_agegrp ADD PRIMARY KEY(agegrp_id)")
sqlQuery(PlayPen, "ALTER TABLE d_ethnicity ADD PRIMARY KEY(ethnicity_id)")
sqlQuery(PlayPen, "ALTER TABLE d_occupation ADD PRIMARY KEY(occupation_id)")
sqlQuery(PlayPen, "ALTER TABLE d_qualification ADD PRIMARY KEY(qualification_id)")
sqlQuery(PlayPen, "ALTER TABLE d_region ADD PRIMARY KEY(region_id)")

# In Oracle we'd use a materialized view, which MySQL can't do.  But
# the below is fast enough anyway:

sql1 <-
"CREATE VIEW vw_mainheader AS SELECT sex, agegrp, occupation, qualification, region, hours, income FROM
d_sex b          on a.sex_id = b.sex_id JOIN
d_agegrp c       on a.agegrp_id = c.agegrp_id JOIN
d_occupation e   on a.occupation_id = e.occupation_id JOIN
d_qualification f on a.qualification_id = f.qualification_id JOIN
d_region g       on a.region_id = g.region_id"

sqlQuery(PlayPen, sql1)```

## Average weekly income in NZ 2011 by various slices and dices

Whew, that's out of the way. Next post that I use this data I can go straight to the database. We're now in a position to check our data matches the summary totals provided by Statistics New Zealand. Statistics New Zealand say this SURF can be treated as a simple random sample, which means each point can get an identical individual weight, which we can estimate from the summary tables in their data dictionary. Each person in the sample represents 1,174 in the population (in the below I have population figures in thousands, to match the Statistics New Zealand summaries.

Statistics New Zealand doesn't provide region and occupation summary statistics, and the qualification summaries they provide use a more detailed classification than is in the actual SURF. But for the other categories - sex, age group, and the trick ethnicity - my results match theirs, so I know I haven't munched the data.

sex Mean Sample Population
female 611 15217 1787.10
male 779 14254 1674.00
```tab1 <- sqlQuery(PlayPen,
"SELECT
sex,
ROUND(AVG(income))          as Mean,
COUNT(1)                    as Sample,
ROUND(COUNT(1) * .11744, 1) as Population
GROUP BY sex")```
agegrp Mean Sample Population
15-19 198 2632 309.10
20-24 567 2739 321.70
25-29 715 2564 301.10
30-34 796 2349 275.90
35-39 899 2442 286.80
40-44 883 2625 308.30
45-49 871 2745 322.40
50-54 911 2522 296.20
55-59 844 2140 251.30
60-64 816 1994 234.20
65+ 421 4719 554.20
```tab2 <- sqlQuery(PlayPen,
"SELECT
agegrp,
ROUND(AVG(income))          as Mean,
COUNT(1)                    as Sample,
ROUND(COUNT(1) * .11744, 1) as Population
GROUP BY agegrp")```
qualification Mean Sample Population
School 564 7064 829.60
None 565 6891 809.30
Other 725 1858 218.20
Bachelor or Higher 955 5223 613.40
```# qualification summary in data dictionary uses a different classification to that in data
tab3 <- sqlQuery(PlayPen,
"SELECT
qualification,
ROUND(AVG(income))          as Mean,
COUNT(1)                    as Sample,
ROUND(COUNT(1) * .11744, 1) as Population
GROUP BY qualification
ORDER BY Mean")```
occupation Mean Sample Population
No occupation 257 10617 1246.90
Labourers 705 2154 253.00
Residual Categories 726 24 2.80
Community and Personal Service Workers 745 1734 203.60
Sales Workers 800 1688 198.20
Clerical and Adminsitrative Workers 811 2126 249.70
Technicians and Trades Workers 886 2377 279.20
Machinery Operators and Drivers 917 1049 123.20
Professionals 1105 4540 533.20
Managers 1164 3162 371.30
```# occupation summary not given in data dictionary
tab4 <- sqlQuery(PlayPen,
"SELECT
occupation,
ROUND(AVG(income))          as Mean,
COUNT(1)                    as Sample,
ROUND(COUNT(1) * .11744, 1) as Population
GROUP BY occupation
ORDER BY Mean")```
region Mean Sample Population
Bay of Plenty 620 1701 199.80
Taranaki 634 728 85.50
Waikato 648 2619 307.60
Southland 648 637 74.80
Manawatu-Wanganui 656 1564 183.70
Northland 667 1095 128.60
Nelson/Tasman/Marlborough/West Coast 680 1253 147.20
Otago 686 1556 182.70
Gisborne / Hawke's Bay 693 1418 166.50
Canterbury 701 4373 513.60
Auckland 720 9063 1064.40
Wellington 729 3464 406.80
```# region summary not given in data dictionary
tab5 <- sqlQuery(PlayPen,
"SELECT
region,
ROUND(AVG(income))          as Mean,
COUNT(1)                    as Sample,
ROUND(COUNT(1) * .11744, 1) as Population
GROUP BY region
ORDER BY Mean ")```
ethnicity Mean Sample Population
Residual Categories 555 85 10.00
Maori 590 3652 428.90
Pacific Peoples 606 1566 183.90
Middle Eastern/Latin American/African 658 343 40.30
Asian 678 3110 365.20
European 706 22011 2585.00
Other Ethnicity 756 611 71.80
```tab6 <- sqlQuery(PlayPen,
"SELECT
ethnicity,
ROUND(AVG(income))          as Mean,
COUNT(1)                    as Sample,
ROUND(COUNT(1) * .11744, 1) as Population
JOIN f_ethnicity e ON m.survey_id = e.survey_id
JOIN d_ethnicity d ON e.ethnicity_id = d.ethnicity_id
GROUP BY ethnicity
ORDER BY Mean")```

## Graphics showing density of income

Finally, here's the code that drew the charts we started with, showing the distribution of the weekly income New Zealanders of different ethnicity.

```library(showtext)
library(RODBC)
library(ggplot2)
library(dplyr)
showtext.auto()

dtf <- sqlQuery(PlayPen,
"SELECT
ethnicity,
income
JOIN f_ethnicity e ON m.survey_id = e.survey_id
JOIN d_ethnicity d ON e.ethnicity_id = d.ethnicity_id")

# density plot of income by density
dtf %>%
filter(ethnicity %in% c("Asian", "European", "Maori", "Pacific Peoples")) %>%
ggplot(aes(x = income, colour = ethnicity)) +
geom_density(size = 1.1) +
scale_x_log10("Weekly income from all sources", label = dollar, breaks = c(10, 100, 345, 825, 10000)) +
theme_minimal(base_family = "myfont") +
theme(legend.position = "bottom") +
scale_colour_brewer("", palette = "Set1")

# boxplot of income by ethnicity
dtf %>%
ggplot(aes(y = income, x = ethnicity, colour = ethnicity)) +
geom_boxplot() +
geom_rug() +
scale_y_log10("Weekly income from all sources", label = dollar, breaks = c(10, 100, 1000, 10000)) +
theme_minimal() +
labs(x = "") +
coord_flip() +
scale_colour_brewer(palette = "Set2") +
theme(legend.position = "none")

# scatter plot of joint density of hours and income
dtf2 <- sqlQuery(PlayPen, "select hours, income from vw_mainheader")
ggplot(dtf2, aes(x = hours, y = income)) +
geom_jitter(alpha = 0.05) +
scale_x_log10("Hours worked", breaks = c(1, 10, 20, 40, 80)) +
scale_y_log10("Weekly income from all sources", label = dollar, breaks = c(10, 100, 345, 825, 10000)) +
theme_minimal(base_family = "myfont")

# how did I choose to mark \$345 and \$825 on the scale?
# ripped off (and improved) from http://stackoverflow.com/questions/27418461/calculate-the-modes-in-a-multimodal-distribution-in-r
find_modes<- function(data, ...) {
dens <- density(data, ...)
y <- dens\$y
modes <- NULL
for ( i in 2:(length(y) - 1) ){
if ( (y[i] > y[i - 1]) & (y[i] > y[i + 1]) ) {
modes <- c(modes,i)
}
}
if ( length(modes) == 0 ) {
modes = 'This is a monotonic distribution'
}
return(dens\$x[modes])
}

x <- dtf\$income
x[x < 1] <- 1 # not interested in negative income just now

# where are those modes?
exp(find_modes(log(x)))```

Edited 18 August 2015 to add the scatter plot of the joint density of hours and income.