Weighted versus unweighted percentiles by @ellis2013nz
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This interesting paper came out recently: A test of the predictive validity of relative versus absolute income for self-reported health and well-being in the United States, by David Brady, Michaela Curran and Richard Carpiano. It uses a large sample of longitudinal data to exploit between-individual and within-individual (across time) variation in absolute and relative income to explore which one is more associated with well-being. Seems that in the USA, relative income is more important.
“Relative income” here is defined as your position (percentile rank) in the national distribution of incomes in a particular year. The average relative income should be 50 by definition. I noticed that the mean percentile reported in table A.1 was 57, not 50, which led to this slightly confusing (for me, anyway) exchange on Twitter. It sparked my interest in a broader question, not really related to the paper itself – what is the relationship between weighted and unweighted percentile rank when using survey data?
Weighted percentile rank
I was surprised to find (well actually, I had discovered this before, so really should say to be “reminded”) that there doesn’t seem to be a weighted version of dplyr’s percent_rank()
function in dplyr or other R packages. There are several packages with functions that will estimate quantiles from weighted data, but if you want to then turn the original vector of continuous variables into percentiles you need to take those quantiles and cut
the original variable using them as breaks.
I started by making my own convenience function to do this. To calculate the breaks, I am using the wtd.quantile
function from Mark Hancock’s reldist
package, which was notably faster than its competition (more on this later).
Post continues after R code
That little test of whether I get the same result as dplyr::percent_rank()
with unweighted data turns out ok:
homemade dplyr difference <dbl> <dbl> <dbl> 1 0.189 0.189 0 2 0.557 0.557 0 3 0.869 0.869 0 4 0.944 0.944 0 5 0.704 0.704 0 6 0.448 0.448 0 7 0.411 0.411 0 8 0.121 0.121 0 9 0.584 0.584 0 10 0.307 0.307 0 # ℹ 990 more rows
Simulating a population and drawing a sample
So to actually explore the difference between weighted and unweighted percentiles, I wanted to simulate a complex survey with unequal probabilities of observations being in the sample, and a skewed variable (income a good example) that was also related, directly or indirectly, to the probability of an observation being in the sample. I did this by creating a educatin variable with 7 different levels, and said
- the higher the education the higher the mean income
- the higher the education, the lower the chance of being sampled (or, which is effectively the same for my purpose, the lower the chance of completing a response to the survey)
Once I have drawn the sample I am going to use standard post-stratification methods to calculate weights that mean the points in the sample between them represent the original population fully.
Whether or not this sampling or non-response effect is realistic doesn’t matter much. Obviously in reality things will be much more complicated. I just wanted something where survey weights were going to vary in a known way (that could be inferred from the data) that correlates with the variable of interest.
Actual income is simulated as coming from a log-normal distribution.
Post continues after R code
This is what the mean income looks like for my different education levels:
educ `mean(income)` <int> <dbl> 1 1 16307. 2 2 19982. 3 3 24369. 4 4 29761. 5 5 36302. 6 6 44355. 7 7 54212.
OK, now I need to draw a sample. I originally did this with dplyr::slice_sample()
but found it slow. The strata()
function in Yves Tillé and Alina Matei’s sampling
package is about twice as fast (benchmarking later in this post) so I used that instead. The code below draws that sample, does the post-stratification weighting (by simply joining the sample to the marginal totals data frame calculated earlier and creating weights forced to add up to those marginal totals) and then calculates the percentile ranks (both weighted and unweighted).
BTW, the reason I was particularly attuned to speed in this overall exercise is that I had heard that Stata was very slow in calculating the weighted percentile ranks. But apart from drawing the sample, everything I was doing in R turned out to be very fast; even calculating percentile ranks on 200,000 observations grouped by year (so similar to the actual data in the original paper) only takes a few seconds.
Post continues after R code
Comparing the weighted and unweighted percentiles
OK, time to do the actual comparisons. Here’s a scatter plot of the weighted and unweighted percentiles. Note the very very high correlation – 0.999! – but also the visually clear slight curvature:
That curvature is more obvious if we plot the difference between the two estimates:
Why do we get this pattern? It’s simple enough if you consider three people – at the bottom, the top and the middle of the sample’s income distribution. For the person at the bottom, whether we are weighting the observations or not, we know that this person has the lowest observed income. They’re in the bottom percentile either way, and the weights of all the people above them simply don’t matter. This applies in reverse for the person at the top. So the weighted and unweighted percentiles are identical in these extremes.
For the person in the middle however it does matter. We look at the unweighted sample and say “right, you’re bang in the middle, you’re the 50th percentile”. But then we look at the weights and say “hold on, the half of the sample that has incomes below you has relatively low weights (because, in my simulation, people with lower education were more likely to be in the survey so get lower weights to compensate). So while it looks like you’ve got higher income than 50% of people, that 50% of the sample actually only represents 45% of the population.” So we get a material difference between weighted and unweighted percentile estimates in the middle, but not at the ends, of the distribution.
There’s nothing inherent that says the weighted percentiles in the middle will always be lower than unweighted as in my simulation – that’s only the case because the lower income people on average had lower weights. With other sampling designs or non-response rates, this could be reversed. But however the weights work out, there will be more discrepancy between the weighted and unweighted percentiles in the middle of the distribution than the ends.
Here’s the simple code for drawing those plots.
Post continues after R code
A couple of other interesting observations. The unweighted mean of the unweighted percentile is 50; and so is the weighted mean of the weighted percentile. But the unweighted mean of the weighted percentile is low - about 47 - and the weighted mean of the unweighted percentile is high - about 53. This all makes sense based on the above reasoning.
Here are some summary stats on the sample:
educ | mean(wt) | mean(income) | mean(uw_perc) | mean(w_perc) | n() | sum(wt) |
---|---|---|---|---|---|---|
1 | 690.4403 | 16980.99 | 37.66090 | 34.70156 | 2067 | 1427140 |
2 | 762.0085 | 21147.84 | 42.89254 | 39.78324 | 1873 | 1427242 |
3 | 842.4838 | 25146.92 | 48.18424 | 45.01756 | 1697 | 1429695 |
4 | 989.3329 | 31281.51 | 54.45006 | 51.25665 | 1445 | 1429586 |
5 | 1175.7494 | 35873.40 | 58.81986 | 55.67827 | 1217 | 1430887 |
6 | 1500.4732 | 42648.17 | 62.77626 | 59.67482 | 951 | 1426950 |
7 | 1904.6667 | 50135.01 | 66.77896 | 63.82556 | 750 | 1428500 |
And the code to make those summary stats and compare the different means of the different percentiles.
Post continues after R code
Some speed benchmarking
Finally, here are some speed tests that were behind some of the design choices above. As it turns out, none of it matters very much. As mentioned above, R can calculate the weighted percentiles for hundreds of thousands of observations grouped by year in only a couple of seconds, so I didn’t need to worry about that. But in case it matters, here’s my core conclusions from running the benchmarking below
reldist::wtd.quantile
is about twice as fast asDesctools::Quantile
in calculating weighted quantiles and gives near-identical results.- setting
labels = FALSE
in a call tocut()
makes the function about four times as fast dplyr::slice_sample
takes about the same time to do a sample with unequal probability of selection as a home made function usingbase::sample
, butsampling:strata
function is about twice as fast.- my final homemade function
wt_percent_rank
for weighted percentile ranks is about four times slower than the unweighteddplyr::percent_rank
, but is still plenty fast enough (<5 seconds for 200,000 observations).
Benchmarking code:
That’s all, cheerio.
Versioning of this blog post
An earlier version of this post had some speculation about the original referenced paper without me adequately checking what was actually happening. As it’s not relevant for my main point that paragraph has been removed.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.