Scatter plots in survey sampling

[This article was first published on Data Literacy - The blog of Andrés Gutiérrez, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

You can find this post in Blogdown format by clicking here

When it comes to analyzing survey data, you have to take into account the stochastic structure of the sample that was selected to obtain the data. Plots and graphics should not be an exception. The main aim of such studies is to try to infer about how the behavior of the outcomes of interest in the finite population.

For example, you may want to know how many people are in poverty. How about counting the poor people in the sample? I know, it is not a good idea. You have to use the sampling weights to estimate the whole number of poor people in the population. The total error paradigm follows the same approach when estimating any parameter of the finite population: you have to use the sampling weights in order to obtain unbiased estimators.

The idea behind this paradigm is the representative principle. If a person $k$ is included in the sample with a sampling weight $w_k$, she/he represents to himself and $w_k-1$ remaining people. That way, you obtain design-unbiasedness.

I am not a big fan of using descriptive sample statistics to analyze survey data because it can mask the reality, and the fact is that person $k$ is included in the sample, but he/she is not alone. Behind person $k$ there are other people, not included in the sample, and you have to realize that.

So, let’s apply that principle to scatter plots. I am using the Lucy population from the TeachingSampling package to recreate my idea. The following code is used to draw a $\pi PS$ sample.

library(TeachingSampling)
data(Lucy)
# The inclusion probability is proportional to Income 
# The selected sample of size n=400 
n <- 400
set.seed(123)
res <- S.piPS(n, Lucy$Income)
sam <- res[,1]
# The sample is stored in an data.sample 
data.sample <- Lucy[sam, ]
attach(data.sample)

The sampling weights will be stored in the data.sample data in the column wk. They will be useful to reproduce our finite popultaion from the sample data.

# Pik.s is the inclusion probability of units in the sample
data.sample$Pik.s <- res[,2]
# wk is the sampling weight
data.sample$wk <- 1/data.sample$Pik.s

Now, let`s make a plot of the sampling data (with just 400 observations). I recall that this scenario is somehow misleading, because we want to know the behavior of the variables in the finite population.

library(ggplot2)
ggplot(data = data.sample, aes(x = Income, y = Employees)) +
geom_point()

The first option that comes to mind is to include the sampling weights in the points of the scatter plot. However, this approach is not appealing to me, because it is not straightforward from this plot to visuzlize the entire finite population.

ggplot(data = data.sample, aes(x = Income, y = Employees, size = wk)) + 
geom_point()

In order to make the finite population scatter plot from the survey sample, I will replicate the rows of the data.sample object as many times as the sampling weight wk. I am using the mefa::rep function to achieve this goal. So, the newLucy object is an intent to mimic the finite population by using the selected sample.

library(mefa)
newLucy <- NULL

for(i in 1:nrow(data.sample)){
newLucy <- rbind(newLucy,
rep(data.sample[i, ],
round(data.sample$wk[i])))
}

newLucy <- as.data.frame(newLucy)

Now, with the newLucy population, I will make a scatter plot. Now, as I am replicating the rows of the sample data, I will add a jitter to avoid overplotting of the points in the scatter plot. This way, this plot (with 2396 observations) looks as if it would come from the finite population.

ggplot(data = newLucy, aes(x = Income, y = Employees)) +
geom_point() + geom_jitter(width = 15, height = 15)

To leave a comment for the author, please follow the link and comment on their blog: Data Literacy - The blog of Andrés Gutiérrez.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)