**Data Literacy - The blog of Andrés Gutiérrez**, and kindly contributed to R-bloggers)

You can find this post in Blogdown format by clicking here

When it comes to analyzing survey data, you have to take into account the stochastic structure of the sample that was selected to obtain the data. Plots and graphics should not be an exception. The main aim of such studies is to try to infer about how the behavior of the outcomes of interest in the finite population.

For example, you may want to know how many people are in poverty. How about counting the poor people in the sample? I know, it is not a good idea. You have to use the sampling weights to estimate the whole number of poor people in the population. The total error paradigm follows the same approach when estimating any parameter of the finite population: you have to use the sampling weights in order to obtain unbiased estimators.

The idea behind this paradigm is the representative principle. If a person $k$ is included in the sample with a sampling weight $w_k$, she/he represents to himself and $w_k-1$ remaining people. That way, you obtain design-unbiasedness.

I am not a big fan of using descriptive sample statistics to analyze survey data because it can mask the reality, and the fact is that person $k$ is included in the sample, but he/she is not alone. Behind person $k$ there are other people, not included in the sample, and you have to realize that.

So, let’s apply that principle to scatter plots. I am using the `Lucy`

population from the `TeachingSampling`

package to recreate my idea. The following code is used to draw a $\pi PS$ sample.

library(TeachingSampling)

data(Lucy)

# The inclusion probability is proportional to Income

# The selected sample of size n=400

n <- 400

set.seed(123)

res <- S.piPS(n, Lucy$Income)

sam <- res[,1]

# The sample is stored in an data.sample

data.sample <- Lucy[sam, ]

attach(data.sample)

The sampling weights will be stored in the `data.sample`

data in the column `wk`

. They will be useful to reproduce our finite popultaion from the sample data.

# Pik.s is the inclusion probability of units in the sample

data.sample$Pik.s <- res[,2]

# wk is the sampling weight

data.sample$wk <- 1/data.sample$Pik.s

Now, let`s make a plot of the sampling data (with just 400 observations). I recall that this scenario is somehow misleading, because we want to know the behavior of the variables in the finite population.

library(ggplot2)

ggplot(data = data.sample, aes(x = Income, y = Employees)) +

geom_point()

The first option that comes to mind is to include the sampling weights in the points of the scatter plot. However, this approach is not appealing to me, because it is not straightforward from this plot to visuzlize the entire finite population.

ggplot(data = data.sample, aes(x = Income, y = Employees, size = wk)) +

geom_point()

In order to make the finite population scatter plot from the survey sample, I will replicate the rows of the `data.sample`

object as many times as the sampling weight `wk`

. I am using the `mefa::rep`

function to achieve this goal. So, the `newLucy`

object is an intent to mimic the finite population by using the selected sample.

library(mefa)

newLucy <- NULL

for(i in 1:nrow(data.sample)){

newLucy <- rbind(newLucy,

rep(data.sample[i, ],

round(data.sample$wk[i])))

}

newLucy <- as.data.frame(newLucy)

Now, with the `newLucy`

population, I will make a scatter plot. Now, as I am replicating the rows of the sample data, I will add a jitter to avoid overplotting of the points in the scatter plot. This way, this plot (with 2396 observations) looks as if it would come from the finite population.

ggplot(data = newLucy, aes(x = Income, y = Employees)) +

geom_point() + geom_jitter(width = 15, height = 15)

**leave a comment**for the author, please follow the link and comment on their blog:

**Data Literacy - The blog of Andrés Gutiérrez**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...