Introducing the Kernelheaping Package

INWT-Blog-RBloggers

4 years ago

[This article was first published on INWT-Blog-RBloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this blog article I’d like to introduce the univariate kernel density estimation for heaped (i.e. rounded or interval censored) data with the Kernelheaping package.

It is not unusual to have interval censored data such as in income surveys due to anonymisation or simplification issues. However, a simple task like plotting a density may fail completely for rounded or interval censored data. When using class centers as input, the density estimate might get bumpy or spiky around the center points. This problem gets even worse for larger sample sizes with decreasing bandwidth. One might try to manually increase the bandwidth until one obtains a sufficiently smooth estimate. However, this will result in oversmoothing as we will see in our example. The Kernelheaping package, available on CRAN, delivers an algorithm to obtain nonparametric density estimates for interval censored data.

Example: Kernel density estimation for interval data without correction

In the following chunk I’ll simulate 5000 observations from a log-normal distribution. Then I plot the histogram together with a conventional kernel density estimate.

<pre class ="r"><code>library("ggplot2") library("dplyr") </code></pre>
<pre class ="r"><code>set.seed(0) dat &lt;- data.frame(x = rlnorm(5000, meanlog = 7.5, sdlog = 0.5)) ggplot(dat, aes(x, y = ..density..)) +   geom_histogram(position = "identity", color = "black", fill = "white") +   geom_density(fill = "blue", alpha = 0.7, linetype = 0) +   ylim(c(0, 0.00055)) +   xlim(c(0, 10000)) </code></pre>

Now let’s split our data into 14 classes imitating a typical income survey.

<pre class="r"><code>classes &lt;- c(0, 500, 1000, 1500, 2000, 2500, 3000, 4000, 5000, 6000, 8000,              10000, 15000, Inf) dat$xclass &lt;- cut(dat$x, breaks = classes)</code></pre>

If I now use the class centers as input, the result is a very unrealistic, spiky density.

<pre class="r"><code>classCenters &lt;- (classes[-1] - classes[-length(classes)]) /   2 + classes[-length(classes)] classCenters[length(classCenters)] &lt;- 2 * classCenters[length(classCenters) - 1] levels(dat$xclass) &lt;- classCenters dat$xclassCenter &lt;- as.numeric(as.character(dat$xclass)) ggplot(dat, aes(xclassCenter, y = ..density..)) +   geom_histogram(breaks = classes, color = "black", fill = "white") +   geom_density(fill = "blue", alpha = 0.7, linetype = 0) </code></pre>

I’ve got the option to adjust the density by manipulating the adjust parameter. But getting a nice and smooth density shape requires manipulation to such an extent that we end up with serious oversmoothing.

<pre class="r"><code>ggplot(dat, aes(xclassCenter, y = ..density..)) +   geom_histogram(breaks = classes, color = "black", fill = "white") +   geom_density(adjust = 2.5, fill = "blue", alpha = 0.7, linetype = 0) +   ylim(c(0, 0.00055)) +   xlim(c(0, 10000))</code></pre>

Kernel density estimation for interval data with the Kernelheaping package

Now, we install and load the Kernelheaping package. Its dclass function implements an iterative SEM (Stochastic Expectation Maximization) algorithm.

<pre class ="r"><code>install.packages("Kernelheaping") library("Kernelheaping")</code></pre>

It can be called with any bandwidth selector available to the density function in the stats package and also provides optional boundary correction for positive only data.

<pre class ="r"><code>densityEst &lt;- dclass(xclass = dat$xclass, classes = classes, burnin = 50,                      samples = 100, evalpoints = 1000)</code></pre>
<pre class ="r"><code>classes &lt;- data.frame(xclass = densityEst$xclass) estimates &lt;- data.frame(Mestimates = densityEst$Mestimates, gridx = densityEst$gridx) ggplot() +   geom_histogram(data = classes, aes(xclass, y = ..density..),                   breaks = densityEst$classes, color = "black", fill = "white") +   geom_area(data = estimates, aes(gridx, Mestimates),              fill = "blue", alpha = 0.7, size = 1) +   ylim(c(0, 0.00055)) +   xlim(c(0, 10000)) </code></pre>

The resulting density estimate is smooth and very close to the estimate on the original unclassed data. The package is also able to handle the case of survey participants rounding values by their own, each by a different level of coarseness. Please check out the dheaping function with given examples.

Soon there will be a second part to this article, where I’ll show how to compute bivariate densities given area-level data with the Kernelheaping package.

To leave a comment for the author, please follow the link and comment on their blog: INWT-Blog-RBloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.