Find the intersection of overlapping histograms in R

[This article was first published on Posts | Joshua Cook, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here, I demonstrate how to find the point where two histograms overlap. While this is an approximation, it seems to have a very high level of precision.

Prepare simulated data

I created two data sets, gamma_dist and norm_dist, which are made up of a different number of values sampled randomly from a gamma distribution and normal distribution, respectively. I specicially made the data sets different sizes to make the point that this method is still applicable.

gamma_dist <- rgamma(1e5, shape = 2, scale = 2)
norm_dist <- rnorm(5e5, mean = 20, sd = 5)
df <- tibble(
x = c(gamma_dist, norm_dist),
original_dataset = c(rep("gamma_dist", 1e5), rep("norm_dist", 5e5))
#> # A tibble: 600,000 x 2
#> x original_dataset
#> <dbl> <chr>
#> 1 6.89 gamma_dist
#> 2 2.25 gamma_dist
#> 3 1.30 gamma_dist
#> 4 4.10 gamma_dist
#> 5 7.77 gamma_dist
#> 6 5.08 gamma_dist
#> 7 4.58 gamma_dist
#> 8 2.30 gamma_dist
#> 9 1.36 gamma_dist
#> 10 1.67 gamma_dist
#> # … with 599,990 more rows

I used ‘ggplot2’ to plot the densities of the two data sets. The gamma distribution is in red and the normal distribution is in blue. I broke the creation of the plot into two steps: the essential step to create the density curves, and the styling step to make the plot look nice. Of course, these could be combined into a single long ggplot statement.

p <- ggplot(df) +
geom_density(aes(x = x, color = original_dataset))
p <- p +
scale_y_continuous(expand = expand_scale(mult = c(0, 0.05))) +
scale_color_manual(values = c("tomato", "dodgerblue")) +
theme_minimal() +
legend.title = element_blank(),
plot.title = element_text(hjust = 0.5)
) +
labs(x = "values",
title = "Two density curves")

Finding the point of intersection

To find the point of intersection, I first binned the data sets using density. It is essential to use the same from and to values for each data set. The density function creates 512 bins, thus, providing the same starting and ending parameters makes density use the same bins for each data set.

from <- 0
to <- 40
gamma_density <- density(gamma_dist, from = from, to = to)
norm_density <- density(norm_dist, from = from, to = to)

The final step was to find where the density of the gamma distribution was less than the normal distribution. Therefore, I applied this logic to create the boolean vector idx. I also included two other filters to contain the result between 5 to 20 because, from the plot above, I can see that the intersection falls within this range.

idx <- (gamma_density$y < norm_density$y) &
(gamma_density$x > 5) &
(gamma_density$x < 20)
poi <- min(gamma_density$x[idx])
#> 10.64579

That’s it, the point of intersection has been approximated to a high precision. A vertical line was added to the plot below at poi.

p <- p +
geom_vline(xintercept = poi, linetype = 2, size = 0.3, color = "black") +
annotate(geom = "text", label = round(poi, 3),
x = poi - 1, y = 0.1, size = 4, angle = 90)

To leave a comment for the author, please follow the link and comment on their blog: Posts | Joshua Cook. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)