Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Crossing the Line

This post continues our series on developing statistical models to explore the arcane relationship between UFO sightings and population. The previous post is available here: Bayes vs. the Invaders! Part One: The 37th Parallel.

The simple linear model developed in the previous post is far from satisfying. It makes many unsupportable assumptions about the data and the form of the residual errors from the model. Most obviously, it relies on an underlying Gaussian (or normal) distribution for its understanding of the data. For our count data, some basic features of the Guassian are inappropriate.

Most notably:

• a Gaussian distribution is continuous whilst counts are discrete — you can’t have 2.3 UFO sightings in a given day;
• the Gaussian can produce negative values, which are impossible when dealing with counts — you can’t have a negative number of UFO sightings;
• the Gaussian is symmetrical around its mean value whereas count data is typically skewed.

Moving from the safety and comfort of basic linear regression, then, we will delve into the madness and chaos of generalized linear models that allow us to choose from a range of distributions to describe the relationship between state population and counts of UFO sightings.

# Basic Models

We will be working in a Bayesian framework, in which we assign a prior distribution to each parameter that allows, and requires, us to express some prior knowledge about the parameters of interest. These priors are the initial starting points for parameters Afrom which the model moves towards the underlying values as it learns from the data. Choice of priors can have significant effects not only on the outputs of the model, but also its ability to function effectively; as such, it is both an important, but also arcane and subtle, aspect of the Bayesian approach4. For the Poisson model above, the traceplot can be created using the bayesplot library:

Show traceplot code.
library( tidyverse )
library( magrittr )
library( lubridate )

library( ggplot2 )
library( showtext )
library( cowplot )

library( rstan )
library( bayesplot )
library( tidybayes )

ufo_population_sightings <-

# UFO reporting font
showtext_auto()

# Bayesplot needs to be told which theme to use as a default.
theme_set( theme_weird() )

# First, as always, a traceplot
tp <-
traceplot(
fit_ufo_pop_poisson,
pars = c("a", "b"),
ncol=1 ) +
scale_colour_viridis_d( name="Chain", direction=-1 ) +
theme_weird()

title <-
ggdraw() +
draw_label("Traceplot of Key Model Parameters", fontfamily="main_font", colour = "#cccccc", size=20, hjust=0, vjust=1, x=0.02, y=0.88) +
draw_label("http://www.weirddatascience.net | @WeirdDataSci", fontfamily="main_font", colour = "#cccccc", size=12, hjust=0, vjust=1, x=0.02, y=0.40)

titled_tp <-
plot_grid(title, tp, ncol=1, rel_heights=c(0.1, 1)) +
theme(
panel.background = element_rect(fill = "#222222", colour = "#222222"),
plot.background = element_rect(fill = "#222222", colour = "#222222"),
)

save_plot("output/poisson_traceplot.pdf",
titled_tp,
base_width = 16,
base_height = 9,
base_aspect_ratio = 1.78 )


These traceplots exhibit the characteristic insane scribbling of well-mixed chains, often referred to — in, of course, hushed whispers — as a manifestation of a hairy caterpillar; the separate lines representing each chain are clearly overlapping and exploring the same areas. If, by contrast, the lines were largely separated or did not show the same space, there would be reason to believe that our model had become lost and unable to find a coherent voice amongst the myriad babbling murmurs of the data.

A second check on the sanity of the modelling process is to examine the output of the model itself to show the value of the fitted parameters of interest, and some diagnostic information:

fit_ufo_pop_poisson %>%
summary(pars=c("a", "b" )) %>%
extract2( "summary" )

mean      se_mean           sd       2.5%        25%        50%        75%      97.5%    n_eff     Rhat
a 5.5199115 7.162684e-03 2.701118e-01 4.99950988 5.33593127 5.51805161 5.70399617 6.05461185 1422.118 1.001581
b 0.0107647 1.728273e-06 6.574476e-05 0.01063562 0.01072192 0.01076418 0.01080788 0.01089496 1447.097 1.001551


For assessment of successful model fit, the Rhat value represents the extent to which the various Markov chains exploring the space, of which there are four by default in Stan, are consistent with each other. As a rule of thumb, a value of Rhat > 1.1 indicates that the model has not converged appropriately and may require a longer set of random sampling iterations, or an improved model. Here, the valuesof Rhat are close to the ideal value of 1.

As a final step, we should examine how well our model can reproduce the shape of the original data. Models aim to be eerily lifelike parodies of the truth; in a Bayesian framework, and in the Stan language, we can build into the model the ability to draw random samples from the posterior predictive distribution — the set of parameters that the model has learnt from the data — to create new possible values of the outcomes based on the observed inputs. This process can be repeated many times to produce a multiplicity of possible outcomes drawn from model, which we can then visualize to see graphically how well our model fits the observed data.

In the Stan code above, this is created in the generated_quantities block. When using more convenient libraries such as brms or rstanarm, draws from the posterior predictive distribution can be obtained more simply after the model has been fit through a range of helper functions. Here, we undertake the process manually.

We can see, then, how well the Poisson distribution, informed by our selection of priors, has shaped itself to the underlying data.

Show posterior predictive plot code.
library( tidyverse )
library( magrittr )
library( lubridate )

library( ggplot2 )
library( showtext )
library( cowplot )

library( rstan )
library( bayesplot )
library( tidybayes )

ufo_population_sightings <-

# UFO reporting font
showtext_auto()

# Plots, posterior predictive checking, LOO.

# Bayesplot needs to be told which theme to use as a default.
theme_set( theme_weird() )

## Model checking visualisations

# Extract posterior estimates from the fit (from the generated quantities of the stan model)
counts_pred_poisson <- as.matrix( fit_ufo_pop_poisson, pars = "counts_pred" )

# Posterior predictive density. (Visual representation of goodness of fit.)
# Sample 50 rows for overlay
counts_pred_sample <-
counts_pred_poisson[ sample( nrow( counts_pred_poisson ), 50 ), ]
gp_ppc <-
ppc_dens_overlay(
y = extract2( ufo_population_sightings, "count" ),
yrep = counts_pred_sample,
alpha=0.4) +
theme_weird()

title <-
ggdraw() +
draw_label("Posterior Predictive Density Plot", fontfamily="main_font", colour = "#cccccc", size=20, hjust=0, vjust=1, x=0.02, y=0.88) +
draw_label("http://www.weirddatascience.net | @WeirdDataSci", fontfamily="main_font", colour = "#cccccc", size=12, hjust=0, vjust=1, x=0.02, y=0.40)

titled_gp_ppc <-
plot_grid(title, gp_ppc, ncol=1, rel_heights=c(0.1, 1)) +
theme(
panel.background = element_rect(fill = "#222222", colour = "#222222"),
plot.background = element_rect(fill = "#222222", colour = "#222222"),
)

save_plot("output/poisson_posterior_predictive.pdf",
titled_gp_ppc,
base_width = 16,
base_height = 9,
base_aspect_ratio = 1.78 )


In the diagram above, the yellow line shows the densities of count values; the blue lines show a sample of twisted mockeries spawned by our piscine approximations. As can be seen, the model has roughly captured the shape of the true distribution, but has notable dissimilarities with the original data.

To appreciate the full horror of what we have wrought, of course, we can plot the predictions of the model against the real data.

Show posterior predictive plot code.

library( tidyverse )
library( magrittr )
library( lubridate )

library( ggplot2 )
library( showtext )
library( cowplot )

library( rstan )
library( bayesplot )
library( tidybayes )
library( modelr )

# Load UFO data and model
ufo_population_sightings <-

fit_ufo_pop_poisson <-

# UFO reporting font
showtext_auto()

# Plots, posterior predictive checking, LOO
theme_set( theme_weird() )

# Use teal colour scheme
color_scheme_set( "teal")

## Model checking visualisations

# Extract posterior estimates from the fit (from the generated quantities of the stan model)
counts_pred_poisson <- as.matrix( fit_ufo_pop_poisson, pars = "counts_pred" )

# US state data
us_state_factors <-
levels( factor( ufo_population_sightings$state ) ) ## Create per-state predictive fit plots # Convert fitted model (stanfit) object to a tibble fit_tbl <- summary(fit_ufo_pop_poisson)$summary %>%
as.data.frame() %>%
mutate(variable = rownames(.)) %>%
select(variable, everything()) %>%
as_tibble()

counts_predicted <-
fit_tbl %>%
filter( str_detect(variable,'counts_pred') )

ufo_population_sightings_pred <-
ufo_population_sightings %>%
ungroup() %>%
mutate( count_mean = counts_predicted$mean, lower = counts_predicted$2.5%,
upper = counts_predicted\$97.5%)

# (Using mean and SD of fit summary)
predictive_plot <-
ggplot( ufo_population_sightings_pred ) +
geom_point( aes( x=population, y=count ), colour="#0b6788", size=0.6, alpha=0.8 ) +
geom_line(aes( x=population, y=count_mean ), colour="#3cd070" ) +
geom_ribbon(aes(x=population, ymin = lower, ymax = upper ), alpha = 0.2, fill="#3cd070") +
labs( x="Population (Thousands)", y="Annual Sightings" ) +
scale_fill_viridis_d( name="State" ) +
scale_colour_viridis_d( name="State" ) +
theme(
axis.title.y = element_text( angle=90 ),
legend.position = "none" )

# Construct full plot, with title and backdrop.
title <-
ggdraw() +
draw_label("UFO Sightings against State Population (1990-2014)", fontfamily="main_font", colour = "#cccccc", size=20, hjust=0, vjust=1, x=0.02, y=0.88) +
draw_label("Poisson GLM. 50% credible intervals.", fontfamily="main_font", colour = "#cccccc", size=12, hjust=0, vjust=1, x=0.02, y=0.48) +
draw_label("http://www.weirddatascience.net | @WeirdDataSci", fontfamily="main_font", colour = "#cccccc", size=12, hjust=0, vjust=1, x=0.02, y=0.16)

data_label <- ggdraw() +
draw_label("Data: http://www.nuforc.org | Tool: http://www.mc-stan.org", fontfamily="main_font", colour = "#cccccc", size=12, hjust=1, x=0.98 )

predictive_plot_titled <-
plot_grid(title, predictive_plot, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) +
theme(
panel.background = element_rect(fill = "#222222", colour = "#222222"),
plot.background = element_rect(fill = "#222222", colour = "#222222"),
)

save_plot("output/poisson_predictive_plot.pdf",
predictive_plot_titled,
base_width = 16,
base_height = 9,
base_aspect_ratio = 1.78 )



This shows an extremely similar line of best fit to that produced from the basic Gaussian model in the previous post. Indeed, a side by side comparison shows that the 95% credible interval around the line are wider in this Poisson-based model5. This most likely reflects the inflexibility of the Poisson distribution given the nature of our data, something that we will discuss and rectify in the next post.

# Unsettling Distributions

In this post we have opened our eyes to the weirdly non-linear possibilities of generalised linear models; sealed and bound this concept within the wild philosophy of Bayesian inference; and unleashed the horrifying capacities of Markov Chain Monte Carlo methods and their manifestation in the Stan language.

Applying the Poisson distribution to our records of extraterrestrial sightings, we have seen that we can, to some extent, create a mindless Golem that imperfectly mimics the original data. In the next post, we will delve more deeply into the esoteric possibilities of other distributions for count data, explore ways in which to account for arcane relationships across and between per-state observations, and show how we can compare the effectiveness of different models to select the final glimpse of dread truth that we inadvisably seek.