Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### TLDR

Maximum Likelihood Estimation (MLE) is one method of inferring model parameters. This post aims to give an intuitive explanation of MLE, discussing why it is so useful (simplicity and availability in software) as well as where it is limited (point estimates are not as informative as Bayesian estimates, which are also shown for comparison).

### Introduction

Distribution parameters describe the shape of a distribution function. A normal (Gaussian) distribution is characterised based on it’s mean, $$\mu$$ and standard deviation, $$\sigma$$. Increasing the mean shifts the distribution to be centered at a larger value and increasing the standard deviation stretches the function to give larger values further away from the mean. When we approximate some uncertain data with a distribution function, we are interested in estimating the distribution parameters that are most consistent with the data.

The likelihood, $$L$$, of some data, $$z$$, is shown below. Where $$f(\theta)$$ is the function that has been proposed to explain the data, and $$\theta$$ are the parameter(s) that characterise that function.

$L = \displaystyle\prod_{i=1}^{N} f(z_{i} \mid \theta)$

Likelihood values (and therefore also the product of many likelihood values) can be very small, so small that they cause problems for software. Therefore it’s usually more convenient to work with log-likelihoods instead. Taking the logarithm is applying a monotonically increasing function. This means if one function has a higher sample likelihood than another, then it will also have a higher log-likelihood. Also, the location of maximum log-likelihood will be also be the location of the maximum likelihood.

$\log{(L)} = \displaystyle\sum_{i=1}^{N} f(z_{i} \mid \theta)$

The distribution parameters that maximise the log-likelihood function, $$\theta^{*}$$, are those that correspond to the maximum sample likelihood.

$\theta^{*} = arg \max_{\theta} \bigg[ \log{(L)} \bigg]$

Below, two different normal distributions are proposed to describe a pair of observations.

obs <- c(0, 3)

The red distribution has a mean value of 1 and a standard deviation of 2. The green distribution has a mean value of 2 and a standard deviation of 1 and so is centered further to the right, and is less dispersed (less stretched out). The red arrows point to the likelihood values of the data associated with the red distribution, and the green arrows indicate the likelihood of the same data with respect to the green function. The first data point, 0 is more likely to have been generated by the red function, and the second data point, 3 is more likely to have been generated by the green function.

We can evaluate the log-likelihood and compare the two functions:

sum(log(dnorm(x = obs, mean = 1, sd = 2))) # Red function
## [1] -3.849171

sum(log(dnorm(x = obs, mean = 2, sd = 1))) # Green function
## [1] -4.337877

As shown above, the red distribution has a higher log-likelihood (and therefore also a higher likelihood) than the green function, with respect to the 2 data points. The above graph suggests that this is driven by the first data point , 0 being significantly more consistent with the red function. The below example looks at how a distribution parameter that maximises a sample likelihood could be identified.

### MLE for an Exponential Distribution

The exponential distribution is characterised by a single parameter, it’s rate $$\lambda$$:

$f(z, \lambda) = \lambda \cdot \exp^{- \lambda \cdot z}$

It is a widely used distribution, as it is a Maximum Entropy (MaxEnt) solution. If some unknown parameters is known to be positive, with a fixed mean, then the function that best conveys this (and only this) information is the exponential distribution. I plan to write a future post about the MaxEnt principle, as it is deeply linked to Bayesian statistics. The expectation (mean), $$E[y]$$ and variance, $$Var[y]$$ of an exponentially distributed parameter, $$y \sim exp(\lambda)$$ are shown below:

$E[y] = \lambda^{-1}, \; Var[y] = \lambda^{-2}$

Simulating some example data…

n_samples <- 25; true_rate <- 1; set.seed(1)

exp_samples <- rexp(n = n_samples,
rate = true_rate)

In the above code, 25 independent random samples have been taken from an exponential distribution with a mean of 1, using rexp.

Below, for various proposed $$\lambda$$ values, the log-likelihood (log(dexp())) of the sample is evaluated.

exp_lik_df <- data.frame(rate = double(),
lik = double())

for (i in seq(from = 0.2, to = 2, by = 0.2)){

exp_lik_df <- rbind(exp_lik_df,
data.frame(rate = i,
log_lik = sum(log(
dexp(x = exp_samples,
rate = i)))))

}

max_log_lik <- exp_lik_df[which.max(x = exp_lik_df$log_lik),] Finally, max_log_lik finds which of the proposed $$\lambda$$ values is associated with the highest log-likelihood. We can print out the data frame that has just been created and check that the maximum has been correctly identified. print(exp_lik_df) ## rate log_lik ## 1 0.2 -45.38774 ## 2 0.4 -33.21086 ## 3 0.6 -28.22602 ## 4 0.8 -26.18577 ## 5 1.0 -25.75897 ## 6 1.2 -26.35273 ## 7 1.4 -27.65076 ## 8 1.6 -29.46427 ## 9 1.8 -31.67149 ## 10 2.0 -34.18927 print(max_log_lik) ## rate log_lik ## 5 1 -25.75897 The below plot shows how the sample log-likelihood varies for different values of $$\lambda$$. It also shows the shape of the exponential distribution associated with the lowest (top-left), optimal (top-centre) and highest (top-right) values of $$\lambda$$ considered in these iterations: #### MLE in Practice: Software Libraries In practice there are many software packages that quickly and conveniently automate MLE. Here are some useful examples… Firstly, using the fitdistrplus library in R: library(fitdistrplus) sample_data <- exp_samples rate_fit_R <- fitdist(data = sample_data, distr = 'exp', method = 'mle') Although I have specified mle (maximum likelihood estimation) as the method that I would like R to use here, it is already the default argument and so we didn’t need to include it. R provides us with an list of plenty of useful information, including: – the original data – the size of the dataset – the co-variance matrix (especially useful if we are estimating multiple parameters) – some measures of well the parameters were estimated You can explore these using $ to check the additional information available.

We can take advantage of this to extract the estimated parameter value and the corresponding log-likelihood:

rate_fit_R$estimate ## rate ## 0.9705356 rate_fit_R$loglik
## [1] -25.74768

Alternatively, with SciPy in Python (using the same data):

import scipy.stats as stats

sample_data = r.exp_samples

rate_fit_py = stats.expon.fit(data = sample_data, floc = 0)
rate = (rate_fit_py[1])**-1

print(rate)
## 0.9705355729681481

Though we did not specify MLE as a method, the online documentation indicates this is what the function uses.

We can also calculate the log-likelihood associated with this estimate using NumPy:

import numpy as np

np.sum(np.log(stats.expon.pdf(x = sample_data,
scale = rate_fit_py[1])))
## -25.747680569393435

We’ve shown that values obtained from Python match those from R, so (as usual) both approaches will work out.

The method argument in R’s fitdistrplus::fitdist() function also accepts mme (moment matching estimation) and qme (quantile matching estimation), but remember that MLE is the default. One useful feature of MLE, is that (with sufficient data), parameter estimates can be approximated as normally distributed, with the covariance matrix (for all of the parameters being estimated) equal to the inverse of the Hessian matrix of the likelihood function.

However, MLE is primarily used as a point estimate solution and the information contained in a single value will always be limited. Likelihoods will not necessarily be symmetrically dispersed around the point of maximum likelihood. We may be interested in the full distribution of credible parameter values, so that we can perform sensitivity analyses and understand the possible outcomes or optimal decisions associated with particular credible intervals. See below for a proposed approach for overcoming these limitations.

### Limitations (or ‘How to do better with Bayesian methods’)

An intuitive method for quantifying this epistemic (statistical) uncertainty in parameter estimation is Bayesian inference. This removes requirements for a sufficient sample size, while providing more information (a full posterior distribution) of credible values for each parameter. If multiple parameters are being simultaneously estimated, then the posterior distribution will be a joint probabilistic model of all parameters, accounting for any inter-dependencies too. Finally, it also provides the opportunity to build in prior knowledge, which we may have available, before evaluating the data.

Take it from Stan himself:

Returning to the challenge of estimating the rate parameter for an exponential model, based on the same 25 observations:

summary(exp_samples)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.1061  0.4361  0.7552  1.0304  1.2296  4.4239

We will now consider a Bayesian approach, by writing a Stan file that describes this exponential model:

data {

int <lower = 0> N; // Defining the number of observations
vector <lower = 0> [N] samples; // A vector containing the observations

}

parameters {

// The (unobserved) model parameter that we want to estimate
real <lower = 0> rate;

}

model {

// An exponential model, which we are proposing to describe the data
samples ~ exponential(rate);

}

As with previous examples on this blog, data can be pre-processed, and results can be extracted using the rstan package:

library(rstan)

exp_posterior_samples <- sampling(object = exp_model,
data = list(N = n_samples,
samples = exp_samples),
seed = 1008)
##
## SAMPLING FOR MODEL '5328d75314acb3313bc7fa418eeb08c2' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1:
## Chain 1:
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1:
## Chain 1:  Elapsed Time: 0.037 seconds (Warm-up)
## Chain 1:                0.034 seconds (Sampling)
## Chain 1:                0.071 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL '5328d75314acb3313bc7fa418eeb08c2' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2:
## Chain 2:
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2:
## Chain 2:  Elapsed Time: 0.048 seconds (Warm-up)
## Chain 2:                0.029 seconds (Sampling)
## Chain 2:                0.077 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL '5328d75314acb3313bc7fa418eeb08c2' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3:
## Chain 3:
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3:
## Chain 3:  Elapsed Time: 0.04 seconds (Warm-up)
## Chain 3:                0.032 seconds (Sampling)
## Chain 3:                0.072 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL '5328d75314acb3313bc7fa418eeb08c2' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 0 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 4:
## Chain 4:
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4:
## Chain 4:  Elapsed Time: 0.026 seconds (Warm-up)
## Chain 4:                0.024 seconds (Sampling)
## Chain 4:                0.05 seconds (Total)
## Chain 4:

Note: We have not specified a prior model for the rate parameter. Stan responds to this by setting what is known as an improper prior (a uniform distribution bounded only by any upper and lower limits that were listed when the parameter was declared). For real-world problems, there are many reasons to avoid uniform priors. Partly because they are no longer ‘non-informative’ when there are transformations, such as in generalised linear models, and partly because there will always be some prior information to help direct you towards more credible outcomes. However, this data has been introduced without any context and by using uniform priors, we should be able to recover the same maximum likelihood estimate as the non-Bayesian approaches above.

Extracting the results from our model:

library(ggmcmc)

extracted_samples <- ggs(S = exp_posterior_samples)

head(x = extracted_samples, n = 5)
## # A tibble: 5 x 4
##   Iteration Chain Parameter value
##       <dbl> <int> <fct>     <dbl>
## 1         1     1 rate      0.614
## 2         2     1 rate      0.710
## 3         3     1 rate      0.783
## 4         4     1 rate      1.00
## 5         5     1 rate      1.16

We can use this data to visualise the uncertainty in our estimate of the rate parameter:

ggplot(data = extracted_samples %>%
dplyr::filter(Parameter == 'rate'))+
geom_density(mapping = aes(x = value,
y = ..density..),
fill = 'purple4', alpha = 0.2)+
geom_vline(aes(lty = 'MLE solution',
xintercept = rate_fit_R\$estimate))+
scale_linetype_manual(values = c(2))+
scale_x_continuous(name = 'Rate Parameter')+
scale_y_continuous(name = 'Posterior Likelihood')+
theme_ddf_light()

We can use the full posterior distribution to identify the maximum posterior likelihood (which matches the MLE value for this simple example, since we have used an improper prior). However, we can also calculate credible intervals, or the probability of the parameter exceeding any value that may be of interest to us.

This distribution includes the statistical uncertainty due to the limited sample size. As more data is collected, we generally see a reduction in uncertainty. Based on a similar principle, if we had also have included some information in the form of a prior model (even if it was only weakly informative), this would also serve to reduce this uncertainty.

Finally, we can also sample from the posterior distribution to plot predictions on a more meaningful outcome scale (where each green line represents an exponential model associated with a single sample from the posterior distribution of the rate parameter):