Maximum Likelihood Distilled

[This article was first published on R on Jorge Cimentada, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We all hear about Maximum Likelihood Estimation (MLE) and we often see hints of it in our model output. As usual, doing things manually can give a better grasp on how to better understand how our models work. Here’s a very short example implementing MLE based on the explanation from Gelman and Hill (2007), page 404-405.

The likelihood is literally how much our outcome variable Y is compatible with our predictor X. We compute this measure of compatibility with the probability density function for the normal distribution. In R, dnorm returns this likelihood. The plot on this website gives a very clear intuition on what dnorm returns: it is literally the height of the distribution, or in other words, the likelihood. We of course, want the highest likelihood, as it indicates greater compatibility.

For example, assuming parameters is a vector with the intercept a, the coefficient b and an error term sigma, we can compute the likelihood for any random value of these coefficients:

loglikelihood <- function(parameters, predictor, outcome) {
  # intercept
  a <- parameters[1]
  # beta coef
  b <- parameters[2]
  # error term
  sigma <- parameters[3]

  # Calculate the likelihood of `y` given `a + b * x`
  ll.vec <- dnorm(outcome, a + b * predictor, sigma, log = TRUE)

  # sum that likelihood over all the values in the data
  sum(ll.vec)
}

# Generate three random values for intercept, beta and error term
inits <- runif(3)

# Calculate the likelihood given these three values
loglikelihood(
  inits,
  predictor = mtcars$disp,
  outcome = mtcars$mpg
)
## [1] -11687.41

That’s the likelihood given the random values for the intercept, the coefficient and sigma. How does a typical linear model estimate the maximum of these likelihoods? It performs an optimization search trying out a sliding set of values for these unknowns and searches for the combination that returns the maximum:

mle <-
  optim(
    inits, # The three random values for intercept, beta and sigma
    loglikelihood, # The loglik function
    lower = c(-Inf, -Inf, 1.e-5), # The lower bound for the three values (all can be negative except sigma, which is 1.e-5)
    method = "L-BFGS-B",
    control = list(fnscale = -1), # This signals to search for the maximum rather than the minimum
    predictor = mtcars$disp,
    outcome = mtcars$mpg
  )

mle$par[1:2]
## [1] 29.59985346 -0.04121511

Let’s compare that to the result of lm:

coef(lm(mpg ~ disp, data = mtcars))
## (Intercept)        disp 
## 29.59985476 -0.04121512

In layman terms, MLE really just checks how compatible a given data point is with the outcome with the respect to a coefficient. It repeats that step many times until it finds the combination of coefficients that maximizes the outcome.

To leave a comment for the author, please follow the link and comment on their blog: R on Jorge Cimentada.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)