Site icon R-bloggers

MLE Adjustment for High-Dimensional Logistic Regression

[This article was first published on HOXO-M Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The maximum likelihood estimator (MLE) of the logistic regression model is not an unbiased estimator. Therefore, estimates calculated with glm() contain bias. Since the MLE satisfies consistency and asymptotic normality, the bias can be disregarded when the sample size is large. However, in the analysis of high-dimensional data, the sample size is sometimes relatively small compared to the dimension of input variables.

For example, let’s consider a scenario where the number of input variables p = 300, and the sample size n = 2000. Additionally, the true parameters beta consist of

In such a case, the MLE returned by glm() contains a non-negligible bias.

p <- 300
n <- 2000

set.seed(314)
x <- rnorm(n * p, mean = 0, sd = sqrt(1/n))
X <- matrix(x, nrow = n, ncol = p)
beta <- matrix(rep(c(10, -10, 0), each = p/3), nrow = p, ncol = 1)
prob <- plogis(X %*% beta)
y <- rbinom(n, 1, prob)

fit <- glm(y ~ X, family = binomial, x = TRUE)

library(ggplot2)
theme_set(theme_bw())
df <- data.frame(index = seq_len(p), mle = coef(fit)[-1])
ggplot(df, aes(index, mle)) +
  geom_point(color = "blue") +
  annotate("segment", x = c(0, 100, 200), xend = c(100, 200, 300), 
           y = c(10, -10, 0), yend = c(10, -10, 0), linewidth = 1.5) +
  scale_x_continuous(breaks = c(0, 100, 200, 300)) +
  ylim(-30, 30) + xlab("Index of parameters") + ylab("MLE") +
  ggtitle("True (black line) and MLE (blue point)")

You can see that the blue points (MLE) are significantly outside the perimeter of the black line (true).

The purpose of this package is to alleviate the bias by adjusting the MLE. To achieve this, we implemented two methods:

The adjustMLE function in our package is designed to mitigate this bias.

library(adjustMLE)

fit_adj <- adjustMLE(fit)

df <- data.frame(index = seq_len(p), mle = coef(fit_adj)[-1])
ggplot(df, aes(index, mle)) +
  geom_point(color = "blue") +
  annotate("segment", x = c(0, 100, 200), xend = c(100, 200, 300), 
           y = c(10, -10, 0), yend = c(10, -10, 0), linewidth = 1.5) +
  scale_x_continuous(breaks = c(0, 100, 200, 300)) +
  ylim(-30, 30) + xlab("Index of parameters") + ylab("Adjusted MLE") +
  ggtitle("True (black line) and adjusted MLE (blue point)")

For more details, refer to https://github.com/hoxo-m/adjustMLE.

< section id="installation" class="level2">

Installation

You can install the package from GitHub.

install.packages("remotes") # if you have not installed "remotes" package
remotes::install_github("hoxo-m/adjustMLE")
< section id="references" class="level2">

References

To leave a comment for the author, please follow the link and comment on their blog: HOXO-M Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version