24 Days of R: Day 20

[This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Some time ago, I was doing some analysis and trying to determine whether or not there was a predictive variable for a binomial response. I ran logistic regressions for about half a dozen variables in different combinations and nothing showed a fit of any significance. Well, almost nothing. I had measured the response against date. Date is a continuous variable, so a regression will generate only one coefficient. However, I also tried month as a categorical variable- twelve coefficients. There are reasons for assuming that a particular month may have explanatory power. Businesses have fiscal years and certain dates attract different behavior. Reinsurance contracts often have a renewal date of the first of January, July and April. In these cases, the month, per se, explains nothing. It's merely a proxy for some other effect. Does increased volume alter the way business is handled? Are larger insurers more likely to stick to a July 1 renewal date?

In this particular case, one month showed up as being (barely) significant. There was no “real” explanatory variable that I could think of. This doesn't mean that one doesn't exist. (Absence of evidence is not evidence of absence.) But there was another explanation that I wanted to explore. Using the standard 5% threshold, what is the likelihood that a single month may appear to be significant, even when it is not?

I'm certain there's a much better way to demonstrate this analytically, but I'm going to cheat and use simulation. I'm going to assume a fairly high level for the probability of success (which was the case in the data I was looking at) and then randomly generate a few thousand sample sets and run a few thousand regressions. How often do I get a spurious result?

numObservations = 400
p = 0.9
trials = 1000
month = 1:12

SampleSet = function(numObservations, p) {
    Result = ifelse(runif(numObservations) <= p, 1, 0)
    Month = sample(1:12, size = numObservations, replace = TRUE)
    df = data.frame(Result = Result, Month = Month)
    df
}

set.seed(1234)
df = SampleSet(numObservations, p)

plot(jitter(df$Month), jitter(df$Result), pch = 19, xlab = "Month", ylab = "Result")

plot of chunk FirstSample

fit1 = glm(Result ~ as.factor(Month), data = df, family = binomial(link = "logit"))

And, with the first random trial, month 3 is almost significant. Let's repeat that 1000 times.

significant = integer()
for (i in 1:trials) {
    df = SampleSet(numObservations, p)
    fit = glm(Result ~ as.factor(Month), data = df, family = binomial(link = "logit"))
    # This will pull the rightmost column of the coefficient matrix
    coefs = summary(fit)$coefficients[, 4]
    # Get rid of the intercept
    coefs = coefs[-1]
    # Count if any cases are less than 0.05
    significant[i] = sum(coefs < 0.05) > 0
}

And how many times out of 1,000 do we find a month that's significant? In 77 cases, one of the months is deemed significant. That's more than the 5% chance for any individual factor, but not as large as I would expect. (Again, I know there's a way to work this analytically, but I'm much slower with pen, paper and math. I'll get to it at some point.)

There's probably also something Bayesian to be said, but right now I need to pack a suitcase. Tomorrrow I'm driving home to the beautiful state of Kentucky to visit my wonderful grandmothers and assorted relations. With luck, I'll get to post something before bedtime.

sessionInfo

## function (package = NULL) 
## {
##     z <- list()
##     z$R.version <- R.Version()
##     z$platform <- z$R.version$platform
##     if (nzchar(.Platform$r_arch)) 
##         z$platform <- paste(z$platform, .Platform$r_arch, sep = "/")
##     z$platform <- paste0(z$platform, " (", 8 * .Machine$sizeof.pointer, 
##         "-bit)")
##     z$locale <- Sys.getlocale()
##     if (is.null(package)) {
##         package <- grep("^package:", search(), value = TRUE)
##         keep <- sapply(package, function(x) x == "package:base" || 
##             !is.null(attr(as.environment(x), "path")))
##         package <- sub("^package:", "", package[keep])
##     }
##     pkgDesc <- lapply(package, packageDescription, encoding = NA)
##     if (length(package) == 0) 
##         stop("no valid packages were specified")
##     basePkgs <- sapply(pkgDesc, function(x) !is.null(x$Priority) && 
##         x$Priority == "base")
##     z$basePkgs <- package[basePkgs]
##     if (any(!basePkgs)) {
##         z$otherPkgs <- pkgDesc[!basePkgs]
##         names(z$otherPkgs) <- package[!basePkgs]
##     }
##     loadedOnly <- loadedNamespaces()
##     loadedOnly <- loadedOnly[!(loadedOnly %in% package)]
##     if (length(loadedOnly)) {
##         names(loadedOnly) <- loadedOnly
##         pkgDesc <- c(pkgDesc, lapply(loadedOnly, packageDescription))
##         z$loadedOnly <- pkgDesc[loadedOnly]
##     }
##     class(z) <- "sessionInfo"
##     z
## }
## <bytecode: 0x0000000015750720>
## <environment: namespace:utils>

To leave a comment for the author, please follow the link and comment on their blog: PirateGrunt » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)