A Machine Learning Result

Posted on February 5, 2015 by Joseph Rickert in R bloggers | 0 Comments

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Joseph Rickert

Learning to effectively use any of the dozens of popular machine learning algorithms requires mastering many details and dealing with all kinds of practical issues. With all of this to consider, it might not be apparent to a person coming to machine learning from a background other than computer science or applied math that there are some easy to get at and very useful “theoretical” results. In this post, we will look at an upper bound result carefully described in section 2.2.1 of Schapire and Freund’s book “Boosting, Foundations and Algorithms” (This book, published in 2012, is destined to be a classic. During the course developing a thorough treatment of boosting algorithms, Schapire and Freund provide a compact introduction to the foundations of machine learning and relate the boosting algorithm to some exciting ideas from game theory, optimization and information geometry.)

The following is an example of a probabilistic upper bound result for the generalization error of an arbitrary classifier.

Assume that:

The data looks like a collection of labeled pairs (x,y) where x represents the vector of features and y takes values 0 or 1.
The training examples and test examples come from the same unknown distribution D of (x,y) pairs
We are dealing with a classification model (or hypothesis in machine learning jargon) h

Define E_t, the training error, to be the percentage of misclassified samples and E_g, the generalization error, to be the probability of misclassifying a single example (x,y) chosen at random from D.

Then, for any d greater than zero, with probability at least 1 – d, the following upper bound holds on the generalization error of h:

E_g <= E_t + sqrt(log(1/d)/(2m)) where m is the number of random samples (R)

Schapire and Freund approach this result as a coin flipping problem, noting that when a training example (x,y) is selected at random, the probability that h(x) does not equal y can be identified with a flipped coin coming up heads. The probability of getting a head, p, is fixed for all flips. The problem becomes that of determining whether the training error, the fraction of mismatches in a sequence of m flips, is significantly different from p.

The big trick in the proof of the above result is to realize that Hoeffding’s Inequality can be used to bound the binomial expression for the probability of getting at most (p – e)m heads in m trials where e is a small positive number. For our purposes, Hoeffding’s inequality can be stated as follows:

Let X₁ . . . X_m be independent random variables taking values in [0,1]. Let A_m denote their average. Then P(A_m <= E[A_m] – e) <= exp(-2me²).

If the X_i are binomial random variables where X_i = 1 if h(x) is not equal to y, then the training error, E_t as defined above, is equal to A_m , the number of successes in m flips. E[A_m] = p is the generalization error, E_g. Hence, the expression in defining the bound in Hoeffding’s Inequality can be written:

E_g >= E_t + e.

Now, letting d = exp(-2me²) where d > 0 we get the result ( R ) above. What this says is that with probability at least 1 – d,

E_t + sqrt(log(1/d)/(2m)) is an upper bound for the generalization error.

A couple of things to notice about the result are:

The bound is the sum of two terms, the second of which doesn't depend on the distribution of the X_i
Hoeffding’s Inequality applies to any random variable in [0,1], so presumably this would make the bound robust with respect to the binomial assumption.

The really big assumption is the one that slipped in at the very beginning, that the training samples and test samples are random draws from the same distribution. This is something that would be difficult to verify in practice, but serves the purpose of encouraging one to think about the underlying distributions that might govern the data in a classification problem.

The plot below provides a simple visualization of the result. It was generated by simulating draws from a binomial with a little bit of noise added, where p = .4 and d = .1 This represents a classifier that does a little better than guessing. The red vertical line marks the value of the generalization error among the simulated upper bounds. The green lines focus on the 10% quantile.

As the result predicts, a little more than 90% of the upper bounds are larger than p.

And here is the code.

m <- 10000                        # number of samples
p <- .4                           # Probability of incorrect classification
N <- 1000                         # Number of simulated sampling experiments
 
delta <- .1                       # 1 - delta is the upper probability bound
 
gamma <- sqrt(log(1/delta)/(2*m)) # Calculate constant term to upper bound
 
Am <- vector("numeric",N)          # Allocate vector
 
for(i in 1:N){
  Am[i] <- sum(rbinom(m,1,p) + rnorm(m,0,.1))/m   # Simulate training error
}
u_bound <- Am + gamma             # Calculate upper bounds
 
plot(ecdf(u_bound),
     xlab="Upper Bound",
     col = "blue",
     lwd = 3,
     main = "Empir Dist (Binomial with noise)")
abline(v=.4, col = "red")
abline(h=.1, col = "green")
abline(v= quantile(u_bound,.1),col="green")

So, what does all this mean in practice? The result clearly pertains to an idealized situation, but to my mind it provides a rough guide as to how low you ought to be able to reduce your testing error. In some cases, it may even signal that you might want to look for better data.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

A Machine Learning Result

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)