by Joseph Rickert
I found my way into data science and machine learning relatively late in my career. When I began reading papers on supervised learning I was delighted to find that good old logistic regression was considered a “go to” classifier. This was like learning that an old friend was admired for an achievement I didn’t know anything about. After a couple of comfortable experiences like this, I thought I would fit in quite nicely with this new (to me) tribe of data analysts studying pattern recognition and natural language processing. It took some time however, before I realized that they were working with a conceptual framework that was a little different from my statistics worldview. You might say it’s all probabilistic and statistical reasoning, but different problems and different tools lead to mindsets that shape and bias a person’s thinking.
For example, consider the following list of classifiers: Decision Trees, Generalized Boosted Models, Logistic Regression, Naive Bayes, Neural Networks, Random Forests and Support Vector Machine.
Some of these are base classifiers, and others are ensemble models, but one of them is conceptually different from the others. The odd duck here Naive Bayes. It’s the only generative model in the list. The others are examples of discriminative models. This is not a distinction that is easy to stumble across in the statistics literature, but it is fundamental to the machine-learning mindset, and a helpful modeling idea.
The basic conceptual difference between generative and discriminative models hinges on the underlying probability inference structure. Discriminative models learn P(Y | X), the conditional relationship between the target variable, Y, and the features, X, directly from the data. This is exactly the way ordinary least squares regression works, and it is the kind of inference pattern that gets fixed in the mind of statistics students very early on in their training. It is a direct approach to sorting out the relationship among variables. Some (usually one) variables are the dependent variables, or target variables, and other variables are the independent variables, or features. These latter variables are given or fixed, at least for the purposes of the analysis.
Generative models, on the other hand, aim for a complete probabilistic description of the data. With these models, the goal is to construct the joint probability distribution P(X, Y) – either directly or by first computing P(X | Y) and P(Y) – and then inferring the conditional probabilities required to classify new data. This approach generally requires more sophisticated probabilistic thinking than a regression mentality demands, but it provides a complete model of the probabilistic structure of the data. Knowing the joint distribution enables you to generate the data; hence, Naive Bayes is a generative model.
Once you know what you are looking for, it is not difficult to find excellent online tutorials demonstrating the differences between generative and discriminative models. For example, Stanford professors Christopher Manning and Andrew Ng have both produced short videos that nicely characterize these models. And for a simple explanation of the Naive Bayes algorithm and how it unfolds as a generative model, I very much enjoyed mathematicalmonk’s colored marker video.
In his Eight to Late blog, Kalish Awati thoroughly develops a classification example using Naive Bayes that is worth a look not only because of the details on data preparation and model building he provides, but also because of the care he takes to explain the underlying theory. Kalish uses the Naive Bayes classifier in the mysteriously named e1071 package and the HouseVotes data set from the mlbench package. (The klar package from the University of Dortmund also provides a Naive Bayes classifier.) I won’t reproduce Kalish’s example here, but I will use his imputation function later in this post.
First however, let’s follow up on the idea of using a Naive Bayes model to produce synthetic data.