[This article was first published on r – Appsilon | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ## Logistic Regression with R

Logistic regression is one of the most fundamental algorithms from statistics, commonly used in machine learning. It’s not used to produce SOTA models but can serve as an excellent baseline for binary classification problems.

Interested in machine learning for beginners? Check our detailed guide on Linear Regression with R.

Today you’ll learn how to implement the logistic regression model in R and also improve your data cleaning, preparation, and feature engineering skills.

Navigate to a section:

## Introduction to Logistic Regression

Logistic regression is an algorithm used both in statistics and machine learning. Machine learning engineers frequently use it as a baseline model – a model which other algorithms have to outperform. It’s also commonly used first because it’s easily interpretable.

In a way, logistic regression is similar to linear regression – but the latter is not used to predict continuous values (such as age or height). Instead, it’s used to predict binary classes – has the client churned or not, has the person survived or not, or is the disease malignant or benign. To simplify, logistic regression is used to predict the Yes/No type of response.

That’s not entirely true. Logistic regression tells us the probability of response is Yes, and we then use a predefined threshold to assign classes. For example, if the probability is greater than 0.5, the assigned class is Yes, and otherwise No. Evaluating performance with different thresholds can reduce the number of false positives or false negatives, depending on how you want to go.

As you would assume, logistic regression can work with both continuous and categorical data. This means your dataset can contain any sort of data, as long it is adequately prepared.

You can use logistic regression models to examine feature importances. You’ll see how to do it later through hands-on examples. Knowing which features are important enables you to build simpler and less-dimensional models. As a result, the predictions and the model are more interpretable.

And that’s all you need for a basic intuition behind logistic regression. Let’s get our hands dirty next.

One of the best-known binary classification datasets is the Titanic dataset. The goal is to predict whether the passenger has survived the accident based on many input features, such as age, passenger class, and others.

You don’t have to download the dataset, as there’s a dedicated package for it in R. You’ll use only the training dataset throughout the article, so you don’t have to do the preparation and feature engineering twice.

The following snippet loads in every required package, stores the training dataset to a variable called `df`, and prints its structure:

Here’s the corresponding structure: Image 1 – Titanic dataset structure

There’s a lot of work required. For example, missing values in some columns are marked with empty strings instead of `NA`. This issue is easy to fix, and once you fix it, you can plot a missingness map. It will show you where the missing values are located:

The missingness map is shown below: Image 2 – Missingness map

The first three columns contain missing data. You’ll see how to fix that in the next section.

## Feature Engineering and Handling Missing Data

You need feature engineering because the default features either aren’t formatted correctly or don’t display information in the best way. Just take a look at the `Name` column in Image 1 – an algorithm can’t process it in the default format.

But this feature is quite useful. You can extract the passenger title from it (e.g., MissSir, and so on). As a final step, you can check if a passenger has a rare title (e.g., DonaLadyMajor, and so on).

The following snippet does just that:

You can see all of the unique titles we have now in the following image: Image 3 – Unique passenger titles

You can apply similar logic to the `Cabin` column. It’s useless by default but can be used to extract the deck number. Here’s how:

The unique deck numbers are shown in the following image: Image 4 – Unique deck numbers

You’ve now done some feature engineering, which means the original columns can be deleted. The snippet below deletes these two, but also `PassengerId` and `Ticket`, because these provide no meaningful information:

Finally, you can shift the focus to the missing values. Two approaches will be used – mode and MICE imputation.

You’ll use mode (most frequent value) imputation on the `Embarked` column because it contains only a couple of missing values. MICE imputation will require a bit more work. Converting categorical variables to factors is a must, and the imputation is done by leaving the target variable out.

Here’s the entire code snippet for imputing missing values:

As a sanity check, you can plot the density plots of continuous variables before and after imputation. Doing so will show you if the imputation skewed the distribution or not. `Age` is the only continuous variable, so let’s make a before and after density plot:

The visualization is shown below: Image 5 – Density plot of Age before and after imputation

Some changes are visible, sure, but the overall distribution stayed roughly the same. There were a lot of missing values in this variable, so some changes in distribution are inevitable.

Finally, you can assign the imputation results to the original dataset and convert `Deck` to factor:

You now have everything needed to start with predictive modeling – so let’s do that next.

## Modeling

Before proceeding with modeling, you’ll need to split your dataset into training and testing subsets. These are available from the start with the Titanic dataset, but you’ll have to do the split manually as we’ve only used the training dataset.

The following snippet splits the data randomly in a 70:30 ratio. Don’t forget to set the seed value to 42 if you want the same split:

You can now train the model on the training set. R uses the `glm()` function to apply logistic regression. The syntax is identical as with linear regression. You’ll need to put the target variable on the left and features on the right, separated with the `~` sign. If you want to use all features, put a dot (.) instead of feature names.

Also, don’t forget to specify `family = "binomial"`, as this is required for logistic regression:

Here’s the summary of the model: Image 6 – Summary of a logistic regression model

The most interesting thing here is the P-values, displayed in the `Pr(>|t|)` column. Those values indicate the probability of a variable not being important for prediction. It’s common to use a 5% significance threshold, so if a P-value is 0.05 or below, we can say that there’s a low chance it is not significant for the analysis.

You can also explore feature importances explicitly, with the `varImp()` function. Here’s how to obtain the ten most important features, sorted:

The features are shown below: Image 7 – Feature importances of a logistic regression model

You’ve built and explored the model so far, but there’s no use in it yet. The next section shows you how to generate predictions on previously unseen data and evaluate the model.

## Generating Predictions

As mentioned in the introduction section, logistic regression is based on probabilities. If the probability is greater than some threshold (commonly 0.5), you can treat this instance as positive.

The most common way of evaluating machine learning models is by examining the confusion matrix. It’s a square matrix showing you how many predictions were correct (true positives and true negatives), how many were negative but classified as positive (false positives), and how many were positive but classified as negative (false negatives). In our case, positive refers to a passenger who survived the accident.

The snippet below shows how to obtain probabilities and classes, and how to print the confusion matrix:

And here are the corresponding results: Image 8 – Confusion matrix of a logistic regression model

221 of 268 records were classified correctly, resulting in an accuracy of 82.5%. There are 26 false positives and 21 false negatives. You can play around with classification thresholds (0.5 now) and see how these misclassifications are changing.

And that’s more than enough to get you started with logistic regression and classification in general. Let’s wrap things up in the next section.

## Conclusion

Logistic regression is often used as a baseline binary classification model. More sophisticated algorithms (tree-based or neural networks) have to outperform it to be useful.

Today you’ve learned how to approach data cleaning, preparation, and feature engineering in a hopefully easy to follow and understand way. You’ve also learned how to apply binary classification modeling with logistic regression, and how to evaluate classification models.

If you want to implement machine learning in your organization, you can always reach out to Appsilon for help. 