Regression analysis is a process of building a linear or non-linear fit for one or more continuous target variables. That’s right! there can be more than one target variable. Multi-output machine learning problems are more common in classification than regression. In classification, the categorical target variables are encoded to convert them to multi-output. In my professional experience, I see about 90% of the data science regression problems usually have a single target variable and the rest usually require fitting for multiple target variables. Some applications for multi-output target variable problems are in forecasting and predictive maintenance.
In the next couple of sections, let me walk you through, how to solve multi-output regression problems using sklearn.
1. Import packages
from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.multioutput import MultiOutputRegressor from sklearn.ensemble import RandomForestRegressor
There are few packages that we would be loading here
- make_regression: to create a regression dataset
- train_test_split: to split the data into train and test
- MultiOutputRegressor: to create a multioutput regressor
- RandomForestRegressor: To build a random forest regressor model
2. Create a multi-output regressor
x, y = make_regression(n_targets=3)
Here we are creating a random dataset for a regression problem. We will create three target variables and keep the rest of the parameters to default. The below will show the shape of our features and target variables.
3. Split data into train and test
The following block of code will spit our features and target variables into train and test split. Our train set will have 70% of the features and the test will have 30% of the features.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=42)
4. Model building
Next, we can train our multi-output regression model using the below code.
According to the sklearn package, “This strategy consists of fitting one regressor per target. This is a simple strategy for extending regressors that do not natively support multi-target regression“.
clf = MultiOutputRegressor(RandomForestRegressor(max_depth=2, random_state=0)) clf.fit(x_train, y_train)
5. Prediction and scoring
The following block of code will perform prediction for the first test observation and calculates the coefficient of determination of the prediction. Since the dataset is a randomly created data set, we cannot expect it to have a good R2 value.
clf.predict(x_test[]) clf.score(x_test, y_test, sample_weight=None)
Putting everything togather
from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.multioutput import MultiOutputRegressor from sklearn.ensemble import RandomForestRegressor # create regression data x, y = make_regression(n_targets=3) # split into train and test data x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=42) # train the model clf = MultiOutputRegressor(RandomForestRegressor(max_depth=2, random_state=0)) clf.fit(x_train, y_train) # predictions clf.predict(x_test)
Finally, we can put all the code together and as you can see with few lines of code, one can easily build a multi-output regression model using sklearn. In my next tutorial, I will show you how to do multi-output regression using deep learning and the Keras package.
Hope you enjoyed this tutorial. Feel free to drop the comments about this tutorial.