Advent of 2020, Day 17 – End-to-End Machine learning project in Azure Databricks

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Series of Azure Databricks posts:

In the past couple of days we looked into configurations and infrastructure and today it is again time to do an analysis, let’s call it end-to-end analysis using R or Python or SQL.

1.Notebook, Cluster and Data

Create new notebook, I am calling my Day17_Analysis and selecting Python as kernel language. Attach cluster to your notebook and start the cluster (if it is not yet running). Import data using SparkR:

%r
library(SparkR)

data_r <- read.df("/FileStore/Day16_wine_quality.csv", source = "csv", header="true")

display(data_r)
data_r <- as.data.frame(data_r)

And we can also do the same for Python:

import pandas as pd
data_py = pd.read_csv("/dbfs/FileStore/Day16_wine_quality.csv", sep=';')

We can use also Python to insert the data and get the dataset insight.

import matplotlib.pyplot as plt
import seaborn as sns
data_py = pd.read_csv("/dbfs/FileStore/Day16_wine_quality.csv", sep=',')
data_py.info()

Importing also all other packages that will be relevant in following steps:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

2.Data wrangling

So let’s continue using Python. You can get the sense of the dataset by using Python describe function:

data_py.describe()

And also work with duplicate values (remove them) and missing values (remove them or replace them with mean value):

#remove duplicates
sum(data_py.duplicated())
data_py.drop_duplicates(inplace=True)

#remove rows with empty values
data_py.isnull().sum(axis=0)
data_py.dropna(axis=0, how='any', inplace=True)

#fill the missing values with mean
data_py.fillna(0, inplace=True)
data_py['quality'].fillna(data_py['quality'].mean(), inplace=True)
data_py.apply(lambda x: x.fillna(x.mean(), inplace=True), axis=0)

You can also find and filter out the outlier by using IQR – Interquartile rang:

Q1 = data_py.quantile(0.25)
Q3 = data_py.quantile(0.75)
IQR = Q3 - Q1
data_py2 = data_py[~((data_py < (Q1 - 1.5 * IQR)) |(data_py > (Q3 + 1.5 * IQR))).any(axis=1)]
#print(data_py2.shape)
print(data_py2 < (Q1 - 1.5 * IQR)) |(data_py2 > (Q3 + 1.5 * IQR))

3.Exploring dataset

We can check the distribution of some variables and best way is to show it with graphs:

fig, axs = plt.subplots(1,5,figsize=(20,4),constrained_layout=True)

data_py['fixed acidity'].plot(kind='hist', ax=axs[0])
data_py['pH'].plot(kind='hist', ax=axs[1])
data_py['quality'].plot(kind='line', ax=axs[2])
data_py['alcohol'].plot(kind='hist', ax=axs[3])
data_py['total sulfur dioxide'].plot(kind='hist', ax=axs[4])

Adding also a plot of counts per quality:

counts = data_py.groupby(['quality']).count()['pH']  # pH or anything else - just for count 
counts.plot(kind='bar', title='Quantity by Quality')
plt.xlabel('Quality', fontsize=18) 
plt.ylabel('Count', fontsize=18)

Adding some boxplots will also give a great understanding of the data and statistics of particular variable. So, let’s take pH and Quality

sns.boxplot(x='quality',y='pH',data=data_py,palette='GnBu_d')
plt.title("Boxplot - Quality and pH")
plt.show()

or quality with fixed acidity:

sns.boxplot(x="quality",y="fixed acidity",data=data_py,palette="coolwarm")
plt.title("Boxplot of Quality and Fixed Acidity")
plt.show()

And also add some correlation among all the variables in dataset:

plt.figure(figsize=(10,10))
sns.heatmap(data_py.corr(),annot=True,linewidth=0.5,center=0,cmap='coolwarm')
plt.show()

4.Modeling

We will split the dataset into Y-set – our predict variable and X-set – all the other variables. After that, we will do splitting of the y-set and x-set into train and test subset.

X = data_py.iloc[:,:11].values
Y = data_py.iloc[:,-1].values

#Splitting the dataset into training and test set
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25,random_state=0)

We will also to the feature scaling

#Feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

And get the general understanding of explained variance:

# Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 3)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_

You will see, that three variables together contribute more than 50% of all variance of the model.

Based on the train and test test, let us now fit the different type of model into the dataset. Using Logistic regression:

#Fitting Logistic Regression into dataset
lr_c=LogisticRegression(random_state=0)
lr_c.fit(X_train,Y_train)
lr_pred=lr_c.predict(X_test)
lr_cm=confusion_matrix(Y_test,lr_pred)
print("The accuracy of  LogisticRegression is:",accuracy_score(Y_test, lr_pred))

and create a confusion matrix to see the correctly predicted values per category.

#Making confusion matrix
print(lr_cm)

I will repeat this for the following algorithms: SVM, RandomForest, KNN, Naive Bayes and I will make a comparison at the end.

SVM

#Fitting SVM into dataset
cl = SVC(kernel="rbf")
cl.fit(X_train,Y_train)
svm_pred=cl.predict(X_test)
svm_cm = confusion_matrix(Y_test,cl.predict(X_test))
print("The accuracy of  SVM is:",accuracy_score(Y_test, svm_pred))

RandomForest

#Fitting Randomforest into dataset
rdf_c=RandomForestClassifier(n_estimators=10,criterion='entropy',random_state=0)
rdf_c.fit(X_train,Y_train)
rdf_pred=rdf_c.predict(X_test)
rdf_cm=confusion_matrix(Y_test,rdf_pred)
print("The accuracy of RandomForestClassifier is:",accuracy_score(rdf_pred,Y_test))

KNN

#Fitting KNN into dataset
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,Y_train)
knn_pred=knn.predict(X_test)
knn_cm=confusion_matrix(Y_test,knn_pred)
print("The accuracy of KNeighborsClassifier is:",accuracy_score(knn_pred,Y_test))

and Naive Bayes

#Fitting Naive bayes into dataset
gaussian=GaussianNB()
gaussian.fit(X_train,Y_train)
bayes_pred=gaussian.predict(X_test)
bayes_cm=confusion_matrix(Y_test,bayes_pred)
print("The accuracy of naives bayes is:",accuracy_score(bayes_pred,Y_test))

And the accuracy for all the model fitting is the following:

  • LogisticRegression is: 0.4722502522704339
  • SVM is: 0.48335015136226034
  • KNeighborsClassifier is: 0.39455095862764883
  • naives bayes is: 0.46316851664984865

It is clear which model would give improvements,

Tomorrow we will look into using Azure Data Factory with Azure Databricks.

Complete set of code and Notebooks will be available at the Github repository.

Happy Coding and Stay Healthy!

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)