Series of Azure Databricks posts:
- Dec 01: What is Azure Databricks
- Dec 02: How to get started with Azure Databricks
- Dec 03: Getting to know the workspace and Azure Databricks platform
- Dec 04: Creating your first Azure Databricks cluster
- Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and jobs
- Dec 06: Importing and storing data to Azure Databricks
- Dec 07: Starting with Databricks notebooks and loading data to DBFS
- Dec 08: Using Databricks CLI and DBFS CLI for file upload
- Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
- Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering tasks
- Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
- Dec 12: Using Azure Databricks Notebooks with Python Language for data analytics
- Dec 13: Using Python Databricks Koalas with Azure Databricks
- Dec 14: From configuration to execution of Databricks jobs
- Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
- Dec 16: Databricks experiments, models and MLFlow
In the past couple of days we looked into configurations and infrastructure and today it is again time to do an analysis, let’s call it end-to-end analysis using R or Python or SQL.
1.Notebook, Cluster and Data
Create new notebook, I am calling my Day17_Analysis and selecting Python as kernel language. Attach cluster to your notebook and start the cluster (if it is not yet running). Import data using SparkR:
%r library(SparkR) data_r <- read.df("/FileStore/Day16_wine_quality.csv", source = "csv", header="true") display(data_r) data_r <- as.data.frame(data_r)
And we can also do the same for Python:
import pandas as pd data_py = pd.read_csv("/dbfs/FileStore/Day16_wine_quality.csv", sep=';')
We can use also Python to insert the data and get the dataset insight.
import matplotlib.pyplot as plt import seaborn as sns data_py = pd.read_csv("/dbfs/FileStore/Day16_wine_quality.csv", sep=',') data_py.info()
Importing also all other packages that will be relevant in following steps:
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC
So let’s continue using Python. You can get the sense of the dataset by using Python describe function:
And also work with duplicate values (remove them) and missing values (remove them or replace them with mean value):
#remove duplicates sum(data_py.duplicated()) data_py.drop_duplicates(inplace=True) #remove rows with empty values data_py.isnull().sum(axis=0) data_py.dropna(axis=0, how='any', inplace=True) #fill the missing values with mean data_py.fillna(0, inplace=True) data_py['quality'].fillna(data_py['quality'].mean(), inplace=True) data_py.apply(lambda x: x.fillna(x.mean(), inplace=True), axis=0)
You can also find and filter out the outlier by using IQR – Interquartile rang:
Q1 = data_py.quantile(0.25) Q3 = data_py.quantile(0.75) IQR = Q3 - Q1 data_py2 = data_py[~((data_py < (Q1 - 1.5 * IQR)) |(data_py > (Q3 + 1.5 * IQR))).any(axis=1)] #print(data_py2.shape) print(data_py2 < (Q1 - 1.5 * IQR)) |(data_py2 > (Q3 + 1.5 * IQR))
We can check the distribution of some variables and best way is to show it with graphs:
fig, axs = plt.subplots(1,5,figsize=(20,4),constrained_layout=True) data_py['fixed acidity'].plot(kind='hist', ax=axs) data_py['pH'].plot(kind='hist', ax=axs) data_py['quality'].plot(kind='line', ax=axs) data_py['alcohol'].plot(kind='hist', ax=axs) data_py['total sulfur dioxide'].plot(kind='hist', ax=axs)
Adding also a plot of counts per quality:
counts = data_py.groupby(['quality']).count()['pH'] # pH or anything else - just for count counts.plot(kind='bar', title='Quantity by Quality') plt.xlabel('Quality', fontsize=18) plt.ylabel('Count', fontsize=18)
Adding some boxplots will also give a great understanding of the data and statistics of particular variable. So, let’s take pH and Quality
sns.boxplot(x='quality',y='pH',data=data_py,palette='GnBu_d') plt.title("Boxplot - Quality and pH") plt.show()
or quality with fixed acidity:
sns.boxplot(x="quality",y="fixed acidity",data=data_py,palette="coolwarm") plt.title("Boxplot of Quality and Fixed Acidity") plt.show()
And also add some correlation among all the variables in dataset:
plt.figure(figsize=(10,10)) sns.heatmap(data_py.corr(),annot=True,linewidth=0.5,center=0,cmap='coolwarm') plt.show()
We will split the dataset into Y-set – our predict variable and X-set – all the other variables. After that, we will do splitting of the y-set and x-set into train and test subset.
X = data_py.iloc[:,:11].values Y = data_py.iloc[:,-1].values #Splitting the dataset into training and test set X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25,random_state=0)
We will also to the feature scaling
#Feature scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.fit_transform(X_test)
And get the general understanding of explained variance:
# Applying PCA from sklearn.decomposition import PCA pca = PCA(n_components = 3) X_train = pca.fit_transform(X_train) X_test = pca.transform(X_test) explained_variance = pca.explained_variance_ratio_
You will see, that three variables together contribute more than 50% of all variance of the model.
Based on the train and test test, let us now fit the different type of model into the dataset. Using Logistic regression:
#Fitting Logistic Regression into dataset lr_c=LogisticRegression(random_state=0) lr_c.fit(X_train,Y_train) lr_pred=lr_c.predict(X_test) lr_cm=confusion_matrix(Y_test,lr_pred) print("The accuracy of LogisticRegression is:",accuracy_score(Y_test, lr_pred))
and create a confusion matrix to see the correctly predicted values per category.
#Making confusion matrix print(lr_cm)
I will repeat this for the following algorithms: SVM, RandomForest, KNN, Naive Bayes and I will make a comparison at the end.
#Fitting SVM into dataset cl = SVC(kernel="rbf") cl.fit(X_train,Y_train) svm_pred=cl.predict(X_test) svm_cm = confusion_matrix(Y_test,cl.predict(X_test)) print("The accuracy of SVM is:",accuracy_score(Y_test, svm_pred))
#Fitting Randomforest into dataset rdf_c=RandomForestClassifier(n_estimators=10,criterion='entropy',random_state=0) rdf_c.fit(X_train,Y_train) rdf_pred=rdf_c.predict(X_test) rdf_cm=confusion_matrix(Y_test,rdf_pred) print("The accuracy of RandomForestClassifier is:",accuracy_score(rdf_pred,Y_test))
#Fitting KNN into dataset knn=KNeighborsClassifier(n_neighbors=3) knn.fit(X_train,Y_train) knn_pred=knn.predict(X_test) knn_cm=confusion_matrix(Y_test,knn_pred) print("The accuracy of KNeighborsClassifier is:",accuracy_score(knn_pred,Y_test))
and Naive Bayes
#Fitting Naive bayes into dataset gaussian=GaussianNB() gaussian.fit(X_train,Y_train) bayes_pred=gaussian.predict(X_test) bayes_cm=confusion_matrix(Y_test,bayes_pred) print("The accuracy of naives bayes is:",accuracy_score(bayes_pred,Y_test))
And the accuracy for all the model fitting is the following:
- LogisticRegression is: 0.4722502522704339
- SVM is: 0.48335015136226034
- KNeighborsClassifier is: 0.39455095862764883
- naives bayes is: 0.46316851664984865
It is clear which model would give improvements,
Tomorrow we will look into using Azure Data Factory with Azure Databricks.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!