Commercial data analytics: An economic view on the data science methods

October 3, 2018
By

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Categories

Tags

The most demanding skill in data science is statistical regression analysis, followed by clustering methods, and the third most required skill is visualization. So if the data scientist learns these analytical skills and focuses on how to use them it will be of great value to both the data scientist, the companies/ organizations as well as the customers.

Regression analysis

Let’s start with the method that is most demanded in the data science: Regression analysis. Instead of choosing random regression models to apply let’s use some of the regression models that are most common for analytics in 2018 according to this article from Towards Data Science. Here we will look at three kinds of regression models: multiple regression, logistic regression, and polynomial regression.

Multiple regression

The multiple regression models consist of an explained variable (y) followed by a number of explanatory variables (the x’s). Multiple regression uses ordinarily least squares (ols) to estimate the coefficients of the model, by minimizing the square distance from the observation to the regression line.

Logistic regression

Logistic regression is used when you have a binary variable (y) to explain. The logistic regression model uses the cumulative distribution function to estimate the logistic function of the model.

Polynomial regression

A polynomial regression function is any multiple regression functions where some of the explanatory variables (x’s) are a potency function of the original variable (eg. age2= age^2).

The below coding is the loading of a dataset followed by different types of regression analysis:

``` # Loading of dataset for regression analysis in R
library(skimr)
library(stargazer)
skim(rdata)
# Linear regression
lMod <- lm(CAPITALGAIN ~ RELATIONSHIP + AGE + ABOVE50K + OCCUPATION + EDUCATIONNUM,
data=rdata)

# Logistic regression
logitMod <- glm(ABOVE50K ~ RELATIONSHIP + AGE + CAPITALGAIN + OCCUPATION + EDUCATIONNUM,

#Polynomial Regression
rdata\$AGE2=((rdata\$AGE)^2)
lpMod = lm(CAPITALGAIN ~ RELATIONSHIP + AGE + AGE2 + ABOVE50K + OCCUPATION + EDUCATIONNUM,
data=rdata)

# Display regression models in a table
stargazer(lMod,logitMod,lpMod,omit = "OCCUPATION", type="text",out="/models.doc")
=============================================================================================
Dependent variable:
-----------------------------------------------------------------
CAPITALGAIN          ABOVE50K          CAPITALGAIN
OLS              logistic       OLS - Polynomial
(1)                 (2)                (3)
---------------------------------------------------------------------------------------------
RELATIONSHIP Not-in-family          258.510**           -2.219***          228.366**
(110.921)            (0.049)           (111.339)

RELATIONSHIP Other-relative          355.678            -2.541***           283.023
(246.046)            (0.191)           (247.147)

RELATIONSHIP Own-child              383.287***          -3.542***           241.854
(144.519)            (0.137)           (151.649)

RELATIONSHIP Unmarried               184.779            -2.529***           207.143
(147.388)            (0.083)           (147.549)

RELATIONSHIP Wife                    -226.536           0.286***            -214.927
(196.262)            (0.066)           (196.273)

AGE                                 18.750***           0.023***            -34.725*
(3.370)             (0.001)            (17.720)

AGE2                                                                        0.607***
(0.197)

ABOVE50K                           3,526.927***                           3,559.029***
(112.903)                              (113.370)

CAPITALGAIN                                             0.0003***
(0.00001)

EDUCATIONNUM                        118.749***          0.288***           124.206***
(19.195)            (0.009)            (19.274)

Constant                          -1,814.407***         -5.489***          -898.784**
(306.870)            (0.161)           (427.640)

---------------------------------------------------------------------------------------------
Observations                          32,561             32,561              32,561
R2                                    0.055                                  0.055
Log Likelihood                                         -10,910.820
Akaike Inf. Crit.                                      21,867.640
Residual Std. Error           7,181.347 (df = 32538)                 7,180.415 (df = 32537)
F Statistic                 86.255*** (df = 22; 32538)             82.937*** (df = 23; 32537)
=============================================================================================
Note:                                                             *p<0.1; **p<0.05; ***p<0.01
```

One of the main advantages of multiple regression is that it is very straightforward to interpret the effects of the estimated coefficients in the model. An example is that the effect of age on capital gain is positive with a highly significant (p<0.001 – shown by ***) effect of 18.750. The polynomial regression model is also easy to interpret. So these models have a lot of explanatory power and are easy to explain to the customer. It is a win-win. The logistic model is a bit harder to explain. The easiest way to explain the effects in the model is to look at effects and marginal effects. The below coding shows this:

``` # Load packages
library(effects)
library(mfx)
# effects model
eff <- allEffects(logitMod)
plot(eff, rescale.axis = FALSE)
```

The plot shows the effect of an explanatory variable on the explained variable at any given level of the explanatory variable. So it is a great way to visualize the effect of the explanatory variables.

The marginal effect code gives us this result (I have not taken all results into this table):

```# Marginal effects model
mfx <-logitmfx(ABOVE50K ~ RELATIONSHIP + AGE + CAPITALGAIN + OCCUPATION + EDUCATIONNUM,
data=rdata, atmean = TRUE, robust = FALSE)
mfx
> mfx
Call:
logitmfx(formula = ABOVE50K ~ RELATIONSHIP + AGE + CAPITALGAIN +
OCCUPATION + EDUCATIONNUM, data = rdata, atmean = TRUE, robust = FALSE)

Marginal Effects:
dF/dx   Std. Err.        z     P>|z|
RELATIONSHIP Not-in-family   -1.7788e-01  4.5764e-03 -38.8692 < 2.2e-16 ***
RELATIONSHIP Other-relative  -1.2431e-01  4.0250e-03 -30.8841 < 2.2e-16 ***
RELATIONSHIP Own-child       -1.9550e-01  4.0274e-03 -48.5418 < 2.2e-16 ***
RELATIONSHIP Unmarried       -1.4571e-01  3.9049e-03 -37.3139 < 2.2e-16 ***
RELATIONSHIP Wife             3.5126e-02  8.8522e-03   3.9680 7.247e-05 ***
AGE                           2.5532e-03  1.6595e-04  15.3856 < 2.2e-16 ***
CAPITALGAIN                   3.3967e-05  1.4328e-06  23.7066 < 2.2e-16 ***
EDUCATIONNUM                  3.2153e-02  1.1105e-03  28.9526 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

The above marginal effects tell us how an explained variable (y) changes when a specific explanatory variable (x) changes. Other variables are assumed to be held constant.

Clustering analysis

Let us look at some of the cluster analysis that are in most demand according to this 2018 article from KDnugget. I have selected some of these analyses in the below coding – first model is K-Nearest Neighbour (KNN). The KNN method is a good way to introduce yourself to machine learning and classification. The method is basically classification by finding similar data points in the training data and making the best guess based on the classification of the observations. This method has wide application in many domains, such as semantic searching, emantic searching, and recommendation systems.

The below coding shows application of a KNN model:

```# KNN R, K-NEAREST NEIGHBOR Clustering
library(caret)
dataurl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
str(wine_df)
set.seed(3033)
intrain <- createDataPartition(y = wine_df\$V1, p= 0.6, list = FALSE)
training <- wine_df[intrain,]
testing <- wine_df[-intrain,]
dim(training); dim(testing);
anyNA(wine_df)
summary(wine_df)
training[["V1"]] = factor(training[["V1"]])
trctrl <- trainControl(method = "repeatedcv", number = 8, repeats = 3)
set.seed(3322)
knn_fit_m <- train(V1 ~., data = training, method = "knn",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)
knn_fit_m
plot(knn_fit_m)```

The next cluster model is a Density Based Spartial Clustering of Applications model with Noise (DBSCAN). The model is given by observations and it groups together observations that are closely packed, and at the same time it find outlier points that are located in low-density regions. DBSCAN is one of the most common clustering algorithms in data science. The below coding demonstrates a DBSCAN model:

```# Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
library(factoextra)
library(plotly)
data("multishapes")
df <- multishapes[, 1:2]
set.seed(120)
km.res <- kmeans(df, 6, nstart = 30)
fviz_cluster(km.res, df,  geom = "point",
ellipse= FALSE, show.clust.cent = FALSE,
palette = "jco", ggtheme = theme_classic())
data("multishapes", package = "factoextra")
df <- multishapes[, 1:2]
# Compute DBSCAN using fpc package
library("fpc")
set.seed(123)
dbs <- fpc::dbscan(df, eps = 0.15, MinPts = 6)
# Plot DBSCAN results
library("factoextra")
fviz_cluster(dbs, data = df, stand = FALSE,
ellipse = FALSE, show.clust.cent = FALSE,
geom = "point",palette = "jco", ggtheme = theme_classic())
print(dbs)
ggplotly(p = ggplot2::last_plot())
```

Visualization

The last skillset that we will work with is visualization. There are some very elegant and efficient visualization in R like `ggplot2`, `dygraphs` and `Plotly`. In this case we will work with `Plotly` due to its ability to create efficient visualization with high quality interactive modules. The below coding creates this:

```# Basic Line Plot
library(plotly)
t_0 <- rnorm(90, mean = 3)
t_1 <- rnorm(90, mean = 0)
t_2 <- rnorm(90, mean = -3)
x <- c(1:90)
datap <- data.frame(x, trace_0, trace_1, trace_2)
r %
add_trace(y = ~t_0, name = 'trace 0',mode = 'lines') %>%
add_trace(y = ~t_1, name = 'trace 1', mode = 'lines+markers') %>%
add_trace(y = ~t_2, name = 'trace 2', mode = 'markers')
r
```

References

Related Post

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...