**R Programming – DataScience+**, and kindly contributed to R-bloggers)

Categories

Tags

The most demanding skill in data science is statistical regression analysis, followed by clustering methods, and the third most required skill is visualization. So if the data scientist learns these analytical skills and focuses on how to use them it will be of great value to both the data scientist, the companies/ organizations as well as the customers.

## Regression analysis

Let’s start with the method that is most demanded in the data science: Regression analysis. Instead of choosing random regression models to apply let’s use some of the regression models that are most common for analytics in 2018 according to this article from Towards Data Science. Here we will look at three kinds of regression models: multiple regression, logistic regression, and polynomial regression.

### Multiple regression

The multiple regression models consist of an explained variable (y) followed by a number of explanatory variables (the x’s). Multiple regression uses ordinarily least squares (ols) to estimate the coefficients of the model, by minimizing the square distance from the observation to the regression line.

### Logistic regression

Logistic regression is used when you have a binary variable (y) to explain. The logistic regression model uses the cumulative distribution function to estimate the logistic function of the model.

### Polynomial regression

A polynomial regression function is any multiple regression functions where some of the explanatory variables (x’s) are a potency function of the original variable (eg. age2= age^2).

The below coding is the loading of a dataset followed by different types of regression analysis:

# Loading of dataset for regression analysis in R library(skimr) library(stargazer) rdata<-read.csv("http://rstatistics.net/wp-content/uploads/2015/09/adult.csv") skim(rdata) # Linear regression lMod <- lm(CAPITALGAIN ~ RELATIONSHIP + AGE + ABOVE50K + OCCUPATION + EDUCATIONNUM, data=rdata) # Logistic regression logitMod <- glm(ABOVE50K ~ RELATIONSHIP + AGE + CAPITALGAIN + OCCUPATION + EDUCATIONNUM, data=rdata, family=binomial(link="logit")) #Polynomial Regression rdata$AGE2=((rdata$AGE)^2) lpMod = lm(CAPITALGAIN ~ RELATIONSHIP + AGE + AGE2 + ABOVE50K + OCCUPATION + EDUCATIONNUM, data=rdata) # Display regression models in a table stargazer(lMod,logitMod,lpMod,omit = "OCCUPATION", type="text",out="/models.doc")============================================================================================= Dependent variable: ----------------------------------------------------------------- CAPITALGAIN ABOVE50K CAPITALGAIN OLS logistic OLS - Polynomial (1) (2) (3) --------------------------------------------------------------------------------------------- RELATIONSHIP Not-in-family 258.510** -2.219*** 228.366** (110.921) (0.049) (111.339) RELATIONSHIP Other-relative 355.678 -2.541*** 283.023 (246.046) (0.191) (247.147) RELATIONSHIP Own-child 383.287*** -3.542*** 241.854 (144.519) (0.137) (151.649) RELATIONSHIP Unmarried 184.779 -2.529*** 207.143 (147.388) (0.083) (147.549) RELATIONSHIP Wife -226.536 0.286*** -214.927 (196.262) (0.066) (196.273) AGE 18.750*** 0.023*** -34.725* (3.370) (0.001) (17.720) AGE2 0.607*** (0.197) ABOVE50K 3,526.927*** 3,559.029*** (112.903) (113.370) CAPITALGAIN 0.0003*** (0.00001) EDUCATIONNUM 118.749*** 0.288*** 124.206*** (19.195) (0.009) (19.274) Constant -1,814.407*** -5.489*** -898.784** (306.870) (0.161) (427.640) --------------------------------------------------------------------------------------------- Observations 32,561 32,561 32,561 R2 0.055 0.055 Adjusted R2 0.054 0.055 Log Likelihood -10,910.820 Akaike Inf. Crit. 21,867.640 Residual Std. Error 7,181.347 (df = 32538) 7,180.415 (df = 32537) F Statistic 86.255*** (df = 22; 32538) 82.937*** (df = 23; 32537) ============================================================================================= Note: *p<0.1; **p<0.05; ***p<0.01

One of the main advantages of multiple regression is that it is very straightforward to interpret the effects of the estimated coefficients in the model. An example is that the effect of age on capital gain is positive with a highly significant (p<0.001 – shown by ***) effect of 18.750. The polynomial regression model is also easy to interpret. So these models have a lot of explanatory power and are easy to explain to the customer. It is a win-win. The logistic model is a bit harder to explain. The easiest way to explain the effects in the model is to look at effects and marginal effects. The below coding shows this:

# Load packages library(effects) library(mfx) # effects model eff <- allEffects(logitMod) plot(eff, rescale.axis = FALSE)

The effect plot look likes this:

The plot shows the effect of an explanatory variable on the explained variable at any given level of the explanatory variable. So it is a great way to visualize the effect of the explanatory variables.

The marginal effect code gives us this result (I have not taken all results into this table):

# Marginal effects model mfx <-logitmfx(ABOVE50K ~ RELATIONSHIP + AGE + CAPITALGAIN + OCCUPATION + EDUCATIONNUM, data=rdata, atmean = TRUE, robust = FALSE) mfx> mfx Call: logitmfx(formula = ABOVE50K ~ RELATIONSHIP + AGE + CAPITALGAIN + OCCUPATION + EDUCATIONNUM, data = rdata, atmean = TRUE, robust = FALSE) Marginal Effects: dF/dx Std. Err. z P>|z| RELATIONSHIP Not-in-family -1.7788e-01 4.5764e-03 -38.8692 < 2.2e-16 *** RELATIONSHIP Other-relative -1.2431e-01 4.0250e-03 -30.8841 < 2.2e-16 *** RELATIONSHIP Own-child -1.9550e-01 4.0274e-03 -48.5418 < 2.2e-16 *** RELATIONSHIP Unmarried -1.4571e-01 3.9049e-03 -37.3139 < 2.2e-16 *** RELATIONSHIP Wife 3.5126e-02 8.8522e-03 3.9680 7.247e-05 *** AGE 2.5532e-03 1.6595e-04 15.3856 < 2.2e-16 *** CAPITALGAIN 3.3967e-05 1.4328e-06 23.7066 < 2.2e-16 *** EDUCATIONNUM 3.2153e-02 1.1105e-03 28.9526 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The above marginal effects tell us how an explained variable (y) changes when a specific explanatory variable (x) changes. Other variables are assumed to be held constant.

## Clustering analysis

Let us look at some of the cluster analysis that are in most demand according to this 2018 article from KDnugget. I have selected some of these analyses in the below coding – first model is K-Nearest Neighbour (KNN). The KNN method is a good way to introduce yourself to machine learning and classification. The method is basically classification by finding similar data points in the training data and making the best guess based on the classification of the observations. This method has wide application in many domains, such as semantic searching, emantic searching, and recommendation systems.

The below coding shows application of a KNN model:

# KNN R, K-NEAREST NEIGHBOR Clustering library(caret) dataurl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data" download.file(url = dataurl, destfile = "wine.data") wine_df <- read.csv("wine.data", header = FALSE) str(wine_df) set.seed(3033) intrain <- createDataPartition(y = wine_df$V1, p= 0.6, list = FALSE) training <- wine_df[intrain,] testing <- wine_df[-intrain,] dim(training); dim(testing); anyNA(wine_df) summary(wine_df) training[["V1"]] = factor(training[["V1"]]) trctrl <- trainControl(method = "repeatedcv", number = 8, repeats = 3) set.seed(3322) knn_fit_m <- train(V1 ~., data = training, method = "knn", trControl=trctrl, preProcess = c("center", "scale"), tuneLength = 10) knn_fit_m plot(knn_fit_m)

The KNN model above gives us the following plot:

The next cluster model is a Density Based Spartial Clustering of Applications model with Noise (DBSCAN). The model is given by observations and it groups together observations that are closely packed, and at the same time it find outlier points that are located in low-density regions. DBSCAN is one of the most common clustering algorithms in data science. The below coding demonstrates a DBSCAN model:

# Density-Based Spatial Clustering of Applications with Noise (DBSCAN) library(factoextra) library(plotly) data("multishapes") df <- multishapes[, 1:2] set.seed(120) km.res <- kmeans(df, 6, nstart = 30) fviz_cluster(km.res, df, geom = "point", ellipse= FALSE, show.clust.cent = FALSE, palette = "jco", ggtheme = theme_classic()) # Load the data data("multishapes", package = "factoextra") df <- multishapes[, 1:2] # Compute DBSCAN using fpc package library("fpc") set.seed(123) dbs <- fpc::dbscan(df, eps = 0.15, MinPts = 6) # Plot DBSCAN results library("factoextra") fviz_cluster(dbs, data = df, stand = FALSE, ellipse = FALSE, show.clust.cent = FALSE, geom = "point",palette = "jco", ggtheme = theme_classic()) print(dbs) ggplotly(p = ggplot2::last_plot())

The above DBSCAN model gives us the following plot:

## Visualization

The last skillset that we will work with is visualization. There are some very elegant and efficient visualization in R like `ggplot2`

, `dygraphs`

and `Plotly`

. In this case we will work with `Plotly`

due to its ability to create efficient visualization with high quality interactive modules. The below coding creates this:

# Basic Line Plot library(plotly) t_0 <- rnorm(90, mean = 3) t_1 <- rnorm(90, mean = 0) t_2 <- rnorm(90, mean = -3) x <- c(1:90) datap <- data.frame(x, trace_0, trace_1, trace_2) r % add_trace(y = ~t_0, name = 'trace 0',mode = 'lines') %>% add_trace(y = ~t_1, name = 'trace 1', mode = 'lines+markers') %>% add_trace(y = ~t_2, name = 'trace 2', mode = 'markers') r

The above coding gives us the following visualization:

### References

1. Using Skimr in R – CRAN.R-project.org

2. Using Stargazer in R – CRAN.R-project.org

3. Using Caret in R – CRAN.R-project.org

4. Using Factoextra in R – CRAN.R-project.org

5. Using Plotly in R – CRAN.R-project.org

6. Using Fpc in R – CRAN.R-project.org

Related Post

- Weight loss in the U.S. – An analysis of NHANES data with tidyverse
- Machine Learning Results in R: one plot to rule them all! (Part 2 – Regression Models)
- Story of pairs, ggpairs, and the linear regression
- Extract FRED Data for OLS Regression Analysis: A Complete R Tutorial
- MNIST For Machine Learning Beginners With Softmax Regression

**leave a comment**for the author, please follow the link and comment on their blog:

**R Programming – DataScience+**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...