Commercial data analytics: An economic view on the data science methods

October 3, 2018
By

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Categories

    1. Regression Models

    Tags

    1. Data Visualisation
    2. Linear Regression
    3. Logistic Regression
    4. R Programming

    The most demanding skill in data science is statistical regression analysis, followed by clustering methods, and the third most required skill is visualization. So if the data scientist learns these analytical skills and focuses on how to use them it will be of great value to both the data scientist, the companies/ organizations as well as the customers.

    Regression analysis

    Let’s start with the method that is most demanded in the data science: Regression analysis. Instead of choosing random regression models to apply let’s use some of the regression models that are most common for analytics in 2018 according to this article from Towards Data Science. Here we will look at three kinds of regression models: multiple regression, logistic regression, and polynomial regression.

    Multiple regression

    The multiple regression models consist of an explained variable (y) followed by a number of explanatory variables (the x’s). Multiple regression uses ordinarily least squares (ols) to estimate the coefficients of the model, by minimizing the square distance from the observation to the regression line.

    Logistic regression

    Logistic regression is used when you have a binary variable (y) to explain. The logistic regression model uses the cumulative distribution function to estimate the logistic function of the model.

    Polynomial regression

    A polynomial regression function is any multiple regression functions where some of the explanatory variables (x’s) are a potency function of the original variable (eg. age2= age^2).

    The below coding is the loading of a dataset followed by different types of regression analysis:

     # Loading of dataset for regression analysis in R
    library(skimr)
    library(stargazer)
    rdata<-read.csv("http://rstatistics.net/wp-content/uploads/2015/09/adult.csv")
    skim(rdata)
    # Linear regression
    lMod <- lm(CAPITALGAIN ~ RELATIONSHIP + AGE + ABOVE50K + OCCUPATION + EDUCATIONNUM, 
                    data=rdata)
    
    # Logistic regression
    logitMod <- glm(ABOVE50K ~ RELATIONSHIP + AGE + CAPITALGAIN + OCCUPATION + EDUCATIONNUM, 
                    data=rdata, family=binomial(link="logit"))
    
    #Polynomial Regression
    rdata$AGE2=((rdata$AGE)^2)
    lpMod = lm(CAPITALGAIN ~ RELATIONSHIP + AGE + AGE2 + ABOVE50K + OCCUPATION + EDUCATIONNUM, 
                 data=rdata)
    
    # Display regression models in a table
    stargazer(lMod,logitMod,lpMod,omit = "OCCUPATION", type="text",out="/models.doc")
    =============================================================================================
                                                       Dependent variable:                       
                                -----------------------------------------------------------------
                                       CAPITALGAIN          ABOVE50K          CAPITALGAIN        
                                           OLS              logistic       OLS - Polynomial            
                                           (1)                 (2)                (3)            
    ---------------------------------------------------------------------------------------------
    RELATIONSHIP Not-in-family          258.510**           -2.219***          228.366**         
                                        (110.921)            (0.049)           (111.339)         
                                                                                                 
    RELATIONSHIP Other-relative          355.678            -2.541***           283.023          
                                        (246.046)            (0.191)           (247.147)         
                                                                                                 
    RELATIONSHIP Own-child              383.287***          -3.542***           241.854          
                                        (144.519)            (0.137)           (151.649)         
                                                                                                 
    RELATIONSHIP Unmarried               184.779            -2.529***           207.143          
                                        (147.388)            (0.083)           (147.549)         
                                                                                                 
    RELATIONSHIP Wife                    -226.536           0.286***            -214.927         
                                        (196.262)            (0.066)           (196.273)         
                                                                                                 
    AGE                                 18.750***           0.023***            -34.725*         
                                         (3.370)             (0.001)            (17.720)         
                                                                                                 
    AGE2                                                                        0.607***         
                                                                                (0.197)          
                                                                                                 
    ABOVE50K                           3,526.927***                           3,559.029***       
                                        (112.903)                              (113.370)         
                                                                                                 
    CAPITALGAIN                                             0.0003***                            
                                                            (0.00001)                            
                                                                                                 
    EDUCATIONNUM                        118.749***          0.288***           124.206***        
                                         (19.195)            (0.009)            (19.274)         
                                                                                                 
    Constant                          -1,814.407***         -5.489***          -898.784**        
                                        (306.870)            (0.161)           (427.640)         
                                                                                                 
    ---------------------------------------------------------------------------------------------
    Observations                          32,561             32,561              32,561          
    R2                                    0.055                                  0.055           
    Adjusted R2                           0.054                                  0.055           
    Log Likelihood                                         -10,910.820                           
    Akaike Inf. Crit.                                      21,867.640                            
    Residual Std. Error           7,181.347 (df = 32538)                 7,180.415 (df = 32537)  
    F Statistic                 86.255*** (df = 22; 32538)             82.937*** (df = 23; 32537)
    =============================================================================================
    Note:                                                             *p<0.1; **p<0.05; ***p<0.01
    

    One of the main advantages of multiple regression is that it is very straightforward to interpret the effects of the estimated coefficients in the model. An example is that the effect of age on capital gain is positive with a highly significant (p<0.001 – shown by ***) effect of 18.750. The polynomial regression model is also easy to interpret. So these models have a lot of explanatory power and are easy to explain to the customer. It is a win-win. The logistic model is a bit harder to explain. The easiest way to explain the effects in the model is to look at effects and marginal effects. The below coding shows this:

     # Load packages
    library(effects)
    library(mfx)
    # effects model
    eff <- allEffects(logitMod)
    plot(eff, rescale.axis = FALSE)
    

    The effect plot look likes this:

    The plot shows the effect of an explanatory variable on the explained variable at any given level of the explanatory variable. So it is a great way to visualize the effect of the explanatory variables.

    The marginal effect code gives us this result (I have not taken all results into this table):

    # Marginal effects model 
    mfx <-logitmfx(ABOVE50K ~ RELATIONSHIP + AGE + CAPITALGAIN + OCCUPATION + EDUCATIONNUM,
                    data=rdata, atmean = TRUE, robust = FALSE)
    mfx 
    > mfx
    Call:
    logitmfx(formula = ABOVE50K ~ RELATIONSHIP + AGE + CAPITALGAIN + 
        OCCUPATION + EDUCATIONNUM, data = rdata, atmean = TRUE, robust = FALSE)
    
    Marginal Effects:
                                       dF/dx   Std. Err.        z     P>|z|    
    RELATIONSHIP Not-in-family   -1.7788e-01  4.5764e-03 -38.8692 < 2.2e-16 ***
    RELATIONSHIP Other-relative  -1.2431e-01  4.0250e-03 -30.8841 < 2.2e-16 ***
    RELATIONSHIP Own-child       -1.9550e-01  4.0274e-03 -48.5418 < 2.2e-16 ***
    RELATIONSHIP Unmarried       -1.4571e-01  3.9049e-03 -37.3139 < 2.2e-16 ***
    RELATIONSHIP Wife             3.5126e-02  8.8522e-03   3.9680 7.247e-05 ***
    AGE                           2.5532e-03  1.6595e-04  15.3856 < 2.2e-16 ***
    CAPITALGAIN                   3.3967e-05  1.4328e-06  23.7066 < 2.2e-16 ***
    EDUCATIONNUM                  3.2153e-02  1.1105e-03  28.9526 < 2.2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    

    The above marginal effects tell us how an explained variable (y) changes when a specific explanatory variable (x) changes. Other variables are assumed to be held constant.

    Clustering analysis

    Let us look at some of the cluster analysis that are in most demand according to this 2018 article from KDnugget. I have selected some of these analyses in the below coding – first model is K-Nearest Neighbour (KNN). The KNN method is a good way to introduce yourself to machine learning and classification. The method is basically classification by finding similar data points in the training data and making the best guess based on the classification of the observations. This method has wide application in many domains, such as semantic searching, emantic searching, and recommendation systems.

    The below coding shows application of a KNN model:

    # KNN R, K-NEAREST NEIGHBOR Clustering
    library(caret)  
    dataurl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
    download.file(url = dataurl, destfile = "wine.data")
    wine_df <- read.csv("wine.data", header = FALSE)  
    str(wine_df)  
    set.seed(3033)
    intrain <- createDataPartition(y = wine_df$V1, p= 0.6, list = FALSE)
    training <- wine_df[intrain,]
    testing <- wine_df[-intrain,]
    dim(training); dim(testing);
    anyNA(wine_df)  
    summary(wine_df)
    training[["V1"]] = factor(training[["V1"]])  
    trctrl <- trainControl(method = "repeatedcv", number = 8, repeats = 3)
    set.seed(3322)
    knn_fit_m <- train(V1 ~., data = training, method = "knn",
                     trControl=trctrl,
                     preProcess = c("center", "scale"),
                     tuneLength = 10)
    knn_fit_m
    plot(knn_fit_m)

    The KNN model above gives us the following plot:

    The next cluster model is a Density Based Spartial Clustering of Applications model with Noise (DBSCAN). The model is given by observations and it groups together observations that are closely packed, and at the same time it find outlier points that are located in low-density regions. DBSCAN is one of the most common clustering algorithms in data science. The below coding demonstrates a DBSCAN model:

    # Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
    library(factoextra)
    library(plotly)
    data("multishapes")
    df <- multishapes[, 1:2]
    set.seed(120)
    km.res <- kmeans(df, 6, nstart = 30)
    fviz_cluster(km.res, df,  geom = "point", 
                 ellipse= FALSE, show.clust.cent = FALSE,
                 palette = "jco", ggtheme = theme_classic())
    # Load the data 
    data("multishapes", package = "factoextra")
    df <- multishapes[, 1:2]
    # Compute DBSCAN using fpc package
    library("fpc")
    set.seed(123)
    dbs <- fpc::dbscan(df, eps = 0.15, MinPts = 6)
    # Plot DBSCAN results
    library("factoextra")
    fviz_cluster(dbs, data = df, stand = FALSE,
                 ellipse = FALSE, show.clust.cent = FALSE,
                 geom = "point",palette = "jco", ggtheme = theme_classic())
    print(dbs)
    ggplotly(p = ggplot2::last_plot())
    

    The above DBSCAN model gives us the following plot:

    Visualization

    The last skillset that we will work with is visualization. There are some very elegant and efficient visualization in R like ggplot2, dygraphs and Plotly. In this case we will work with Plotly due to its ability to create efficient visualization with high quality interactive modules. The below coding creates this:

    # Basic Line Plot
    library(plotly)
    t_0 <- rnorm(90, mean = 3)
    t_1 <- rnorm(90, mean = 0)
    t_2 <- rnorm(90, mean = -3)
    x <- c(1:90)
    datap <- data.frame(x, trace_0, trace_1, trace_2)
    r %
      add_trace(y = ~t_0, name = 'trace 0',mode = 'lines') %>%
      add_trace(y = ~t_1, name = 'trace 1', mode = 'lines+markers') %>%
      add_trace(y = ~t_2, name = 'trace 2', mode = 'markers')
    r
    

    The above coding gives us the following visualization:

    References

    1. Using Skimr in R – CRAN.R-project.org
    2. Using Stargazer in R – CRAN.R-project.org
    3. Using Caret in R – CRAN.R-project.org
    4. Using Factoextra in R – CRAN.R-project.org
    5. Using Plotly in R – CRAN.R-project.org
    6. Using Fpc in R – CRAN.R-project.org

    Related Post

    1. Weight loss in the U.S. – An analysis of NHANES data with tidyverse
    2. Machine Learning Results in R: one plot to rule them all! (Part 2 – Regression Models)
    3. Story of pairs, ggpairs, and the linear regression
    4. Extract FRED Data for OLS Regression Analysis: A Complete R Tutorial
    5. MNIST For Machine Learning Beginners With Softmax Regression

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

    Comments are closed.

    Search R-bloggers

    Sponsors

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)