Multilevel Modelling in R: Analysing Vendor Data

[This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Categories

    1. Regression Models

    Tags

    1. Linear Mixed Model
    2. Linear Regression
    3. R Programming

    One of the main limitations of regression analysis is when one needs to examine changes in data across several categories. This problem can be resolved by using a multilevel model, i.e. one that varies at more than one level and allows for variation between different groups or categories.

    This dataset from data.ok.gov contains information on purchases made by state and higher educational institutions in the State of Oklahoma from various vendors.

    Multilevel Model: Vendor Data

    Consider the following business problem. Suppose that new vendors wish to enter the market and sell to these institutions. How can we estimate potential sales to these institutions by these new vendors? Let us see how using a multilevel model can help us accomplish this.

    Firstly, the relevant libraries and dataset are imported.

    # Import Libraries
    library(lme4)
    library(ggplot2)
    library(reshape2)
    library(dplyr)
    library(data.table)
    
    # Load data and convert to numeric
    setwd("yourdirectory")
    mydata<-read.csv("file.csv")
    attach(mydata)
    

    From this dataset, we are importing General Purchases across different agencies (as identified by their Agency Number), along with the Amount data (it is being assumed that all the positive values represent the purchases from these vendors).

    The Vendor variable is converted into numeric format and the data frame is formulated once again:

    Vendor<-as.numeric(Vendor)
    mydata<-data.frame(mydata,Vendor)
    attach(mydata)
    

    The multilevel model is formulated, and the conditional modes of the random effects are extracted using ranef.

    mlevel <- lmer(Amount ~ 1 + (1|Vendor.1),mydata)
    ranef(mlevel)
    

    Here are the regression results:

    mlevel
    Linear mixed model fit by REML ['lmerMod']
    Formula: Amount ~ 1 + (1 | Vendor.1)
       Data: mydata
    REML criterion at convergence: 4967261
    Random effects:
     Groups   Name        Std.Dev.
     Vendor.1 (Intercept) 4616    
     Residual             5910    
    Number of obs: 244051, groups:  Vendor.1, 39789
    Fixed Effects:
    (Intercept)  
          574.2  
    

    For the purchase data, the fixed and random effects are added together, and a plot of purchases for the last 20 observations are formulated.

    # Average sales (amount) by vendor
    purchases <- fixef(mlevel) + ranef(mlevel)$Vendor.1
    purchases$Vendor.1<-rownames(purchases)
    names(purchases)[1]<-"Intercept"
    purchases <- purchases[,c(2,1)]
    # plot
    ggplot(purchases[39750:39770,],aes(x=Vendor.1,y=Intercept))+geom_point()
    

    observations

    Now that the observed data has been generated, 20 simulations will be run to generate predictions for the 20 hypothetical new vendors – i.e. what sales could a new vendor to this market expect?

    The fixed intercept is added to a random number with a standard deviation of 200:

    # Simulation - 20 new vendors
    new_purchases <- data.frame(Vendor.1 = as.character(39800:39819),
                              Intercept= fixef(mlevel)+rnorm(20,0,200),Status="Simulated")
    purchases$Status <- "Observed"
    purchases2 <- rbind(purchases,new_purchases)
    

    Now, the simulated amounts can be plotted against observed amounts to determine potential vendor sales:

    # Plot simulated vs observed
    ggplot(purchases2[39709:39809,],aes(x=Vendor.1,y=Intercept,color=Status))+
      geom_point()+
      geom_hline(aes(yintercept = fixef(mlevel)[1],linewidth=1.5))
    

    observed vs simulated

    We can see that the simulated sales are more or less in line with that observed from the actual data. As mentioned, the advantage of a multilevel model is the fact that differences across levels are taken into account when running the model, and this helps us avoid the issue of significantly different trends across levels ultimately yielding a “one size fits all” result from a standard linear regression.

    Conclusion

    In this example, we have seen:

      • How to implement a multilevel model in R
      • The advantages of these models in modelling data with multiple categories
      • Running simulations with the model

    You can also find another example of how to run a multilevel model here.

    Thank you for your time! Feel free to view more data science and machine learning content at michaeljgrogan.com.

    Related Post

    1. Logistic Regression with Python using Titanic data
    2. Failure Pressure Prediction Using Machine Learning
    3. Machine learning logistic regression for credit modelling in R
    4. Commercial data analytics: An economic view on the data science methods
    5. Weight loss in the U.S. – An analysis of NHANES data with tidyverse

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)