Machine Learning Benchmarking with SFA in R

October 14, 2018

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)


    1. Advanced Modeling


    1. Linear Regression
    2. Machine Learning
    3. R Programming

    Regression analysis is the most demanding machine learning method in 2018. One group of regression analysis for measuring economic efficiency is stochastic frontier analysis (SFA). This method is well suited for benchmarking and finding improvements for optimization in companies and organizations. It can, therefore, be used to design companies so they generate more value for employees and customers.

    In this article, you will learn how to program an SFA model in R and also use this model in a benchmarking evaluation setting – before and after changes in organizations or companies.

    Introduction to Benchmarking with SFA Machine Learning

    One of the main advantages of SFA models is that they can predict how you can create more value in companies – either for employees or customers. This is due to the models’ ability to measure efficiency improvements that can give room for optimization. Basically – an SFA model consists of a parameter you want to explain and optimize on (y) and a number of variables that you know will affect this outcome in either a positive or a negative way (the x’s). In this respect, it is like a multiple ols regression model. The difference is that the explanatory variables (the x’s) are considered inputs that generate the outcome variable (the y). These inputs can either be production inputs or cost inputs. This means that the SFA model can either be a production function or a cost function. So you either want to add value by finding room for optimization in cost or production inputs of the company. This can be illustrated with the following graph:

    Benchmarking with SFA Machine Learning in R

    This makes the SFA model very suitable for commercial machine learning analysis. There are already published some really efficient and great SFA packages in R – these are sfa and frontier. The package sfa focuses on implementing stochastic frontier analysis in R and you can make both production- and cost models. The package frontier are focused upon more specialised SFA models. In this article we will mainly work with sfa. The below coding creates an SFA model in R.

    First we load packages into the R library:

    # Stochastic frontier analysis in R
    # SFA packages
    # Other packages

    Next step is to generate a dataset in R and also plotting the dataset in R:

    # Generation of random data for dataframe
    ns = 15
    x = runif(ns, 0, 1)
    ybar = x^(1/2)
    u = rtmvnorm(n = ns, mean = c(0.25), sigma = c(0.2), lower = c(0))
    y = ybar/(1 + u)
    plot(y ~ x, type = "p", col = "red", ylim = c(0,1))

    The above coding gives the following plot:

    3rd step is creation of the SFA model:

    # Representation of the efficiency frontier
    x.seq <- seq(0, 1, by = 0.01) <- x.seq^(1/2)
    lines( ~ x.seq, col = "blue")
    # The frontier efficiency measurement
    lambda = y/sqrt(x)
    theta = y^2/x
    delta = 1/theta
    # Table for Frontier Efficiency measures
          lambda     theta    delta
    1  0.6475214 0.4192839 2.385019
    2  0.7918257 0.6269879 1.594927
    3  0.9576361 0.9170669 1.090433
    4  0.7704356 0.5935711 1.684718
    5  0.7528086 0.5667209 1.764537
    6  0.6541730 0.4279423 2.336763
    7  0.8588736 0.7376639 1.355631
    8  0.7548694 0.5698277 1.754916
    9  0.8929326 0.7973287 1.254188
    10 0.5870898 0.3446745 2.901288
    11 0.6966463 0.4853160 2.060513
    12 0.5375163 0.2889238 3.461120
    13 0.6908752 0.4773085 2.095081
    14 0.5192932 0.2696654 3.708299
    15 0.6621521 0.4384453 2.280786

    The above coding also generates the following SFA graph:

    Introduction to SFA Machine Learning Analysis for Benchmarking Evaluation

    In this section, I will show how you can use SFA analysis for benchmarking evaluation of changes in companies or organizations, before- and after these changes. In the first part of the analysis, the SFA estimates are calculated, using the method described in the above section. In order to find the average treatment effect, before and after the intervention, the estimates are divided in a before category and an after category. Thereafter, the estimates are divided into a treatment and control category. This creates four different categories of the SFA estimates, depicted in table I:

    As the table shows, the difference estimates are first calculated as the difference between the pre-intervention minus the post-intervention of the treatment group. Thereafter, the second difference estimate is calculated as the pre-intervention minus the post-intervention of the control group. These two different estimates form the DiD design by subtracting the two estimates. This creates the DiD formula: (D-C) – (B-A).

    SFA Machine Learning for Benchmarking Evaluation in R

    Now it is time to code the SFA benchmarking evaluation model in R. The below coding are using a DRG healthcare dataset, that are not open source. The coding can be applied to any sort of dataset with a time variable (Post) and treatment variable (Treatment) that are coded in the above table format. The below coding creates data management a SFA difference-in-difference model for Benchmarking Evaluation:

    # Load dataset into R
    drg_data <- read_excel("healthcaredata.xlsx")
    # Create data for each Benchmarking Evaluation group <- subset((drg_data,drg_data$Post < 1 & drg_data$Treatment < 1))  0 & drg_data$Treatment < 1)) <- subset((drg_data,drg_data$Post  0))  0 & drg_data$Treatment > 0))

    Now its time to benchmarking the SFA difference-in-difference model for benchmarking evaluation:

    #SFA - group  A
    SFA_A <- sfa(log(drg) ~ log(los) + log(readmis) + log(agecat) + log(sex), data =
    #SFA - group  B
    SFA_B <- sfa(log(drg) ~ log(los) + log(readmis) + log(agecat) + log(sex), data =
    #SFA - group  C
    SFA_C <- sfa(log(drg) ~ log(los) + log(readmis) + log(agecat) + log(sex), data =
    #SFA - group  D
    SFA_D <- sfa(log(drg) ~ log(los) + log(readmis) + log(agecat) + log(sex), data =
    # Create a table for the model
    stargazer(SFA_A,SFA_B,SFA_C,SFA_D, type="text",out="/SFADiDModel.doc")
    Table I – Estimated results for the various groups
    		(1)		(2)		(3)		(4)
    VARIABLES	Group A		Group B		Group C		Group D
    LOS		-0.174***	-0.123***	-0.310***	-0.342***
    		(0.0297)	(0.0369)	(0.0305)	(0.0437)
    Sex		-0.0571***	-0.0537*	0.236***	0.00239
    		(0.0221)	(0.0282)	(0.0248)	(0.0380)
    Age category	0.221***	0.170***	0.542***	0.335***
    		(0.0417)	(0.0527)	(0.0338)	(0.0529)
    Readmissions	0.504***	0.477***	0.556***	0.714***
    		(0.0134)	(0.0180)	(0.0144)	(0.0206)
    Constant	2.689***	2.778***	2.197***	2.632**
    		(0.674)		(0.841) 	(0.779)		(1.255)
    Observations	8,461		5,327		8,594		4,523
    Standard errors in parentheses
    *** p<0.01, ** p<0.05, * p<0.1

    Now it is time to predict the technical efficens measures for the above period and post estimates, and display them in a table:

    #Technical efficiens - group  A
    eff_A <- efficiencies( SFA_A, margEff = TRUE )
    #Technical efficiens - group  B
    eff_B <- efficiencies( SFA_B, margEff = TRUE )
    #Technical efficiens - group  C
    eff_C <- efficiencies( SFA_C, margEff = TRUE )
    #Technical efficiens - group  D
    eff_D <- efficiencies( SFA_D, margEff = TRUE )
    # Create a table for the TE model
    stargazer(eff_A,eff_B,eff_C,eff_D, type="text",out="/SFADiDTEModel.doc")
    Table II – Technical Efficiency estimates: DRG as outcome
    Group A		0.967
    Group B		0.982
    Group C		0.966
    Group D		0.970
    Aggregate DiD	0.00969

    As described above: Group A is (pre, treat), Group B is (post, treat), Group C is (pre, comp) and Group D is (post, comp). Aggregate DiD is DiD estimate = (D-C) – (B-A).


    1. Using Stargazer in R –
    2. Using sfa in R –
    3. Using frontier in R –
    4. Using tmvtnorm in R –
    5. Using htmlTable in R –
    6. Using readxl in R –
    7. Using foreign in R –

    Related Post

    1. neuralnet: Train and Test Neural Networks Using R
    2. Keras: Regression-based neural networks
    3. Working with panel data in R: Fixed vs. Random Effects (plm)
    4. Understanding Titanic Dataset with H2O’s AutoML, DALEX, and lares library
    5. K-fold cross-validation in Stan

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

    Comments are closed.

    Search R-bloggers


    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)