Statistics Sunday: Fit Statistics in Structural Equation Modeling

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Fit Measures In my video on interpreting confirmatory factor analysis output, I promised a post on the various fit statistics. And here we are! As I said in the video, when you conduct structural equation modeling, the program is comparing the observed data – specifically the observed covariance matrix – to the model-specified covariance matrix. Fit refers to how well those two things match up. Any metric examining fit is using those pieces of information.

There are two types of fit statistics in structural equation modeling: absolute fit and relative fit. When assessing model fit, you should use a combination of both, though nearly all of them are derived from chi-square, which is neither a measure of absolute or relative fit, in some way. So let’s start there.

Chi-Square

Chi-square is an exception to the absolute versus relative fit dichotomy. It’s a measure of exact fit: does your model fit the data? Any deviations between the observed covariance matrix and the model-specified covariance matrix are tallied up, giving an overall metric of the difference between observed and model-specified. If the chi-square is not significant, the model fits your data. If it is significant, the model does not fit your data.

The problem is that chi-square is biased to be significant with large sample sizes and/or large correlations between variables. So for many models, your chi-square will indicate the model does not fit the data, even if it’s actually a good model. One way to correct for this is with the normed chi-square I mentioned in the video: divide chi-square by your degrees of freedom. There is no agreed upon cutoff value for normed chi-square. Personally, I use the critical value for a chi-square with 1 degree of freedom, 3.841. I’ve been told that’s too liberal and also too conservative. Like I said: no agreed upon cutoff value.

But chi-square is still very useful for two reasons. First, we use it to compute other fit indices. I’ll talk about that next. Second, we can use it to compare nested models. You can find out more about that a little farther down in this post.

You may ask, then – if chi-square is biased to be significant, why do we use it for all of our other fit indices? The calculations conducted to create these different fit indices are meant to correct for these biases in different ways, factoring in things like sample size or model complexity. That underlying bias is there, though, and there are many different ways to try to correct for it, each way with its own flaws. This is why you should look at a range of fit indices.

Because your fit indices are based on chi-square, which is given to you by whatever statistical program you use to conduct your SEM, you can compute any fit index, even if your program doesn’t give them to you.

Measures of Absolute Fit

These measures are based on the assumption that the perfect model has a fit of 0 – or rather, no deviation between observed and model-specified covariance matrices. As a result, these measures tell you how much worse your model is than the theoretically perfect model, and are sometimes called badness of fit measures. For these measures, smaller is better.

Root Mean Square Error of Approximation (RMSEA)

Chi-square is a little like ANOVA in how it deals with variance. This is why it’s chi-square; we measure deviations from central tendency by squaring them, to keep them from adding up to 0. The same thing is done in ANOVA: squared deviations are added up, which produce the sum of squares. This value is divided by degrees of freedom, to produce the mean square, which is then used in the calculation of the F statistic. RMSEA is calculated in a very similar way as this process of creating sum of squares then mean square:

√(χ2 – df) / √[df(N-1)]
where df is degrees of freedom and N is total sample size.

Chi-square is biased to be significant, so the higher the degrees of freedom, the higher the chi-square will likely be. In fact, the expected value of chi-square is equal to its degrees of freedom. The expected value of RMSEA for a perfectly fitting model, then, is 0, since in the equation above, degrees of freedom is subtracted from chi-square. There is not one single agreed upon cutoff for RMSEA, though 0.05 and 0.07 are commonly used.

Let’s look once again at the fit measures from the Satisfaction with Life confirmatory factor analysis. In fact, here’s a trick I didn’t introduce previously – while including fit.measures=TRUE in the summary function will give you only a small number of fit measures, you can access more information with fitMeasures:

Facebook<-read.delim(file="small_facebook_set.txt", header=TRUE)
SWL_Model<-'SWL =~ LS1 + LS2 + LS3 + LS4 + LS5'
library(lavaan)

## This is lavaan 0.5-23.1097

## lavaan is BETA software! Please report any bugs.

SWL_Fit<-cfa(SWL_Model, data=Facebook)
fitMeasures(SWL_Fit)

##                npar                fmin               chisq 
##              10.000               0.052              26.760 
##                  df              pvalue      baseline.chisq 
##               5.000               0.000             635.988 
##         baseline.df     baseline.pvalue                 cfi 
##              10.000               0.000               0.965 
##                 tli                nnfi                 rfi 
##               0.930               0.930               0.916 
##                 nfi                pnfi                 ifi 
##               0.958               0.479               0.966 
##                 rni                logl   unrestricted.logl 
##               0.965           -2111.647           -2098.267 
##                 aic                 bic              ntotal 
##            4243.294            4278.785             257.000 
##                bic2               rmsea      rmsea.ci.lower 
##            4247.082               0.130               0.084 
##      rmsea.ci.upper        rmsea.pvalue                 rmr 
##               0.181               0.003               0.106 
##          rmr_nomean                srmr        srmr_bentler 
##               0.106               0.040               0.040 
## srmr_bentler_nomean         srmr_bollen  srmr_bollen_nomean 
##               0.040               0.040               0.040 
##          srmr_mplus   srmr_mplus_nomean               cn_05 
##               0.040               0.040             107.321 
##               cn_01                 gfi                agfi 
##             145.888               0.959               0.876 
##                pgfi                 mfi                ecvi 
##               0.320               0.959               0.182

The RMSEA is 0.13. We can recreate this using the model chi-square (called chisq above), degrees of freedom (df), and sample size (ntotal):
chi.sq=26.76
df = 5
N = 257
sqrt(chi.sq-df)/sqrt(df*(N-1))

## [1] 0.130384

Standardized Root Mean Square Residual

The standardized root mean square residual (SRMR) is the average square root of the residual between the observed covariance matrix and the model-specified covariance matrix, which has been standardized to range between 0 and 1. Unlike some of the other fit indices I discuss here, SRMR is biased to be larger for models with few degrees of freedom or small sample size. This means SRMR has the unusual characteristic of being smaller (i.e., showing better fit) for more complex models. If you remember from the CFA post and video, both models showed poor fit for many of the fit indices but showed good fit based on SRMR. In essence, SRMR rewards something that is penalized with other fit indices. Also unlike the other fit indices discussed here, SRMR is not based on chi-square; you can read more about its calculation here.

Measures of Relative Fit

In addition to measures of absolute fit, which deal with deviations of the observed covariance matrix from the model-specified covariance matrix, we have measures of relative fit, which compare our model to another theoretical model, the null model, sometimes called the independence model. This model assumes that all variables included are independent, or uncorrelated with each other. This is basically the worst possible model, and fit measures using this model can be thought of as goodness of fit measures - how much better does your model fit than the worst possible model you could have? In the fit measures output, this value is called the Baseline Chi-Square. So let's create a new variable to use in our calculations called "null", which uses this baseline chi-square value.

null=634.988
null.df=10

In general, closer to 1 is better. Anything lower than 0.9 would be considered poor fit. If any of these formulas produce a value higher than 1, the fit measure is set at 1.

Normed Fit Index (NFI)

According to David Kenny, this was the first fit measure proposed in the literature. It's computed as the difference between the null and observed model chi-squares, divided by the null chi-square.

(null-chi.sq)/null

## [1] 0.9578575

This measure doesn't provide any kind of correction for more complex models, so it isn't recommended for use. (Although, when I was in grad school, which wasn't that long ago, it was one of the recommended measures in my SEM course. How quickly things change...)

Tucker-Lewis Index (TLI)

This measure is also sometimes call the Non-Normed Fit Index (NNFI). It is similar to the NFI but corrects for more complex models, by taking a ratio of each chi-square and its corresponding degrees of freedom.

((null/null.df)-(chi.sq/df))/((null/null.df)-1)

## [1] 0.9303667

Comparative Fit Index (CFI)

CFI provides a very similar, and slightly elevated, estimate as the NNFI/TLI. The penalty for complexity is smaller than for the TLI. Instead of taking a ratio of chi-square to degrees of freedom, CFI uses the difference between chi-square and the corresponding degrees of freedom.

((null-null.df)-(chi.sq-df))/(null-null.df)

## [1] 0.9651833

There are many other fit indices you'll see listed in the fit measures output. GFI and AGFI (which are actually absolute fit measures) were developed by the creators of the LISREL software and are automatically computed by that program. However, pretty much everything else I've read said not to use these fit indices. (Again, different from what I heard in grad school.) I prefer to use CFI and TLI. CFI is always going to be higher than TLI, because it penalizes you less for model complexity than the TLI. So using both gives you a sort of range of goodness of fit, with lower end of the continuum (TLI) being more conservative than the other (CFI). They're similar, so they'll often tell you the same thing, but you can run into the situation of having a TLI just below your cutoff and CFI just above it.

Comparing Nested Models

I mentioned in the video the idea of nested versus non-nested models. First, let's talk about nested models. A nested model is another model you specify that has the same structure but adds or drops paths. For instance, I conducted two three-factor models using the Rumination Response Scale: one in which the 3 factors were allowed to correlate with each other and another where they were considered orthogonal (uncorrelated). If I drew out these two models, they would look the same except that one would have curved arrows between the 3 factors to reflect the correlations and the other would not. Because I'm comparing two models with the same structure, I can test the impact of that change with my chi-square values.

RRS_Model<- '
  Depression =~ Rum1 + Rum2 + Rum3 + Rum4 + Rum6 + Rum8 + 
    Rum9 + Rum14 + Rum17 + Rum18 + Rum19 + Rum22
  Reflecting =~ Rum7 + Rum11 + Rum12 + Rum20 + Rum21
  Brooding =~ Rum5 + Rum10 + Rum13 + Rum15 + Rum16
'
RRS_Fit<-cfa(RRS_Model, data=Facebook)
RRS_Fit2<-cfa(RRS_Model, data=Facebook, orthogonal=TRUE)
summary(RRS_Fit)

## lavaan (0.5-23.1097) converged normally after  40 iterations
## 
##   Number of observations                           257
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic              600.311
##   Degrees of freedom                               206
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Depression =~                                       
##     Rum1              1.000                           
##     Rum2              0.867    0.124    6.965    0.000
##     Rum3              0.840    0.124    6.797    0.000
##     Rum4              0.976    0.126    7.732    0.000
##     Rum6              1.167    0.140    8.357    0.000
##     Rum8              1.147    0.141    8.132    0.000
##     Rum9              1.095    0.136    8.061    0.000
##     Rum14             1.191    0.135    8.845    0.000
##     Rum17             1.261    0.141    8.965    0.000
##     Rum18             1.265    0.142    8.887    0.000
##     Rum19             1.216    0.135    8.992    0.000
##     Rum22             1.257    0.142    8.870    0.000
##   Reflecting =~                                       
##     Rum7              1.000                           
##     Rum11             0.906    0.089   10.138    0.000
##     Rum12             0.549    0.083    6.603    0.000
##     Rum20             1.073    0.090   11.862    0.000
##     Rum21             0.871    0.088    9.929    0.000
##   Brooding =~                                         
##     Rum5              1.000                           
##     Rum10             1.092    0.133    8.216    0.000
##     Rum13             0.708    0.104    6.823    0.000
##     Rum15             1.230    0.143    8.617    0.000
##     Rum16             1.338    0.145    9.213    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Depression ~~                                       
##     Reflecting        0.400    0.061    6.577    0.000
##     Brooding          0.373    0.060    6.187    0.000
##   Reflecting ~~                                       
##     Brooding          0.419    0.068    6.203    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .Rum1              0.687    0.063   10.828    0.000
##    .Rum2              0.796    0.072   11.007    0.000
##    .Rum3              0.809    0.073   11.033    0.000
##    .Rum4              0.694    0.064   10.857    0.000
##    .Rum6              0.712    0.067   10.668    0.000
##    .Rum8              0.778    0.072   10.746    0.000
##    .Rum9              0.736    0.068   10.768    0.000
##    .Rum14             0.556    0.053   10.442    0.000
##    .Rum17             0.576    0.056   10.370    0.000
##    .Rum18             0.611    0.059   10.418    0.000
##    .Rum19             0.526    0.051   10.352    0.000
##    .Rum22             0.609    0.058   10.428    0.000
##    .Rum7              0.616    0.067    9.200    0.000
##    .Rum11             0.674    0.069    9.746    0.000
##    .Rum12             0.876    0.080   10.894    0.000
##    .Rum20             0.438    0.056    7.861    0.000
##    .Rum21             0.673    0.068    9.867    0.000
##    .Rum5              0.955    0.090   10.657    0.000
##    .Rum10             0.663    0.065   10.154    0.000
##    .Rum13             0.626    0.058   10.819    0.000
##    .Rum15             0.627    0.064    9.731    0.000
##    .Rum16             0.417    0.050    8.368    0.000
##     Depression        0.360    0.072    4.987    0.000
##     Reflecting        0.708    0.111    6.408    0.000
##     Brooding          0.455    0.096    4.715    0.000

summary(RRS_Fit2)

## lavaan (0.5-23.1097) converged normally after  31 iterations
## 
##   Number of observations                           257
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic             1007.349
##   Degrees of freedom                               209
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Depression =~                                       
##     Rum1              1.000                           
##     Rum2              0.903    0.129    6.985    0.000
##     Rum3              0.915    0.129    7.065    0.000
##     Rum4              1.071    0.134    8.023    0.000
##     Rum6              1.245    0.147    8.462    0.000
##     Rum8              1.142    0.145    7.849    0.000
##     Rum9              1.124    0.141    7.961    0.000
##     Rum14             1.219    0.140    8.686    0.000
##     Rum17             1.198    0.143    8.374    0.000
##     Rum18             1.189    0.144    8.235    0.000
##     Rum19             1.240    0.141    8.806    0.000
##     Rum22             1.215    0.145    8.380    0.000
##   Reflecting =~                                       
##     Rum7              1.000                           
##     Rum11             0.999    0.100    9.952    0.000
##     Rum12             0.614    0.090    6.842    0.000
##     Rum20             1.002    0.100    9.979    0.000
##     Rum21             0.971    0.098    9.875    0.000
##   Brooding =~                                         
##     Rum5              1.000                           
##     Rum10             1.132    0.150    7.536    0.000
##     Rum13             0.662    0.112    5.901    0.000
##     Rum15             1.295    0.164    7.914    0.000
##     Rum16             1.461    0.176    8.292    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Depression ~~                                       
##     Reflecting        0.000                           
##     Brooding          0.000                           
##   Reflecting ~~                                       
##     Brooding          0.000                           
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .Rum1              0.692    0.065   10.637    0.000
##    .Rum2              0.777    0.072   10.829    0.000
##    .Rum3              0.766    0.071   10.808    0.000
##    .Rum4              0.630    0.060   10.454    0.000
##    .Rum6              0.653    0.064   10.184    0.000
##    .Rum8              0.790    0.075   10.537    0.000
##    .Rum9              0.719    0.069   10.485    0.000
##    .Rum14             0.540    0.054    9.999    0.000
##    .Rum17             0.640    0.062   10.247    0.000
##    .Rum18             0.686    0.066   10.337    0.000
##    .Rum19             0.513    0.052    9.881    0.000
##    .Rum22             0.655    0.064   10.243    0.000
##    .Rum7              0.656    0.075    8.790    0.000
##    .Rum11             0.588    0.069    8.491    0.000
##    .Rum12             0.838    0.079   10.604    0.000
##    .Rum20             0.582    0.069    8.446    0.000
##    .Rum21             0.580    0.067    8.613    0.000
##    .Rum5              0.993    0.096   10.386    0.000
##    .Rum10             0.671    0.071    9.454    0.000
##    .Rum13             0.671    0.063   10.729    0.000
##    .Rum15             0.616    0.072    8.530    0.000
##    .Rum16             0.342    0.064    5.368    0.000
##     Depression        0.354    0.073    4.867    0.000
##     Reflecting        0.668    0.112    5.972    0.000
##     Brooding          0.417    0.096    4.332    0.000

The first model, where the 3 factors are allowed to correlate, produces a chi-square of 600.311, with 206 degrees of freedom. The second model, where the 3 factors are forced to be orthogonal, produces a chi-square of 1007.349, with 209 degrees of freedom. I can compare these two models by looking at the difference in chi-square between them. That produces a chi-square with degrees of freedom equal to the difference between df for model 1 and df for model 2.

1007.349-600.311

## [1] 407.038

This gives me a change in chi-square (Δχ2) of 407.038, with 3 degrees of freedom. I don't even need to check a chi-square table to tell you that value is significant. (I looked it up and was informed my p-value is less than 0.00001.) So forcing the 3 factors to be orthogonal significantly worsens model fit. This provides further evidence that the 3 subscales are highly correlated with each other.

Information Criterion Measures

There are a few other fit indices that don't really fall within absolute or relative. These are the information criterion measures: the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and the Sample-Size Adjusted BIC. These fit indices are only meaningful when comparing two different models using the same data. That is, the models should be non-nested. For instance, let's say that in addition to examining a single factor analysis of the Satisfaction with Life Scale, I also tested a two-factor model. These two models have a different structure, so they would be non-nested models. I can't look at difference in chi-square to figure out which model is better. Instead, I can compare my information criterion measures. I prefer to use AIC. In this case, the model with the lowest AIC is the superior model.

Fit measures are a hotly debated topic in structural equation modeling, with disagreement on which ones to use, which cutoffs to apply, and even whether we should be using them at all. (What can I say? We statisticians don't get out much.) Regardless of where you fall on the debate, if you're testing a structural equation model, chances are someone is going to ask to see fit measures, so it's best to provide them even if you hate them with a fiery passion. And though people will likely disagree with which ones you selected and which cutoffs you use, the best things you can do are 1) pick your fit measures before conducting your analysis and stick to them - do not cherry-pick fit measures that make your model look good, and 2) provide sources to back up which ones you used and which cutoffs you selected. My recommendations for sources are:

1. Hooper, D., Coughlan, J., & Mullen, M.R. (2008). Structural equation modelling: Guidelines for determining model fit. Electronic Journal of Business Research, 6, 53-60.
2. Hu, L., & Bentler, P.M. (1998). Fit indices in covariance structure modeling: Sensitivity to underparameterized model misspecification. Psychological Methods, 3, 424-453.

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)