Item Response Modeling of Customer Satisfaction: The Graded Response Model
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Next, we must run the analysis and interpret the resulting estimates. Again, R is fortunate that Dimitris Rizopoulos has provided the ltm package. We will spend some time discussing the results because it takes a couple of examples before it becomes clear that rating scales are ordinal, that each item can measure the same latent trait differently, and that the different items differentiate differently at different locations along the individual difference dimension. That is correct, I did say “different items differentiate differently at differentlocations along the individual differencedimensions.”
Suppose that we are given a data set with over 4000 respondents who completed a customer satisfaction rating scale after taking a flight on a major airline. The scale contained 12 ratings on a fivepoint scale from 1=very dissatisfied to 5=very satisfied. The 12 ratings can be separated into three different components covering the ticket purchase (e.g., online booking and seat selection), the flight itself (e.g., seat comfort, food/drink, and ontime arrival/departure), and the service provided by employees (e.g., flight attendants and staff at ticket window or gate).
Proportion Rating Each Category on 5Point Scale

Descriptive Statistics


1

2

3

4

5

mean

sd

skew


Purchase_1

0.005

0.007

0.079

0.309

0.599

4.49

0.72

1.51


Purchase_2

0.004

0.015

0.201

0.377

0.404

4.16

0.82

0.63


Purchase_3

0.007

0.016

0.137

0.355

0.485

4.30

0.82

1.09


Flight_1

0.025

0.050

0.205

0.389

0.330

3.95

0.98

0.86


Flight_2

0.022

0.055

0.270

0.403

0.251

3.81

0.95

0.60


Flight_3

0.024

0.053

0.305

0.393

0.224

3.74

0.94

0.52


Flight_4

0.006

0.022

0.191

0.439

0.342

4.09

0.81

0.66


Flight_5

0.048

0.074

0.279

0.370

0.229

3.66

1.06

0.63


Flight_6

0.082

0.151

0.339

0.259

0.169

3.28

1.16

0.23


Service_1

0.002

0.008

0.101

0.413

0.475

4.35

0.71

0.91


Service_2

0.004

0.013

0.091

0.389

0.503

4.37

0.74

1.17


Service_3

0.009

0.018

0.147

0.422

0.405

4.19

0.82

0.97

We can see the three components in the correlation matrix below. The Service ratings form the most coherent cluster, followed by Purchase and possibly Flight. If one were looking for factors, it seems that three could be extracted. That is, the three service items seem to “hang” together in the lower righthand corner. Perhaps one could argue for a similar clustering among the three purchase ratings in the upper lefthand corner. Yet, the six Flight variables might cause us to pause because they are not that highly interrelated. But they do have lower correlations with the Purchase ratings, so maybe Flight will load on a separate factor given the appropriate rotation. On the other hand, if one were seeking a single underlying dimension, one could point to the uniformly positive correlations among all the ratings that fall not that far from an average value of 0.39. Previously, we have referred to this pattern of correlations as a positive manifold.
P_1

P_2

P_3

F_1

F_2

F_3

F_4

F_5

F_6

S_1

S_2

S_3


Purchase_1


0.43

0.46

0.34

0.30

0.33

0.39

0.28

0.25

0.39

0.40

0.37

Purchase_2

0.43

0.46

0.37

0.43

0.33

0.36

0.31

0.34

0.43

0.40

0.42


Purchase_3

0.46

0.46


0.29

0.36

0.34

0.41

0.29

0.30

0.38

0.44

0.45

Flight_1

0.34

0.37

0.29


0.37

0.43

0.37

0.45

0.35

0.44

0.44

0.40

Flight_2

0.30

0.43

0.36

0.37

0.42

0.38

0.35

0.45

0.40

0.39

0.36


Flight_3

0.33

0.33

0.34

0.43

0.42

0.52

0.38

0.40

0.33

0.38

0.35


Flight_4

0.39

0.36

0.41

0.37

0.38

0.52

0.35

0.35

0.35

0.37

0.37


Flight_5

0.28

0.31

0.29

0.45

0.35

0.38

0.35

0.38

0.37

0.39

0.40


Flight_6

0.25

0.34

0.30

0.35

0.45

0.40

0.35

0.38


0.40

0.38

0.35

Service_1

0.39

0.43

0.38

0.44

0.40

0.33

0.35

0.37

0.40


0.66

0.51

Service_2

0.40

0.40

0.44

0.44

0.39

0.38

0.37

0.39

0.38

0.66

0.70


Service_3

0.37

0.42

0.45

0.40

0.36

0.35

0.37

0.40

0.35

0.51

0.70


This section will attempt a minimalist account of the fitting of the graded response data to these 12 satisfaction ratings. As with all item response models, the observed item response is a function of the latent trait. In this case, however, we have a rating scale rather than a binary yes/no or correct/incorrect; we have a graded response between very dissatisfied and very satisfied. The graded response model assumes only that the observed item response is an ordered categorical variable. The rating values from one to five indicate order only and nothing about the distance between the values. The rating scale is treated as ordered but not equal interval.
Each item has its own response category characteristic curve, like the one shown above for Service_2, and each curve represents the relationship between the latent trait and the observed ratings. But what should we be looking for in these curves? What would a “good” item look like, or more specifically, is Service_2 a good item? Immediately, we note that the topbox is reached rather “early” along the latent trait. Everyone above the mean has a greater probability of selecting “5” than any other category. It should be noted that this is consistent with our original frequency table at the beginning of this post. Half of the respondents gave Service_2 a rating of five, so Service_2 is unable to differentiate between respondents at the mean or one standard deviation above the mean or two standard deviations above the mean.
Although one takes the time to examine carefully the characteristic curve for each item, there is an easier method for comparing items. We can present the parameter estimates from which these curves were constructed, as shown below.
Coefficients:

Extrmt1

Extrmt2

Extrmt3

Extrmt4

Dscrmn

Purchase_1

3.72

3.20

1.86

0.39

1.75

Purchase_2

3.98

3.00

1.11

0.29

1.72

Purchase_3

3.57

2.85

1.41

0.01

1.73

Flight_1

2.87

2.07

0.88

0.58

1.64

Flight_2

3.08

2.14

0.64

0.95

1.56

Flight_3

3.05

2.15

0.50

1.11

1.53

Flight_4

3.94

2.87

1.17

0.55

1.58

Flight_5

2.53

1.79

0.46

1.08

1.52

Flight_6

2.26

1.21

0.21

1.51

1.35

Service_1

3.46

2.76

1.46

0.02

2.50

Service_2

2.91

2.33

1.39

0.07

3.13

Service_3

2.86

2.32

1.17

0.24

2.45

1 vs. 25

12 vs. 35

13 vs. 45

14 vs. 5

The columns labeled with the prefix “Extrmt” are the extremity parameters, the cutpoints that separate the categories as shown in the bottom row (e.g., 1 vs. 25). This might seem confusing at first, so we will walk through it slowly. The first column, Extrmt1, separates the bottombox from the topfour categories (1 vs. 25). So, for Flight_6, anyone with a latent score of 2.26 has a 5050 chance of assigning a rating of 1. And what is the latent trait score that yields a 5050 chance of selecting 1 or 2 versus 3, 4, or 5? Correct, the value is 1.21. Finally, the latent score for a respondent to have a 5050 chance of giving a topbox rating for Flight_6 is 1.51.
What about the last column with the estimates of the discrimination parameters? We have already noted that there is a benefit when the characteristics curves for each of the category levels have high peaks. The higher the peaks, then the less overlap between the category values and the greater the discrimination between the rating scores. Thus, although the ratings for Flight_6 span the range of the latent trait, these curves are relatively flat and their discrimination is low. Service_2, on the other hand, has a higher discrimination because its curves are more peaked, even if those curves are concentrated toward the lower end of the latent trait.
Item Information
We remember that the observed ratings are indicators of the latent variable, and each item provides some information about the underlying latent trait. The term information is used in IRT to indicate the reciprocal of the precision with which the latent trait is measured. Thus, a high information value is associated with a small standard error of measurement. Unlike classical test theory with its single value of reliability, IRT does not assume that measurement precision is constant for all levels of the latent trait. The figure below displays how well each item performs as a function of the latent trait.
The green curve yielding the most information at low levels of latent trait is our familiar Service_2. Along with it are Service _1 (red) and Service_3 (blue). The three Purchase ratings are labeled 1, 2, and 3. The six ratings of the Flight are the only curves providing any information in the upper ranges of the latent variable (numbered 49). All of this is consistent with the item distributions. The means for all the Purchase and Service ratings are above 4.0. The means for the Flight items are not much better, but most of these means are below 4.0.
So that is the graded response model for a series of ratings measuring a single underlying dimension. We wanted to be able to differentiate customers who are delighted from those who are disgusted and everyone in between. Although we often speak about customer satisfaction as if it were a characteristic of the brand (e.g., #1 in customer satisfaction), it is not a brand attribute. Customer satisfaction is an individual difference dimension that spans a very wide range. We need multiple items because different portion of this continuum have different definitions. It is failure to deliver the basics that generates dissatisfaction, so we must include ratings tapping the basic features and services. But it is meeting and exceeding expectation that produces the highest satisfaction levels. As was clear from the item information curves, we failed to include such more difficult to deliver items in our battery of ratings.
describe(data) # means and SDs for data file with 12 ratings
cor(data) # correlation matrix
scree(data, factors=FALSE) # scree plot
omega(data) # runs bifactor model
descript(data) # runs frequency tables for every item
fit<grm(data) # graded response model
fit # print cutpoints and discrimination
plot(fit, type=”IIC”) # plots item information curves
pattern<factor.scores(fit, resp.pattern=data)
trait<pattern$score.dat$z1
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.