An application of structural equation modeling to detect ... - Springer Link

3 downloads 4 Views 249KB Size Report
true change in quality of life data from cancer patients undergoing invasive ... shifts did change the estimate of true change in physical health from medium to ...

Quality of Life Research (2005) 14: 599–609

 Springer 2005

An application of structural equation modeling to detect response shifts and true change in quality of life data from cancer patients undergoing invasive surgery Frans J. Oort, Mechteld R.M. Visser & Mirjam A.G. Sprangers Department of Medical Psychology, Academic Medical Centre, University of Amsterdam, The Netherlands (E-mail: [email protected]) Accepted in revised form 14 June 2004

Abstract The objective is to show how structural equation modeling can be used to detect reconceptualization, reprioritization, and recalibration response shifts in quality of life data from cancer patients undergoing invasive surgery. A consecutive series of 170 newly diagnosed cancer patients, heterogeneous to cancer site, were included. Patients were administered the SF-36 and a short version of the multidimensional fatigue inventory prior to surgery, and 3 months following surgery. Indications of response shift effects were found for five SF-36 scales: reconceptualization of Ôgeneral healthÕ, reprioritization of Ôsocial functioningÕ, and recalibration of Ôrole-physicalÕ, Ôbodily painÕ, and ÔvitalityÕ. Accounting for these response shifts, we found deteriorated physical health, deteriorated general fitness, and improved mental health. The sizes of the response shift effects on observed change were only small. Yet, accounting for the recalibration response shifts did change the estimate of true change in physical health from medium to large. The structural equation modeling approach was found to be useful in detecting response shift effects. The extent to which the procedure is guided by subjective decisions is discussed. Key words: Cancer, Health related quality of life, Response shift, Structural equation modeling

Introduction When assessing self-reported change we must account for recalibration, reprioritization, and reconceptualization response shifts. Recalibration refers to a change in the respondent’s internal standards of measurement, reprioritization to a change in the respondent’s values, and reconceptualization to a change in the respondent’s understanding of the target construct [1, 2]. Oort [17] proposes a procedure for the detection of response shifts and the measurement of true change through structural equation modeling. The procedure applies if one or more target constructs (e.g. health-related quality of life) are measured with multiple items, or scales of items (e.g. the items or scales of a quality of life questionnaire).

In this procedure, operationalizations of response shifts are based on the idea that reconceptualization refers to a change in the meaning of the item content, reprioritization to a change in the relative importance of the item as an indicator of the target construct, and recalibration to a change in the meaning of (the labeling of) the response options of the item (i.e., the anchors of the response scale). Oort distinguishes between uniform and nonuniform recalibration. If the meaning of all response scale anchors changes in the same way, that is, if the recalibration affects all anchors in the same direction and to the same extent, the recalibration is uniform. Otherwise, the recalibration is nonuniform. The detection procedure models group means and covariances. It will therefore only detect

600 response shifts and true change if these phenomena are experienced by a substantial part of the respondents. In other words, the procedure aims at detecting response shifts and true change at the group level rather than the individual level. For a detailed account of the procedure, we refer the reader to Oort [17]. The present objective is to illustrate the response shift detection procedure by applying it to data from cancer patients who underwent invasive surgery. The surgery induced severe and sustaining physical limitations, thus necessitating the patients to accommodate to their deteriorating condition. We therefore expect to find response shifts in these data, because, according to Sprangers and Schwartz [1], response shifts are most likely to occur if the change in patients’ health status is recent, intense and pervasive, thus requiring adaptation.

Method Cancer patients’ health-related quality of life was first assessed prior to surgery, shortly after diagnosis. The second assessment was 3 months following surgery. Patients A consecutive series of 170 newly diagnosed cancer patients were enrolled, including 29 lung cancer patients (17%) waiting for either lobectomy or pneumectomy, 43 pancreatic cancer patients (25%) waiting for Whipple or bypass surgery, 46 esophageal cancer patients (27%) waiting for either transhiatal or transthoracal surgery, and 52 cervical cancer patients (31%) waiting for radical hysterectomy. Exclusion criteria were being younger than 18 years, having a life expectancy less than 9 months, or not being able to complete a (Dutch) questionnaire. The sample consisted of 87 men and 83 women, with ages ranging from 27 to 83 (mean 57.5, standard deviation 14.1). Measures Generic health-related quality of life was assessed with the Dutch language version [3] of the SF-36 health survey [4], encompassing eight scales: physical functioning (PF), role limitations due to

physical health (role-physical, RP), bodily pain (BP), general health perceptions (GH), vitality (VT), social functioning (SF), role limitations due to emotional problems (role-emotional, RE), and mental health (MH). As effects on patients’ fatigue were of prime interest, we also measured fatigue (FT) with a sixitem short form of the multidimensional fatigue inventory (MFI) [5]. Both SF-36 and MFI scale scores were transformed to a scale ranging from 0 to 5, with higher scores indicating better health. Procedure Response shift detection and true change assessement was done in four steps [17]: (1) establishment of an appropriate measurement model, (2) fitting a model of no response shifts, (3) detection of response shifts, and (4) assessment of true change. Each of these steps is associated with a particular model. Step 1. An appropriate measurement model, Model 1, was established on the basis of published results of principal components analyses of the SF36 [4], results of exploratory factor analyses of the present data, and substantive considerations. Model 1 has no across occasion constraints. Step 2. In Model 2 all response shift parameters are constrained to be equal across occasions. The comparison of the fit of Models 1 and 2 through the v2 difference test [6] can be used as an overall test of the presence of response shifts. If the difference in fit of these two models is not significant, we may conclude that there are no response shifts, and skip Step 3. Step 3. In Model 3 all apparent response shifts are accounted for. The specification search of Model 3 started with Model 2, and was guided by modification indices [7] and standardized residuals. Each modification was tested with the v2 difference test [6]. Response shifts are operationalized as acrossoccasion differences between patterns of common factor loadings (reconceptualization), differences between values of common factor loadings (reprioritization), differences between intercepts (uniform recalibration), and differences between residual variances (nonuniform recalibration). Step 4. After establishing (partial) invariance of intercepts, factor loadings, and residual variances, we tested other types of invariance, including the

601 equality of common factor means, variances, and correlations. Differences between common factor means indicate ÔunbiasedÕ or ÔtrueÕ changes in the target constructs [17] (here: health-related quality of life). Evaluation The parameter estimates of the final model, Model 4, were used to calculate effect size indices for true change, as well as for the contributions of response shifts and true change to observed change [17]. The response shift effect on the estimation of true change was investigated by comparing estimates of true change from a model in which response shifts were accounted for with estimates from a model in which response shifts were not accounted for.

and with large sample sizes the v2 test generally turns out to be significant. An alternative measure of overall goodness-of-fit is the root mean square of approximation (RMSEA). According to a generally accepted rule of thumb, an RMSEA value below 0.08 indicates ÔreasonableÕ fit and one below 0.05 ÔcloseÕ fit [10]. Yet another fit index is the expected cross-validation index (ECVI), which expresses how well a model fits a ÔvalidationÕ sample when its parameter estimates are obtained with an independent ÔcalibrationÕ sample. The ECVI can be estimated without actually splitting the available data; ECVI values represent discrepancies and can only be interpreted in comparisons of different models for the same data [11].

Results Structural equation modeling In the four-step procedure, structural equation models were fitted to the means, variances, and covariances of the SF-36 and MFI scale scores, using standard statistical computer programs [7–9] (LISREL and Mx syntax is available upon request).

Identification To achieve identification of all model parameters, scales and origins of the common factors were established by fixing the means at zero and the variances at one. In Steps 2–4 of the detection procedure, only the first occasion factor means and variances are fixed; second occasion means and variances are then identified by constraining intercepts and factor loadings to be equal across occasions [17].

Estimation The maximum likelihood estimation method yields a v2 test of overall goodness-of-fit, and standard errors for all parameter estimates [6].

Goodness-of-fit A significant v2 indicates a significant difference between data and model. However, in the practice of structural equation modeling, exact fit is rare,

Table 1 gives pre-surgery and post-surgery means, standard deviations, and Cronbach’s a’s for all SF-36 and MFI scales. All scale scores appear sufficiently reliable, with the exception of the GH pre-surgery scores. The last column of Table 1 presents the standardized difference between presurgery and post-surgery means (i.e., Cohen’s dindex [12]), thereby giving a first impression of change in the observed scale means (without accounting for possible response shifts). Conventional t-tests indicated deterioration in PF, RP, BP, VT, and FT, improvement in MH and RE, and no change in SF and GH. Before continuing with the structural equation modeling, we first checked whether our sample of cancer patients was not too heterogeneous to be used in single-group analysis. We calculated separate mean vectors and covariance matrices for the four groups of patients (lung cancer, pancreatic cancer, esophageal cancer, and cervical cancer patients), and we fitted a multi-group model in which the means, variances, and covariances were assumed to be invariant across the four groups. The hypothesis of exact fit (RMSEA ¼ 0) was rejected (p ¼ 0.001), but the hypothesis of close fit (RMSEA < 0.05) was not rejected (p ¼ 0.585) [10]. So we concluded that we could continue with single-group analyses, although we should keep in mind that our conclusions might not carry the same weight for all patient groups.

602 Table 1. Means, standard deviations, and reliability estimates for SF-36 and MFI scales Scale


Before surgery

Three months after surgery

Post-Pre d-index#





3.96 2.73 3.94 3.81 3.25 3.00 3.14 2.96 3.30

1.22 2.09 1.19 1.32 1.08 2.12 1.26 0.95 1.10

0.93 0.86 0.87 0.78 0.84 0.83 0.86 0.50 0.86

3.18 2.13 3.68 3.62 3.69 3.55 2.77 2.96 2.92

1.32 2.02 1.21 1.47 1.05 1.93 1.23 1.06 1.18

0.92 0.84 0.92 0.79 0.82 0.81 0.89 0.78 0.91

)0.59** )0.27** )0.20* )0.11 0.40** 0.21* )0.30** 0.00 )0.35**

Notes: n ¼ 170; SF-36 scales have been scored in the way as described in the manual [4], but for computational convenience, have been divided by 20 afterwards. As a result, all SF-36 and MFI scale scores range from 0 to 5; #Standardized mean difference: 0.2, 0.5, and 0.8 indicate ÔsmallÕ, ÔmediumÕ, and ÔlargeÕ differences [12]; *p < 0.01, **p < 0.001 in paired t-test.

Below we first describe the measurement model that was used in the response shift detection procedure, then we present the results of response shift detection and true change assessment, and we conclude with an evaluation of the size of response shifts and true change. Measurement model Results from exploratory factor analyses and substantive considerations gave rise to the measurement model displayed in Figure 1. The circles represent unobserved, latent variables and the squares represent the observed variables. Three latent variables are the common factors general physical health (GenPhys), general mental health (GenMent), and general fitness (GenFitn). GenPhys is measured by PF, RP, BP, and SF, GenMent is measured by MH, RE, and again SF, and GenFitn is measured by VT, GH, and FT. Other latent variables are the residual factors ResPF, ResRP, ResBP, etc. The residual factors represent all that is specific to PF, RP, BP, etc., plus random error variation [13, 17]. Numbers in Figure 1 are maximum likelihood estimates of common factor loadings, common factor correlations, residual variances, and one residual correlation (numbers separated by a slash represent separate first and second occasion estimates). The measurement model portrayed in Figure 1 resembles the principal components model of the SF-36 scales described by Ware et al. [4]. The general physical and mental components feature in both models, with largely the same indicators.

However, the addition of the MFI scale that specifically measures fatigue, FT, brought about the GenFitn factor. The SF-36 scales VT and GH, consisting of items with wordings that do not distinguish between physical and mental aspects, also loaded on the GenFitn factor. The wording of the SF items combines physical and mental aspects, causing SF to load on both GenPhys and GenMent. Another difference with the principal components model is that in our model the common factors were substantially correlated. The correlation between the GenPhys and GenFitn factors was especially high (0.87), but a two-factor model did not yield satisfactory fit (the preceding explorary factor analyses showed significant v2 differences between fit measures of two- and threefactor models). Inspection of the standardized residuals showed that the covariance between RP and RE was not sufficiently explained by the correlation between the common factors. We therefore allowed the residual factors ResRP and ResRE to co-vary (0.22 correlation), to account for the very similar wordings of the RP and RE items. Detection of response shifts and true change Fit results for the four models that resulted from carrying out the four-step procedure are given in Table 2. Step 1. The measurement model of Figure 1 was the basis of Model 1, a structural equation model for measurements at two occasions, but without any across occasion constraints. The v2 test of

603 0.87



Gen. Ment.

Gen. Phys.

Gen. Fitness

0.45 / 0.69




Res. PF






Res. RP

0.00 / 0.30


Res. BP




Res. SF






Res. MH




Res. RE


Res. GH



0.36 / 0.21

Res. VT




Res. FT


Figure 1. The measurement model used in response shift detection. Notes: Circles represent latent variables (common and residual factors) and squares represent observed variables (the SF-36 and MFI scales). Abbreviations: PF – physical functioning; RP – rolephysical; BP – bodily pain; GH – general health; VT – vitality; SF – social functioning; RE – role-emotional; MH – mental health; FT – Fatigue. Numbers are maximum likelihood estimates of Model 4 parameters: common factor loadings, common factor correlations, residual variances, and a residual correlation. Parameter estimates separated by a slash represent separate first and second occasion estimates; all other parameters were constrained to be equal across occasions. Table 3 gives all Model 4 parameter estimates, for both occasions.

Table 2. Goodness of overall fit of models in the four-step response shift detection procedure Model






Model 1

Measurement model (no across occasion constraints)



0.048 (0.027; 0.066)

1.98 (1.82; 2.20)

Model 2

No response shift model



0.060 (0.044; 0.074)

2.03 (1.82; 2.30)

Model 3

Response shift model



0.040 (0.017; 0.058)

1.81 (1.70; 2.03)

Model 4

Final model (all tenable constraints imposed)



0.039 (0.015; 0.056)

1.73 (1.64; 1.96)

Notes: n ¼ 170; Numbers between parentheses represent 90% confidence intervals.

exact fit was significant (CHISQ(106) ¼ 146.9), but the RMSEA measure indicated close fit (RMSEA ¼ 0.048, Table 2).

Step 2. In Model 2, all response shift parameters were held invariant across occasions. This means that all across occasion invariance constraints on

604 factor loadings, intercepts, and residual variances were imposed. The fit of Model 2, although still satisfactory (RMSEA ¼ 0.060, Table 1), was significantly worse than the fit of Model 1, indicating the presence of response shifts (v2 difference test: CHISQ(23) ¼ 60.0, p < 0.001). Step 3. Inspection of modification indices and standardized residuals indicated which of the equality constraints were not tenable. Step by step modification of Model 2 yielded Model 3, which showed five cases of response shift, as will be explained below. The fit of Model 3 was good (RMSEA ¼ 0.040), and significantly better than the fit of Model 2 (CHISQ(5) ¼ 48.7). Step 4. To investigate change in the means, variances, and correlations of the common factors, we fitted additional models with Model 3 as the starting point. The across occasion invariance of parameters was tested step-by-step, maintaining all equality constraints that proved tenable. This procedure finally yielded Model 4, which fitted the data closely (CHISQ(134) ¼ 168.3, RMSEA ¼ 0.039, Table 2). Estimates of all Model 4 parameters are given in Table 3. Evaluation of response shifts and true change It appears that most Model 4 parameters are invariant across occasions, but here we focus on the parameters that did change (printed in bold, Table 3). Results of significance tests of these changes are presented in Table 4. Reconceptualization and reprioritization Comparison of the common factor loadings of the first occasion with those of the second occasion (Table 3, top rows) shows that at the second occasion the GH scale became an indicator of GenMent, indicating reconceptualization of GH. In addition, the common factor loading of SF on GenPhys became larger at the second occasion, indicating reprioritization of SF. Recalibration Intercepts and residual variances contain information about uniform and nonuniform recalibration. For RP and BP, we found differences between first and second occasion intercepts,

indicating uniform recalibration of both RP and BP. We also found a change in the variance of the residual factor ResVT, indicating nonuniform recalibration of VT. True change Common factor variances and common factor correlations did not change across occasions, but the common factor means did. Common factor means were fixed at zero for the first occasion (because of identification requirements), so that the second occasion estimates serve as direct representations of change. The across occasion differences were significant (p < 0.001) for each of the common factors: GenPhys ()0.77) and GenFitn ()0.35) deteriorated, and GenMent (+0.53) improved. Respective effect-sizes were ÔlargeÕ (d ¼ )0.80), ÔmediumÕ (d ¼ )0.38), and ÔmediumÕ (d ¼ 0.49). Contributions of response shifts and true change to change in the observed variables In addition to significance test results, Table 4 provides effect-sizes for observed change, and the response shift and true change contributions to observed change, as implied by the parameter estimates of Model 4 (in Table 3). From Table 4 it appears that the response shift effects on observed change are only ÔsmallÕ: 0.27 and 0.30 for the uniform calibration of RP and BP, 0.14 for for the reconceptualization of GH, )0.11 for the reprioritization of SF, and zero (obviously) for the nonuniform recalibration of VT. The effects of true change are generally larger, but vary from none (SF) to ÔmediumÕ (PF, RP, BP). For RP, BP, and GH the effects of response shifts and true change are in opposite directions. The impact of response shifts on the measurement of true change The impact of response shifts on the estimation of true change was investigated by checking what the estimated true change would have been if we would not have accounted for the response shifts that we found. With Model 2 we found changes in the common factor means of )0.59, +0.48, and )0.35 for GenPhys, GenMent, and GenFitn, whereas with Model 4 these changes were )0.77, +0.53, and )0.35 (Table 3). However, the fit of Model 2 was perhaps too poor to allow interpretation of its

605 Table 3. Parameter estimates in the final model (Model 4) Pre-test


GenPhys1 GenMent1 GenFitn1 Factor loadings (G) PF1 RP1 BP1 SF1 MH1 RE1 VT1 GH1 FT1

0.99 1.56 0.84 0.45

GenPhys2 GenMent2 GenFitn2

0.57 0.82 1.32 1.12 0.49 1.06


0.99 1.56 0.84 0.69

0.57 0.82 1.32 0.30

1.12 0.49 1.06

Intercepts (s) PF









3.94 3.94

2.70 3.33

3.93 4.32

3.82 3.82

3.25 3.25

2.92 2.92

3.15 3.15

2.96 2.96

3.28 3.28

Residual variances (Diag(W)) ResPF









Pre-test Post-test

1.73 1.73

0.75 0.75

0.92 0.92

0.44 0.44

2.33 2.33

0.36 0.21

0.62 0.62

0.18 0.18

Residual correlations (Diag(W21*)) Pre · Post 0.28 0.12








Pre-test Post-test

0.64 0.64

Common factor variances (Diag (F)) Pre-test


GenPhys1 GenMent1 GenFitn1 1.00 Common factor correlations (F*) Pre-test Gen-Phys1 1 0.53 Gen-Ment1 Gen-Fitn1 0.87 Post-test Gen-Phys2 0.54 Gen-Ment2 0.33 Gen-Fitn2 0.49 Common factor means (a) 0.00



1 0.65


0.33 0.41 0.42 0.00

GenPhys2 GenMent2 GenFitn2 1.00



0.49 0.42 0.59

1 0.53 0.87

1 0.65






Notes: n = 170, goodness of overall fit mesures: CHISQ(134) = 168.3, RMSEA = 0.039, RMSEA 90% confidence interval = 0.015– 0.056. Results indicating across-occasion variance are printed in bold. Greek symbols refer to the structural equation model described by Oort [17]. Factor loadings are unstandardized, but covariances are decomposed into variances and correlations.

parameter estimates. Checking each of the five cases of response shift one by one, we found that only uniform recalibration affected the measurement of true change. If we ignored uniform recalibration, the GenPhys mean change was estimated

at )0.54. Thus, in the present data set, uniform recalibration had some impact on the estimation of true change in GenPhys. Accounting for uniform recalibration, the effect-size increases by 0.23, from a ÔmediumÕ sized change to a ÔlargeÕ sized

606 Table 4. Significance tests of response shifts, and effect-sizes of observed change, response shift, and true change in the final model (Model 4) Scale


Response shift

Significance test


v2 (df = l)


Uniform recalibration Uniform recalibration Reprioritization

11.1 12.7 4.4

Suggest Documents