Quality of Life Research (2005) 14: 599–609

Springer 2005

An application of structural equation modeling to detect response shifts and true change in quality of life data from cancer patients undergoing invasive surgery Frans J. Oort, Mechteld R.M. Visser & Mirjam A.G. Sprangers Department of Medical Psychology, Academic Medical Centre, University of Amsterdam, The Netherlands (E-mail: [email protected]) Accepted in revised form 14 June 2004

Abstract The objective is to show how structural equation modeling can be used to detect reconceptualization, reprioritization, and recalibration response shifts in quality of life data from cancer patients undergoing invasive surgery. A consecutive series of 170 newly diagnosed cancer patients, heterogeneous to cancer site, were included. Patients were administered the SF-36 and a short version of the multidimensional fatigue inventory prior to surgery, and 3 months following surgery. Indications of response shift eﬀects were found for ﬁve SF-36 scales: reconceptualization of Ôgeneral healthÕ, reprioritization of Ôsocial functioningÕ, and recalibration of Ôrole-physicalÕ, Ôbodily painÕ, and ÔvitalityÕ. Accounting for these response shifts, we found deteriorated physical health, deteriorated general ﬁtness, and improved mental health. The sizes of the response shift eﬀects on observed change were only small. Yet, accounting for the recalibration response shifts did change the estimate of true change in physical health from medium to large. The structural equation modeling approach was found to be useful in detecting response shift eﬀects. The extent to which the procedure is guided by subjective decisions is discussed. Key words: Cancer, Health related quality of life, Response shift, Structural equation modeling

Introduction When assessing self-reported change we must account for recalibration, reprioritization, and reconceptualization response shifts. Recalibration refers to a change in the respondent’s internal standards of measurement, reprioritization to a change in the respondent’s values, and reconceptualization to a change in the respondent’s understanding of the target construct [1, 2]. Oort [17] proposes a procedure for the detection of response shifts and the measurement of true change through structural equation modeling. The procedure applies if one or more target constructs (e.g. health-related quality of life) are measured with multiple items, or scales of items (e.g. the items or scales of a quality of life questionnaire).

In this procedure, operationalizations of response shifts are based on the idea that reconceptualization refers to a change in the meaning of the item content, reprioritization to a change in the relative importance of the item as an indicator of the target construct, and recalibration to a change in the meaning of (the labeling of) the response options of the item (i.e., the anchors of the response scale). Oort distinguishes between uniform and nonuniform recalibration. If the meaning of all response scale anchors changes in the same way, that is, if the recalibration aﬀects all anchors in the same direction and to the same extent, the recalibration is uniform. Otherwise, the recalibration is nonuniform. The detection procedure models group means and covariances. It will therefore only detect

600 response shifts and true change if these phenomena are experienced by a substantial part of the respondents. In other words, the procedure aims at detecting response shifts and true change at the group level rather than the individual level. For a detailed account of the procedure, we refer the reader to Oort [17]. The present objective is to illustrate the response shift detection procedure by applying it to data from cancer patients who underwent invasive surgery. The surgery induced severe and sustaining physical limitations, thus necessitating the patients to accommodate to their deteriorating condition. We therefore expect to ﬁnd response shifts in these data, because, according to Sprangers and Schwartz [1], response shifts are most likely to occur if the change in patients’ health status is recent, intense and pervasive, thus requiring adaptation.

Method Cancer patients’ health-related quality of life was ﬁrst assessed prior to surgery, shortly after diagnosis. The second assessment was 3 months following surgery. Patients A consecutive series of 170 newly diagnosed cancer patients were enrolled, including 29 lung cancer patients (17%) waiting for either lobectomy or pneumectomy, 43 pancreatic cancer patients (25%) waiting for Whipple or bypass surgery, 46 esophageal cancer patients (27%) waiting for either transhiatal or transthoracal surgery, and 52 cervical cancer patients (31%) waiting for radical hysterectomy. Exclusion criteria were being younger than 18 years, having a life expectancy less than 9 months, or not being able to complete a (Dutch) questionnaire. The sample consisted of 87 men and 83 women, with ages ranging from 27 to 83 (mean 57.5, standard deviation 14.1). Measures Generic health-related quality of life was assessed with the Dutch language version [3] of the SF-36 health survey [4], encompassing eight scales: physical functioning (PF), role limitations due to

physical health (role-physical, RP), bodily pain (BP), general health perceptions (GH), vitality (VT), social functioning (SF), role limitations due to emotional problems (role-emotional, RE), and mental health (MH). As eﬀects on patients’ fatigue were of prime interest, we also measured fatigue (FT) with a sixitem short form of the multidimensional fatigue inventory (MFI) [5]. Both SF-36 and MFI scale scores were transformed to a scale ranging from 0 to 5, with higher scores indicating better health. Procedure Response shift detection and true change assessement was done in four steps [17]: (1) establishment of an appropriate measurement model, (2) ﬁtting a model of no response shifts, (3) detection of response shifts, and (4) assessment of true change. Each of these steps is associated with a particular model. Step 1. An appropriate measurement model, Model 1, was established on the basis of published results of principal components analyses of the SF36 [4], results of exploratory factor analyses of the present data, and substantive considerations. Model 1 has no across occasion constraints. Step 2. In Model 2 all response shift parameters are constrained to be equal across occasions. The comparison of the ﬁt of Models 1 and 2 through the v2 diﬀerence test [6] can be used as an overall test of the presence of response shifts. If the difference in ﬁt of these two models is not signiﬁcant, we may conclude that there are no response shifts, and skip Step 3. Step 3. In Model 3 all apparent response shifts are accounted for. The speciﬁcation search of Model 3 started with Model 2, and was guided by modiﬁcation indices [7] and standardized residuals. Each modiﬁcation was tested with the v2 diﬀerence test [6]. Response shifts are operationalized as acrossoccasion diﬀerences between patterns of common factor loadings (reconceptualization), diﬀerences between values of common factor loadings (reprioritization), diﬀerences between intercepts (uniform recalibration), and diﬀerences between residual variances (nonuniform recalibration). Step 4. After establishing (partial) invariance of intercepts, factor loadings, and residual variances, we tested other types of invariance, including the

601 equality of common factor means, variances, and correlations. Diﬀerences between common factor means indicate ÔunbiasedÕ or ÔtrueÕ changes in the target constructs [17] (here: health-related quality of life). Evaluation The parameter estimates of the ﬁnal model, Model 4, were used to calculate eﬀect size indices for true change, as well as for the contributions of response shifts and true change to observed change [17]. The response shift eﬀect on the estimation of true change was investigated by comparing estimates of true change from a model in which response shifts were accounted for with estimates from a model in which response shifts were not accounted for.

and with large sample sizes the v2 test generally turns out to be signiﬁcant. An alternative measure of overall goodness-of-ﬁt is the root mean square of approximation (RMSEA). According to a generally accepted rule of thumb, an RMSEA value below 0.08 indicates ÔreasonableÕ ﬁt and one below 0.05 ÔcloseÕ ﬁt [10]. Yet another ﬁt index is the expected cross-validation index (ECVI), which expresses how well a model ﬁts a ÔvalidationÕ sample when its parameter estimates are obtained with an independent ÔcalibrationÕ sample. The ECVI can be estimated without actually splitting the available data; ECVI values represent discrepancies and can only be interpreted in comparisons of diﬀerent models for the same data [11].

Results Structural equation modeling In the four-step procedure, structural equation models were ﬁtted to the means, variances, and covariances of the SF-36 and MFI scale scores, using standard statistical computer programs [7–9] (LISREL and Mx syntax is available upon request).

Identiﬁcation To achieve identiﬁcation of all model parameters, scales and origins of the common factors were established by ﬁxing the means at zero and the variances at one. In Steps 2–4 of the detection procedure, only the ﬁrst occasion factor means and variances are ﬁxed; second occasion means and variances are then identiﬁed by constraining intercepts and factor loadings to be equal across occasions [17].

Estimation The maximum likelihood estimation method yields a v2 test of overall goodness-of-ﬁt, and standard errors for all parameter estimates [6].

Goodness-of-ﬁt A signiﬁcant v2 indicates a signiﬁcant diﬀerence between data and model. However, in the practice of structural equation modeling, exact ﬁt is rare,

Table 1 gives pre-surgery and post-surgery means, standard deviations, and Cronbach’s a’s for all SF-36 and MFI scales. All scale scores appear suﬃciently reliable, with the exception of the GH pre-surgery scores. The last column of Table 1 presents the standardized diﬀerence between presurgery and post-surgery means (i.e., Cohen’s dindex [12]), thereby giving a ﬁrst impression of change in the observed scale means (without accounting for possible response shifts). Conventional t-tests indicated deterioration in PF, RP, BP, VT, and FT, improvement in MH and RE, and no change in SF and GH. Before continuing with the structural equation modeling, we ﬁrst checked whether our sample of cancer patients was not too heterogeneous to be used in single-group analysis. We calculated separate mean vectors and covariance matrices for the four groups of patients (lung cancer, pancreatic cancer, esophageal cancer, and cervical cancer patients), and we ﬁtted a multi-group model in which the means, variances, and covariances were assumed to be invariant across the four groups. The hypothesis of exact ﬁt (RMSEA ¼ 0) was rejected (p ¼ 0.001), but the hypothesis of close ﬁt (RMSEA < 0.05) was not rejected (p ¼ 0.585) [10]. So we concluded that we could continue with single-group analyses, although we should keep in mind that our conclusions might not carry the same weight for all patient groups.

602 Table 1. Means, standard deviations, and reliability estimates for SF-36 and MFI scales Scale

PF RP BP SF MH RE VT GH FT

Before surgery

Three months after surgery

Post-Pre d-index#

Mean

St.dev.

Reliability

Mean

St.dev.

Reliability

3.96 2.73 3.94 3.81 3.25 3.00 3.14 2.96 3.30

1.22 2.09 1.19 1.32 1.08 2.12 1.26 0.95 1.10

0.93 0.86 0.87 0.78 0.84 0.83 0.86 0.50 0.86

3.18 2.13 3.68 3.62 3.69 3.55 2.77 2.96 2.92

1.32 2.02 1.21 1.47 1.05 1.93 1.23 1.06 1.18

0.92 0.84 0.92 0.79 0.82 0.81 0.89 0.78 0.91

)0.59** )0.27** )0.20* )0.11 0.40** 0.21* )0.30** 0.00 )0.35**

Notes: n ¼ 170; SF-36 scales have been scored in the way as described in the manual [4], but for computational convenience, have been divided by 20 afterwards. As a result, all SF-36 and MFI scale scores range from 0 to 5; #Standardized mean diﬀerence: 0.2, 0.5, and 0.8 indicate ÔsmallÕ, ÔmediumÕ, and ÔlargeÕ diﬀerences [12]; *p < 0.01, **p < 0.001 in paired t-test.

Below we ﬁrst describe the measurement model that was used in the response shift detection procedure, then we present the results of response shift detection and true change assessment, and we conclude with an evaluation of the size of response shifts and true change. Measurement model Results from exploratory factor analyses and substantive considerations gave rise to the measurement model displayed in Figure 1. The circles represent unobserved, latent variables and the squares represent the observed variables. Three latent variables are the common factors general physical health (GenPhys), general mental health (GenMent), and general ﬁtness (GenFitn). GenPhys is measured by PF, RP, BP, and SF, GenMent is measured by MH, RE, and again SF, and GenFitn is measured by VT, GH, and FT. Other latent variables are the residual factors ResPF, ResRP, ResBP, etc. The residual factors represent all that is speciﬁc to PF, RP, BP, etc., plus random error variation [13, 17]. Numbers in Figure 1 are maximum likelihood estimates of common factor loadings, common factor correlations, residual variances, and one residual correlation (numbers separated by a slash represent separate ﬁrst and second occasion estimates). The measurement model portrayed in Figure 1 resembles the principal components model of the SF-36 scales described by Ware et al. [4]. The general physical and mental components feature in both models, with largely the same indicators.

However, the addition of the MFI scale that speciﬁcally measures fatigue, FT, brought about the GenFitn factor. The SF-36 scales VT and GH, consisting of items with wordings that do not distinguish between physical and mental aspects, also loaded on the GenFitn factor. The wording of the SF items combines physical and mental aspects, causing SF to load on both GenPhys and GenMent. Another diﬀerence with the principal components model is that in our model the common factors were substantially correlated. The correlation between the GenPhys and GenFitn factors was especially high (0.87), but a two-factor model did not yield satisfactory ﬁt (the preceding explorary factor analyses showed signiﬁcant v2 diﬀerences between ﬁt measures of two- and threefactor models). Inspection of the standardized residuals showed that the covariance between RP and RE was not suﬃciently explained by the correlation between the common factors. We therefore allowed the residual factors ResRP and ResRE to co-vary (0.22 correlation), to account for the very similar wordings of the RP and RE items. Detection of response shifts and true change Fit results for the four models that resulted from carrying out the four-step procedure are given in Table 2. Step 1. The measurement model of Figure 1 was the basis of Model 1, a structural equation model for measurements at two occasions, but without any across occasion constraints. The v2 test of

603 0.87

0.65

0.53

Gen. Ment.

Gen. Phys.

Gen. Fitness

0.45 / 0.69

0.99

PF

0.64

Res. PF

1.56

RP

0.84

BP

1.73

Res. RP

0.00 / 0.30

0.75

Res. BP

0.57

SF

0.92

Res. SF

0.82

MH

1.32

RE

0.44

Res. MH

0.49

GH

2.33

Res. RE

0.62

Res. GH

1.12

VT

0.36 / 0.21

Res. VT

1.06

FT

0.18

Res. FT

0.22

Figure 1. The measurement model used in response shift detection. Notes: Circles represent latent variables (common and residual factors) and squares represent observed variables (the SF-36 and MFI scales). Abbreviations: PF – physical functioning; RP – rolephysical; BP – bodily pain; GH – general health; VT – vitality; SF – social functioning; RE – role-emotional; MH – mental health; FT – Fatigue. Numbers are maximum likelihood estimates of Model 4 parameters: common factor loadings, common factor correlations, residual variances, and a residual correlation. Parameter estimates separated by a slash represent separate ﬁrst and second occasion estimates; all other parameters were constrained to be equal across occasions. Table 3 gives all Model 4 parameter estimates, for both occasions.

Table 2. Goodness of overall ﬁt of models in the four-step response shift detection procedure Model

Description

Df

CHISQ

RMSEA

ECVI

Model 1

Measurement model (no across occasion constraints)

106

146.9

0.048 (0.027; 0.066)

1.98 (1.82; 2.20)

Model 2

No response shift model

129

206.9

0.060 (0.044; 0.074)

2.03 (1.82; 2.30)

Model 3

Response shift model

124

158.2

0.040 (0.017; 0.058)

1.81 (1.70; 2.03)

Model 4

Final model (all tenable constraints imposed)

134

168.3

0.039 (0.015; 0.056)

1.73 (1.64; 1.96)

Notes: n ¼ 170; Numbers between parentheses represent 90% conﬁdence intervals.

exact ﬁt was signiﬁcant (CHISQ(106) ¼ 146.9), but the RMSEA measure indicated close ﬁt (RMSEA ¼ 0.048, Table 2).

Step 2. In Model 2, all response shift parameters were held invariant across occasions. This means that all across occasion invariance constraints on

604 factor loadings, intercepts, and residual variances were imposed. The ﬁt of Model 2, although still satisfactory (RMSEA ¼ 0.060, Table 1), was signiﬁcantly worse than the ﬁt of Model 1, indicating the presence of response shifts (v2 diﬀerence test: CHISQ(23) ¼ 60.0, p < 0.001). Step 3. Inspection of modiﬁcation indices and standardized residuals indicated which of the equality constraints were not tenable. Step by step modiﬁcation of Model 2 yielded Model 3, which showed ﬁve cases of response shift, as will be explained below. The ﬁt of Model 3 was good (RMSEA ¼ 0.040), and signiﬁcantly better than the ﬁt of Model 2 (CHISQ(5) ¼ 48.7). Step 4. To investigate change in the means, variances, and correlations of the common factors, we ﬁtted additional models with Model 3 as the starting point. The across occasion invariance of parameters was tested step-by-step, maintaining all equality constraints that proved tenable. This procedure ﬁnally yielded Model 4, which ﬁtted the data closely (CHISQ(134) ¼ 168.3, RMSEA ¼ 0.039, Table 2). Estimates of all Model 4 parameters are given in Table 3. Evaluation of response shifts and true change It appears that most Model 4 parameters are invariant across occasions, but here we focus on the parameters that did change (printed in bold, Table 3). Results of signiﬁcance tests of these changes are presented in Table 4. Reconceptualization and reprioritization Comparison of the common factor loadings of the ﬁrst occasion with those of the second occasion (Table 3, top rows) shows that at the second occasion the GH scale became an indicator of GenMent, indicating reconceptualization of GH. In addition, the common factor loading of SF on GenPhys became larger at the second occasion, indicating reprioritization of SF. Recalibration Intercepts and residual variances contain information about uniform and nonuniform recalibration. For RP and BP, we found diﬀerences between ﬁrst and second occasion intercepts,

indicating uniform recalibration of both RP and BP. We also found a change in the variance of the residual factor ResVT, indicating nonuniform recalibration of VT. True change Common factor variances and common factor correlations did not change across occasions, but the common factor means did. Common factor means were ﬁxed at zero for the ﬁrst occasion (because of identiﬁcation requirements), so that the second occasion estimates serve as direct representations of change. The across occasion differences were signiﬁcant (p < 0.001) for each of the common factors: GenPhys ()0.77) and GenFitn ()0.35) deteriorated, and GenMent (+0.53) improved. Respective eﬀect-sizes were ÔlargeÕ (d ¼ )0.80), ÔmediumÕ (d ¼ )0.38), and ÔmediumÕ (d ¼ 0.49). Contributions of response shifts and true change to change in the observed variables In addition to signiﬁcance test results, Table 4 provides eﬀect-sizes for observed change, and the response shift and true change contributions to observed change, as implied by the parameter estimates of Model 4 (in Table 3). From Table 4 it appears that the response shift eﬀects on observed change are only ÔsmallÕ: 0.27 and 0.30 for the uniform calibration of RP and BP, 0.14 for for the reconceptualization of GH, )0.11 for the reprioritization of SF, and zero (obviously) for the nonuniform recalibration of VT. The eﬀects of true change are generally larger, but vary from none (SF) to ÔmediumÕ (PF, RP, BP). For RP, BP, and GH the eﬀects of response shifts and true change are in opposite directions. The impact of response shifts on the measurement of true change The impact of response shifts on the estimation of true change was investigated by checking what the estimated true change would have been if we would not have accounted for the response shifts that we found. With Model 2 we found changes in the common factor means of )0.59, +0.48, and )0.35 for GenPhys, GenMent, and GenFitn, whereas with Model 4 these changes were )0.77, +0.53, and )0.35 (Table 3). However, the ﬁt of Model 2 was perhaps too poor to allow interpretation of its

605 Table 3. Parameter estimates in the ﬁnal model (Model 4) Pre-test

Post-test

GenPhys1 GenMent1 GenFitn1 Factor loadings (G) PF1 RP1 BP1 SF1 MH1 RE1 VT1 GH1 FT1

0.99 1.56 0.84 0.45

GenPhys2 GenMent2 GenFitn2

0.57 0.82 1.32 1.12 0.49 1.06

PF2 RP2 BP2 SF2 MH2 RE2 VT2 GH2 FT2

0.99 1.56 0.84 0.69

0.57 0.82 1.32 0.30

1.12 0.49 1.06

Intercepts (s) PF

RP

BP

SF

MH

RE

VT

GH

FT

3.94 3.94

2.70 3.33

3.93 4.32

3.82 3.82

3.25 3.25

2.92 2.92

3.15 3.15

2.96 2.96

3.28 3.28

Residual variances (Diag(W)) ResPF

ResRP

ResBP

ResSF

ResMH

ResRE

ResVT

ResGH

ResFT

Pre-test Post-test

1.73 1.73

0.75 0.75

0.92 0.92

0.44 0.44

2.33 2.33

0.36 0.21

0.62 0.62

0.18 0.18

Residual correlations (Diag(W21*)) Pre · Post 0.28 0.12

0.32

0.05

0.43

)0.03

0.22

0.30

0.23

Pre-test Post-test

0.64 0.64

Common factor variances (Diag (F)) Pre-test

Post-test

GenPhys1 GenMent1 GenFitn1 1.00 Common factor correlations (F*) Pre-test Gen-Phys1 1 0.53 Gen-Ment1 Gen-Fitn1 0.87 Post-test Gen-Phys2 0.54 Gen-Ment2 0.33 Gen-Fitn2 0.49 Common factor means (a) 0.00

1.00

1.00

1 0.65

1

0.33 0.41 0.42 0.00

GenPhys2 GenMent2 GenFitn2 1.00

1.00

1.00

0.49 0.42 0.59

1 0.53 0.87

1 0.65

1

0.00

)0.77

0.53

)0.35

Notes: n = 170, goodness of overall ﬁt mesures: CHISQ(134) = 168.3, RMSEA = 0.039, RMSEA 90% conﬁdence interval = 0.015– 0.056. Results indicating across-occasion variance are printed in bold. Greek symbols refer to the structural equation model described by Oort [17]. Factor loadings are unstandardized, but covariances are decomposed into variances and correlations.

parameter estimates. Checking each of the ﬁve cases of response shift one by one, we found that only uniform recalibration aﬀected the measurement of true change. If we ignored uniform recalibration, the GenPhys mean change was estimated

at )0.54. Thus, in the present data set, uniform recalibration had some impact on the estimation of true change in GenPhys. Accounting for uniform recalibration, the eﬀect-size increases by 0.23, from a ÔmediumÕ sized change to a ÔlargeÕ sized

606 Table 4. Signiﬁcance tests of response shifts, and eﬀect-sizes of observed change, response shift, and true change in the ﬁnal model (Model 4) Scale

PF RP BP SF MH RE VT GH FT

Response shift

Signiﬁcance test

Eﬀect-sizes

v2 (df = l)

Prob.

Uniform recalibration Uniform recalibration Reprioritization

11.1 12.7 4.4

Springer 2005

An application of structural equation modeling to detect response shifts and true change in quality of life data from cancer patients undergoing invasive surgery Frans J. Oort, Mechteld R.M. Visser & Mirjam A.G. Sprangers Department of Medical Psychology, Academic Medical Centre, University of Amsterdam, The Netherlands (E-mail: [email protected]) Accepted in revised form 14 June 2004

Abstract The objective is to show how structural equation modeling can be used to detect reconceptualization, reprioritization, and recalibration response shifts in quality of life data from cancer patients undergoing invasive surgery. A consecutive series of 170 newly diagnosed cancer patients, heterogeneous to cancer site, were included. Patients were administered the SF-36 and a short version of the multidimensional fatigue inventory prior to surgery, and 3 months following surgery. Indications of response shift eﬀects were found for ﬁve SF-36 scales: reconceptualization of Ôgeneral healthÕ, reprioritization of Ôsocial functioningÕ, and recalibration of Ôrole-physicalÕ, Ôbodily painÕ, and ÔvitalityÕ. Accounting for these response shifts, we found deteriorated physical health, deteriorated general ﬁtness, and improved mental health. The sizes of the response shift eﬀects on observed change were only small. Yet, accounting for the recalibration response shifts did change the estimate of true change in physical health from medium to large. The structural equation modeling approach was found to be useful in detecting response shift eﬀects. The extent to which the procedure is guided by subjective decisions is discussed. Key words: Cancer, Health related quality of life, Response shift, Structural equation modeling

Introduction When assessing self-reported change we must account for recalibration, reprioritization, and reconceptualization response shifts. Recalibration refers to a change in the respondent’s internal standards of measurement, reprioritization to a change in the respondent’s values, and reconceptualization to a change in the respondent’s understanding of the target construct [1, 2]. Oort [17] proposes a procedure for the detection of response shifts and the measurement of true change through structural equation modeling. The procedure applies if one or more target constructs (e.g. health-related quality of life) are measured with multiple items, or scales of items (e.g. the items or scales of a quality of life questionnaire).

In this procedure, operationalizations of response shifts are based on the idea that reconceptualization refers to a change in the meaning of the item content, reprioritization to a change in the relative importance of the item as an indicator of the target construct, and recalibration to a change in the meaning of (the labeling of) the response options of the item (i.e., the anchors of the response scale). Oort distinguishes between uniform and nonuniform recalibration. If the meaning of all response scale anchors changes in the same way, that is, if the recalibration aﬀects all anchors in the same direction and to the same extent, the recalibration is uniform. Otherwise, the recalibration is nonuniform. The detection procedure models group means and covariances. It will therefore only detect

600 response shifts and true change if these phenomena are experienced by a substantial part of the respondents. In other words, the procedure aims at detecting response shifts and true change at the group level rather than the individual level. For a detailed account of the procedure, we refer the reader to Oort [17]. The present objective is to illustrate the response shift detection procedure by applying it to data from cancer patients who underwent invasive surgery. The surgery induced severe and sustaining physical limitations, thus necessitating the patients to accommodate to their deteriorating condition. We therefore expect to ﬁnd response shifts in these data, because, according to Sprangers and Schwartz [1], response shifts are most likely to occur if the change in patients’ health status is recent, intense and pervasive, thus requiring adaptation.

Method Cancer patients’ health-related quality of life was ﬁrst assessed prior to surgery, shortly after diagnosis. The second assessment was 3 months following surgery. Patients A consecutive series of 170 newly diagnosed cancer patients were enrolled, including 29 lung cancer patients (17%) waiting for either lobectomy or pneumectomy, 43 pancreatic cancer patients (25%) waiting for Whipple or bypass surgery, 46 esophageal cancer patients (27%) waiting for either transhiatal or transthoracal surgery, and 52 cervical cancer patients (31%) waiting for radical hysterectomy. Exclusion criteria were being younger than 18 years, having a life expectancy less than 9 months, or not being able to complete a (Dutch) questionnaire. The sample consisted of 87 men and 83 women, with ages ranging from 27 to 83 (mean 57.5, standard deviation 14.1). Measures Generic health-related quality of life was assessed with the Dutch language version [3] of the SF-36 health survey [4], encompassing eight scales: physical functioning (PF), role limitations due to

physical health (role-physical, RP), bodily pain (BP), general health perceptions (GH), vitality (VT), social functioning (SF), role limitations due to emotional problems (role-emotional, RE), and mental health (MH). As eﬀects on patients’ fatigue were of prime interest, we also measured fatigue (FT) with a sixitem short form of the multidimensional fatigue inventory (MFI) [5]. Both SF-36 and MFI scale scores were transformed to a scale ranging from 0 to 5, with higher scores indicating better health. Procedure Response shift detection and true change assessement was done in four steps [17]: (1) establishment of an appropriate measurement model, (2) ﬁtting a model of no response shifts, (3) detection of response shifts, and (4) assessment of true change. Each of these steps is associated with a particular model. Step 1. An appropriate measurement model, Model 1, was established on the basis of published results of principal components analyses of the SF36 [4], results of exploratory factor analyses of the present data, and substantive considerations. Model 1 has no across occasion constraints. Step 2. In Model 2 all response shift parameters are constrained to be equal across occasions. The comparison of the ﬁt of Models 1 and 2 through the v2 diﬀerence test [6] can be used as an overall test of the presence of response shifts. If the difference in ﬁt of these two models is not signiﬁcant, we may conclude that there are no response shifts, and skip Step 3. Step 3. In Model 3 all apparent response shifts are accounted for. The speciﬁcation search of Model 3 started with Model 2, and was guided by modiﬁcation indices [7] and standardized residuals. Each modiﬁcation was tested with the v2 diﬀerence test [6]. Response shifts are operationalized as acrossoccasion diﬀerences between patterns of common factor loadings (reconceptualization), diﬀerences between values of common factor loadings (reprioritization), diﬀerences between intercepts (uniform recalibration), and diﬀerences between residual variances (nonuniform recalibration). Step 4. After establishing (partial) invariance of intercepts, factor loadings, and residual variances, we tested other types of invariance, including the

601 equality of common factor means, variances, and correlations. Diﬀerences between common factor means indicate ÔunbiasedÕ or ÔtrueÕ changes in the target constructs [17] (here: health-related quality of life). Evaluation The parameter estimates of the ﬁnal model, Model 4, were used to calculate eﬀect size indices for true change, as well as for the contributions of response shifts and true change to observed change [17]. The response shift eﬀect on the estimation of true change was investigated by comparing estimates of true change from a model in which response shifts were accounted for with estimates from a model in which response shifts were not accounted for.

and with large sample sizes the v2 test generally turns out to be signiﬁcant. An alternative measure of overall goodness-of-ﬁt is the root mean square of approximation (RMSEA). According to a generally accepted rule of thumb, an RMSEA value below 0.08 indicates ÔreasonableÕ ﬁt and one below 0.05 ÔcloseÕ ﬁt [10]. Yet another ﬁt index is the expected cross-validation index (ECVI), which expresses how well a model ﬁts a ÔvalidationÕ sample when its parameter estimates are obtained with an independent ÔcalibrationÕ sample. The ECVI can be estimated without actually splitting the available data; ECVI values represent discrepancies and can only be interpreted in comparisons of diﬀerent models for the same data [11].

Results Structural equation modeling In the four-step procedure, structural equation models were ﬁtted to the means, variances, and covariances of the SF-36 and MFI scale scores, using standard statistical computer programs [7–9] (LISREL and Mx syntax is available upon request).

Identiﬁcation To achieve identiﬁcation of all model parameters, scales and origins of the common factors were established by ﬁxing the means at zero and the variances at one. In Steps 2–4 of the detection procedure, only the ﬁrst occasion factor means and variances are ﬁxed; second occasion means and variances are then identiﬁed by constraining intercepts and factor loadings to be equal across occasions [17].

Estimation The maximum likelihood estimation method yields a v2 test of overall goodness-of-ﬁt, and standard errors for all parameter estimates [6].

Goodness-of-ﬁt A signiﬁcant v2 indicates a signiﬁcant diﬀerence between data and model. However, in the practice of structural equation modeling, exact ﬁt is rare,

Table 1 gives pre-surgery and post-surgery means, standard deviations, and Cronbach’s a’s for all SF-36 and MFI scales. All scale scores appear suﬃciently reliable, with the exception of the GH pre-surgery scores. The last column of Table 1 presents the standardized diﬀerence between presurgery and post-surgery means (i.e., Cohen’s dindex [12]), thereby giving a ﬁrst impression of change in the observed scale means (without accounting for possible response shifts). Conventional t-tests indicated deterioration in PF, RP, BP, VT, and FT, improvement in MH and RE, and no change in SF and GH. Before continuing with the structural equation modeling, we ﬁrst checked whether our sample of cancer patients was not too heterogeneous to be used in single-group analysis. We calculated separate mean vectors and covariance matrices for the four groups of patients (lung cancer, pancreatic cancer, esophageal cancer, and cervical cancer patients), and we ﬁtted a multi-group model in which the means, variances, and covariances were assumed to be invariant across the four groups. The hypothesis of exact ﬁt (RMSEA ¼ 0) was rejected (p ¼ 0.001), but the hypothesis of close ﬁt (RMSEA < 0.05) was not rejected (p ¼ 0.585) [10]. So we concluded that we could continue with single-group analyses, although we should keep in mind that our conclusions might not carry the same weight for all patient groups.

602 Table 1. Means, standard deviations, and reliability estimates for SF-36 and MFI scales Scale

PF RP BP SF MH RE VT GH FT

Before surgery

Three months after surgery

Post-Pre d-index#

Mean

St.dev.

Reliability

Mean

St.dev.

Reliability

3.96 2.73 3.94 3.81 3.25 3.00 3.14 2.96 3.30

1.22 2.09 1.19 1.32 1.08 2.12 1.26 0.95 1.10

0.93 0.86 0.87 0.78 0.84 0.83 0.86 0.50 0.86

3.18 2.13 3.68 3.62 3.69 3.55 2.77 2.96 2.92

1.32 2.02 1.21 1.47 1.05 1.93 1.23 1.06 1.18

0.92 0.84 0.92 0.79 0.82 0.81 0.89 0.78 0.91

)0.59** )0.27** )0.20* )0.11 0.40** 0.21* )0.30** 0.00 )0.35**

Notes: n ¼ 170; SF-36 scales have been scored in the way as described in the manual [4], but for computational convenience, have been divided by 20 afterwards. As a result, all SF-36 and MFI scale scores range from 0 to 5; #Standardized mean diﬀerence: 0.2, 0.5, and 0.8 indicate ÔsmallÕ, ÔmediumÕ, and ÔlargeÕ diﬀerences [12]; *p < 0.01, **p < 0.001 in paired t-test.

Below we ﬁrst describe the measurement model that was used in the response shift detection procedure, then we present the results of response shift detection and true change assessment, and we conclude with an evaluation of the size of response shifts and true change. Measurement model Results from exploratory factor analyses and substantive considerations gave rise to the measurement model displayed in Figure 1. The circles represent unobserved, latent variables and the squares represent the observed variables. Three latent variables are the common factors general physical health (GenPhys), general mental health (GenMent), and general ﬁtness (GenFitn). GenPhys is measured by PF, RP, BP, and SF, GenMent is measured by MH, RE, and again SF, and GenFitn is measured by VT, GH, and FT. Other latent variables are the residual factors ResPF, ResRP, ResBP, etc. The residual factors represent all that is speciﬁc to PF, RP, BP, etc., plus random error variation [13, 17]. Numbers in Figure 1 are maximum likelihood estimates of common factor loadings, common factor correlations, residual variances, and one residual correlation (numbers separated by a slash represent separate ﬁrst and second occasion estimates). The measurement model portrayed in Figure 1 resembles the principal components model of the SF-36 scales described by Ware et al. [4]. The general physical and mental components feature in both models, with largely the same indicators.

However, the addition of the MFI scale that speciﬁcally measures fatigue, FT, brought about the GenFitn factor. The SF-36 scales VT and GH, consisting of items with wordings that do not distinguish between physical and mental aspects, also loaded on the GenFitn factor. The wording of the SF items combines physical and mental aspects, causing SF to load on both GenPhys and GenMent. Another diﬀerence with the principal components model is that in our model the common factors were substantially correlated. The correlation between the GenPhys and GenFitn factors was especially high (0.87), but a two-factor model did not yield satisfactory ﬁt (the preceding explorary factor analyses showed signiﬁcant v2 diﬀerences between ﬁt measures of two- and threefactor models). Inspection of the standardized residuals showed that the covariance between RP and RE was not suﬃciently explained by the correlation between the common factors. We therefore allowed the residual factors ResRP and ResRE to co-vary (0.22 correlation), to account for the very similar wordings of the RP and RE items. Detection of response shifts and true change Fit results for the four models that resulted from carrying out the four-step procedure are given in Table 2. Step 1. The measurement model of Figure 1 was the basis of Model 1, a structural equation model for measurements at two occasions, but without any across occasion constraints. The v2 test of

603 0.87

0.65

0.53

Gen. Ment.

Gen. Phys.

Gen. Fitness

0.45 / 0.69

0.99

PF

0.64

Res. PF

1.56

RP

0.84

BP

1.73

Res. RP

0.00 / 0.30

0.75

Res. BP

0.57

SF

0.92

Res. SF

0.82

MH

1.32

RE

0.44

Res. MH

0.49

GH

2.33

Res. RE

0.62

Res. GH

1.12

VT

0.36 / 0.21

Res. VT

1.06

FT

0.18

Res. FT

0.22

Figure 1. The measurement model used in response shift detection. Notes: Circles represent latent variables (common and residual factors) and squares represent observed variables (the SF-36 and MFI scales). Abbreviations: PF – physical functioning; RP – rolephysical; BP – bodily pain; GH – general health; VT – vitality; SF – social functioning; RE – role-emotional; MH – mental health; FT – Fatigue. Numbers are maximum likelihood estimates of Model 4 parameters: common factor loadings, common factor correlations, residual variances, and a residual correlation. Parameter estimates separated by a slash represent separate ﬁrst and second occasion estimates; all other parameters were constrained to be equal across occasions. Table 3 gives all Model 4 parameter estimates, for both occasions.

Table 2. Goodness of overall ﬁt of models in the four-step response shift detection procedure Model

Description

Df

CHISQ

RMSEA

ECVI

Model 1

Measurement model (no across occasion constraints)

106

146.9

0.048 (0.027; 0.066)

1.98 (1.82; 2.20)

Model 2

No response shift model

129

206.9

0.060 (0.044; 0.074)

2.03 (1.82; 2.30)

Model 3

Response shift model

124

158.2

0.040 (0.017; 0.058)

1.81 (1.70; 2.03)

Model 4

Final model (all tenable constraints imposed)

134

168.3

0.039 (0.015; 0.056)

1.73 (1.64; 1.96)

Notes: n ¼ 170; Numbers between parentheses represent 90% conﬁdence intervals.

exact ﬁt was signiﬁcant (CHISQ(106) ¼ 146.9), but the RMSEA measure indicated close ﬁt (RMSEA ¼ 0.048, Table 2).

Step 2. In Model 2, all response shift parameters were held invariant across occasions. This means that all across occasion invariance constraints on

604 factor loadings, intercepts, and residual variances were imposed. The ﬁt of Model 2, although still satisfactory (RMSEA ¼ 0.060, Table 1), was signiﬁcantly worse than the ﬁt of Model 1, indicating the presence of response shifts (v2 diﬀerence test: CHISQ(23) ¼ 60.0, p < 0.001). Step 3. Inspection of modiﬁcation indices and standardized residuals indicated which of the equality constraints were not tenable. Step by step modiﬁcation of Model 2 yielded Model 3, which showed ﬁve cases of response shift, as will be explained below. The ﬁt of Model 3 was good (RMSEA ¼ 0.040), and signiﬁcantly better than the ﬁt of Model 2 (CHISQ(5) ¼ 48.7). Step 4. To investigate change in the means, variances, and correlations of the common factors, we ﬁtted additional models with Model 3 as the starting point. The across occasion invariance of parameters was tested step-by-step, maintaining all equality constraints that proved tenable. This procedure ﬁnally yielded Model 4, which ﬁtted the data closely (CHISQ(134) ¼ 168.3, RMSEA ¼ 0.039, Table 2). Estimates of all Model 4 parameters are given in Table 3. Evaluation of response shifts and true change It appears that most Model 4 parameters are invariant across occasions, but here we focus on the parameters that did change (printed in bold, Table 3). Results of signiﬁcance tests of these changes are presented in Table 4. Reconceptualization and reprioritization Comparison of the common factor loadings of the ﬁrst occasion with those of the second occasion (Table 3, top rows) shows that at the second occasion the GH scale became an indicator of GenMent, indicating reconceptualization of GH. In addition, the common factor loading of SF on GenPhys became larger at the second occasion, indicating reprioritization of SF. Recalibration Intercepts and residual variances contain information about uniform and nonuniform recalibration. For RP and BP, we found diﬀerences between ﬁrst and second occasion intercepts,

indicating uniform recalibration of both RP and BP. We also found a change in the variance of the residual factor ResVT, indicating nonuniform recalibration of VT. True change Common factor variances and common factor correlations did not change across occasions, but the common factor means did. Common factor means were ﬁxed at zero for the ﬁrst occasion (because of identiﬁcation requirements), so that the second occasion estimates serve as direct representations of change. The across occasion differences were signiﬁcant (p < 0.001) for each of the common factors: GenPhys ()0.77) and GenFitn ()0.35) deteriorated, and GenMent (+0.53) improved. Respective eﬀect-sizes were ÔlargeÕ (d ¼ )0.80), ÔmediumÕ (d ¼ )0.38), and ÔmediumÕ (d ¼ 0.49). Contributions of response shifts and true change to change in the observed variables In addition to signiﬁcance test results, Table 4 provides eﬀect-sizes for observed change, and the response shift and true change contributions to observed change, as implied by the parameter estimates of Model 4 (in Table 3). From Table 4 it appears that the response shift eﬀects on observed change are only ÔsmallÕ: 0.27 and 0.30 for the uniform calibration of RP and BP, 0.14 for for the reconceptualization of GH, )0.11 for the reprioritization of SF, and zero (obviously) for the nonuniform recalibration of VT. The eﬀects of true change are generally larger, but vary from none (SF) to ÔmediumÕ (PF, RP, BP). For RP, BP, and GH the eﬀects of response shifts and true change are in opposite directions. The impact of response shifts on the measurement of true change The impact of response shifts on the estimation of true change was investigated by checking what the estimated true change would have been if we would not have accounted for the response shifts that we found. With Model 2 we found changes in the common factor means of )0.59, +0.48, and )0.35 for GenPhys, GenMent, and GenFitn, whereas with Model 4 these changes were )0.77, +0.53, and )0.35 (Table 3). However, the ﬁt of Model 2 was perhaps too poor to allow interpretation of its

605 Table 3. Parameter estimates in the ﬁnal model (Model 4) Pre-test

Post-test

GenPhys1 GenMent1 GenFitn1 Factor loadings (G) PF1 RP1 BP1 SF1 MH1 RE1 VT1 GH1 FT1

0.99 1.56 0.84 0.45

GenPhys2 GenMent2 GenFitn2

0.57 0.82 1.32 1.12 0.49 1.06

PF2 RP2 BP2 SF2 MH2 RE2 VT2 GH2 FT2

0.99 1.56 0.84 0.69

0.57 0.82 1.32 0.30

1.12 0.49 1.06

Intercepts (s) PF

RP

BP

SF

MH

RE

VT

GH

FT

3.94 3.94

2.70 3.33

3.93 4.32

3.82 3.82

3.25 3.25

2.92 2.92

3.15 3.15

2.96 2.96

3.28 3.28

Residual variances (Diag(W)) ResPF

ResRP

ResBP

ResSF

ResMH

ResRE

ResVT

ResGH

ResFT

Pre-test Post-test

1.73 1.73

0.75 0.75

0.92 0.92

0.44 0.44

2.33 2.33

0.36 0.21

0.62 0.62

0.18 0.18

Residual correlations (Diag(W21*)) Pre · Post 0.28 0.12

0.32

0.05

0.43

)0.03

0.22

0.30

0.23

Pre-test Post-test

0.64 0.64

Common factor variances (Diag (F)) Pre-test

Post-test

GenPhys1 GenMent1 GenFitn1 1.00 Common factor correlations (F*) Pre-test Gen-Phys1 1 0.53 Gen-Ment1 Gen-Fitn1 0.87 Post-test Gen-Phys2 0.54 Gen-Ment2 0.33 Gen-Fitn2 0.49 Common factor means (a) 0.00

1.00

1.00

1 0.65

1

0.33 0.41 0.42 0.00

GenPhys2 GenMent2 GenFitn2 1.00

1.00

1.00

0.49 0.42 0.59

1 0.53 0.87

1 0.65

1

0.00

)0.77

0.53

)0.35

Notes: n = 170, goodness of overall ﬁt mesures: CHISQ(134) = 168.3, RMSEA = 0.039, RMSEA 90% conﬁdence interval = 0.015– 0.056. Results indicating across-occasion variance are printed in bold. Greek symbols refer to the structural equation model described by Oort [17]. Factor loadings are unstandardized, but covariances are decomposed into variances and correlations.

parameter estimates. Checking each of the ﬁve cases of response shift one by one, we found that only uniform recalibration aﬀected the measurement of true change. If we ignored uniform recalibration, the GenPhys mean change was estimated

at )0.54. Thus, in the present data set, uniform recalibration had some impact on the estimation of true change in GenPhys. Accounting for uniform recalibration, the eﬀect-size increases by 0.23, from a ÔmediumÕ sized change to a ÔlargeÕ sized

606 Table 4. Signiﬁcance tests of response shifts, and eﬀect-sizes of observed change, response shift, and true change in the ﬁnal model (Model 4) Scale

PF RP BP SF MH RE VT GH FT

Response shift

Signiﬁcance test

Eﬀect-sizes

v2 (df = l)

Prob.

Uniform recalibration Uniform recalibration Reprioritization

11.1 12.7 4.4