MEASURING DISABILITY IN PATIENTS WITH

0 downloads 0 Views 3MB Size Report
Koes BW, Sanders RJ, and Tuut MK. CBO-richtlijn voor diagnostiek en behandeling van acute en chronische aspecifieke lage rugklachten. [CBO- guideline for ...
MEASURING DISABILITY IN PATIENTS WITH CHRONIC LOW BACK PAIN The usefulness of different instruments

RIJKSUNIVERSITEIT GRONINGEN

MEASURING DISABILITY IN PATIENTS WITH CHRONIC LOW BACK PAIN The usefulness of different instruments

Proefschrift

ter verkrijging van het doctoraat in de Medische Wetenschappen aan de Rijksuniversiteit Groningen op gezag van de Rector Magnificus, dr. F. Zwarts, in het openbaar te verdedigen op woensdag 19 april 2006 om 16.15 uur

door Wietske Kuijer geboren op 20 mei 1980 te Zevenaar

Promotores:

Prof. dr. J.H.B. Geertzen Prof. dr. J.W. Groothoff

Copromotor:

Dr. P.U. Dijkstra

Beoordelingscommissie:

Prof. dr. M.H.W. Frings-Dresen Prof. dr. G.J. Lankhorst Prof. dr. J.J.A. Mooij

Paranimfen:

Sandra Brouwer Grieke W. Olijve

This research was financially supported by: ZonMw (grant number 96-06-006) University Medical Centre Groningen The foundation ‘Beatrixoord Noord Nederland’ The foundation ‘De Drie Lichten’

The publication of this thesis was generously supported by: University Medical Centre Groningen Northern Center for Healthcare Research The foundation ‘Beatrixoord Noord Nederland’

Printed by:

Stichting Drukkerij C. Regenboog, Groningen

Cover:

Nicole van der Veen-Kerssies

Kuijer, Wietske. Measuring disability in patients with chronic low back pain: the usefulness of different instruments. Thesis University of Groningen, the Netherlands – With ref. – With summary in Dutch.

ISBN 9077113444

Copyright: © 2006 W. Kuijer All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronical or mechanical, including photocopy, recording or any information storage or retrieval system, without the prior written permission of the copyright owner.

Contents Chapter 1 Introduction

1

Chapter 2 Reliability and stability of the Roland Morris Disability Questionnaire: Intra Class Correlation and limits of agreement Disability and Rehabilitation, 2004; 26 (3): 162-165

11

Chapter 3 Responsiveness of the Roland Morris Disability Questionnaire: consequences of using different external criteria Clinical Rehabilitation, 2005; 19 (5): 488-495

21

Chapter 4 Measuring physical performance via self-report in healthy young adults Journal of Occupational Rehabilitation, 2004; 14 (1): 77-87

35

Chapter 5 Safe lifting in patients with chronic low back pain: comparing FCE lifting task and NIOSH lifting guideline Accepted for publication. Journal of Occupational Rehabilitation

51

Chapter 6 Prediction of sickness absence in patients with low back pain: a systematic review Accepted for publication. Journal of Occupational Rehabilitation

67

Chapter 7 Work status and chronic low back pain: exploring the International Classification of Functioning, Disability and Health Disability and Rehabilitation, 2006; 28 (6):379-388

109

Chapter 8 Matching FCE activities and work demands: an explorative study Accepted for publication. Journal of Occupational Rehabilitation

131

Chapter 9 General discussion

151

Summary

163

Samenvatting

167

Northern Center for Healthcare research (NCH)

171

Dankwoord

173

Curriculum Vitae

177

Introduction

Chapter 1

1

2

Introduction

Introduction Low Back Pain Non-specific low back pain (LBP) is a very common complaint in Western industrialised countries. Seventy to 85 % of all people have LBP at some time in their life.1 LBP can be defined as ‘mechanical’ pain of musculoskeletal origin in which symptoms vary with physical activities and time, and often spreads to one or both buttocks or thighs.2 There is no damage to the nerves or any more serious spinal pathology.2,3 LBP often develops spontaneously, and mostly resolves within 4-6 weeks after onset.1,2 However, in some cases (less than 10%), the back pain persists and the pain becomes chronic.4 About 93% of subjects with chronic low back pain (CLBP) have a new episode of LBP again in the following 12 months, which shows the intermittent character of CLBP.5 Although a relatively small percentage of patients develops CLBP, it is known that CLBP has a major impact on patients functioning.6 Over 30% of patients with CLBP seek healthcare for their back complaints and about 66% of subjects with recurrent CLBP who sought care for complaints at baseline, did seek care again during follow-up.7 Seeking healthcare and work incapacity associated with CLBP contribute to the high costs of CLBP.4 Understanding factors that determine healthcare utilisation and work incapacity associated with LBP is important to lower the costs associated with CLBP. In addition, these factors can help to detect groups who deserve special attention. For healthcare seeking it may also be important to detect inaccessibility of health care services for patients with certain characteristics. For clinicians it is important because it informs them about the patients who consults them, and it provides researchers with knowledge about characteristics of back pain populations in different settings.7

Chronic Low Back Pain and Disability In general, disability seems to be one of the most important determinants for seeking healthcare in patients with CLBP.7-9 In 1980, the World Health Organisation (WHO) defined disability as ‘any restriction or lack (resulting from an impairment) of ability to perform an activity in the manner of within the range considered normal for an human being’.10 This definition assumes that the normal is to have no disability or restriction of any kind, and that disability is ‘due to an impairment’.2 In 2001 however, the WHO presented the International Classification of Functioning, Disability and Health (ICF),11 a bio psychosocial model currently used in rehabilitation and disability perspectives.12 (Figure 1.1). In this classification, disability is defined as an umbrella term for impairments, activity limitations and participation restrictions. It denotes the negative aspects of

3

Chapter 1

the interaction between an individual (with a health condition) and that individual’s contextual factors (environmental and personal factors).11 Patients with CLBP may be impaired in body functions and structures, limited in performing activities and restricted in participation. Experience learned that solely measuring impairments in body functions and structures could not explain the complete concept of disability in CLBP. This also means that in rehabilitation treatment, the focus on disability has shifted, in that the pain and complaints are no longer determining the level of disability but more the interaction between the concepts, with the focus on activity and participation. The guiding principle in rehabilitation treatment has shifted from a complaint contingent approach to a more time-contingent approach.13,14

Health condition (disorder or disease)

Body functions and body structures

Environment

Activity

Participation

Personal

Figure 1.1 Model of the International Classification of Functioning, Disability and Health11

Measuring Disability Measuring disability is an important topic in rehabilitation research in patients with CLBP. Due to the major impact of CLBP on functioning in both daily living and work, measuring disability in patients with CLBP is best described in terms of limitations in activities and restrictions in participation in daily living and work. Measuring disability in rehabilitation medicine can serve the purposes assessment, evaluation and prediction of disability and functioning. Assessment of functioning refers to determining the need for rehabilitation and content specification of the

4

Introduction

treatment. Evaluation of functioning refers to the evaluation if the patient has changed after treatment. Prediction of functioning refers to the prediction of (the impact on) functioning after treatment. Disability can be measured using different kinds of instruments, such as selfreports, performance testing, clinical observation, or a combination of these instruments.15 Each of these instruments can result in a different level of disability due to the use of different perspectives, and professionals should be aware of these differences when using them in daily practice.16,17 Despite of differences in outcome, each used measurement instrument should be reliable, valid and responsive for an adequate assessment, evaluation and prediction of functioning. Reliability refers to the consistency and the amount of error inherent in any instrument, and a good reliability is a prerequisite for each instrument to serve all three purposes.18 The validity of an instrument refers to what extent the instrument measures what it intends to measure. Validity does not refer to the instrument itself, but to the inferences one can make from the instrument.18 Responsiveness involves the ability to detect a clinically important change in the construct being measured, and is relevant to evaluate if the patient has changed.18,19

Measurement Instruments A frequently used measurement in rehabilitation medicine to measure disability in patients with CLBP, is the Roland Morris Disability Questionnaire (RMDQ).20-22 The RMDQ is a self-report instrument,23 derived from the Sickness Impact Profile24 and assesses dichotomously (yes/no), perceived limitations due to low back pain in 24 activities of daily living. The total score is calculated by summing the ‘yes’ answers. The scale ranges from zero (no disability) to 24 (severe disability). A validated Dutch language version is available (RMDQ-Dv),5,20,25 but test-retest reliability and limits of agreement have not been investigated previously in the RMDQ-Dv. In addition, these studies found different responsiveness statistics, dependent on the used external criteria to measure change. A frequently used measurement instrument in rehabilitation medicine to measure performance of work-related activities, is the Functional Capacity Evaluation (FCE).26 Different FCEs are available and one of the more well known is the Isernhagen Work Systems Functional Capacity Evaluation (IWS FCE).27 The IWS FCE consists of 28 work-related tasks based on the Dictionary of Occupational Titles (DOT),26,28,29 including lifting, carrying, pushing, pulling, forward bending, squatting, crouching, etc. Several subtests of the IWS FCE have proven good reliability in patients with CLBP.30

5

Chapter 1

Aim of this thesis In this thesis, the RMDQ and the IWS FCE will be examined on their usefulness to assess, evaluate and predict disability and functioning in patients with CLBP. The thesis will mainly focus on the assessment, evaluation and prediction of disability in ADL and work-related activities. Furthermore, physical and psychosocial factors will be studied on their ability to assess and predict disability in work participation in patients with CLBP. The main research questions answered in this thesis are: Disability in ADL - What is the reliability and stability of the Dutch language version of the Roland Morris Disability Questionnaire? - What are the consequences of using different external criteria on the responsiveness of the Roland Morris Disability Questionnaire? Disability and functioning in work-related activities - To what extent can self-report replace the lifting task of the Functional Capacity Evaluation? - Produce the lifting task of the Isernhagen Work Systems Functional Capacity Evaluation and the recommended weight limit of the National Institute for Occupational Safety and Health similar safe lifting weights? Disability and functioning in work - What are predictors for sickness absence in patients with chronic low back pain, distinguishing predictors aimed at the decision to report sick (absence threshold) and decision to return to work (return to work threshold)? - Which variables are related to work status in patients with chronic low back pain, classified according to the International Classification of Functioning, Disability and Health? Disability and functioning in work-related activities in relation to work participation - To what extent can the standardised Functional Capacity Evaluation be matched to specific work demands of patients with chronic low back pain and can this match predict sickness absence in the year after rehabilitation treatment?

Outline of this thesis The design of this thesis is presented in figure 1.2. Chapters 2 and 3 focus on the assessment and evaluation of self-reported disability and functioning in ADL, by studying the reliability of the RMDQ (chapter 2) and the responsiveness of the RMDQ (chapter 3) in patients with CLBP. Chapters 4 and 5 focus on the validity 6

Introduction

of measuring disability, related to the assessment, evaluation and prediction of disability and functioning in work-related activities. In chapter 4 the concurrent validity of the FCE (a performance test) in relation to self-report is investigated in healthy persons and in chapter 5, the concurrent validity of the IWS FCE in relation to the NIOSH lifting guideline (calculation based on performance) is investigated in patients with CLBP. Chapters 6 and 7 focus on the prediction of disability and functioning in work. Chapter 6 is a systematic review investigating predictors aimed at the decision to report sick and to return to work. Chapter 7 is a cross-sectional study investigating what variables are related to work status, classified according to the ICF, including the RMDQ and FCE. Chapter 8 focuses on the match between work-related activities and work participation, and the prediction of functioning in work. Finally, chapter 9 discusses the main findings of these and previous studies in relation to the usefulness of the RMDQ and FCE to assess, evaluate and predict disability and functioning in patients with CLBP. Clinical implications of these findings and recommendations for further research are presented.

Measuring Disability

Assessment

Goal

Perspective

ICF domain

Studies

Evaluation

Self report

ADL

Ch 2 Ch 3

Prediction

Performance

Work activities

Work participation

Ch 4 Ch 5

Ch 6 Ch 7 Ch 8

Figure 1.2. Design of the thesis

ADL:

Activities of Daily Living

ICF: Ch:

International Classification of Functioning, Disability and Health Chapter

7

Chapter 1

References 1. 2. 3. 4. 5.

6.

7. 8.

9.

10. 11. 12.

13.

14.

15.

8

Andersson GBJ. Epidemiological features of chronic low-back pain. Lancet 1999;354:581-5. Waddell G. The Back Pain Revolution. London: Churchill Livingstone, 1998. Deyo RA, Rainville J, and Kent DL. What can the history and physical examination tell us about low back pain? JAMA 1992;268:760-5. Nachemson AL. Newest knowledge of low back pain: A critical look. Clin Orthop Relat Res 1992;279:8-20. De Vet HC, Heymans MW, Dunn KM, Pope DP, van der Beek AJ, Macfarlane GJ, Bouter LM, and Croft PR. Episodes of low back pain: a proposal for uniform definitions to be used in research. Spine 2002;27:2409-16. Picavet HS, and Schouten JS. Musculoskeletal pain in the Netherlands: prevalences, consequences and risk groups, the DMC(3)-study. Pain 2003;102:167-78. IJzelenberg W, and Burdorf A. Patterns of Care for Low Back Pain in a Working Population. Spine 2004;29:1362-8. Van den Hoogen HJ, Koes BW, van Eijk JT, Bouter LM, and Deville W. On the course of low back pain in general practice: a one year follow-up study. Ann Rheum Dis 1998;57:13-9. Molano SM, Burdorf A, and Elders LAM. Factors associated with medical care-seeking due to low-back pain in scaffolders. Am J Ind Med 2001;40:275-81. WHO. International classification of impairments, disabilities and handicaps. Geneva: World Health Organization, 1980. WHO. ICF: International Classification of Functioning, Disability and Health. Geneva: World Health Organization, 2001. Stucki G. International Classification of Functioning, Disability, and Health (ICF). A promising framework and classification for rehabilitation medicine. Am J Phys Med Rehabil 2005;84:733-40. Kwaliteitsinstituut voor de gezondheidszorg CBO. Richtlijn. Aspecifieke lage rugklachten. [Guideline. Non-specific low back pain.] Alphen aan den Rijn: Van Zuiden Communications B.V. 2003. Koes BW, Sanders RJ, and Tuut MK. CBO-richtlijn voor diagnostiek en behandeling van acute en chronische aspecifieke lage rugklachten. [CBOguideline for diagnosis and treatment of non-specific acute and chronic low back pain.] Ned Tijdschr Geneeskd 2004;148:310-4. Wunderlich GS. Measuring Functional Capacity and Work Requirements, summary of a workshop. Washington D.C.: National Academy Press, 1999.

Introduction

16.

17.

18.

19. 20.

21.

22.

23.

24.

25.

26. 27. 28.

29.

Brouwer S, Dijkstra PU, Stewart RE, Göeken LNH, Groothoff JW, and Geertzen JHB. Measuring work limitations in chronic low back pain: comparing self-report, clinical examination and functional testing. Disabil Rehabil 2005;27:999-1005. Lee CE, Simmonds MJ, Novy DM, and Jones S. Self-reports and clinicianmeasured physical function among patients with low back pain: a comparison. Arch Phys Med Rehabil 2001;82:227-31. Streiner DL and Norman GR. Health Measurement Scales. A practical guide to their development and use. New York: Oxford University Press Inc. 1995. Guyatt G, Walter S, and Norman G. Measuring change over time: assessing the usefulness of evaluative instruments. J Chron Dis 1989;40:171-8. Beurskens AJ, de Vet HC, Koke AJ, van der Heijden GJ, and Knipschild PG. Measuring the functional status of patients with low back pain. Spine 1995;20:1017-28. Grotle M, Brox JI, and Vollestad NK. Functional status and disability questionnaires: what do they assess? A systematic review of back specific outcome questionnaires. Spine 2004;30:130-40. Müller U, Duetz MS, Roeder C, and Greenough CG. Condition-specific outcome measures for low back pain. Part 1: Validation. Eur Spine J 2004;13:301-13. Roland M, and Morris R. A study of the natural history of back pain. Part 1. Development of a reliable and sensitive measure of disability in low back pain. Spine 1983;8:141-4. Bergner M, Bobbit RA, Carter WB, and Gilson BS. Sickness Impact Profile: Development and final revision of health status measure. Med Care 1981;19:787-805. Gommans IHB, Koes BW, and van Tulder MW. Validiteit en responsiviteit Nederlandstalige Roland Disability Questionnaire. Vragenlijst naar functionele status bij patiënten met chronische lage rugpijn. [Validity and responsiveness of the Dutch Roland Disability Questionnaire. A functional status questionnaire in patients with chronic low back pain.] Ned Tijdschr Fys 1997;107:28-33. King PM, Tuckwell N, and Barrett TE. A critical review of Functional Capacity Evaluations. Phys Ther 1998;78:852-66. Isernhagen Work Systems. Functional Capacity procedure manual. Duluth, MN: 1997. Abdel-Moty E, Fishbain DA, Khalil TM, Sadek S, Cutler R, Rosomoff RS, and Rosomoff HL. Functional capacity and residual functional capacity and their utility in measuring work capacity. Clin J Pain 1993;9:2003-13. U.S. Department of Labor, Employment and Training Administration. Dictionary of Occupational Titles. Washington D.C.: 1986.

9

Chapter 1

30.

10

Brouwer S, Reneman MF, Dijkstra PU, Groothoff JW, Schellekens JMH, and Göeken LNH. Test-retest reliability of the Isernhagen Work Systems Functional Capacity Evaluation in patients with chronic low back pain. J Occup Rehabil 2003;13:207-18.

Reliability and stability of the Roland Morris Disability Questionnaire Intra Class Correlation and limits of agreement

Chapter 2 Sandra Brouwer Wietske Kuijer Pieter U. Dijkstra Ludwig N.H. Göeken Johan W. Groothoff Jan H.B. Geertzen

Published in: Disability and Rehabilitation, 2004; 26 (3): 162-165 Reprinted with the kind permission of Taylor & Francis Group www.tandf.co.uk

11

Chapter 2

Abstract Purpose: To analyse test- retest reliability and stability of the Dutch language version of the Roland Morris Disability Questionnaire (RMDQ) in a sample of patients (n=30) suffering from Chronic Low Back Pain (CLBP). Method: Patients filled out the Dutch language version of the RMDQ questionnaire, before starting the rehabilitation programme, with a two-week interval. Intra Class Correlations (ICC), (one way random) was used as a measure for reliability and the limits of agreement were calculated for quantifying the stability of the RMDQ. An ICC of 0.75 or more was considered as an acceptable reliability. No criteria for limits of agreement were available. However, smaller limits of agreement indicate more stability because it indicates that the natural variation is small. Results: The Dutch RMDQ showed good reliability, with an ICC of 0.91. Calculating limits of agreement to quantify the stability, a large amount of natural variation (± 5.4) was found relative to the total scoring range of 0 to 24. Conclusion: The Dutch RMDQ proves to be a reliable instrument to measure functional status in CLBP patients. However, the natural variation should be taken into account when using it clinically.

12

Reliability and stability of the RMDQ

Introduction Functional status is an important evaluative outcome measure in low back pain rehabilitation.1,2 To assess changes in functional status after treatment in patients with low back pain, the Roland Morris Disability Questionnaire (RMDQ) is frequently used.2-4 The RMDQ is derived from the Sickness Impact Profile, a general health questionnaire.5 For an outcome measurement, it is important that the reliability is good and that repeated measures in individuals remain stable over time,6 in the absence of treatment. In reliability studies of the RMDQ, Pearson correlation coefficient is often used as a measure for reliability.2,7,8 Pearson correlation reflects the extent to which two repeated measures can be fitted by a straight line. The disadvantage of this statistic measure is that repeated measures may differ systematically (statistically), yet correlate highly or perfectly. By contrast, the intra-class correlation coefficient (ICC) assesses not only the strength of correlation, but also if all measures on each subject are identical, and do not differ systematically. Therefore, ICC is preferable over the Pearson correlation to use as measure for reliability. But usually the Pearson coefficient will be higher than the ICC and may be used more often for that reason. Stability over time, in the absence of treatment, may be influenced by withinpatient variance and random errors. These sources of variance may lead to instability or fluctuations on the RMDQ-scale: 'natural variations'.6 If a person fills out the same questionnaire on two occasions, it is relevant to know what variation in test scores can be expected in the absence of treatment. To investigate this natural variation on the RMDQ-scale, limits of agreement can be calculated according to the method of Bland and Altman.9 In an individual patient the change due to treatment should exceed these limits of agreement before one can state that the treatment has been effective. Therefore, limits of agreement should be taken into account when using the RMDQ clinically. The English version of the RMDQ shows good reliability.1,7,10 However, limits of agreement have not been investigated. A validated Dutch language version of the RMDQ is available,11 but test-retest reliability and limits of agreement have not been investigated previously. The aim of this study is to investigate the test-retest reliability of the Dutch RMDQ for patients with chronic low back pain (CLBP), using ICC as measure for reliability, as well as to quantify the stability of the RMDQ by calculating limits of agreement.

13

Chapter 2

Methods General procedure Patients with CLBP were recruited from the population who were admitted for rehabilitation treatment of the Centre for Rehabilitation at the University Hospital Groningen. Patients were included in the study if they were between 18-65 years of age, still at work, or were less than 1 year out off work due to CLBP. Exclusion criteria were specific low back pain, entirely off work for a year or more, cardiovascular or pulmonary diseases, pregnancy, addiction, and psychopathology. Patients filled out the Dutch language version of the RMDQ, before starting the rehabilitation program, with a two-week interval. Time, day and place of assessment were held constant for the two test-sessions. The present study was approved by the Medical Ethical Committee of the University Hospital Groningen. Population Thirty patients (24 male and 6 female) with CLBP participated in this study. All patients were referred for treatment in a rehabilitation centre between May 2000 and April 2001 and agreed to participate. Demographics and medical history were obtained of all patients. The mean age of the patients was 40 years (SD 8.1 yr). The duration of low back pain ranged between 5 and 10 years. Patients were off work for a mean of 17 weeks (SD 19.2). Fifteen patients (50%) were receiving financial compensation. Dutch language version of the RMDQ The Dutch language version of the RMDQ is a translation of the original RMDQ.8 It assesses perceived limitations in 24 activities of daily living dichotomously. The sum score is calculated by summing the ‘yes’ answers. The scale ranges from 0 (no disability) to 24 (severe disability). Data analyses Descriptive statistics were calculated for the total scores of the two test-sessions. Test-retest reliability was determined by means of a paired t-test, intra class correlation coefficient (ICC, one way random) for the sum scores. Limits of agreement were used to determine the natural variation for quantifying stability over time.9,12 To calculate limits of agreement, a plot of the difference between the two sessions for each patient against the mean of each patient of the two sessions was made. Then the average difference in the two sessions, and the standard deviation of the difference between the two scores (SDchange) were calculated. Finally, the limits of agreement were calculated, equal to twice the standard deviation. An ICC above 0.75 was considered as good reliability.13-15 No criteria for interpretation of the limits of agreement were available. However, smaller limits of agreement indicate more stability because it indicates that the natural

14

Reliability and stability of the RMDQ

variation is small. Data analyses were performed using the Statistical Package for Social Sciences (SPSS 10.0).

Results Mean of the sum score in the first and second session was respectively 13.0 (SD 4.8) and 12.1 (SD 5.0). The mean difference was 0.83 (SD 2.7) (95% CI of the difference: -0.2 to 1.8). The ICC was 0.91 (95% CI: 0.82 to 0.96). Limits of agreement were ± 5.4 (figure 2.1).

8,1

5,4

Difference: (RMDQ1-RMDQ2)

2,7

0 0

4

8

12

16

20

24

-2,7

-5,4

-8,1

Mean: (RMDQ1+RMDQ2)/2

Figure 2.1. Difference between RMDQ1 and RMDQ2 plotted against average of RMDQ1

15

Chapter 2

Discussion No systematic differences were found in the sum score of the first and the second session. The reliability of the Dutch RMDQ was good (ICC (one way random) above the criterion of 0.75). Similar results of ICCs of 0.75 or higher were found in many other RMDQ studies.10,16-21 However, also considerable lower ICCs were found ranging from 0.42 to 0.66.21,22 Most studies with ICC values of ≥0.75 used an interval of 1-14 days between the two sessions, whereas for the studies with ICCs below 0.75, the interval was 6 weeks or more. Almost all studies with a time interval of more than 2 weeks have lower ICCs than the studies with an interval of 2 weeks or less. An explanation of this phenomenon might be that a shorter interval between the two sessions may result in patients remembering the score of the previous session. A larger interval between the two sessions may result in loss of remembering the score of the previous session and change of clinical status in that period. Reliability of functional status questionnaires may be best measured using an interval of 1-2 weeks, a period in which the clinical status is reasonably stable in chronic pain patients.4 In our study we used an interval of two weeks. Comparing studies using Pearson correlation2,7,8 with studies using ICC10,16-21 as a measure of reliability, it appears that the magnitude of Pearson and ICC are similar, i.e. the reliability is good. This suggests that the predominant source of error is due to random variation instead of a systematic difference. Under these circumstances, the Pearson and ICC are very similar.14 To quantify stability, we investigated the natural variation by calculating limits of agreement according to the method of Bland and Altman.9 Despite the good reliability (ICC), the limits of agreement (± 5.4) were large relative to the total scoring range of 0 to 24. This means that within person variance or random errors have led to instability in measurement results, approximately 95% of all differences within persons will lie between ± 5.4. This large amount of natural variation should be taken into account when using the RMDQ clinically. Effects of therapy should exceed the limits of agreement before one can state that the treatment has been effective. Post-hoc analysis showed that for all items ≥ 70% of the scores were the same for the two sessions. Thus, the large amount of natural variation could not be contributed to some specific items. De Vet et al.6 used the Smallest Real Differences for individuals (SRDindividual) as measure for quantifying the stability of the RMDQ. Despite the use of different terms, the calculation of the limits of agreement and SRDindividual are the same. We found limits of agreement of 5.4 on a scale of 0-24 on the RMDQ, de Vet et al.6 found a SRDindividual value of 5.9. Limits of agreement, in our study, were calculated on the basis of scores collected before patients started the intervention, to minimize the possibility that a clinically important change of the construct would occur in the period of data collecting. The study of de Vet et al.6 however,

16

Reliability and stability of the RMDQ

is an intervention study and the SRDindividual was calculated on the basis of the scores of a group of patients who rated themselves as not clinically important changed despite the intervention. An estimation of not clinically important changed was obtained by global perceived effect assessed by the patient on a 7-points transition scale (1= completely recovered, 7 = vastly worsened).23 The validity of a transition scale however is problematic; it cannot be regarded as a 'gold standard'. Bias may affect subjective assessment of change, patients do not remember correctly how they felt at the beginning of treatment. Moreover, they usually underestimate their initial state, resulting in exaggerating the effect of the program.14 Classifying patients as ‘changed’ or ‘not-changed’ on the basis of these results may be biased. Limits of agreement can also be used as a cut-off score for change in an intervention study or in daily practice. The cut-off change score determines the minimum change that is considered to be clinically relevant.24,25 Based on our results, patients have to change at least 6 points on a scale of 0-24 of the RMDQ to exceed the natural variation and to be judged as having really changed. Several intervention studies have determined the cut-off score for change of the RMDQ,4,21,26,27 ranging from 2 to 5 points. These studies underestimate the height of the cut-off score; changes on the RMDQ scale ranging from 2 to 5 points cannot be detected as a clinically relevant change, given the natural variation we found.

Conclusion The Dutch RMDQ proves to be a reliable instrument to assess functional status in CLBP patients. However, a large amount of natural variation (± 5.4) was found relative to the total scoring range of 0 to 24.

Acknowledgements The authors like to thank R. Schiphorst Preuper (MD), C Muskee (MD), W. Jorritsma (MD) and J. Stubbe (MSc) for their valuable assistance in selecting patients and collecting data. This study was supported by ‘ZonMw’ grant number 96-06-006.

17

Chapter 2

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

12. 13. 14. 15.

18

Deyo R. Comparative validity of the Sickness Impact Profile and Shorter Scales for Functional Assessment in Low-Back Pain. Spine 1986;11:951-4. Bombardier C. Outcome assessments in the evaluation of treatment of spinal disorders: summary and general recommendations. Spine 2000;25:3100-03. Roland M, and Fairbank J. The Roland-Morris Disability Questionnaire and the Oswestry Disability Questionnaire. Spine 2000;25:3115-24. Beurskens AJ, de Vet HC, Köke AJ, van der Heijden GJ, and Knipschild PG. Measuring the functional status of patients with low back pain. Spine 1995;20:1017-28. Bergner M, Bobbitt RA, Carter WB, and Gilson BS. Sickness Impact Profile: Development and final revision of health status measure. Med Care 1981;19:787-805. De Vet HCW, Bouter LM, and Bezemer PD. Reproducibility and responsiveness of evaluative outcome measures. Int J Technol Assess Health Care 2001;17:479-87. Jensen MP, Strom SE, Turner J, and Romano JM. Validity of the sickness impact profile Roland scale as a measure of dysfunction in chronic pain patients. Pain 1992;50:157-62. Roland M, and Morris R. A study of the natural history of back pain, part I: development of a reliable and sensitive measure of disability in low back pain. Spine 1983;8:141-4. Bland JM, and Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986; 307-10. Stratford PW, Binkley JM, and Riddle DL. Development and initial validation of the back pain functional scale. Spine 2000;25:2095-102. Gommans IHB, Koes BW, and van Tulder MW. Validiteit en responsiviteit Nederlandstalige Roland Disability Questionnaire. Vragenlijst naar functionele status bij patiënten met lage rugpijn. Ned Tijdschr Fys 1997;107:28-33. Altman DG, and Bland JM. Measurement in Medicine: The analysis of method comparison studies. Statistician 1983;32:307-17. Lee J, Koh D, and Ong CN. Statistical evaluation of agreement between two methods for measuring quantitative variables. Comput Biol Med 1989; 19:61-70. Streiner DL, and Norman GR. Health Measurement Scales: A practical Guide to Their Development and Use. Oxford: Oxford University Press, 1995: 104-27. Tammemagi MC, Frank JW, LeBlanc M, Artsob H, and Streiner DL. Methodological issues in assessing reproducibility - A comparative study of various indices of reproducibility applied to repeat elisa serologic tests for lyme disease. J Clin Epidemiol 1995;48:1123-32.

Reliability and stability of the RMDQ

16. 17. 18. 19. 20. 21. 22. 23. 24.

25. 26. 27.

Johansson E, and Lindberg P. Subacute and chronic low back pain: reliability and validity of a Swedish version of the Roland and Morris disability Questionnaire. Scand J Rehabil Med 1998;30:139-43. Kopec JA, Esdaile JM, Abrahamowicz M, Abenhaim L, Wood-Dauphinee S, Lamping DL, and Williams JI. The Quebec Back pain disability scale: measurement properties. Spine 1995;20: 341-52. Nusbaum L, Natour J, Ferraz MB, and Goldenberg J. Translation, adaptation and validitation of the Roland-Morris questionnaire: Brazil Roland-Morris. Brazil J Med Biol Res 2001;34:203-10. Underwood MR, Barnett AG, and Vickers MR. Evaluation of two timespecific back pain outcome measure. Spine 1999;24:1104-12. Jacob T, Baras M, Zeev A, and Epstein L. Low back pain: reliability of a set of pain measurement tools. Arch Phys Med Rehabil 2001;82:735-42. Patrick DL, Deyo RA, Atlas SJ, Singer DE, Chapin A, and Keller RB. Assessing health-related quality of life in patients with sciatica. Spine 1995;20:1899-908. Davidson M, and Keating JL. A comparison of five low back disability questionnaires: reliability and responsiveness. Phys Ther 2002;82: 8-24. Beurskens AJ, de Vet HC, and Köke AJ. Responsiveness of functional status in low back pain. A comparison of different instruments. Pain 1996;65:71-6. Goldsmith CH, Boers M, Bombardier C, and Tugwell P. Criteria for clinically important changes in outcomes: Development, scoring and evaluation of rheumatoid arthritis patient and trial profiles. J Rheumatol 1993;20:561-5. Wells GA, Tugwell P, Kraag GR, Baker PR, Groh J, and Redelmeier DA. Minimum important difference between patients with rheumatoid arthritis: The patient's perspective. J Rheumatol 1993;20:557-60. Deyo RA, and Centor RM. Assessing the responsiveness of functional scales to clinical change: an analogy to diagnostic test performance. J Chron Dis 1986;11:897-906. Stratford PW, Binkley JM, Solomon P, Gill C, and Finch E. Assessing change over time in patients with low back pain. Phys Ther 1994;74:52833.

19

20

Responsiveness of the Roland Morris Disability Questionnaire Consequences of using different external criteria

Chapter 3 Wietske Kuijer Sandra Brouwer Pieter U. Dijkstra Ludwig N.H. Göeken Johan W. Groothoff Jan H.B. Geertzen

Published in: Clinical Rehabilitation, 2005; 19 (5): 488-495 Reprinted with the kind permission of Edward Arnold (Publishers) Ltd. www.hodderarnoldjournals.com

21

Chapter 3

Abstract Objective. To determine the consequences of using different external criteria on responsiveness of the Roland Morris Disability Questionnaire (RMDQ) in patients with chronic low back pain. Design. Questionnaire measures before and after rehabilitation treatment. Setting. Rehabilitation Centre. Subjects. Patients with non-specific chronic low back pain, referred for treatment. Main measures. The RMDQ was used to assess self-reported functional status. The used external criteria were: (1) global perceived effect of change in complaints; (2) global perceived effect of change in ability to take care of oneself; (3) change in rating of pain intensity; (4) smallest real difference. Standardised response means, pooled effect sizes and receiver operating curves were calculated to determine responsiveness and to enable comparison of effect sizes with the thresholds of Cohen. Results. Standardised response means ranged from 1.33 to 3.45, pooled effect sizes ranged from 1.50 to 2.81, and areas under curves ranged from 0.76 to 1.00, dependent on the used external criterion. Conclusions. All pooled effect sizes were well above 0.80, and all other statistics were high, indicating good responsiveness of the RMDQ. However, considerable differences were found in responsiveness, when using different external criteria in a same study population. Therefore, it can be concluded that the magnitude of the responsiveness statistic depends on the used external criterion.

22

Responsiveness of the RMDQ

Introduction The Roland Morris Disability Questionnaire (RMDQ) is often used as an evaluative outcome measure in patients with chronic low back pain,1-3 to assess change in self-reported functional status after treatment. Evaluative outcome measures should be reliable and responsive to be able to assess change.4-6 Both the English and Dutch version of the RMDQ-24 show good reliability when using a time interval < three weeks (Pearson's r= 0.83 and intra class correlation coefficients range from 0.79 to 0.91).7-12 Despite the good reliability, smallest real differences of 5.4 and 5.9 show that substantial variation must be taken into account when the RMDQ is used in a clinical setting.7,13 The responsiveness of the English and Dutch version of the RMDQ has also been investigated in several studies. However, different outcomes were found for both versions. Responsiveness statistics, such as areas under curves, ranged from 0.68 to 0.84 for the English version,14-17 and from 0.68 to 0.93 for the Dutch version.1,13,18 Effect sizes ranged from 0.50 to 1.60 for the English version14,19-21 and from 0.58 to 2.02 for the Dutch version.18,22 No clarity exists in literature about responsiveness.23,24 Different external criteria are used to determine whether a patient has achieved a clinically important change in functional status, due to the absence of a gold standard to measure clinically important change. Furthermore, the terminology and calculation of responsiveness statistics as effect sizes and standardised response means may vary considerably,12 depending on the type of change that is intended to be measured,10,24 change in general, clinically important change, or the ability to detect changes in the construct being measured.24 Previous studies evaluating the responsiveness of the RMDQ, used different external criteria, such as a global rating scale of magnitude in or importance of change (15-points),17,21 a global rating scale of change in low back pain (7-points),1,13 and a global rating scale of change in complaints (5-points).18 It is not known to what extent different external criteria influence responsiveness statistics. Additionally, whether the previously found effect sizes are trivial (≤ .20), small (≥ .20, < .50), moderate (≥ .50, < .80) or large (≥ .80), is not clear, because not all studies used the pooled effect size, as is needed for a direct comparison with these thresholds of Cohen.8 Because the lack of clarity about the definition of responsiveness, different external criteria1,8,21,22,25 and unambiguous responsiveness statistics should be used in the same study to determine responsiveness of an evaluative outcome instrument. The aim of this study is to determine consequences of using different external criteria on the responsiveness of the RMDQ, in patients with chronic low back pain.

23

Chapter 3

Methods Subjects An existing clinical database was used. Data were gathered before and after rehabilitation treatments of patients with non-specific chronic low back pain at the Centre for Rehabilitation location Beatrixoord in Haren, the Netherlands. In total, 83 patients (44 men and 39 women), with a mean age of 38.5 years (SD=9.7), participated, and filled out the RMDQ before and after treatment. The mean RMDQ score before treatment was 10.9 (SD=4.7). The mean duration of treatment was 28 weeks (SD=14.5), including one to two treatment sessions per week. Outcome measure The Dutch version of the RMDQ-24 was used to assess self-reported functional status before and after treatment. The RMDQ is derived from the Sickness Impact Profile, a general health questionnaire,26 and assesses, dichotomously, perceived limitations due to low back pain in 24 activities of daily living. The time frame used in this study was ‘the past few days’. The sum score is calculated by summing the ‘yes’ answers. The scale ranges from 0 (no disability) to 24 (severe disability). External criteria Different external criteria were used to analyse the responsiveness of the RMDQ. First, the global perceived effect of change in complaints due to chronic low back pain (7-point scale, ranging from 'completely recovered' to 'worse than ever'). Patients were improved if they scored ‘completely recovered' or 'much recovered’. Second, the global perceived effect of change in ability to take care of oneself (4-point scale, ranging from 'much improved' to 'not improved'). Patients were improved if they scored ‘much improved’. Third, change in rating of pain intensity. Three 10-point pain intensity scales were used before and after treatment, representing ‘pain when at worst’, ‘pain when at least’ and ‘pain right now’. All scales were used as a separate external criterion. Additionally, the mean of the three scales represented the fourth pain intensity criterion. According to patients' ratings of pain intensity, patients were improved if the change score was 2 units or more on the pain scales.4,9,27 Finally, smallest real difference of the RMDQ was used as external criterion.13 Patients were improved if their change score exceeded the smallest real difference of the RMDQ (i.e. if they changed 6 points or more on the RMDQ).7,13 The external criteria on pain and complaints are impairment based, and the criterion on ability to take care of oneself is participation based. Data analyses To calculate the association between the different external criteria, Spearman rank correlation coefficients were calculated. To calculate the association between the different external criteria and RMDQ change scores, Pearson correlation

24

Responsiveness of the RMDQ

coefficients were calculated. Missing data were excluded pair wise from analyses. The smallest real difference criterion is not compared with the other external criteria and with the RMDQ-change score, because this criterion is only a cut-off score. No correlation coefficients were calculated between mean pain intensity score and other pain scores, because the mean score is computed from the other pain scores. Therefore, these scores are not independent of each other. Means and standard deviations of the RMDQ before and after treatment, and mean difference and the standard deviation of the difference were calculated for the different groups of patients classified as improved by the different external criteria. Standardised response means were calculated as the ratio of the mean difference of the improved group and the standard deviation of this mean difference. The higher the standardised response mean, the better the responsiveness. Pooled effect sizes were calculated as the ratio of mean difference of the improved group and the pooled standard deviation of the improved group (SDpooled improved), in which SDpooled improved = √[(SDbefore treatment2 + SDafter treatment2)/2]. Effect sizes are large when exceeding 0.80.8 In addition, receiver operating curves were calculated. The receiver operating curve is a graph of ‘true positive’ (sensitivity) versus ‘false positive’ (1-specificity) for each of several cut-off points in score change.23,28 The area under the receiver-operating curve can be interpreted as the probability of correctly discriminating between improved and non improved patients. This area theoretically ranges from 0.5 (no accuracy in discriminating improved from non-improved patients) to 1.0 (perfect accuracy).28 The four different external criteria were used to discriminate between improved and nonimproved patients. An RMDQ-score defined as 'no limitations' can vary between 0 and 5 points, based upon the established smallest real difference of the RMDQ. Therefore, it was decided to perform above mentioned analyses not only in the total group of patients classified as improved according to the different external criteria, but also in improved patients with an initial RMDQ score ≥ 6. Only these patients are 'certain' of having a limitation in self-reported functional status due to chronic low back pain (initial RMDQ-score > smallest real difference), and can show improvement according to the smallest real difference of the RMDQ. Furthermore, the number of improved patients classified by the different external criteria was calculated. Finally, the relationship between pain and self-reported functional status has been investigated in a post hoc analysis. Pearson correlation coefficients were calculated between the different initial pain intensity scores and initial RMDQ scores.

25

Chapter 3

Results All results are shown for both improved patients and improved patients with an initial RMDQ score ≥ 6, classified by the different external criteria. All improved patients classified by the smallest real difference criterion have an initial RMDQ score ≥ 6. Therefore, only one group of patients is shown according to this criterion. Spearman rank correlation coefficients between the different external criteria and Pearson correlation coefficients between the different external criteria and the RMDQ change score range from 0.27 to 0.85 (table 3.1). Valid percentages of the improved patients are presented in table 3.2, as well as means, standard deviations of initial, post and RMDQ change scores, effect sizes (standardised response means and pooled effect sizes), and areas under curves with confidence intervals. Standardised response means range from 1.33 to 3.45. Pooled effect sizes range from 1.50 to 2.81. Areas under curve range from 0.76 to 1.00. Sixteen patients are classified as improved if all seven external criteria are applied. Twenty patients were improved on six criteria, 26 patients were improved on five criteria, 32 patients were improved on four criteria, 35 patients were improved on three criteria, 41 patients were improved on two criteria, 50 patients were improved on one criterion and 14 patients were not improved on a single criterion. Table 3.1. Relationships between change in self-reported functional status and different external criteria* ∆ RMDQa Complaintsb Taking ∆ Pain ∆ Pain care of leastd worste c oneself Complaintsb -0.59 Taking care of -0.51 0.78 oneselfc ∆ Pain meanf 0.85 -0.56 -0.43 ∆ Pain leastd 0.66 -0.41 -0.27 ∆ Pain worste 0.80 -0.50 -0.38 0.76 ∆ Pain nowg 0.80 -0.60 -0.47 0.72 0.77 *All coefficients were significant at p ≤ 0.001 a Change in self-reported functional status, measured by the RMDQ b Global rating of change in complaints c Global rating of change in taking care of oneself d Pain intensity difference ‘pain when at least’ e Pain intensity difference ‘pain when at worst’ f Mean pain intensity difference g Pain intensity difference ‘pain right now’

26

Table 3.2. Characteristics and responsiveness of the RMDQ, applying different external criteria Criterion

Global rating of

Global rating of

change in

taking care of

complaints

oneself

pain intensity

pain intensity

pain intensity

Mean pain

difference

difference

difference

intensity

real

'pain when at

'pain when at

'pain right now'

difference

difference

least'

worst'

Smallest

All Improved patients Improved (n [valid %])

39 [48.8]

37 [46.8]

38 [46.9]

30 [37.0]

36 [43.9]

31 [38.3]

Initial RMDQ-score (mean [SD])

11.3 [5.0]

11.4 [5.1]

12.7 [4.1]

12.2 [4.3]

13.2 [3.6]

13.3 [3.2]

Post RMDQ-score (mean [SD])

4.0 [3.6]

4.5 [4.5]

4.6 [3.7]

4.2 [3.4]

4.6 [3.9]

4.3 [3.4]

Mean ∆ RMDQ (mean [SD])

7.4 [4.7]

6.9 [5.2]

8.1 [3.8]

8.0 [4.3]

8.6 [3.6]

9.0 [3.4]

SRMa

1.56

1.33

2.16

1.87

2.42

2.64

ESpb

1.68

1.50

2.09

2.07

2.29

2.69

.82 [.73 to .91]

.76 [.65 to .87]

.92 [.86 to .98]

.84 [.75 to .92]

.94 [.89 to .98]

.93 [.88 to .98]

AUC [95% CI]c

Improved patients with an initial score ≥ 6 Improved (n [valid %])

35 [51.5]

33 [49.3]

35 [51.5]]

27 [39.7]

35 [50.7]

31 [45.6]

35 [50]

Initial RMDQ-score (mean [SD])

12.8 [3.2]

12.8 [3.1]

13.5 [3.3]

13.2 [3.3]

13.5 [3.3]

13.3 [3.2]

13.6 [3.2]

Post RMDQ-score (mean [SD])

4.6 [4.0]

5.1 [4.5]

4.8 [3.7]

4.4 [3.5]

4.7 [3.9]

4.3 [3.4]

4.2 [3.5]

Mean ∆ RMDQ (mean [SD])

8.2 [4.1]

7.8 [4.8]

8.7 [3.3]

8.8 [3.8]

8.7 [3.6]

9.0 [3.4]

9.4 [2.7]

SRMa

1.98

1.62

2.63

2.34

2.45

2.64

3.45

ESpb

2.26

1.98

2.45

2.60

2.39

2.69

2.81

.78 [.66 to .89] .93 [.87 to 1.00]

.85 [.76 to .94]

.92 [.86 to .98]

AUC [95% c.i.] c a

.85 [.75 to .94]

.92 [.85 to .98] 1.00 [1.00]

Standardised response mean = mean difference improved group / SD difference improved group b Pooled effect size = mean difference improved group / SDpooled improved group, in which SDpooled improved group = root mean square of the standard deviations before and after treatment of the improved group c Area Under Curve [95% confidence interval]

Chapter 3

Correlation coefficients between pain intensity scores and RMDQ change score ranged between 0.43 and 0.56 (all were significant at p≤ 0.001, table 3.3). Pearson correlation coefficients between different change pain intensity scales and RMDQ change score ranged from 0.66 to 0.85 (table 3.1). Table 3.3. Relationships between self-reported functional status and initial pain scoresa Initial RMDQ-score Initial mean pain intensity 0.56 Initial pain intensity ‘pain when at least’ 0.43 Initial pain intensity ‘pain when at worst’ 0.44 Initial pain intensity ‘pain right now’ 0.55 a

All correlation coefficients were significant at p ≤ 0.001

Discussion The choice of the external criterion influences the size of the responsiveness statistic. In our study, comparing application of external criteria in the same study population, considerable differences in effect sizes were found. The differences in effect sizes amounted to 2.12 points. Additionally, the differences found in areas under curves amounted to 0.24. Apart from these large differences, comparison of pooled effect sizes with the thresholds of Cohen showed that all effect sizes were ranging well above 0.80. Furthermore, all areas under curves were above 0.75. These results indicate good responsiveness of the RMDQ, independent of the used external criterion. In the present study, using smallest real difference as an external criterion, yields the highest statistic for both the effect sizes and the size of the area under curve. It was expected that using this statistic as a criterion for change, almost all patients would be classified as improved correctly, because smallest real difference as a cut-off score for change is not a real external criterion, but it is based on the measurement properties of the instrument itself. The specificity to change is with the smallest real difference as the cut-off for change per definition, for a normal distribution, equal to 95%.13 In our study however, the specificity to change was 100%, due to a skewed distribution of RMDQ change scores. This means that the estimation, if patients underwent a clinically important change, can be made with 100% accuracy in this group of patients. The next highest responsiveness statistic is found when using mean pain intensity difference as an external criterion, followed by the pain intensity scales ‘pain right now’, ‘pain when at least’ and ‘pain when at worst’. We chose to use all four pain intensity scales as separate external criteria because we wanted to compare different external criteria in this responsiveness study. The mean pain intensity criterion was previously used in a

28

Responsiveness of the RMDQ

study after responsiveness of the 100-mm visual analogue scale, specifically aimed at patients with chronic low back pain.6 It should be noted however, that the validity of the calculation of this mean pain intensity has not been investigated. The high responsiveness statistic found for the RMDQ when using the criteria for change in pain intensity means that a change in self-reported pain intensity accompanies a considerable change in self-reported functional status, as Beurskens already suggested.22 However, initial pain scores (impairment based) and self-reported functional status measures (limitation based) are not strongly related (correlation coefficients ranged from 0.27 to 0.691,5,6,11,18,22,29,30). A discrepancy exists between clinical assessment of patients with chronic low back pain, and scientific purposes of determining the responsiveness of the RMDQ in this group of patients. Looking at the smallest real difference of the RMDQ, in our study, 10-11% of the patients are not 'certain' of having a limitation in self-reported functional status due to chronic low back pain (initial RMDQscore < smallest real difference), and cannot show any improvement according to the smallest real difference of the RMDQ. However, these patients were treated for their low back complaints, and they did improve according to the external criteria global rating of complaints, global rating of taking care of oneself and selfreported change in pain intensity. This discrepancy can be explained as follows: first, patients' limitations concern other domains than functional status in daily life, which is measured by the RMDQ. Secondly, an improvement on one of the items of the RMDQ is that important for the patient, that he or she feels really improved, despite the absence of change on other items, the so called patient priority. For these reasons, patients did not improve in self-reported functional status as is measured by the RMDQ, but they may have reached their personal treatment goals. Thirdly, the global perceived effect measures are only used after treatment. Because patients usually underestimate their pre-treatment state, their assessment of being improved after treatment may therefore be biased,31 resulting in an overestimation of improvement on the global ratings. The pain intensity criteria scales are used before and after treatment, but the scales ‘pain when at worst’ and ‘pain when at least’ also refer to a previous state. Because this discrepancy between clinical assessment and research purposes, we decided to dichotomise our presentation of results into two groups: all improved patients and the group of improved patients with an initial RMDQ score ≥ 6. Lower responsiveness statistics are found for all improved patients compared to the improved group with an initial RMDQ score ≥ 6, when using global rating of change in complaints, global rating of taking care of oneself, and change in pain intensity scores ‘pain when at least’, pain when at worst’ and ‘pain right now’ as external criteria. These differences in responsiveness statistics between both groups can be explained by baseline score variability. If baseline-score variability decreases, which occurs when excluding patients with an initial score ≤ 6, the responsiveness of a measurement increases.32 When using the criteria of mean pain intensity or smallest real difference, exactly the same patients were classified

29

Chapter 3

as ‘improved’ for all patients as for the group with an initial score ≥ 6, thus no change in baseline score variability occurs. Additionally, the ability of the RMDQ to detect changes, diminishes when small limitations exist in self-reported functional status, as is measured by the RMDQ.17,33 The study has some limitations. First, the sample size of the study is somewhat small, in relation to the number of statistical tests performed. However, all results indicate large differences in responsiveness. Therefore, our conclusion that the use of different external criteria leads to differences in responsiveness statistics is rather robust. Second, it might be argued that the RMDQ measures limitations in activities of daily living on an interval scale, ranging from zero limitations on the questioned activities to 24 activity limitations. However, no evidence exists that each item should be weighted to calculate a sum score. Therefore, in this study, parametrical statistical techniques are used for analyses of the RMDQ. This is commonly undertaken when analysing the RMDQ. Good responsiveness is found for the RMDQ. However, considerable differences were found in responsiveness statistics, when using different external criteria in a same study population. Therefore, it can be concluded that the magnitude of responsiveness statistics depends on the external criteria used.

Acknowledgements We thank the clinicians and the co-workers of the Centre of Rehabilitation, location Beatrixoord in Haren, the Netherlands, for their co-operation in this study. The study is supported by grants from ‘Zorgonderzoek Nederland’ (Zon Mw), and the foundation 'Beatrixoord Noord-Nederland'.

30

Responsiveness of the RMDQ

References 1. 2. 3. 4. 5.

6. 7.

8. 9. 10. 11.

12. 13. 14.

Beurskens AJ, Vet de HC, Köke AJ, Heijden van der GJ, and Knipschild PG. Measuring the functional status of patients with low back pain. Spine 1995; 20: 1017-28. Bombardier C. Outcome assessments in the evaluation of treatment of spinal disorders: summary and general recommendations. Spine 2000; 25: 3100-3. Roland M, and Fairbank J. The Roland-Morris Disability Questionnaire and the Oswestry Disability Questionnaire. Spine 2000; 25: 3115-24. Farrar JT, Young JPJr, Moreaux La L, Werth JL, and Poole MR. Clinical importance of changes in chronic pain intensity measured on an 11-point numerical pain rating scale. Pain 2001; 94: 149-58. Gronblad M, Jarvinen EAO, Ruuskanen M, Hamalainen H, and Kouri J. Relationship of subjective disability with pain intensity, pain duration, pain location, and work-related factors in nonoperated patients with chronic low back pain. Clin J Pain 1996; 12: 194-200. Hazard RG, Haugh LD, Green PA, and Jones PL. Chronic low back pain. The relationship between patient satisfaction and pain, impairment, and disability outcomes. Spine 1994; 19: 881-7. Brouwer S, Kuijer W, Dijkstra PU, Göeken LNH, Groothoff JW, and Geertzen JHB. Reliability and stability of the Roland Morris Disability Questionnaire: Intra Class Correlation and limits of agreement. Disabil Rehabil 2004; 26: 162-5. Cohen J. Statistical power analysis for the behavioral sciences, 1-27. New York: Academic Press, 1988. Farrar JT, Portenoy RK, Berlin JA, Kinman JL, and Strom BL. Defining the clinically important difference in pain outcome measures. Pain 2000; 88: 287-94. Liang MH. Evaluating measurement responsiveness. J Rheumatol 1995; 22: 1191-2. Takeyachi Y, Konno S, Otani K, Yamauchi K, Takahashi I, Suzukamo Y, and Kikuchi S. Correlation of low back pain with functional status, general health perception, social participation, subjective happiness, and patient satisfaction. Spine 2003; 28: 1461-7. Taylor SJ, Taylor AE, Foy MA, and Fogg AJB. Responsiveness of common outcome measures for patients with low back pain. Spine 1999; 24: 1805-12. Vet de HC, Bouter LM, and Bezemer PM. reproducibility and responsiveness of evaluative outcome measures. Int J Technol Assess Health Care 2001; 17: 479-87. Davidson M, and Keating JL. A comparison of five low back disability questionnaires: reliability and responsiveness. Phys Ther 2002; 82: 8-24.

31

Chapter 3

15. 16. 17. 18.

19. 20. 21. 22. 23. 24.

25.

26. 27.

32

Riddle DL, Stratford PW, and Binkley JM. Sensitivity to change of the Roland-Morris Back Pain Questionnaire: part 2. Phys Ther 1998; 78: 1197207. Stratford PW, Binkley J, Solomon P, Gill C, and Finch E. Assessing change over time in patients with low back pain. Phys Ther 1994; 74: 528-33. Stratford PW, Binkley JM, Riddle DL, and Guyatt GH. Sensitivity to change of the Roland-Morris Back Pain Questionnaire: part 1. Phys Ther 1998; 78: 1186-96. Gommans IHB, Koes BW, and van Tulder MW. Validiteit en responsiviteit Nederlandstalige Roland Disability Questionnaire. Vragenlijst naar functionele status bij patiënten met lage rugpijn. (Validity and Responsiveness of the Dutch version of the Roland Disability Questionnaire. A functional status questionnaire for patients with low back pain.) Ned Tijdschr Fys 1997; 107: 28-33. Kopec JA, Esdaile JM, Abrahamowicz M, Abenhaim L, Wood-Dauphinee S, Lamping DL, and Williams JI. The Quebec Back Pain Disability Scale. Measurement properties. Spine 1995; 20: 341-52. Patrick DL, Deyo R, Atlas SJ, Singer DE, Chapin A, and Keller RB. Assessing health-related quality of life in patients with sciatica. Spine 1995; 20: 1899-909. Stratford PW, Binkley J, and Riddle DL. Health Status Measures: Strategies and analytic methods for assessing change scores. Phys Ther 1996; 76: 1109-23. Beurskens AJ, de Vet HC, and Köke AJ. Responsiveness of functional status in low back pain. A comparison of different instruments. Pain 1996; 65: 71-6. Deyo R, and Centor R. Assessing the responsiveness of functional scales to clinical change: An analogy to diagnostic test performance. J Chron Dis 1986; 39: 897-906. Terwee CB, Dekker FW, Wiersinga WM, Prummel MF, and Bossuyt PMM. On assessing responsiveness of health-related quality of life instruments: Guidelines for instrument evaluation. Qual Life Res 2003; 12: 349-62. Beaton DE, Hogg-Johnson S, and Bombardier C. Evaluating changes in health status: reliability and responsiveness if five generic health status measures in workers with musculoskeletal disorders. J Clin Epidemiol 1997; 50: 79-93. Bergner M, Bobbitt RA, Carter WB, and Gilson BS. Sickness Impact Profile: Development and final revision of health status measure. Med Care 1981; 19: 787-805. Hagg O, Fritzell P, and Nordwall A. The clinical importance of changes in outcome scores after treatment for chronic low back pain. Eur Spine J 2003; 12: 12-20.

Responsiveness of the RMDQ

28. 29.

30. 31. 32. 33.

Sackett, DL, Haynes RB, Guyatt G.H. and Tugwell P. Clinical Epidemiology, a basic science for clinical medicine, Boston/Toronto/London: Little, Brown and Company, 1991. Gronblad M, Hupli M, Wennerstrand P, Jarvinen E, Lukinmaa A, Kouri JP, and Karaharju EO. Intercorrelation and test-retest reliability of the Pain Disability Index (PDI) and the Oswestry Disability Questionnaire (ODQ) and their correlation with pain intensity in low back pain patients. Clin J Pain 1993; 9: 189-95. Koho P, Aho S, Watson P, and Hurri H. Assessment of chronic pain behaviour: reliability of the method and its relationship with perceived disability, physical impairment and function. J Rehabil Med 2001; 33: 128 Linton SJ, and Melin L. The accuracy of remembering chronic pain. Pain 1982; 13: 281-5. Guyatt G, Walter S, and Norman G. Measuring change over time: Assessing the usefulness of evaluative instruments. J Chron Dis 1987; 40: 171-8. Stratford PW, Binkley J, Solomon P, Finch E, Gill C, and Moreland J. Defining the minimum level of detectable change for the Roland-Morris questionnaire. Phys Ther 1996; 76: 359-65.

33

34

Measuring physical performance via self-report in healthy young adults

Chapter 4 Wietske Kuijer Erwin H.J. Gerrits Michiel F. Reneman

Published in: Journal of Occupational Rehabilitation, 2004; 14 (1): 77-87 Reprinted with the kind permission of SpringerLink www.springerlink.com

35

Chapter 4

Abstract Discrepancies exist in literature as to what extent self-reporting can replace performance-based testing. To answer this question, self-reports and performance tests should measure identical constructs. Previous studies did not measure identical constructs. The objective of our study was to investigate to what extent self-reporting can replace performance-based testing. Seventy-two healthy subjects were tested. The constructs of the self-reports and the performance tests covered the same components to enable a comparison of self-reports and performance test results. Three different self-reports and a performance test were used to measure physical performance. Additionally, rating of perceived exertion was measured after the subjects lifted a reference weight to predict maximal lifting performance. The controls were age, gender, educational level, subject’s participation in fitness, availability of reference data, motivation, attitude, general self-efficacy and mood. Results showed that all lifting tasks could be predicted, though not solely via self-reporting. A prediction of the performance test results with a margin of ± 5 kg of error could be made for at least 79% of the subjects, via gender, self-reporting and subject's participation in fitness. Self-reporting may not replace performance testing, although performance testing can be predicted with a margin of error of ± 5 kg for at least 79% of the healthy subjects.

36

Measuring physical performance via self-report

Introduction Physical performance can be estimated via different kinds of instruments. Examples of instruments used to estimate physical performance are self-reports, proxy-reports, performance testing, clinical observation, or a combination of these instruments.1 Estimation of physical performance is often used in clinical practice to determine someone's ability to work. Physicians appear to rely strongly on a patient self-report. Self-report is one’s verbal or written estimation of one's capacity to perform activities. Performance-based tests usually referred to as Functional Capacity Evaluations (FCEs), measure performance of work-related activities.2 FCEs are time-consuming and expensive, while self-reports are less expensive and more practical to use. The question is to what extent physical performance can be estimated via self-reporting. Some suggest that self-reporting can be used to replace performance-based testing,3 at least for screening healthy subjects.1 Moderate to strong correlations between measured physical performances and self-report of function have been reported.4,5 Others support the idea that physical performance cannot be measured via self-reporting6-9 and that direct observation, as opposed to self-reporting, is the better indicator of patient behavior.10 This latter premise may be questioned however. The studies mentioned did not only compare self-reports and performance-based tests, but also differences in construct, context and item scaling.9,11 To answer the question to what extent physical performance can be estimated via self-reporting, self-reports and performance tests should measure identical constructs. The validity of the questionnaires that were used previously is expected to be weak because such questionnaires do not measure the construct of physical performance. To estimate physical performance, the predicting behaviour should be described as accurately as possible.12 Subjects must understand what kind of behaviour is required and what conditions the questionnaires apply to.13 Accessibility of information in memory and contextual cues appears to be of key importance.14 Motivational and cognitive factors can confound the assessment of ‘true capacity’.15 Additionally, motivation and emotion are known to influence the perception during physical performance as well as the performance itself.16 Furthermore, a relationship exists between self-efficacy and the degree of effort a subject expends on a test.13,17-19 Important predictors of general behaviour for healthy subjects and therefore also for performance during an FCE and selfreported performance are attitude and self-efficacy.19,20 Lackner et al.19 also showed that functional self-efficacy was significantly related to behavioural measures of physical function. Efficacy expectations alone will not however produce the desired performance if the component abilities are missing.13 Performance testing preceding self-reporting makes the subjects more aware of

37

Chapter 4

their true physical capabilities, but further research is needed into ways to combine self-reporting and performance measures.21 In this study, different self-reports were compared with performance test results, that is, different lifting tasks. Differences in outcome between performance test results and self-reports were described and explained. The goal of this study was to determine to what extent self-reporting can be used to replace performancebased testing. Information from performance-based tests is often used to determine ones ability to work.22 Therefore answering this question contributes to the development of a cost-effective method to determine ability to work, it does however not directly assess its validity.

Methods Subjects A convenience sample of 72 healthy subjects (36 male and 36 female) participated in this study. All but one were students. Their mean age was 22 years (ranging from 19 to 28 years). Before participating, all subjects declared in an informed consent that they were healthy and agreed to participate voluntarily and at their own risk. Procedures Firstly, personal data (age, gender and educational level) were obtained via a selfconstructed questionnaire, as was data of lifting experience and sporting activities. One’s attitude towards self-reporting and performance testing was measured with a self-constructed 10 cm Visual Analogue Scale (VAS). Secondly, self-efficacy was measured via the ALCOS, short form (SF) (ALgemene COmpetentie Schaal, a Dutch version of the General self-efficacy scale).23 Thirdly, subjects were asked to fill out three different self-reports to measure self-estimated physical performance, that is, the maximal amount of weight they can lift. Fourthly, the subjects performed four different lifting tasks, with six-minute rests between each task. After the self-report measures and after each lifting task, the subject’s motivation was measured with a self-constructed 10 cm VAS, and mood was measured using a Profile Of Mood States questionnaire (POMS-SF, translated version into Dutch).24 The total test duration was approximately 2 h 15 min. Measures Self-reports Three questionnaires were constructed in order to estimate physical performance. Each type of self-report added progressively more information. Self-report 1 consisted of open-ended questions about the maximum amount of weight they could lift; self-report 2 consisted of closed questions using everyday examples as a reference; self-report 3 consisted of asking subjects to lift a reference weight, and

38

Measuring physical performance via self-report

then asking what percentage of his or her maximum performance the weight represented. These latter execution-related questions resembled the performance test most closely. Each self-report covered the same components as the lifting tasks, in order to equalise the constructs of the questionnaires and the performances; in each self-report, lifting heights (lifting from waist to shoulder- or from waist to overhead and back), and the amounts of repetitions were reported, dependent on the relevant lifting task. Additionally, illustrations concerning the relevant lifting task accompanied each question. Perceived exertion was measured after lifting the reference weight, using a Rating of Perceived Exertion scale (RPE-scale).16 An example of an open-ended question was: ‘What is the maximum weight you can lift one time from your waist to overhead and back?’ An example of a closed question was: ‘Can you lift one kilo (pack of sugar) from your waist to overhead and back?’ The questions in this self-report ranged from 1 to 40 kilos for women and from 1 to 60 kilos for men, with 5-kg increments at each question. For the execution-related questions, the reference weight the subjects had to lift was determined by estimating 70% of average maximum lifting ability expected in their age group (based on results of a pilot study). Seventy percent of maximal effort was used because the highest test-retest correlation coefficients on RPEscale were obtained for 70% effort of maximum.16 Performance tests Subjects lifted a starting weight from waist to shoulder or waist to overhead and back for a set of one or five repetitions. Heart rate was measured after each set, and weight was added until a maximum was reached, following the procedures used in different functional capacity evaluation protocols. Maximum lifting performance was determined when a strength maximum or a maximal acceptable heart rate ((220-age)*85%) was reached, or safety was no longer guaranteed. Subjects were also allowed to discontinue the test themselves. In this case, no safe maximal performance could be determined. The lifting tasks were derived from two FCEs. Task one, lifting from waist to shoulder and back, 1 repetition, is the Upper Lifting Strength test from the Ergo-Kit FCEa. Task two, lifting from waist to shoulder and back, 5 repetitions, was derived from the previous lifting task. Task three, lifting from waist to overhead and back, 5 repetitions, was modified from the Isernhagen Work System FCE (IWS FCEb). a

Ergo-Kit Functional Capacity Evaluation Ergo Control B.V. Buurserstraat 214 7544 RG Enschede, The Netherlands

b

Isernhagen Work System Functional Capacity Evaluation 1015 E. Superior Street Duluth, MN 55802, USA

39

Chapter 4

Task four, lifting from waist to overhead and back, 1 repetition, was derived from the previous lifting task. A lab situation was established with the materials needed for the lifting tasks. Materials needed for the tasks included a measurement frame with two adjustable shelves (2.5 cm increments), a stopwatch, a heart rate monitor, a plastic box and weights. For lifting from waist to shoulder and back, 1 repetition, unpublished data showed good test-retest reliability for a workers compensation population.25 Because maximum performance was required in this study, the Ergo-Kit termination criteria were equalised to the IWS termination criteria. Lifting from waist to overhead and back, 5 repetitions, has demonstrated good test-retest reliability for healthy subjects (unpublished data)26 and for lower back pain patients.27,28 Good intra and inter rater reliability was established in healthy subjects.2 No reliability data were available for the other lifting tasks. Additional variables Self-efficacy was measured with the ALCOS-SF,23 measuring the subjects’ expectations of their capacities in general. This questionnaire consists of 17 questions with response possibilities on a 5-point Likert scale. The sum score ranges from 100 to 500. The reliability and construct validity of the scale is satisfactory.29 Motivation and attitude were measured with a self-constructed 10 cm VAS. The scale ranges from ‘not motivated’ to ‘very motivated’ and from ‘very negative’ to ‘very positive’. Mood was measured using the profile of mood states questionnaire (POMS-SF, translated version into Dutch).24 It measures depression, anger, fatigue, vigour and tension on a 5-point Likert scale. The reliability of the POMS-SF is satisfactory30 and is an excellent alternative to the original POMS.31 Little validity research is available for the POMS-SF. Data analysis Strong over and underestimations - defined as estimations of the subject’s test results under or over three standard deviations (SDs) of the mean performance test results - were removed from analyses list-wise. This resulted in removal of a maximum of 7 estimations per test, depending on the lifting task and/ or selfreport. Specifically, lifting from waist to shoulder, 1 repetition, n=7; lifting from waist to shoulder, 5 repetitions, n=3; lifting from waist to overhead, 5 repetitions, n=5; lifting from waist to overhead, 1 repetition, n=6. Descriptive statistics were calculated for performance test results and self-reports and for absolute differences between performance test results and matched self-report. The percentages of maximal lifting ability for the execution-related questions were calculated. Spearman’s rank correlation coefficients were calculated between performance test results and self-reports. Multiple linear regression analyses were used to predict performance test results via self-reporting, controlling for age, gender, educational level, attitude, general self-efficacy, availability of reference data, subject’s participation in fitness, motivation, mood and rating of perceived

40

Measuring physical performance via self-report

exertion. Predictors found significant in these analyses, were entered in a subsequent Multiple linear regression analyses to establish a model for the prediction of performance test results. In this second regression analyses, not all of the above mentioned subjects were removed from the analyses, because of the listwise removal, additionally, one or two subjects were excluded because of missing data in the variables controlled for. Including all 72 subjects, percentages were described of predictions of performance test results with a margin of error of ± 5 kg, using the unstandardized residual within the (second) regression analyses. All analyses were performed using the Statistical Package for Social Sciences (SPSS 10.1 for Windows). Interpretation of data An alpha of .05 was used to determine statistical significance. Correlation coefficients must be higher than 0.75 to be relevant in a clinical situation (criterion for concurrent validity).32 Criteria for the linear regression were probability of F to Enter ≤ .05, and probability of F to remove ≥ .10. Within the regression analyses, predictors were deemed significant if α ≤ .05. To decide whether self-reporting can replace performance-based testing, the prediction of an individual test result should fall within a margin of error of ± 5 kg, of the actual individual test result; the unstandardized residual should not exceed ± 5 kg. This is based on a criterion of the Dutch government, which uses 5 kg increments in its assessment methods to indicate clinically important differences.

Results Means, SDs and minimal and maximal values of performance test results and selfreports are presented in table 4.1. This table represents the values of the self-report scores and test results, applied to the different lifting tasks. Means, standard deviations and ranges of differences between performance test results and selfreports are presented in table 4.2. These differences between performance test results and self-reports are absolute differences, and therefore do not have to be equal to the differences presented in table 4.1. Cell 1 represents the absolute mean difference between the open-ended questions and lifting from waist to shoulder and back, 1 repetition, which is 10 kg with a standard deviation of 6.6 kg and a range of 26 kg. The estimated amount of 70% of maximal lifting ability (which the subjects had to lift for the execution-related questions), turned out to range between 40 and 135%. The average percentage of maximal lifting ability for the different lifting tasks ranged between 60 and 80%.

41

Table 4.1. Descriptive statistics of test results and self-reports in kilograms Open-ended questions

Lifting from waist to shoulder, one repetition Lifting from waist to shoulder, five repetitions Lifting from waist to overhead, five repetitions Lifting from waist to overhead, one repetition

SD Min-Max Mean 12.6 5-60 31.2

Execution related Performance test result questions SD Min-Max Mean SD Min-Max Mean SD Min-Max 10.5 10-55 37.0 12.4 20-83 30.3 7.7 20-50

18.5

12.5

3-60

17.5

10.9

1-50

27.2

9.4

15-57

24.5

6.1

12-42

15.0

11.1

2-50

12.8

8.4

1-40

22.4

7.1

13-44

19.2

5.3

8-30

19.2

11.4

3-50

26.1

10.5

5-55

30.0

8.7

16-47

24.8

6.7

14-38

Mean 23.6

Closed questions

Table 4.2. Absolute differences between performance test results and self-reports in kilograms

Lifting from waist to shoulder, one repetition Lifting from waist to shoulder, five repetitions Lifting from waist to overhead, five repetitions Lifting from waist to overhead, one repetition

Open-ended questions Mean SD Range 10.0 6.6 26.0 10.0 5.4 22.0 8.7 5.6 25.0 9.2 5.4 22.0

Closed-questions Mean SD Range 6.2 4.5 20.0 9.7 6.3 22.5 7.9 5.1 21.0 6.2 5.4 30.0

Execution-related questions Mean SD Range 7.1 8.3 32.3 6.9 6.9 27.5 6.9 8.7 60.9 5.9 7.9 49.0

Measuring physical performance via self-report

Spearman’s rank correlation coefficients were calculated between performance test results and self-reports. All correlation coefficients were significant at the .001 level. Correlation coefficients are presented in table 4.3. Table 4.3. Correlation coefficients (r) between test results and self-reportsa Open-ended Closed Executionquestions questions related questions Lifting from waist to shoulder, .55 .69 .43 one repetition Lifting from waist to shoulder, .50 .55 .48 five repetitions Lifting from waist to overhead, .55 .57 .51 five repetitions Lifting from waist to overhead, .56 .72 .52 one repetition a All correlations were Significant (p≤ .001) Performing multiple regression analyses, results showed that age, educational level, availability of reference data, motivation, attitude, general self-efficacy, mood, open-ended questions and execution-related questions did not significantly contribute to the prediction of performance test results (all p-values were ≥ 0.05, data not shown). Subsequent regression analyses showed that lifting from waist to shoulder and back, one repetition, was predicted by the closed questions and gender [F(2,65)= 104.2, p ≤ .001]; Lifting from waist to shoulder and back, 5 repetitions, was predicted by gender [F(1,70)=67.5, p ≤ .001]; Lifting from waist to overhead and back, 5 repetitions, was predicted by gender and subject’s participation in fitness [F(2,68)=46.7, p ≤ .001]; Lifting from waist to shoulder and back, 1 repetition, was predicted by the closed questions, gender, and subject’s participation in fitness [F(3,65)=74.8, p ≤ .001]. These variables accounted for 76, 48, 57 and 77% of the adjusted variance, respectively. Results of this (second) regression analyses (variables entered which significantly contributed to the prediction of performance test results) are presented in table 4.4.

43

Chapter 4

Table 4.4. Multiple regression analyses predicting physical performance Performance Test and Significant ß Adjusted df F value Predictors R2 for equation model 65 104.2** .76 Lifting from Waist to Shoulder, 1 Repetition 18.46** Constant 8.98** Gender 0.23** Closed-questions 67.5** 70 .48 Lifting from Waist to Shoulder, 5 Repetitions 20.21** Constant 8.54** Gender 46.7** 68 .57 Lifting from Waist to Overhead, 5 Repetitions 14.78** Constant 7.67** Gender 2.45* Subject’s Participation in Fitness 74.8** 65 .77 Lifting from Waist to Overhead, 1 Repetition 14.92** Constant 8.34** Gender 0.20** Closed-questions 2.07* Subject’s Participation in Fitness Note: variables entered that significantly contributed to the prediction of performance test results; *p≤ 0.05; **p≤ 0.001

For 81% of the subjects the individual test result was predicable for lifting from waist to shoulder and back, one repetition. For 79% of the subjects the individual test result was predicable for lifting from waist to shoulder and back, 5 repetitions. For 80% of the subjects the individual test result was predicable for lifting from waist to overhead and back, 5 repetitions. For 84% of the subjects the individual test result was predicable for lifting from waist to shoulder and back, 1 repetition. All predictions included a margin of error of +/- 5 kg (table 4.5).

44

Measuring physical performance via self-report

Table 4.5. Prediction of performance test results With a margin of Exceeding ± 5 kg error of ± 5 kg n [%] n [%] Lifting from waist to shoulder, 59 [81.9] 9 [12.5] one repetition (n=68) Lifting from waist to shoulder, 57 [79.2] 15 [20.8] five repetitions (n=72) Lifting from waist to overhead, 58 [80.6] 13 [18.1] five repetitions (n=71) Lifting from waist to overhead, 61 [84.7] 8 [11.1] one repetition (n=69)

Discussion The goal of this study was to determine to what extent self-reporting can be used to replace performance-based testing. Results showed that correlation coefficients between self-reports and performance tests were too low to be relevant in a clinical situation. Additionally, results from the linear regression showed that all lifting tasks could be predicted, though not solely via self-reporting. A prediction of the performance test result with a margin of error of ± 5 kg could be made for at least 79% of the subjects. In conclusion: self-reporting may not replace performance testing, although performance testing can be predicted with a margin of error of ± 5 kg for at least 79% of the healthy subjects, via gender, selfreporting and/ or subject's participation in fitness. Although some previous literature showed that physical performance can be assessed by Rating of Perceived Exertion (RPE)16 and significant correlations were found between RPE and amount of weight patients were able to lift (correlation coefficients unknown),33 other literature confirmed our findings. Perceptual differences were found when different treadmill protocols were used.34 Performance of a postural tolerance test was only weakly associated with perceived exertion for an elevated work test (r= .23) and for a forward bending test (r= .23).7 In conclusion, RPE scales should not be used to predict maximal performance. An explanation of the significant contribution of subject’s participation in fitness to the prediction of lifting from waist to overhead and back, 1 and 5 repetitions, and not to the prediction of the other tests, could be that this specific lifting task of the Isernhagen Work Systems FCE is more comparable with performing fitness activities than are the lifting tasks derived from the Ergo-Kit FCE. Subjects performing the lifting task from waist to overhead and back benefit from

45

Chapter 4

participating in fitness. Therefore, fitness participation can significantly contribute to the prediction of lifting from waist to overhead and back. Previous studies showed that functional self-efficacy was strongly related to performance testing, suggesting that self-efficacy would contribute significantly to the prediction of performance test results. In our study, self-efficacy did not significantly contribute to the prediction of performance test results. It should be mentioned that in our study general self-efficacy was measured, instead of functional self-efficacy. Additionally, in our study healthy subjects were tested instead of patients, and little inter-individual variability in self-efficacy scores was found. This may explain the absence of a significant contribution. In this study, reliable conclusions for groups could be drawn because 72 healthy subjects were studied. A gradual construction was made to measure performance. It was expected that the execution-related questions provided the best information for the subject and consequently, enabled the subject to make the most accurate estimation of physical performance. This self-report was not however the selfreport that provided the best information for lifting ability. In the lifting tasks, asking closed questions was the best self-report to estimate physical performance. Healthy subjects were tested; they varied little in age, motivation, attitude, mood and self-efficacy. This was not a representative group of subjects, because performance testing is usually used for measuring the performance of work-related activities in patients. When estimating physical performance of patients instead of healthy subjects, and to determine if the results found in this study can be generalised to a group of patients, other controlling variables should be included. These variables may be depression,15,17,35 kinesiofobia,15,17 pain,36 pain behavior,3 pain self-efficacy,1,15 functional self-efficacy,19 functional status,36 outcome expectations,1,15 harm-belief,3 and disability compensation.15,37 These variables may influence performance during an FCE, as well as self-reported performance. After performing this study, it is possible to answer the question to what extent physical performance can be estimated via self-reporting in healthy subjects, because self-reports and performance tests measured identical constructs. Selfreporting may not replace the performance tests lifting from waist to shoulder and waist to overhead and back, for a set of one or five repetitions. However, results of performance tests can be predicted with a margin of error of ± 5 kg for at least 79% of the healthy subjects, via gender, self-reporting and/ or subject's participation in fitness. It should be mentioned that the margin of error of ± 5 kg in predicting performance test results could result in an over- or underestimation of 26% in this group of healthy subjects. Individuals can exceed this percentage of 26%. It is necessary to determine whether a margin of error of ± 5 kg is acceptable when using self-reporting in a clinical environment. Besides, it is unknown to what extent individual subjects vary in test performances and in estimations. If the

46

Measuring physical performance via self-report

natural variation of performance test results exceeds 5 kg, the 5 kg criterion will become inappropriate, because predictions over 5 kg may not by definition result in a wrong estimation of physical performance. Furthermore, strong over and underestimations removed from analyses were defined as estimations of the subject’s test results under or over three standard deviations of the mean performance test results. This definition was based on the idea that subjects with these estimations can be filtered out immediately, and should be tested anyhow. If this criterion was sharpened to 2 standard deviations of the mean performance test results, the prediction would probably be more accurate because of the smaller deviations. To be able to answer the question to what extent self-reporting can be used when estimating physical performance in individual patients (i.e. in disability determination), further research is needed to determine natural variation in performance testing and estimating, and to combine self-reporting and performance-based testing. Although information gathered from performance testing is often used to determine ability to work, this study has not investigated to what extent this way of self-reporting can assess the ability to work. More research is needed to answer the question whether or not estimating physical performance can predict ability to work in patients, as it is used for this goal.

Acknowledgements The authors would like to thank J. Plat (MSc) for placing testing materials at our disposal, and for providing the courses to become a registered test leader (WK). Additionally, we wish to thank S. IJmker (MSc), who helped collect data.

47

Chapter 4

References 1. 2. 3. 4.

5.

6.

7.

8. 9. 10.

11.

12.

48

Wunderlich GS. Measuring Functional Capacity and Work Requirements, summary of a workshop. Washington D.C.: National Academy Press, 1999. Reneman MF, Jaegers SMHJ, Westmaas M, and Göeken LNH. The reliability of determining effort level of lifting and carrying in a functional capacity evaluation. Work 2002; 18 (1): 23-8. Jensen MP, Romano JM, Turner JA, Good AB, and Wald LH. Patient beliefs predict patient functioning: further support for a cognitivebehavioral model of chronic pain. Pain 1999; 81: 95-104. Simmonds MJ, Olson SL, Jones S, Hussein T, Lee CE, Novy D, and Radwan H. Psychometric Characteristics and Clinical Usefulness of Physical Performance Tests in Patients With Low Back Pain. Spine 1998; 23 (22): 2412-21. Simonsick EM, Kasper JD, Guralnik JM, Bandeen-Roche K, Ferrucci L, Hirsch R, Leveille S, Rantanen T, and Fried LP. Severity of Upper and Lower Extremity Functional Limitation: Scale Development and Validation With Self-Report and Performance-Based Measures of Physical Function. J Gerontol B Psychol Sci Soc Sci 2001; 56B (1): S10-9. Daltroy LH, Phillips CB, Eaton HM, Larson MG, Partridge AJ, Logigian M, and Liang MH. Objectively Measuring Physical Ability in elderly Persons: The Physical Capacity Evaluation. Am J Public Health 1995; 85 (4): 558-60. Reneman MF, Bults MMWE, Engberts LH, Mulders KKG, and Göeken LNH. Measuring Maximum Holding Times and Perception of Static Elevated Work and Forward Bending in Healthy Young Adults. J Occup Rehabil 2001; 11 (2): 87-97 Piela CR, Hallenberg KK, Geoghegan AE, Monsein MR, and Lindgren BR. Prediction of Functional Capacities. Work 1996; 6: 161-4. Wijlhuizen GJ, and Ooijendijk W. Measuring disability, the agreement between self-evaluation and observation of performance. Disabil Rehabil 1999; 21 (2): 61-7. Catalano MD, Campagna TDG, Chiappetta CN, Peters S, and Dale L. The inter-rater reliability and validity of self-report of risk factors for cumulative trauma disorders within the hand therapist population - a pilot study. Work 1999; 12 (2): 189-94. Reneman MF, Jorritsma W, Schellekens JMH, and Göeken LNH. Concurrent validity of questionnaire and performance based disability measurements in patients with chronic low back pain. J Occup Rehabil 2002; 12 (3): 119-30. Azjen I. From intentions to actions: a theory of planned behavior. In: J. Kuhl & J. Beckman (Ed.) Action control: From Cognition to behavior. Berlin: Springer-Verslag, 1985, pp. 11-39.

Measuring physical performance via self-report

13. 14. 15.

16. 17. 18. 19.

20. 21.

22. 23. 24. 25. 26.

Bandura A. Self-efficacy: toward a unifying theory of behavioral change. Psychol Rev 1977; 84 (2): 191-215. Stone AA, Turkkan JS, Bachrach AC, Jobe JB, Kurtzman HS, and Cain VS. The science of self-report. Implications for Research and Practice. Mahwah: Lawrence Erlbaum Associates, Inc, 2000. Watson PJ. Non-physiological determinants of physical performance in musculoskeletal pain. Syllabus IASP refresher courses on pain management held in conjunction with the 9th world congress on pain. Vienna, Austria. August 22-27, 1999 Borg G. Borg’s perceived exertion and pain scales. U.S.A: Human Kinetics, 1998. Kaplan GM, Wurtele SK, and Gillis D. Maximal effort during Functional Capacity Evaluation: an examination of psychological factors. Arch Phys Med Rehabil 1996; 77: 161-4. Matheson LN, Matheson ML, and Grant J. Development of a Measure of Perceived Functional Ability. J Occup Rehabil 1993; 3 (1): 15-29. Lackner JM, Carosella AM, and Feuerstein M. Pain Expectancies, Pain, and Functional Self-Efficacy Expectancies as Determinants of Disability in Patients With Chronic Low Back Disorders. J Consult Clin Psychol 1996; 64 (1): 212-20. Brug J, Schaalma H, Kok G, Meerten RM, and van der Molen HT. Gezondheidsvoorlichting en gedragsverandering, een planmatige aanpak. Assen: Van Gorcum & Comp. B.V, 2000. Daltroy LH, Larson MG, Eaton HM, Phillips CB, and Liang MH. Discrepancies between self-reported and observed physical function in the elderly: the influence of response shift and other factors. Soc Sci Med 1999; 48: 1549-61. Matheson LN, Isernhagen SJ, and Hart DL. Relationships Among Lifting Ability, Grip Force, and Return to Work. Phys Ther 2002; 82 (3):249-56. Scherer M, Maddux JE, Mercandante B, Prentice-Dunn S, Jacobs B, and Rogers RW. The self-efficacy scale: Construction and validation. Psychol Rep 1982 ; 51: 663-71. Cluydts RJG. Gemoedstoestanden en slaap. Proefschrift. Brussel: Vrije Universiteit Brussel, 1979 Schellekens J. Reconditionering als warming-up voor reïntegratie: tussenrapport. [Reconditioning as a warm-up for reintegration.] Groningen: Arbeidspsychologie/ Amsterdam: Lisv, 2001. Dijkink A, Kuis M, Reneman MF, Brouwer S, and Göeken LNH. Onderzoek naar de betrouwbaarheid van de Functionele Capaciteit Evaluatie van Isernhagen. [Study of the reliability of the Isernhagen Functional Capacity Evaluation]. Leeronderzoek Instituut voor Bewegingswetenschappen: Groningen, 2000.

49

Chapter 4

27. 28. 29.

30. 31. 32. 33. 34.

35. 36.

37.

50

Gross DP, and Battié NC. Reliability of Safe Maximal Lifting Determination of a Functional Capacity Evaluation. Phys Ther 2002; 82 (4): 364-71. Reneman MF, Dijkstra PU, Westmaas M, and Göeken LNH. Test-Retest Reliability of Lifting and Carrying in a 2-day Functional Capacity Evaluation. J Occup Rehabil 2002; 12 (4): 269-75. Bosscher RJ, Smit JH, and Kempen GIJM. Algemene competentie verwachtingen bij ouderen: een onderzoek naar de psychometrische kenmerken van de Algemene competentieschaal (ALCOS)[Global expectations of self-efficacy in the elderly: an investigation of psychometric characteristics of the General Self-efficacy Scale] Ned Tijdschr Psychol 1997; 52: 239-48. Wald FDM, and Mellenbergh GJ. De verkorte versie van de Nederlandse vertaling van de Profile of Mood States (POMS). Ned Tijdschr Psychol 1990; 45: 86-90. Curran SL, Andrykowski MA, and Studts JL. Short Form of the Profile of Mood States (POMS-SF): Psychometric Information. Psychol Assess 1995; 7 (1): 80-3. Innes E, and Straker L. Validity of work-related assessments. Work 1999:13 (2): 125-52. King ML, Dracup KA, and Woo MA. Predictors of isotonic exercise in patients with heart failure. Med Sci Sports Exerc 2001; 33 (7): 1090-5. Abstract. Whaley MH, Brubaker PH, Kaminsky LA, and Miller CR. Validity of Rating of Perceived Exertion During Graded Exercise Testing in Apparently Healthy Adults and Cardiac Patients. J Cardiopulm Rehabil 1997; 17: 261-7. Fylkesnes K, and Forde OH. The Tromso Study: Predictors of selfevaluated health- has society adopted the expanded health concept? Soc Sci Med 1991; 32 (2): 141-6. Fishbain DA, Abdel-Moty A, Cutler R, Khalil TM, Sadek S, Rosomoff RS, and Rosomoff HL. Measuring Residual Functional Capacity in Chronic Low Back Pain Patients Based on the Dictionary of Occupational Titles. Spine 1994 ; 19 (8): 872-80. Fylkesnes K, and Forde OH. Determinants and dimensions involved in self-evaluation of health. Soc Sci Med 1992; 35 (3): 271-9.

Safe lifting in patients with chronic low back pain Comparing FCE lifting task and NIOSH lifting guideline

Chapter 5 Wietske Kuijer Pieter U. Dijkstra Sandra Brouwer Michiel F. Reneman Johan W. Groothoff Jan H.B. Geertzen

Accepted for publication: Journal of Occupational Rehabilitation Reprinted with the kind permission of SpringerLink www.springerlink.com

51

Chapter 5

Abstract Both the floor-to-waist lifting task of the Isernhagen Work Systems Functional Capacity Evaluation (IWS FCE) and recommended weight limit (RWL) of the NIOSH produce safe lifting weights and are used world-wide nowadays. It is unknown whether they produce similar safe lifting weights. Aim of this study was to compare FCE performance on the floor-to-waist lifting task and RWL of the NIOSH lifting guideline for this task, in patients with chronic low back pain (CLBP). Ninety-two patients performed the FCE lifting task. RWL was calculated for this task. Performance was compared with RWL. A lifting index was calculated by dividing performance by RWL. Differences between groups with a lifting index ≤1, 1-3 and >3 were calculated for pain intensity, scores on the Roland Morris Disability Questionnaire (RMDQ) and work status. Men lifted on average 32.5 kg (SD 15.4) and women 18.8 kg (SD 7.8). RWL for this task was 12.8 kg. Mean difference between performance and RWL was 15.0 kg (SD 14.7; range –8.8 to 59.2). The Roland Morris Disability score of patients with a lifting index ≤ 1 was significantly lower than patients with a lifting index 1-3 and > 3. No difference in pain intensity and work status was found between groups. It was concluded that performance on the FCE floor-to-waist lifting task and RWL of the NIOSH for this task produce different safe lifting weights in individual patients with CLBP, which may result in contradictory recommendations about need for rehabilitation and return to work.

52

Safe lifting in patients with chronic low back pain

Introduction Since lifting is a major risk factor for the onset of low back pain (LBP) and sickness absence due to LBP,1-3 several instruments have been developed to determine safe lifting weight limits and to determine a workers’ ability to perform a specific lifting task safely.4 The floor-to-waist lifting task from the Isernhagen Functional Capacity Evaluation (IWS FCE),5 is a performance task frequently used in rehabilitation medicine, which determines safe lifting performance in a laboratory situation in individual patients. The American National Institute for Occupational Safety and Health (NIOSH) developed an equation, to calculate a recommended weight limit (RWL) and a lifting index (LI) in a specific occupational setting.6 A higher LI indicates a higher physical strain for a specific lifting task and thus potentially more harmful for the lower back. Although both instruments were designed for different purposes, the FCE for determining the ability of an individual to perform a certain job with known physical requirements and the RWL of the NIOSH to advice if a job created potential hazard to the worker, both instruments are used world-wide nowadays to determine a workers’ ability to perform a specific (job) lifting task safely.7-10 Instruments for decision-making should be reliable and valid. Reliability refers to the amount of error inherent in any measurement.11 Validity of an instrument refers to what extent the instrument measures what it intends to measure.11 One of the most practical and objective ways to determine validity is to assess the criterion-related validity.12 Criterion validity is usually divided into concurrent and predictive validity, in which concurrent validity refers to the relation with another instrument given at the same time, whereas predictive validity refers to the prediction of a certain outcome in the future.12 The floor-to-waist lifting task of the IWS FCE has proven good reliability in patients with chronic low back pain (CLBP), with Intra Class Correlations ranging from 0.78 to 0.87.13-15 Concurrent validity between the FCE lifting task and lifting related questions on the Quebec and Oswestry questionnaire showed poor to moderate relationships (Oswestry 3, ρ=-0.20, non-significant; Quebec 20, r=–0.51, p 1 and 33% of all male patients (n=20) and 3% of all female patients (n=1) had a LI > 3.

56

Safe lifting in patients with chronic low back pain

Table 5.2. Calculation of NIOSH Multipliers for the FCE lifting task Multipliera Formula FCE Parameter Value Load Constant (LC) 23 kg 23 kg 23 Horizontal (HM) 25/H 1 H ≤ 25 Vertical (VM) 1-(0.003* |V-75|) Vbeginning = 97 cm 0.934 Vend = 23 cm 0.844 Distance (DM) 0.82 + (4.5/D) D = 74 cm 0.88 Asymmetry (AM) 1-(0.0032A) 1 A = 0° 6.7 lifts/ min 0.735 Frequency (FM) From tablesa Good 1 Coupling (CM) From tablesa a

Operational definitions and tables of NIOSH Multipliers are presented in the appendix

No significant differences were found in pain intensity between the LI groups. Significant differences were found between the groups and RMDQ-score. Patients with 0
  • 3 were restricted in work (table 5.3). Table 5.3. Differences in work status, RMDQ-score and pain intensity for different Lifting Indices (LI) LI>3 p-value 0
  • 45 degrees flexion in knees and < 60 degrees trunk flexion. Maintaining a sitting position. Trunk flexion