psychometric principles in the selection, interpretation ...

3 downloads 0 Views 2MB Size Report
DEMOREST • WALDEN: Psychometrics of Self-Assessment Inventories 227. The first section of the paper discusses test selection and shows how it is guided by ...
Journa! of Speech and Hearing Disorders, Volume 49, 226-240, August 1984

PSYCHOMETRIC P R I N C I P L E S IN T H E S E L E C T I O N , INTERPRETATION, A N D E V A L U A T I O N OF C O M M U N I C A T I O N SELF-ASSESSMENT INVENTORIES MARILYN E. DEMOREST University of Maryland Baltimore County and Walter Reed Army Medical Center BRIAN E. WALDEN Walter Reed Army Medical Center A variety of self-assessment inventories have been introduced in recent years for use with hearing-impaired patients. These instruments differ considerably, both conceptually and operationally. Audiologists, therefore, are faced with the task of selecting a test instrument that is appropriate to their patient population and testing purpose. This paper outlines the psychometric principles that guide the selection, interpretation, and evaluation of self-assessment inventories. The application of these principles to a specific clinical population is illustrated by three studies of the Hearing Performance Inventory (Giolas, Owens, Lamb, & Schubert, 1979) conducted at Walter Reed Army Medical Center (WRAMC).

for meeting their diagnostic and rehabilitative needs. As a consequence, it has also become increasingly necessary for audiologists to be knowledgeable about the selection, interpretation, and evaluation of self-assessment inventories. For what populations and what types of assessment is a given questionnaire appropriate? Is a single, global measure needed, or is a profile of scores necessary? Is a screening instrument sufficient, or is a more comprehensive assessment required? How do responses or scores vary, on the average, as a function of age, education, degree of hearing loss, or other factors? What kinds of generalizations, inferences, or predictions can be made from test results? What evidence is there for the reliability and validity of the measures? The issues raised by these questions are fundamental psychometric issues. Some of the relevant information will be provided by those who develop an inventory, but the remainder must be supplied by those who use it. The clinician administering the questionnaire and interpreting the results must be knowledgeable about the essentials of psychological testing and the application of that knowledge in actual practice. Even the best designed and most carefully researched inventory will not be a clinically useful tool if the audiologist lacks this expertise. This tutorial discusses, from a psychometric perspective, the principles involved in the selection, interpretation, and evaluation of self-assessment inventories. Special emphasis is placed on the statistical properties of scores because these properties place limits on the clinical utility of any test procedure. Although currently available inventories are used whenever possible to illustrate specific points, this is neither a comprehensive nor critical review of the literature on communication selfassessment. Instead, detailed consideration is given to the kinds of issues that should be raised and the types of data that are relevant when one's goal is to select, interpret, or evaluate an inventory.

For a patient with sensorineural hearing loss, the ability to communicate effectively depends on both sensory and nonsensory factors. General communication skills, acceptance or denial of the hearing loss, overall emotional adjustment, and the behavior and attitudes of friends, family, and co-workers all can have an impact on communication. It is not Surprising, then, that the patient's performance in typical communication situations is not highly predictable from audiometric measures. Audiometric tests do not assess the nonsensory variables that contribute to actual Communication performance so, at best, they provide only partial and indirect information about communication handicap. An increasingly popular supplement to traditional audiometrie measures in the assessment of communication problems involves the use of questionnaires or self-report inventories. Among the best known and most widely used are the Hearing Handicap Scale (High, Fairbanks, & Glorig, 1964), the Hearing Measurement Scale (Noble & Atherley, 1970), the Denver Scale of Communication Function (Alpiner, Chevrette, Glascoe, Metz, & Olsen, 1974), and the Hearing Performance Inventory (Giolas, Owens, Lamb, & Schubert, 1979). These instruments and others currently available differ in many ways. Their focus may be exclusively on communication per se, or environmental and affective problems may also be explored. Some inventories are designed for a general population of hearing-impaired adults whereas others, such as the Hearing Handicap Inventory for the Elderly (Ventry & Weinstein, 1982), are designed for special populations. Scoring can yield a single global measure or can produce a profile of subscale scores or a profile of responses to individual items. There may be alternate forms of the questionnaire or it may only be available in a single form. As.the number and variety of instruments grow, audiologists are acquiring an impressive choice of techniques 226

0022-4677/84/4903-0226501.00/0

DEMOREST • WALDEN: Psychometrics of Self-Assessment Inventories 227 The first section of the paper discusses test selection and shows how it is guided by the measurement objective, the target population, and the purpose of testing. The second section presents principles of score interpretation. Four different types of scores are reviewed: content-referenced scores, norm-referenced scores, predictive measures, and measures of constructs. It is then shown that score interpretation typically involves generalization and comparison. Justification for generalizing to other items, occasions, or conditions of test administration is discussed in terms of reliability and measurement error. Techniques for comparing a score to a norm, for assessing change in a score over time, and for comparing scores on different scales are illustrated. The third section constitutes a case study in test evaluation. Quantitative methods of describing and evaluating self-assessment inventories are illustrated using data obtained in three projects conducted at Walter Reed Army Medical Center on the Hearing Performance Inventory of Giolas et al. (1979). SELECTION The selection of a self-report measure involves careful consideration of three broad questions: "What is to be assessed?", "Who is to be assessed?", and "Why is the information being obtained?". The answers to these questions have implications for the content, structure, and essential psychometric properties of the instrument adopted.

Defining the Measurement Objective The selection of a self-report measure must be based on clearly defined measurement objectives. What is it, conceptually, that a given inventory is designed to assess and how is that concept operationally defined? Questionnaires with similar titles may actually assess different things, and questionnaires that appear to have different measurement objectives may actually be quite similar in content. The following comparison of three existing inventories illustrates that instruments may differ conceptually, operationally, or both. The Hearing Performance Inventory (Giolas et al., 1979) "was developed to assess hearing performance in those problem areas experienced in everyday listening" (p. 170). Most of the 158 items deal with understanding speech, detecting speech and nonspeech signals, and behavioral responses to auditory failure. The Hearing Handicap Scale of High et al. (1964) was designed to quantify hearing handicap, defined as "any disadvantage in the activities of everyday living which derives from hearing impairment" (p: 215). The 40 items comprising Forms A and B describe the listener's ability to hear signals and understand speech in a variety of situations. Operationally, the Hearing Handicap Scale resembles the Hearing Performance Inyentory in its emphasis on performance. The Hearing Handicap Inventory for the

Elderly (Ventry & Weinstein, 1982) is also designed to measure hearing handicap. These authors, however, conceptualize handicap as "a multifaceted phenomenon comprised of several important dimensions, including situational difficulties experiene_ed, emotional response to hearing impairment and sensitivity problems" (p. 17). Accordingly, their questionnaire includes not only items describing communication problems, but also items describing the emotional and behavioral consequences of these problems. Although the Hearing Handicap Scale and Hearing Handicap Inventory for the Elderly both claim to measure hearing handicap, the two questionnaires reflect different operational definitions of that concept. Clearly, selection of a self-assessment inventory must be based not only on what the inventory claims to measure but also on how that measurement objective has been translated into a set of items.

Defining the Target Population All measurement techniques are explicitly or implicitly designed for a certain population of persons. If empirical studies are conducted during the development of an instrument or if normative data are published, the subjects included should constitute a representative sample from that target population. They serve as a reference group against which the responses or scores of individual patients are calibrated. The choice of a target population influences both the content and format of a questionnaire. For example, the Hearing Handicap Inventory for the Elderly contains no items focusing directly on occupational settings, whereas the Hearing Performance Inventory, which is intended for a general population of hearing-impaired adults, has 27 occupational items. Item response formats may also have to be tailored to the target population. Kaplan, Feeley, and Brown (1978) found that the 7-point response scale of the Denver Scale of Communication Function was too difficult for elderly patients and that most of them tended to respond dichotomously, using only the endpoints of the scale. Kaplan et al. also found it necessary to use an interview method of administering the Denver scale rather than the usual paper-and-pencil format. Selection of a self-assessment inventory will be guided by the type of clinical population the audiologist serves. The closer the match between these patients and those for whom the inventory was designed, the more likely it is that the inventory will yield useful information. When significant discrepancies exist, it may be necessary for the clinician to consider modifying an existing instrument or even constructing a new one.

Defining the Purpose of Measurement In addition to knowing what and who is to be assessed, it is equally important to consider why the information is being sought. How will responses or scores be used?

228 Journal of Speech and Hearing Disorders There are numerous possibilities and each has implications for the required properties of the inventory. Questionnaires versus scales. Perhaps the simplest and most direct way to use a questionnaire is to review and interpret the patient's r,esponses to individual items. In this ease, the questionnaire functions as a standardized clinical interview efficiently providing the audiologist with specific bits of information. It becomes a clinical assessment tool rather than a psychometric one. Item responses are interpreted directly and are not combined to form quantitative measures, so questionnaire selection or development can be based almost entirely on item content. If the responses are numerical ratings, they can be displayed for quick reference in a table or graph such as that used with the Denver Scale of Communication Function. A questionnaire becomes a scale when each response is scored, and the scores are summed or averaged across items. Responses to individual items might not even be inspected. Instead, it is assumed that the items have something in common and interest focuses on the score computed across items. The properties of the scale are a function of the psychometric properties of the items and of their interrelationships. The clinician must take these properties into account when selecting an inventory for a particular purpose. Profiles versus global scores. If a self-assessment inventory is to be used as a diagnostic aid in planning an individualized aural rehabilitation program, it is likely that a profile o f scores will be required. The patient's relative strengths and weaknesses in specific areas can be used to set appropriate rehabilitative goals. In contrast, a single, global measure is more useful if only a general classification of the patient is desired, for example, to evaluate candidacy for rehabilitation or amplification. One problem with profile assessment is that each scale within a profile must provide a reliable and valid measure, and a relatively large number of items is usually required to achieve this. Thus, inventories that provide profiles are generally longer and more time-consuming to administer than instruments that yield only a single score. If testing time is limited, some compromises in the specificity of the assessment may be necessary. Single Versus repeated measures. If the questionnaire is used to assess progress over time or to measure benefits derived from amplification or rehabilitation, it must have good retest reliability. That is, in order to measure change, one should be confident that score differences are not just clue to measurement error. When retesting is planned, the interval between testings must also be taken into account. If the interval is short and patients are likely to recall their earlier responses to specific items, the retesting should be done with an alternate, equivalent form of the questionnaire. If recall is not considered likely, or if the patient would probably not be able to distinguish between the old and new forms, then a single test form will suffice. Predictive measures. When scales are used predictively, their validity with respect to specific criterion variables is important. Suppose, for example, the audiologist

49 226-240

August 1984

wished to predict adjustment to amplification, defined in terms of hours of aid use per day, after 1 year. To establish whether a specific inventory is useful for that purpose requires information about the correlation between scores on the inventory and a measure of that criterion variable. If such validity data exist, the audiologist's assessment of the inventory's appropriateness is greatly simplified. However, since the number of possible criteria is essentially limitless, it is more likely that the validity data will not exist and the audiologist will have to exercise clinical judgment in selecting an appropriate measure. Its validity will then have to be determined empirically. As these examples illustrate, the selection of a selfassessment inventory is based on many factors. Even though several well-designed instruments already exist, it is not difficult to imagine situations in which none of them would adequately meet all of the selection criteria outlined above. It is likely then that new methods of assessment will continue to appear. For clinicians to make informed choices, the psychometric characteristics of new and existing instruments must be adequately described.

INTERPRETATION The interpretation of responses, scores, and profiles involves generalizing from information at hand to information that has not been directly observed. The question "What does this score mean?" really asks two things: "What type of information does this score provide?" and "What can I do with this information?." These are interrelated but separate issues. The next section briefly discusses four types of scores: those whose interpretation rests on the content of the test items, those that depend on norms, those that are not interpreted directly but are used to predict other variables, and those that assess traits or theoretical constructs. This is followed by a review of the generalizations, inferences, and predictions that test users typically make and the kinds of empirical data needed to justify them.

Types of Scores Content-referen, ced measures. The simplest and most direct type of information provided by a score is information about the patient's responses to the questions asked. Item content determines score interpretation. For example, Item 1 of the Hearing Performance Inventory states: "You are with a male friend or family m e m b e r in a fairly quiet room. Can you understand him when his voice is loud enough for you and you can see his face?." The 5point response scale is defined in percentages ranging in equal steps from 1 (practically always or 100%) to 5 (almost never or 0%). A patient who responds 2 is reporting that he or she can understand about 75% of the time in that situation. A patient whose mean response across the 71 Speech items is 2.50 is saying that for the situations

DEMOREST • WALDEN: Psychometrics of Self-Assessment Inventories d e s c r i b e d s p e e c h is u n d e r s t o o d b e t w e e n 75% and 50% of the time. If item c o n t e n t is b r o a d e n o u g h to e n c o m p a s s all the situations of interest, score i n t e r p r e t a t i o n is straightforward. It is more l i k e l y though that items will not exhaust the c o n t e n t domain. Instead, t h e y will be c o n s i d e r e d a s a m p l e from the h y p o t h e t i c a l set of all p o s s i b l e r e l e v a n t items. F o r example, d e s p i t e its c o m p r e h e n s i v e n e s s , the H e a r i n g P e r f o r m a n c e I n v e n t o r y does not ask about all p o s s i b l e c o m m u n i c a t i o n situations. It s y s t e m a t i c a l l y samples different types of situations, varying features such as the speaker, the p r e s e n c e and nature of c o m p e t i n g signals, and the p r e s e n c e of visual cues. Although interpretation of scores may b e c o n t e n t - r e f e r e n c e d , the test user will t y p i c a l l y w a n t to g e n e r a l i z e b e y o n d the information given. Norm-referenced measures. An advantage of contentr e f e r e n c e d scores is that score i n t e r p r e t a t i o n is direct and does not d e p e n d on h o w other patients score. Normr e f e r e n c e d scores, in contrast, acquire m e a n i n g b y comparing a p a t i e n t ' s score to a d i s t r i b u t i o n of scores obt a i n e d b y an a p p r o p r i a t e r e f e r e n c e group. T h e m e a n of this distribution constitutes a norm and r e p r e s e n t s average or typical p e r f o r m a n c e for the r e f e r e n c e population. In statistical terms, it is the e x p e c t e d score of a person s e l e c t e d at r a n d o m from the population. Variation around the norm is d e s c r i b e d in terms of the standard deviation of the, scores or in terms of the p e r c e n t i l e rankings of different scores. N o r m - r e f e r e n c e d scores p r o v i d e statistical information. 1 T h a t is, the norms reveal the extent to which an o b s e r v e d score is typical or atypical for the target p o p u l a t i o n and that is all. W h e n a target p o p u l a t i o n for an i n v e n t o r y has b e e n identified, general norms should b e b a s e d on a representative s a m p l e from that population. Yet few test d e v e l opers have the resources to s a m p l e r a n d o m l y from the target population. Instead, the norm group may consist of several samples g a t h e r e d u n s y s t e m a t i c a l l y from available sources. Such Samples m a y b e b i a s e d b y over- or u n d e r r e p r e s e n t i n g certain types of h e a r i n g loss, certain occupational groups, age groups, and so on. I f t h e s e factors are systematically r e l a t e d to test scores, the norms will b e b i a s e d and will give a d i s t o r t e d picture of typical performance. F o r t h e s e reasons, the characteristics of the norm group m u s t b e d e s c r i b e d so that test users can evaluate the a p p l i c a b i l i t y of the norms to their own clinical population. If an inventory is f r e q u e n t l y (lsed at a particular clinic, it is h i g h l y d e s i r a b l e to d e v e l o p local norms. Such norms 1Information theory defines the information provided by an event as -log2 p, where p is the a priori probability that the event will occur. For example, when one of two equally likely events occurs (p = .5), we obtain -log2.5 = 1, or 1 bit of information. As p approaches 1 and an event becomes highly likely, its actual occurrence provides little information. On the other hand, as p approaches zero, and the event becomes highly unlikely, its occurrence provides a great deal of information. By this definition, norm-referenced scores that lie near the mean give relatively little information since they are quite likely to be observed. Scores that deviate from the norm provide more information because their occurrence is, a priori, less likely.

229

d e s c r i b e the typical pattern of p e r f o r m a n c e for the audiologist's own clientele. Local normative data can h e l p the rehabilitative a u d i o l o g i s t plan an a p p r o p r i a t e t r e a t m e n t program, s u i t e d to the special n e e d s of the p a t i e n t population b e i n g served. T h e norms can also be useful in i n t e r p r e t i n g the scores of i n d i v i d u a l patients: A score that is u n u s u a l for the general p o p u l a t i o n of h e a r i n g - i m p a i r e d adults m i g h t b e q u i t e typical for military officers. Likewise, a score that m i g h t b e typical of the general p o p u l a tion might b e u n u s u a l in the military group. To the extent that a local group, or any other subgroup, is systematically different from the g e n e r a l population, it is the s u b g r o u p norm that provides the more a p p r o p r i a t e frame of reference for i n t e r p r e t i n g an i n d i v i d u a l ' s scores. Factors that may account for the shape of the s u b g r o u p profile can also b e c o n s i d e r e d in score interpretation. To illustrate, T a b l e 1 presents normative data o b t a i n e d with the H e a r i n g Performance I n v e n t o r y (HPI) from 250 patients w h o a t t e n d e d the W R A M C Aural R e h a b i l i t a t i o n Program from June 1979 to July 1980. D e m o g r a p h i c a l l y , the typical patient is a 38-year-old m a r r i e d man on active d u t y in the military, who has had some c o l l e g e e d u c a t i o n and who is a n e w user of amplification. T h e a u d i o m e t t i c data reveal a pattern of bilateral, h i g h - f r e q u e n c y h e a r i n g i m p a i r m e n t , with e l e v a t e d SRT and r e d u c e d s p e e c h recognition performance.

TABLE 1. Characteristics of patients in the WRAMC Aural Rehabilitation Program.

Descriptive data obtained with HPI sample Demographic data (N = 250) Age:

M

38.0 Gender: 9.4 Range 18-69 Employment: Marital Married 85.3% Status: Single 10.6% Separated 0.9% Divorced 3.2% Prior Aid Use: Education: High School 37.0% College 42.0% Prior Rehab.: Postgrad. 20.7%

SD

Male 96.4% Female 3.6% Military 96.0% Civilian 2.0% Retired 2.0% Yes No

26.7% 73.3%

Yes No

5.9% 94.1%

Audiometric data (n = 148) Pure-tone hearing level (dB) Speech Ear

250

500 1000 1500 2000 4000 SRT Recog.

Right

M

40.5

61.7 19,9 86.7%

Left

SD 16.6 17.3 19.1 19.7

21.4

24.2

M

17.8 21.1 29.0

40.1

50.3

67.6 23.4 80.1%

SD

8.3 20.1 21.8

22.3

21.9

20.9

18.5 23.4%

10.6 12.9 18.8 26.1

34.5

55.4

15.5 90.9%

9.6 10.4 14.5 17.0 22.3 25.7 34.0 44.9

19.7 22.6 8.7 11.4% 56.4 73.9 27.4 77.4%

Better M

SD Worse M

15.2 17.5 23.8

30.9

SD 21.3 22.8 22.9 21.5

19.0

16.4 18.0%

18.9 20.8 23.2%

230 Journal of Speech and Hearing Disorders

49 226-240

August 1984

TABLE 2. Walter Reed norms for the scales of the Hearing Performance Inventory.

N M SD Range

Speech

Intensity

RAF a

Social

Personal

Occup.

250 2.55 0.72 1.00-4.42

250 2.82 0.83 1.06-4.84

250 3.11 0.59 1.58-4.52

250 2.87 0.55 1.51-4.28

243 2.42 1.05 1.00-4.88

242 2.52 0.53 1.05-3.92

"Response to Auditory Failure.

Responses to HPI items are recorded on a 5-point frequency scale, ranging from i for practically always to 5 for almost never. For items that describe negative behaviors or feelings, the scoring of the response scale is reversed. For each of the six sections of the HPI, scores were obtained by averaging across items, excluding those that were omitted or judged not applicable. This yielded scales for which a low score was the most desirable and a high score the least desirable. Descriptive statistics are given in Table 2. The group profile is relatively fiat, with mean scores generally below the midpoint of 3.00. The score distributions are slightly skewed, and the direction of skewing indicates that the scales are more sensitive to individual differences above the mean (frequent communication problems) and less sensitive to individual differences below the mean (infrequent communication problems). This is a desirable characteristic of a measure that is intended primarily for clinical use. 2 The ranges and standard deviations are as large as could be expected given the potential range of 4.00 points. Thus, the patients are not highly homogeneous in their assessment of communication problems; there are substantial individual differences. More detailed information about the typical response pattern of this patient population can be obtained from mean scores for subcategories of items. Figure 1, for example, shows the group profile for different types of response to auditory failure. For this group, asking others for assistance, asking for repetition of a portion of what was said, and informing others of the hearing loss are problem areas that raise the overall score on the Response to Auditory Failure scale. Predictive measures. When measures are obtained solely because they have value as predictors of other variables, direct interpretation of observed scores may be of little interest. Their meaning and usefulness are derived from their relationship with a criterion variable. In this case, it wouldn't matter if the patient's responses were biased or inaccurate, provided they were correlated with the criterion. For predictive measures, score inter-

pretation is likely to be indirect and to be based on an expectancy table or a regression equation that relates the predictor score to an expected score or interval of scores on the criterion. Measures of constructs. When test scores are interpreted as measures of abstract constructs or traits, it is implicitly assumed that the observed score is a quantitative expression of the individual's position on some theoretically important continuum. For example, a patient's score on the HPI speech items might be interpreted as a measure of "communication handicap." The validity of this interpretation, however, depends on the extent to which the test constitutes a useful operational definition of that construct. Most constructs are conceptually complex, and it is unlikely that any one operational definition will be sufficiently broad to encompass all aspects of the construct. Rather, the construct is viewed as an underlying determiner of many different, but theoretically related, types of observable behavior. For example, communication handicap might be revealed both by errors in understanding speech and by avoidance of communication situations or by the emotional stress associated with trying to communicate effectively. Score interpretation must be undertaken cautiously and with the acknowledgment that a given score is probably only a single indicator of a multifaceted underlying construct. The four different types of measures and score interpre-

~a o 2 1

~When a test-score distribution is skewed, individual differences at one end of the distribution are emphasized, while those at the other are de-emphasized. If the goal of a self-assessment inventory is to detect individual differences in communication problems among those most in n e e d of rehabilitation, the score distribution should b e skewed in that direction. Because individual differences among those functioning at a better-than-average level are of less interest, little clinically useful information is lost.

]

I

I

I

I

I

[

o++:° I o&'°+ I .,"°"°°" FIGURE 1. Mean score on subscales of Response to Auditory Failure.

DEMOREST & WALDEN: Psychometrics of Self-Assessment Inventories tation that have been discussed are not mutually exclusive. For example, a patient's score on the Speech scale of the HPI might be interpreted with reference to (a) the specific situations sampled by the test items, (b) the scores obtained by an appropriate norm group, (c) the construct of communication handicap, or (d) predicted success with amplification or some other criterion variable. It is therefore important that the audiologist understand the bases for these different interpretations and have assurance, through empirical evidence, that they are warranted.

Generalizing f r o m Observed Scores All observed scores represent information obtained with a specific set of items, on a specific occasion, under a specific set of test conditions. To what extent can one generalize from the observed score to scores that might be obtained with different items, on different occasions, or under different test conditions? Answers to these questions are provided, in part, by reliability coefficients and by statistics such as the standard error of measurement. Because observed scores have a precision that is often more apparent than real, it is important that clinicians develop the ability and the inclination to use psychometric information when interpreting the scores of individual patients. The type of information required varies with the type of generalization one wishes to make. Generalizing over items. Some self-assessment inventories, such as the Hearing Handicap Scale of High et al. (1964), are available in two forms. An obvious question concerns the equivalence of the two forms: Do they provide comparable measures? Are their means and standard deviations the same, and do the scores on one form correlate highly with the scores on the other form? According to data presented by High et al. for a sample of 50 subjects, the answers to these questions are affirmative. Forms A and B had means of 55.8 and 55.2, respectively, with standard deviations of 14.7 and 15.0 and a correlation of .96. Obviously, given this evidence of alternate-form equivalence, it is Unimportant which form a patient uses because performance does not depend on the particular set of items comprising the form. If two test forms are highly correlated, but differ somewhat in their means and standard deviations, separate norms are required for the two forms, but otherwise they provide equivalent information. However, as the correlation between forms drops from 1.00, the observed score becomes increasingly a function of the particular questions the patient answers. When an inventory such as the HPI consists of a single form, it is still important to know whether the observed score is likely to depend on the particular items presented. Internal consistency reliability, estimated by coefficient alpha, provides this information. Coefficient alpha can be based on interitem correlations or interitem covariances. When the items are a representative sample of items from a well-defined content domain, interitem correlations or covariances can be used to estimate the

231

expected correlation between any two random samples of items from that domain (Nunnally, 1978). If items on a kitem test are standardized so that each has zero mean and unit variance, a standardized coefficient alpha is given by

kr a-

1 +(k-

1)r'

(1)

where r is the average interitem correlation. This formula shows that the expected correlation with a different set of k items increases as the average interitem correlation increases and as the number of items increases. Equation i is a special ease of the Spearman-Brown Formula which shows how reliability changes as test length is changed by a factor of k. In this context, r is an estimate of reliability for a single item, and since the test contains k items, the reliability of the k-item test is obtained directly from Equation 1. If the item scores are not standardized, but are merely summed or averaged to obtain a total score, then alpha depends on the interitem covarianees rather than the interitem correlations. It can be shown that the formula then becomes k

a-

- (k-

(sx 2 -

1)

~si 2)

sx 2

'

(2)

where Sx 2 is the estimated variance of the total score x, and sl2 is the estimated variance of item i. Available evidence suggests that both the Hearing Handicap Inventory for the Elderly (HHIE) and the newly revised 90-item HPI have excellent internal consistency. Ventry and Weinstein (1982) reported that coefficient alpha was .95 for the total score on the HHIE, .93 for the Emotional subscale, and .88 for the Social subscale (N = 100). Similarly, Lamb, Owens, and Schubert (1983) reported values of alpha ranging from .86 to .96 for the subscales of the HPI (N = 354). Estimates of alternate-form reliability and internal consistency reliability are important because they can provide justification for generalizing from an observed score on one set of items to the score that would probably be observed on another, equivalent set of items. Without this information, a conservative approach to score interpretation is to refrain from generalizing beyond the items actually administered. Generalizing over occasions. Because an observed score is obtained on a specific occasion, it can be influenced by irrelevant temporal factors such as the patient's mood, attitude, general state of health, or any situational factor specific to that occasion. Because these variables fluctuate over time, observed scores for a given patient can also be expected to fluctuate to some extent, even if the same items are administered and if the testing conditions are constant. It is therefore important to know whether scores Obtained on one occasion can be generalized to scores that might have been obtained at another time. Theoretically, this information can be obtained from a retest reliability coefficient based on two independent

232

Journal of Speech and Hearing Disorders

administrations of the test. Unfortunately, it is i m p o s s i b l e to maintain i n d e p e n d e n c e of the two test administrations. If the p a t i e n t r e m e m b e r s the responses p r e v i o u s l y given to certain items, the retest correlation m a y b e spuriously high. Or, if the first test administration sensitizes the patient to certain issues, r e s p o n s e s to some items may change from one occasion to the next and the retest correlation will b e lowered. An e v e n more difficult problem arises if the interval b e t w e e n tests is long or if some form of t r e a t m e n t takes p l a c e b e t w e e n the two test sessions. In either case a true change in scores m a y oecur, and if there is differential change across patients, the retest correlation will u n d e r e s t i m a t e retest reliability. M e t h o d o l o g i c a l p r o b l e m s such as these p r o b a b l y account for the conspicuous lack of large-sample retestreliability studies in the literature on c o m m u n i c a t i o n inventories. N e v e r t h e l e s s , a s s e s s m e n t of retest reliability is essential if a e o m m u n i e a t i o n i n v e n t o r y is to be u s e d to measure change for i n d i v i d u a l patients. To d e t e r m i n e w h e t h e r a r e h a b i l i t a t i o n program has affected a patient's c o m m u n i c a t i o n performance, it is n e c e s s a r y to estimate w h e t h e r o b s e r v e d changes in an i n d i v i d u a l ' s scores are greater than w o u l d b e e x p e c t e d without t r e a t m e n t intervention, that is, greater than changes p r o d u c e d b y irrelevant factors associated with the test occasion. If program evaluation is the goal, a study can b e d e s i g n e d in which a group of patients who r e c e i v e the program are c o m p a r e d with a control group of patients who do not. But to assess change for individuals, information about retest variability in scores is i n d i s p e n s a b l e . Generalizing over conditions of administration. If the administration of a self-assessment inventory is standardized, irrelevant factors s h o u l d have a m i n i m a l effect on the o b s e r v e d scores. Yet any feature of the administration that is not s t a n d a r d i z e d is a potential source of influence on the scores. T h e HPI, for example, contains explicit instruetions to the p a t i e n t i n c l u d i n g a d e t a i l e d explanation of the 5-point r e s p o n s e scale. But administration conditions also e n c o m p a s s m a n y other characteristics of the setting that m a y affect seores. F o r example, the H P I could b e a d m i n i s t e r e d i n a clinic or hospital, or it might be given to the p a t i e n t to take home. It might be administered in p e r s o n by an a u d i o l o g i s t who reads the instructions verbatim, answers questions, and explains w h y the information is b e i n g sought, or it might be sent to the patient through the mail with a cover letter. T h e p a t i e n t might take the i n v e n t o r y alone, with family present, or in a group with other patients. T h e format of test administration is also an i m p o r t a n t consideration. T h e H e a r i n g M e a s u r e m e n t Scale (Noble. & Atherley, 1970) was originally d e s i g n e d as a s t a n d a r d i z e d i n t e r v i e w and later r e v i s e d for a d m i n i s t r a t i o n in a p a p e r - a n d - p e n c i l format. N o b l e (1979) t e s t e d 23 subjects u n d e r both formats with a 6-month interval b e t w e e n testings and o b t a i n e d a eorrelation of a p p r o x i m a t e l y .8 b e t w e e n formats. H o w e v e r , interpretation of the results was c o m p l i c a t e d b y the presence of order effects. T h e r e is no single m e t h o d d l o g y for estimating the effects of administration conditions on o b s e r v e d scores. In principle, one could d e s i g n an elaborate e x p e r i m e n t in

49 226-240

August 1984

which each potential source of score variability was systematically v a r i e d and its effect on scores assessed. In fact, it is p r e c i s e l y this t y p e of m o d e l that is the basis tbr the theory of g e n e r a l i z a b i l i t y d e v e l o p e d b y Cronbaeh, Gleser, Nanda, and Rajaratnam (1972). It is u n l i k e l y that such ambitious projects will be u n d e r t a k e n for most c o m m u n i c a t i o n self-assessment inventories, so the safest strategy is to recognize that a score o b s e r v e d in a particular setting u n d e r a particular set of conditions cannot safely b e g e n e r a l i z e d to scores that w o u l d b e o b s e r v e d u n d e r other conditions. I f such g e n e r a l i z a t i o n is necessary, the effects of varying conditions should b e assessed empirically.

Estimating Measurement Error Justification for g e n e r a l i z i n g across time, occasions, and the conditions of a d m i n i s t r a t i o n d e p e n d s on empirieal e v i d e n c e that is often r e p o r t e d in the form of a reliability coefficient, for example, a retest correlation, an alternate-form correlation, or an internal consistency m e a s u r e such as coefficient alpha. O n e important property of reliability coefficients is that t h e y reflect the range of true i n d i v i d u a l differences among patients in the population. Effects of variability on reliability coefficients. Theoretieaily, a the variance of o b s e r v e d scores on m e a s u r e x can b e p a r t i t i o n e d into two c o m p o n e n t s : O'x2 •

(rt 2 % (Ye2 ,

(3)

w h e r e (rt2 is the variance of true scores and %2 is the variance attributable to m e a s u r e m e n t error. Reliability coefficients p r o v i d e estimates of the ratio of true variance to o b s e r v e d variance: (rt 2 pxx, -

O-t2 q- O-e2

(4)

This r e l a t i o n s h i p shows that r e l i a b i l i t y increases as st 2 increases. R e l i a b i l i t y is a function of both m e a s u r e m e n t error variance a d d the variance of true scores in the population. If m e a s u r e m e n t error is small relative to true i n d i v i d u a l differences, i n d i v i d u a l s t e n d to r e m a i n in the same rank o r d e r from one testing to another and the reliability coefficient is large. But if m e a s u r e m e n t error is large relative to true i n d i v i d u a l differences, a person's ranking changes c o n s i d e r a b l y b e t w e e n testings and the reliability coefficient is small. Thus, h e t e r o g e n e o u s samples p r o d u c e h i g h e r estimates of r e l i a b i l i t y than homogeneous samples. F o r this reason, r e l i a b i l i t y coefficients are of little use in i n t e r p r e t i n g scores for individuals. W h a t is

3These relationships are based on classical true-score theory in which x = t + e, crx2 = ~t2 + %2, and for parallel measures x and x', 0xx' = Ox,2 = crt2/¢x2. A true score, t, is the average score obtained over an infinite number of independent testings. For a concise presentation of classical true-score theory see Allen and Yen (1979).

DEMOREST & WALDEN: Psychometrics o f Self-Assessment Inventories n e e d e d is a m e a s u r e of the precision of i n d i v i d u a l scores that is i n d e p e n d e n t of the true variability a m o n g individuals. Standard error o f measurement. ! f it fs reasonable to assume that %2 is constant across individuals, an estimate of ~ , called the standard error of m e a s u r e m e n t (s¢), can be o b t a i n e d from an estimated reliability coefficient and the standard deviation of the scores: s~ = Sx X / 1 -

r~x, •

(5)

Theoretically, se is unaffected by sample heterogeneity because the formula takes score variability into account: As sx increases, so does rxx,, b u t se remains the same. The standard error of m e a s u r e m e n t may be interpreted as the standard deviation of i n d e p e n d e n t l y obtained scores a r o u n d an i n d i v i d u a l ' s true score, Any source of variability reflected in the reliability coefficient is also reflected in the s t a n d a r d error of m e a s u r e m e n t . If rxx' is an alternate-form or i n t e r n a l consistency reliability coefficient, then variability in scores attributable to item content is taken into account. If r~x, is a retest reliability coefficient, then variability due to the test occasion is taken into account. D i f f e r e n t types of r e l i a b i l i w coefficients produce different estimates of the standard error of m e a s u r e m e n t . C o n s e q u e n t l y , the justification for different types of generalizatlon rests on different types of data. Standard errors of m e a s u r e m e n t can be extremely useful in the interpretation of a patient's observed score on a self-assessment inventory. For example, Ventry and W e i n s t e i n (1982) reported an internal consistency reliability of .95 for the H H I E a n d a standard deviation of 27.3%. Using these values in E q u a t i o n 5, the estimated value Of Se for that scale is 27.3% x X/T - .95, or 6.10%. If a patient o b t a i n e d a score of 60% on the H H I E , it w o u l d be rather u n l i k e l y that this score w o u l d d e v i a t e by more than 2 Se from his or her true score, The range of scores x +- 2 Se is an approximate 95% confidence interval for the patient's true score 4 on items like those in the H H I E . It is therefore u n l i k e l y that the true score is higher than 72.2% or lower than 47.8%. T h e width of the confidence interval reflects the u n c e r t a i n t y i n v o l v e d in g e n e r a l i z i n g from the observed score of 60% to the patient's true score on all items like those in the H H I E . Assessing deviation f r o m a norm. O n e application of

4The confidence interval is only approximate for two reasons. First, the correct multiplier of s~ is the critical value of the t statistic for a two-tailed test at p < .05, based on the degrees of freedom tbr the sample used to estimate se. As Sample size increases, the t distribution approximates the normal distribution and the eritieai value approaches z = 1.96, Second, the confidence interval should not be constructed around the observed score because observed scores are biased estimators of true scores. An observed score above the mean tends to overestimate the true score, and an observed score below the mean tends to underestimate it. The correct procedure is to set the confidence interval around an unbiased estimate of the true score rather than around the observed score (Nunnally, 1978). However, unless the reliability eoeffleient is very low, the bias involved is probably of no practical importance.

233

the standard error of m e a s u r e m e n t in score interpretation is to use the confidence interval for a patient's true score to d e t e r m i n e w h e t h e r the score deviates significantly from some norm. For example, Ventry and W e i n s t e i n (1982) reported that the average score on the H H I E in their standardization sample was 29.9%. For the hypothetical p a t i e n t with an observed score of 60%, the 95% confidence interval calculated above (47.8% to 72.2%) does not i n c l u d e the norm of 29.9%. Therefore, it can be c o n c l u d e d that the patient's true score is higher than the average score o b t a i n e d by patients in the normative sample. The same strategy can be used with local norms and with more than one type of reliability coefficient. For example, the data in T a b l e 2 show that the average patient in the WRAMC Aural Rehabilitation Program has a score of 2.55 on the 71 Speech items of the HPI. Based on studies d e s c r i b e d in detail below, the estimated se is 0,08 w h e n calculated from an internal consistency coefficient and 0.21 w h e n calculated from a retest coefficient. 5 For a patient with an observed score of 3.50, the two 95% confidence intervals w o u l d be 3.50 +- 0.16 and 3.50 +0.42, respectively. Neither interval contains the norm of 2.55. Therefore, regardless of w h e t h e r generalization is over items or over occasions, it could be c o n c l u d e d that the patient's true score is higher than the average score obtained by patients in the rehabilitation program. 6 Assessing the difference b e t w e e n scores on two occasions. If the patient's scores on two occasions are used to estimate w h e t h e r a true change has occurred, the meas u r e m e n t error in both scores must be taken into account. If the observed difference is (x2 - xl), and if Se is the standard error of m e a s u r e m e n t for each score, the standard error of the difference b e t w e e n the two scores is x/'2 s,. Suppose, for example, that a patient's score on the Speech items changes from 3.50 to 2.50 after wearing a hearing aid for 1 year. If s, = 0.21, X/2 Se = 0.30 and an approximate 95% confidence interval for the true change is (x2 - Xl) -+ 2 X/2 s¢. This gives an interval of (2.50 3.50) -+ 2 (0.30) or - 1 . 0 0 + 0.60. T h e interval ranges from - 1 . 6 0 to - 0 . 4 0 , and because it does not i n c l u d e zero, it can be c o n c l u d e d with 95% confidence that a true change has occurred. Any change less than 0.60 scale units w o u l d not be significant. Because the 71-item Speech scale is extremely reliable, it is sensitive to change in an i n d i v i d u a l ' s score. Scales with few items are not likely to be this sensitive. For example, in a local study of retest reliability u s i n g only 18 items from the Speech scale, s~ was estimated to be 0.39 (see Table 8). T h e s t a n d a r d error of the difference be5The retest reliability obtained with 18 Speech items was .73 (see Table 8). Using Equation 1 with k = (71/18), the estimated retest reliability for the 71-item scale would be .91. Substituting this value for rxx, in Equation 5, with sx = 0.72, yields 0.21 for se. 6Teehnicaliy, these tests should also take into account the sampling variability in the mean that represents the norm. The standard error of the mean is s x / ~ , which for our data is 0.72/2~/250 = 0.05. Thus, if the norm is based on a reasonably large sample, it can be safely treated as a population mean rather than a sample mean.

234 Journal of Speech and Hearing Disorders comes ~/2 (0.39) = 0.55 and the 95% confidence interval for true change is (x2 - x0 --+ 2 (0.55)i A patient's score would have to change more than 1.10 scale tinits before it could be concluded with 95% confidence that a true change had occurred. By this standard a change from 3.50 to 2.50 would not be significant. When the standard error of measurement is large, it is difficult to detect change with a high level of confidence. An alternative is to use a less conservativeapproach and to assess change by constructing a'68% confidence interval: (x2 - Xl) -4- ~

se .

For the patient described above, this interval is -1.00 -+ 0.55, and since it does not include zero, it could be concluded with 68% confidence that a true change had occurred. Because the confidence interval is narrower and more likely to exclude zero, it is easier to detect a true change: However, the risk of mistakenly inferring that a true change has occurred is increased from 5% to 32%. Assessing the difference between scores on two scales. When an inventory provides scores on more than one scale, it may be of interest to know whether two scores are significantly different. Once again the measurement error in both scores must be taken into account. If xl is the patient's score on one scale, with standard error s .40

Item number

Highest loading

38 28 26

33 112 3

.80 .81 .79

Listen in noise Listen w/o visual cues Listen in quiet

9 11

65 74

.77 .69

Detect common sounds Detection/loudness speech

6 5 7 3 3 3 3

101 97 93 106 103 88 94

.80 .87 .73 .80 .67 .81 .69

Tell about hearing loss Repeat portion heard Ask for repetition Move seat to hear better Ask favor of speaker Ask assistance of others Pretend to understand

20 7 5 5 3 4 4 2 2

38 113 101 102 97 12 53 88 94

.82 .8i .78 .68 .88 .66 .65 .74 .70

Listen in noise Listen w/o visual cues Tell about hearing loss Move seat to hear better Repeat portion heard Listen in quiet Listen in automobile Ask assistance of others Pretend to understand

4 4

127 129

.88 .84

Avoid performances Emotional reactions

l0 6 3 3 3 3

138 133 145 147 157 153

.88 .86 .89 .85 .79 .88

Listen in noise Listen in quiet Pretend to understand Tell about hearing loss Hearing interferes w/job Repeat portion heard

Interpretation

Speech FactOr I Factor II Factor III Intensity Factor I Factor II RAF Factor I Factor II Factor III Faetor IV Factor V Factor VI Factor VII Social Factor I Factor II Factor III Factor IV Factor V Factor VI Factor VII Factor VIII Factor IX Personal Factor I Factor II Occupational Factor I Factor II Factor III Factor IV Factor V Factor VI

and RAF sections, the faetor structure of the Social scale is simply a composite of the results from the parent scales. T e n items have significant loadings on more than one factor and one item, No. 78, has no loading greater than .40. The Personal items cluster into two groups, those describing avoidance of certain types of activity and those describing emotional reactions to hearing problems. Finally, the Occupational items reveal a t~aetor structure consistent with the composition of the scale: Two factors involve u n d e r s t a n d i n g speech, one consists of personal items, and three describe different responses to auditory failure. Only three items have significant loadings on more than one factor.

T h e results of the internal consistency and factor analyses are quite compatible. T h e Speech, Intensity, and Personal scales, which have a simple factor structure, also have hig h internal consistency reliability. The RAF scale describes several different behavioral responses that are not strongly correlated with one another. This scale is more heterogeneous and therefore has lower internal consistency. Likewise, the Social and Oeeupational scales, each c o n t a i n i n g a variety of item types, have a relatively complex factor structure and lower internal consistency. The factor structure of the H P I could be used to guide the r e f i n e m e n t and/or reorganization of subscales. For example, a n e w Speech scale might be composed of three

DEMOREST • WALDEN: Psychometrics of Self-Assessment Inventories TABLE 5. Correlations among HPI scales.

Speech Intensity RAF Social Personal

Intensity

RAF

.69

.08 .01

,Social Personal Occup. .90 .65 .30

.46 .44 .08 .39

.67 .52 .30 .70 .39

subscales corresponding to Factors I, II, and III. Given a decision about the number of items to be retained, items with the highest factor loadings could be selected for tryout. In fact, we followed such a procedure in developing an adaptation of the H P I that was used in studies described below.

Correlations A m o n g HPI Scales The analyses presented so far have described relationships among items within the H P I scales, that is, internal structure. It is also of interest to know how the scales relate to one another. Table 5 gives the correlations among the six scales, based on the same sample of 250 patients. Among the scales with no items in common, the highest correlation is between Speech and Intensity. The correlation is moderately high and implies that patients who report difficulty understanding speech also tend to report difficulty detecting common sounds and speech signals. The correlation is far from perfect, however, and there will be sgme patients whose scores on the two scales differ significantly. Correlations between the Personal scale a n d the Speech and Intensity scales are also significant. This implies that the patient's emotional adjustment is related, to some extent, to his or her communication problems. Perhaps the most interesting correlations in the table are those involving the RAF scale. This scale is not significantly correlated with Speech, Intensity, or Personal items, and its correlation with the Social and Occupational items can be explained in terms of the overlap in item content. Thus, the behavioral items are independent of items describing communication performance per se and emotional reactions to communication problems. Different rehabilitation strategies might be required depending on the patient's score on the RAF scale.

scales and pure-tone thresholds, speech reception thresholds, or speech recognition scores have at times been interpreted as validity coefficients, particularly when the correlations with speech measures were higher than those with sensitivity measures. Such correlations are considered evidence of construct validity because they confirm theoretically predicted relationships between the construct(s) measured by the inventory and other variables. For administrative reasons it was possible to obtain the medical records of only 148 of the 250 patients who took the HPI. Although this subsample cannot be considered a random one, there is no reason to believe it is systematically biased. Pure-tone thresholds between 0.5 and 8 kHz for both ears were used to calculate a variety of average thresholds. Occasional missing data for frequencies such as 1.5 or 6 kHz were estimated by interpolation. Speech reception thresholds and word recognition scores (NU-6) were also recorded. Correlations between the various pure-tone averages and the HPI were quite similar. Typical results are shown in Table 6. The Speech, Intensity, RAF, and Occupational scales correlate significantly with the average puretone threshold in the better ear at 1, 2, and 4 kHz, but the relationships are not strong. The highest correlation is with the Intensity scale, which assesses the detectability of speech and nonspeech signals. Correlations with SRT are somewhat weaker but follow a similar pattern. The same four scales correlate significantly with speech recognition. Of particular interest is that the Speech scale is more highly correlated with speech recognition than with pure-tone threshold, an important indicator of construct validity. It is also noteworthy that the compensatory strategies involved in the RAF scale are somewhat more strongly related to speech recognition that to sensitivity. The Personal items show only weak, nonsignificant correlations with the audiometric measures, suggesting that they assess factors that are relatively independent of the degree of hearing impairment.

W R A M C Adaptation of the HPI The data presented in Tables 3, 4, and 5 justify shortening the HPI and reducing the number of scales. To TABLE 6. Correlations of HPI scales with selected audiometric measures for the patient's better ear (n = !48).

Correlations w i t h Audiometric Measures One issue receiving considerable attention in the literature on self-assessment inventories is the relationship between self-reported communication problems and audiometric measures of hearing impairment. Although communication performance is not solely a function of the degree or configuration of one's hearing loss, it is reasonable to expect some relationship between the two. As a consequence, correlations between communication

237

Section

Average pure-tone threshold 1, 2, 4 kHz

SRT

Speech recognition

.18" .39** .20* .13 .11 .19"

.12 .31"* .14 .16" .13 .20*

-.32** -.35"* -.32** -.10 -.06 -.29"*

Speech Intensity RAF Social Personal Occupational *p < .05.**p < ,001,

238 Journal of Speech and Hearing Disorders'

49 226-240

facilitate further research on the inventory at Walter Reed, an abbreviated version containing 4 scales and 15 subscales was designed. The Social scale was eliminated because its items were included in the Speech and Response to Auditory Failure scales, and the Occupational items were incorporated into the Speech, Response to Auditory Failure, and Personal scales. The goal was to reduce testing time while adequately reflecting the structure of the 158-item test. The methods and results described here" give a brief illustration of the process of test revision and provide psychometric information relevant to studies reported below. The sample of 250 patients was randomly divided into two snbsamples. Item selection was based on item analyses of the data for the first subsample. The principal criterion for item selection was maximization of coefficient alpha for each subscale. However, item content, means, and standard deviations were also considered. An attempt was made tO maintain content validity (by including a representative sample of the original items) and to match the scale means and standard deviations insofar as possible. A set of 80 items was selected. Because item selection based on statistical criteria can capltalize on chance within a given sample, the statistical properties of the resulting scales must be replicated (i.e., cross-validated) in an independent sample. The second subsample was used for this purpose. Table 7 presents descriptive statistics, coefficient alpha, and se for each scale based on data from the crossvalidation sample. Sample size fluctuates because subjects with missing data on one or more items were dropped from the analysis. The Speech scale contains 27 items~ 9 representing each of the three factors shown in Table 4. Similarly, the Intensity scale consists of two 9item subscales. The RAF scale includes 3 or 4 items representing each of seven factors, and the Personal scale includes all 8 Personal items plus the 3 personal items from the Occupational scale. The values of alpha in Table 7 are quite similar to the values obtained in the first subsample (viz., .97, .92, .90, and 89) with no indication of the shrinkage that is expected in an independent cross-validation sample. This is turther evidence that the behavior of HPI items is replieable in different samples. Although the 80-item HPI was intended for local use only, the 90-item Revised HPI (Lamb et al., 1988) was

August 1984

developed using similar techniques. Given that the two abbreviated forms have 54 items in common, the results reported here are probably consistent with what would have been obtained in our population with the Revised HPI.

Retest Reliability Although it is difficult to meet the assumptions underlying estimation of reliability from test-retest correlation coefficients, a study of reliability was performed under conditions comparable to those that might be encountered in practice. When patients are referred to Waiter Reed by the audiologist at their duty station, a referral audiogram and supporting documentation are forwarded. It would be convenient to have the patient also fill out the inventory before arrival, but then it would be important to know whether scores obtained in the field can be generalized to scores that would be obtained on Day 2 of the rehabilitation program. Because test administration in the field cannot be carefully controlled, both the occasion and the testing conditions are free to vary between test and retests and the resulting correlations provide conservative estimates of retest reliability. Between August 1980 and August 1981, patients scheduled for the aural rehabilitation program were mailed a copy of the adapted HPI and asked to complete and return it before their arrival. There were 91 patients who complied with this request, subsequently joined the program, and retook the test under the usual conditions. The retest interval varied from 1 to 99 days. For a variety of reasons, including self-selection, the sample cannot be considered representative of all program patients. Descriptive statistics, retest correlations, and standard errors of measurement for the four scales are given in Table 8. Reliability is lower and the standard error of measurement is higher than the corresponding values in Table 3. This illustrates the difficulty entailed in generalizing both over occasions and the conditions of administration. The retest correlations in Table 8 are typical of reliability coefficients for self-report measures. However, be-

TABLE 8. Descriptive statistics, retest reliability, and standard error of measurement for WRAMC adaptation of H PI seales (N = 91). TABLE 7. Psychometric characteristics of the WRAMC adaptation of the HPI.

Speech n~ # Items M

SD se

91 27 2.56 0.84 .97 0.14

Intensity 95 18 2.96 0.91 .94 0.22

RAF 89 24 3.21 0.61 .88 0.21

Personal 70 11 2.46 1.01 .91 0.31

aSample size varies because of missing data for some items.

# Items M

SD rxx. Se

Speech

Intensity

RAF

Personal

Test Retest

18 2.89 2.84

12 3.04 3.02

24 3.14 3.14

11 2.73 2.58

Test Retest

0.76 0.75

0.81 0.85

0.67 0.70

1.01 1.01

.73 0.39

.78 0.38

.71 0.37

.74 0.51

DEMOREST ~Z WALDEN: Psychometrics of Self-Assessment Inventories cause many more sources of variability were free to operate here than in the usual test-retest study, these values are extremely conservative estimates of HPI retest reliability in general. If the inventory were administered under standardized conditions on two occasions, the correlations would undoubtedly be higher, but then they would not reflect the actual sources of measurement error that must be dealt with in our setting. This underscores the usefulness of having local studies tailored to a clinic's own population and operating proeedures.

Short-Term Changes in HPI Scores One potential use of the H P I scales is to assess changes in a patient's performance over time. To evaluate the ability of the HPI to detect change, a comparison was made between scores at the beginning and end of the aural rehabilitation program. Owens and Fujikawa (1980) have reported significant differences on the HPI between patients who wear hearing aids and those who do not, so it was predicted that the patients' 1 week of experience with their hearing aids would produce a significant improvement in scores, especially on the Speech and Intensity scales. Beneficial effects of the rehabilitation program might also be observed on the lqAF and Personal scales, but because those scales deal with behavioral and attitudinal variables, significant change after only ! week was considered less likely. A total of 141 patients attending the program between July 1980 and August 1981 received the abbreviated HPI twice. The first administration took place on the morning of Day 2. One week later, on the last day of the program, the test was readministered and patients were instructed to respond in whatever way they now felt was appropriate. That is, they were told that it was not necessary to respond as they had on the first test. Table 9 gives the pretest and post-test means and the correlations between pretest and posttest scores. Significant improvement was observed for all four scales, but the magnitude of the change was much greater for the Speech and Intensity scales than for the RAF and Personal scales.

TABLE 9, Comparison of pretest and posttest scores on four HPI scales (N = 141).

Speech

Intensity

BAF

Personal

Pretest

M SD

2.75 0.74

2,88 0.81

3.22 0.68

2.37 0.93

Posttest

M

2.07 0.74

1.96 0.74

2.88 0.84

2.14 0.89

10.27"* ,42**

14.58"* .54**

5.19"* .50*

3.24* ,56**

SD t r

*p < .01. **p < .001.

239

Methodologically, it is difficult to assess change when the pretestJposttest period is short and a treatment program intervenes. The potential for artifactual changes in scores is great. If it is assumed that amplification has the greatest effect on the Speech and Intensity scales, it is reasonable to conclude that a change in scores during the first week of hearing aid use reflects a true change in communication performance. Even though it is possible that the behavioral and adjustment counseling patients received also had an immediate impact, the changes are small by comparison and clearly susceptible to response bias. The importance of these results lies in demonstrating that the scales can detect change; interpreting the nature of the change or its underlying cause requires a more carefully controlled experiment. The difficulty involved in estimating retest reliability from a pretest/posttest study is illustrated by the correlations in Table 9. If scores for all patients had improved by the same amount, the correlations would be at least as high as the retest correlations already reported. The fact that they are lower implies differential change in patients' seores that is not just attributable to measurement error.

Long-Term Changes in Scores A more realistic evaluation of improvement in scores can be obtained by extending the follow-up period. We therefore contacted patients who had taken the original version of the HPI and asked them to respond to the inventory again. The 158-item HPI was mailed to 178 patients who had attended the program between June 1979 and June 1980. Sixteen of the questionnaires were returned as undeliverable, 83 were completed and returned, and 79 were not returned. The response rate was therefore 51.2%. The follow-up interval ranged from 90 days to 15 months. Table 10 presents the pretest and follow-up data for the six HPI scales. Significant improvement, of similar magnitude, was found for all six scales. This result is consistent with the assumption that a longer follow-up period is required for a meaningful change to appear in the behavioral and personal items. Both studies of change reported here share certain methodological problems that are common in applied research. For example, if the improvement in scores were interpreted as evidence Of the benefits derived from amplification or the effectiveness of the aural rehabilitation program, it would obviously be necessary to include an appropriate control group. Also, the problem of selfselection in the follow-up study would have to be addressed. The value of these data lies in the demonstration that the HPI (and by implication other self-report measures) can detect change in circumstances where it is reasonable to expeet that change will occur. Moreover, because the pattern of change conforms to what might be expected on a logical basis, the data provide additional evidence for the construct validity of the scales. In summary, the local studies reported in this section were undertaken to evaluate the Hearing Performance

240

Journal of Speech and Hearing Disorders

49

226-240

August 1984

TABLE 10. Comparison of pretest and follow-up scores on six HPI scales (N = 83).

Pretest

Follow-up

Speech

Intensity

RAF

Social

Personal

Occup.

M

2.60

2.87

3.21

2.94

2.51

2.56

SD

0.73

0.82

0.57

0.57

1.10

0.56

M

2.18

2.30

2.80

2.53

1.98

2.26

SD

0.76

0.92

0,73

0.69

0.93

0.57

t*

5.33

5.87

5.93

5.42

4.41

4.58

*p < .001 for all scales.

I n v e n t o r y as a c o m m u n i c a t i o n self-assessment p r o c e d u r e for use with a specific p o p u l a t i o n of h e a r i n g - i m p a i r e d adults. T h e results confirm that H P I scales have high internal consistency and straightforward, i n t e r p r e t a b l e factor structure. Retest reliability, u n d e r the conditions a p p r o p r i a t e to the W R A M C Aural Rehabilitation Program, is a c c e p t a b l e b u t not as high as could p r o b a b l y b e o b t a i n e d u n d e r more s t a n d a r d i z e d conditions. Relationships b e t w e e n H P I scales and a u d i o m e t r i e measures support the construct validity of the scales, as does the sensitivity of the scales to both short- and long-term changes in p a t i e n t s ' self-reported c o m m u n i c a t i o n performance. Although the results of t h e s e three studies should not b e hastily g e n e r a l i z e d to other p a t i e n t populations or conditions of test administration, t h e y are consistent with what other researchers have found. Thus, there is converging e v i d e n c e that the H e a r i n g Performance Inventory and other carefully d e v e l o p e d self-assessment inventories p r o v i d e p s y c h o m e t r i c a l l y sound information, an obvious p r e r e q u i s i t e to their effective clinical use in the a s s e s s m e n t and r e h a b i l i t a t i o n of c o m m u n i c a t i o n problems.

ACKNOWLEDGMENTS This research was supported by the Department of Clinical Investigation, Walter Reed Army Medical Center, under Work Unit 2526. Computer time was provided by the Computer Science Center of the University of Maryland. Portions of the material were presented by the first author at the Annual Convention of the American Speech-Language-Hearing Association (Wang, 1981). We are grateful to Sue A. Erdman for her collaboration throughout the project and for her contributions to the design and implementation of these studies. We also appreciate the cooperation and assistance of Joanne Crowley, Laura Holum-Hardegen, and Charlene K. Scherr in data collection and processing. We are indebted to several colleagues for their critical comments on an earlier version of the manuscript: L. E. Bernstein, A. C. Catania, A. Cerf-Beare, R. Deluty, S. A. Erdman, R. A. Prosek, R. R. Provine, B. C. Schumann, and R. K. Sedge. The opinions or assertions contained herein are the private views of the authors and are not to be construed as official or as reflecting the views of the Department of the Army or the Department of Defense.

REFERENCES

ALLEN, M.J., & YEN, W. M. (!979). Introduction to measurement theory. Monterey, CA: Brooks/Cole. ALPINEB, J. G., CHEVRETTE, W., GLASCOE, G., METZ, M., & OLSEN, B. (1974). The Denver Scale of Communication Function. Unpublished manuscript, University of Denver. CRONBACH, L. J., GLESER, G. C., NANDA,H., & RAJARATNAM,N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley. GIOLAS, T. G., OWENS, E., LAMB, S. H., & SCHUBERT, E. D. (1979). Hearing Performance Inventory.Journal of Speech and Hearing Disorders, 44, 169-195. HmH, W. S., FAmBANKS,G., & GLOmG, A. (1964). Scale for SelfAssessment of Hearing Handicap. Journal of Speech and Hearing Disorders, 29, 215-230. KAPLAN,H., FEELE¥, J., & BROW~,,T,J. (1978). A modified Denver Scale: Test-retest reliability. Journal of the Academy of Rehabilitative Audiology, 11(2), 15-32. LAMB, S. H., OWENS, E., & SCHUBERT, E. D. (1983). The revised form of the Hearing Performance Inventory. Ear and Hearing, 4, 152-157. NIL, N. H., HULL, C. H., JENKINS, J. G., STEINBRENNER, K., & BENT, D. H. (1975). Statistical package for the social sciences (2nd ed.). New York: McGraw-Hill. NOBLE, W.G. (1979). The Hearing Measurement Scale as a paper-and-pencil form: Preliminary results. Journal of the American Auditory Society, 5, 95-106. NOBLE, W.G., & ATHERLEY, G. R.C. (1970). The Hearing Measure Scale: A questionnaire for the assessment of auditory disability, Journal of Auditory Research, 10, 229-250. NUNNALLY, J.C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill. OWENS, E., & FUJIKAWA,S. (1980). The Hearing Performance Inventory and hearing aid use in profound hearing loss. Journal of Speech and Hearing Research, 23, 470-479. gENTRY, I. M., & WEINSTEIN, B. E. (1982), The Hearing Handicap Inventory for the Elderly: A new tool. Ear and Hearing, 3, 128-134. WANG, M. D. (1981, November). Self-assessment inventories in audio!ogic evaluation. Presented at the Annual Convention of the American Speech-Language-Hearlng Association, Los Angeles. Received November 4, 1983 Accepted May 15, i984 Requests for reprints should be sent to Marilyn E. Demorest, Ph.D., Army Audiology and Speech Center, Walter Reed Army Medical Center, Washington, DC 20307.

Psychometric Principles in the Selection, Interpretation, and Evaluation of Communication Self-Assessment Inventories Marilyn E. Demorest, and Brian E. Walden J Speech Hear Disord 1984;49;226-240

This article has been cited by 3 HighWire-hosted article(s) which you can access for free at: http://jshd.asha.org/cgi/content/abstract/49/3/226#otherarticles This information is current as of October 22, 2012 This article, along with updated information and services, is located on the World Wide Web at: http://jshd.asha.org/cgi/content/abstract/49/3/226