coding schemes - Wiley Online Library

5 downloads 0 Views 148KB Size Report
Jun 23, 2009 - 1Department of Psychology, North Carolina State University, Raleigh, North Carolina, USA ... North Carolina State University, 640 Poe Hall, Campus Box 7650, Raleigh, ..... GA; Cleveland, OH; Dallas, TX; Greenville, SC; Miami,. FL; Los ..... nicity (S.M.A.R.T.): Guide to best practices.http://www.fda.gov/.
Ó 2009 Wiley-Liss, Inc.

Birth Defects Research (Part A) 85:864–871 (2009)

New and Improved: The Role of Text Augmentation and the Application of Response Interpretation Standards (Coding Schemes) in a Final Iteration of Birth Defects Warnings Development Christopher B. Mayhorn1* and Richard C. Goldsworthy2 1

Department of Psychology, North Carolina State University, Raleigh, North Carolina, USA 2 The Academic Edge, Inc, Blomington, Indiana, USA Received 24 February 2009; Revised 9 April 2009; Accepted 17 April 2009

BACKGROUND: Several birth defects warning symbols identified as most successful in an earlier study

(Mayhorn and Goldsworthy, 2007) were further modified and then evaluated within a nationally distributed field trial (n 5 2773). The purpose for the current research was to determine whether symbol warning components could be improved further, whether the addition of text enhanced comprehension uniformly across symbols, and whether results varied by the application of different interpretation standards (coding schemes). METHOD: A total of 11 warning labels were examined: four new symbols plus the existing baseline symbol, each in versions with and without text, plus a text-only condition. Participant interpretation accuracy and preferences were assessed during face-to-face interview sessions. RESULTS: For symbol-only conditions, several candidate symbols outperformed the existing symbol, one substantially so. The effect of adding text to symbols varied significantly by symbol. Symbol plus text and text-only conditions performed equivalently, generally exceeded symbol-only conditions, and often surpassed the American National Standards Institute benchmark of 85% accurate interpretation. CONCLUSIONS: The research effort has identified a teratogen symbol and warning that outperforms the one currently in use. The effort has also identified important pragmatic and conceptual issues that should inform future work to improve medication labeling and other hazard communication. Birth Defects Research (Part A) 85:864–871, 2009. Ó 2009 Wiley-Liss, Inc. Key words: warnings; symbols; teratogen; birth defects; medications

INTRODUCTION To avoid birth defects and other health-related consequences, it is essential that patients be warned of the potential hazards of exposure to substances with teratogenic properties (Meadows, 2001; Perlman et al., 2001). To attempt to meet this need for increased patient education, awareness campaigns and intervention strategies such as the System to Manage Accutane Teratogenicity (Roche Laboratories, 2001) have been implemented. These efforts have proven useful, but gatekeeping mechanisms designed to guard against a particular hazard sometimes fail, and it ultimately falls to medication labels and packaging to inform those at risk. Previous research has documented that the teratogen warning in use at the time of the study may be confusing to those who encounter it (Daniel et al., 2001). This warning (illustrated in warnings 3 and 4 of Table 1) consists of a symbol showing a circle and a slash mark superimposed over a graphic representation of a pregnant

woman with the accompanying text ‘‘Do Not Get Pregnant.’’y Results reported by Daniel et al. (2001) indicate that only 21% of the women shown the current warning were able to correctly interpret it. As a benchmark for considering these results, the American National Stand-

y Since the original Daniel et al. study, manufacturers have replaced this text ‘‘Do not get pregnant,’’ which we have criticized in multiple venues, with a different text warning.

Supported by grants 1R43DD00001-01 and 1R44DD00001-01 to the second author from the National Center on Birth Defects and Developmental Disabilities (NCBDDD), a part of the Centers for Disease Control and Prevention (CDC). The article’s contents are solely the responsibility of the authors and do not necessarily represent the official views of NCBDDD/CDC. *Correspondence to: Christopher B. Mayhorn, Department of Psychology, North Carolina State University, 640 Poe Hall, Campus Box 7650, Raleigh, NC 27695-7650. E-mail: [email protected] Published online 23 June 2009 in Wiley InterScience (www.interscience. wiley.com). DOI: 10.1002/bdra.20601

Birth Defects Research (Part A): Clinical and Molecular Teratology 85:864–871 (2009)

2.0% 1.6% 1.6%

66.8% G 62.4% G

3.0%

39.4% ABCDE FHIJKL

2.0%

54.0% G

1.6% 1.2% .08%

54.4% G

60.8% G 64.4% G

57.5% BCDEF 61.4% G GHIJK 53.6% G 54.8% G Interpretations Considered to be Critical Confusions 1.3% Conservative criteria

Empty cells represent 0%. The upper-case letters appearing below the percentages are used to indicate percentages that are significantly different (p < 0.05) than the percentages for the labels in the referenced columns. For example, symbol 4 (column E) elicited significantly more correct interpretations using conservative coding criteria than symbol 6— that is, it is statistically significantly different (p < 0.05) from G. Warnings 3 (with text) and 4 (without text) represent the teratogen warning that is currently in use. Text statements included in all warnings except those in the symbol-only condition was: ‘‘Warning: Causes birth defects. Do not use if you are pregnant or may become pregnant. Consult your doctor or pharmacist.’’

84.4% EG 74.4% DGH 74.0% DGH 89.2% EGIK 83.2% EG 61.8% ABCDF HIJKL 86.0% EGIK Interpretations Considered Correct Liberal criteria 78.9% BCDEFGHIJK

84.1% EG

80.4% EG

66.4% ABCD FHJL 60.4% G

(G) (F) (E) (D) (C) (B) (A) Group

Mean

85.2% EG

(L) (K) (H)

(I)

(J)

Warning 7 Warning 8 Warning Warning (text) (no text) 9 (text) 10 (no text) Warning 6 (no text) Warning 5 (text) Warning Warning 3 (text) 4 (no text) Warning 2 (no text) Warning 1 (text)

Table 1 Initial Interpretations by Coding condition and Warning Label Coded in Response to: What Do You Think This Label Means?

Warning 11 (text only)

BIRTH DEFECTS WARNINGS

865

ards Institute (2002) requires that at least 85% of the answers from a sample of 50 or more people should correctly identify the message communicated by a safety symbol. If not, then the symbol should be revised or text added. Thus, the existing symbol, with its text statement, exhibits patient comprehension substantially below what is considered acceptable by the American National Standards Institute (ANSI).

Improving the Teratogen Warning via Iterative Design Given the shortcomings of the existing warning, efforts to improve patient comprehension through iterative design were implemented. Using such a technique, prototype warnings should be developed and tested for comprehension with a sample of the at-risk population. Warnings that do not meet acceptable levels of comprehension should be redesigned based on feedback from earlier test participants and retested for comprehension in an iterative process (e.g., design, test, redesign, test) until a satisfactory level of comprehension is reached. To this end, Goldsworthy and Kaplan (2006a) described a process in which rapid prototyping, expert review, and user-centered design techniques were used to develop alternate teratogen warnings. Later, a field trial that used an ANSI Z535.3 testing procedure solicited open-ended interpretation of candidate symbols from 300 participants (Goldsworthy and Kaplan, 2006b). These initial findings were promising because they revealed that participants’ abilities to correctly interpret the meanings of several of the alternate warnings exceeded that of the existing warning, with several candidates emerging as viable alternatives to the existing warning. The candidates were further refined based on the results and a second, larger scale field study (n 5 700) conducted to further validate these alternative warnings (Mayhorn and Goldsworthy, 2007). Results indicated that two of the alternate symbols met the ANSI criterion of 85%, and none exceeded the ANSI limit of 5% critical confusion. In addition, the same two alternate symbols that surpassed the ANSI criteria consistently elicited accurate responding in terms of message interpretation, target audience, intended action, and perceived consequences of ignoring the warning.

Addressing the Questions That Remain Although these efforts have been informative in terms of highlighting a number of design and methodological issues, further questions remain. For instance, previous efforts by Goldsworthy and Kaplan (2006a) and Mayhorn and Goldsworthy (2007) focused on the enhancement of the symbol component of the teratogen warning: that is, how effective can the symbol be made by modifying the symbol itself. This is a reasonable approach given the well-documented benefits of including a symbol component in warnings, including the ability to attract attention (Kalsher et al., 1996; Sojourner and Wogalter, 1998) and to cue people to perform necessary safety behavior to avoid a hazard (Leonard et al., 1999). However, Mayhorn and Goldsworthy (2007) noted that such refinement, for this particular message and within the design limitations, may be closing in on a ‘‘performance plateau,’’ a level of observed outcomes whereby large gains are no longer possible and incremental gains become increasingly diffiBirth Defects Research (Part A) 85:864–871 (2009)

866

MAYHORN AND GOLDSWORTHY

cult as outcomes approach higher levels of correctness. Thus, one empiric question that remains is whether further symbol refinement can be accomplished within the stipulated design limitations of the effort: that is, can symbol effectiveness be further improved? A second issue that needs to be addressed is whether the symbol component of a teratogen warning can effectively communicate hazards without the addition of a text component. Evidence suggests that effective text warnings should include a signal word such as ‘‘Warning’’ as well as verbiage to convey the nature of the hazard, instructions to avoid the hazard, and the consequences of not avoiding the hazard (Wogalter et al., 1987; Rousseau and Wogalter, 2006). Thus, it was recognized early in the design process that the text of the existing teratogen warning—‘‘Do not get pregnant’’—was problematic and should be changed. To meet the stringent hazard communication requirements described by Wogalter et al. (1987), ‘‘Do not get pregnant’’ was replaced with a more comprehensive and explicit text statement: ‘‘Warning: Causes birth defects. Do not use if you are pregnant or may become pregnant. Consult your doctor or pharmacist.’’ Because the focus of earlier iterations was to refine the symbol component of the warning, the addition of this enhancement to the text component was not explicitly investigated. Previous studies reported by Goldsworthy and Kaplan (2006b) and Mayhorn and Goldsworthy (2007) did not explicitly test the effects on comprehension that resulted from the addition of the enhanced text to the symbol component. Instead, participants in these previous iterations were asked to report whether they believed that the inclusion of text would change their interpretations; however, labels with and without text were not systematically compared. A second question is, therefore, what is the effect of including text with symbols and vice versa. For both of these questions, determining effectiveness is a complex issue. As Goldsworthy and Kaplan (2006b) indicate, determining whether a respondent’s answer is correct entails, ideally, not only what is said but what is left unsaid and what is highly likely to be in the minds of the respondents based upon what is actually uttered. Previous efforts illustrated the difficulty of developing qualitative coding schemes to interpret the quality of participant responses. Although established methods have been documented to maintain the rigor and credibility of qualitative coding (Cohen and Crabtree, 2008; Krueger and Casey, 2000), issues such as the ambiguities of code definition and over interpretation of evidence act to maintain the perception that qualitative data analysis is as much an art as it is a science. In the previous Goldsworthy et al. studies (2006a, 2006b), a strict, ‘‘conservative,’’ coding scheme was used whereby a response was considered correct if and only if participant responses could be coded to a limited set of correct responses. If a response could be considered as entailing or related to a correct response, but not actually matching a correct response, it was not considered a correct interpretation. Thus, another goal of the current project was to investigate differences in coding whereby conservative and liberal estimates of patient comprehension could be compared to determine whether the same trends in the data were observed. Thus, a third question was: what effect does a liberal coding scheme have on outcomes. Birth Defects Research (Part A) 85:864–871 (2009)

Regarding these empiric issues, the current investigation, a culminating activity in this series of efforts, should further enlighten teratogen warning design efforts by answering these questions. Four alternative symbols were further refined based on the previous results, and participant comprehension of these four alternate symbols was compared to that of the existing teratogen symbol. To directly determine the effects of the addition of the enhanced text component to a teratogen warning, separate groups of participants encountered each of the five symbols with and without the text component and were also compared to a text-only condition; thus, 11 warning labels were evaluated. Finally, qualitative responses were coded using both a conservative and a more liberal definition of warning interpretation correctness to determine whether results were comparable. In addition to answering the three questions mentioned previously, the research provides a recommendation for selecting a warning label from among the candidates.

MATERIALS AND METHODS Participants Eleven warning labels were evaluated by 2773 participants (a minimum of 250 per label) in a nationally distributed field trail that used a one-on-one approach interview conducted in malls and other public places. Interview sessions for each participant lasted approximately 25 minutes. The ANSI Z535.3 standard suggests a minimum of 50 participants per warning label; therefore, the current study’s use of at least 250 participants per label represents an effort to improve the generalizability of the results. Participants were recruited from 10 geographically and demographically diverse locations (Atlanta, GA; Cleveland, OH; Dallas, TX; Greenville, SC; Miami, FL; Los Angeles, CA; Philadelphia, PA; Phoenix, AZ; South Bend, IN; and Tacoma, WA). Efforts were made to recruit diverse participants at each location by using a stratification quota for adolescents, males, and Hispanics of 20%. Inclusion targets for other racial and ethnic groups mirrored the overall 2000 U.S. Census levels. Participants were randomly assigned one of the 11 warning labels at each location. Participants ranged in age from 12 to 45 years, with 21.4% of the sample categorized as adolescents aged 12 to 17 years and the remainder evenly distributed across the age range. This age range was selected to make the current data comparable to those collected in previous studies (e.g., Mayhorn and Goldsworthy, 2007) that investigated warning comprehension by women of childbearing age. Forty-two percent of the sample population was male. Self reports of racial identity indicated that the sample was diverse: 49.6% Caucasian, 21.6% African American, 21.4% Hispanic, and 5.4% Asian. Ninety-three participants (3.4% of the sample) indicated that they were currently taking or had previously taken isotretinoin (Accutane), whereas 42 (1.5%) indicated that they had used or were currently using finasteride (Propecia). Thus, it could be argued that this sample was not familiar with prescription medications that have teratogenic properties. To assess health literacy, a trained interviewer administered the Rapid Estimate of Adult Literacy in Medicine (REALM) in which participants were asked to read a list of healthcare related terms aloud (Davis et al., 1993). Pro-

BIRTH DEFECTS WARNINGS

867

nunciation for each term was evaluated and rated as either correct or incorrect, and participants were classified as demonstrating either low or high health literacy based on their overall score. Results classified 66.1% of the sample as low health literate. This figure approximates national estimates among American adults (Davis et al., 1998).

Materials Based on the warning development methodology described by Goldsworthy and Kaplan (2006a, 2006b), several candidate symbols were evaluated during a national field trial (Mayhorn and Goldsworthy, 2007). The trial results were paired with evaluations from experts in the areas of warnings and teratology to inform the further generation and revision of alternate teratogen warnings. Four newly refined symbols emerged as candidates for further evaluation. To these four, the original benchmark symbol was added, yielding five symbols for evaluation. To directly address the question of the effect of text on warning interpretation, a text (symbol 1 text) and a no-text (symbol only) condition was created for each symbol, as well as a text/no-symbol (text only) condition, for a total of 11 warning label conditions. The text component for all warnings except those in the symbolonly condition was ‘‘Warning: causes birth defects. Do not use if you are pregnant or may become pregnant. Consult your doctor or pharmacist.’’ Odd numbered warnings are symbol 1 text; even numbered warnings are symbol-only. Paired warnings 1 and 2, 3 and 4, through 9 and 10, use the same symbol with and without text. See Table 1 for all warnings. The symbol for warnings 1 and 2 is a multipanel design that captures both the prohibition (the act of taking something during pregnancy) and consequence (injury or death to the fetus). Warnings 3 and 4 use the baseline control symbol that is currently in use. The symbol in warnings 5 and 6 highlights the consequences of taking the medication during pregnancy. Warnings 7 and 8 use a variation on the original teratogen warning symbol (used in Warning 3) that clearly shows the prohibited behavior of taking something. Warnings 9 and 10 attempt to integrate the prohibition and consequence concepts within a single panel symbol. Last, warning 11 uses the elaborated text statement only for control purposes.

Procedure Symbol interpretation. Trained researchers at each location approached participants individually and invited them to participate. Approximately 17% of the sample reported speaking a language other than English at home. Because the large majority of these individuals reported speaking Spanish as a primary language, they completed the interview process in Spanish using a script that underwent a rigorous translation procedure. However, it should be noted that all warning text appeared in English and this information was translated by researchers upon request from participants. Participants were told that the project involved evaluation of the effectiveness of warning labels. To avoid altering participants’ interpretations, they were not told about the nature of the warning label or the hazard. Each participant was shown a picture of how the warning would appear in three contexts: in label format (at the actual size seen on

Figure 1. Multipanel warning.

a prescription bottle), on a pill bottle, and on a blister pack (Fig. 1). Each participant was then asked, based on the ANSI open-ended comprehension procedure, ‘‘What do you think this label means?’’ and was prompted for additional statements. Responses were written, repeated, and probed to ensure the accuracy of the recording procedure. The researchers then read participants a brief statement about teratogens and their risk to women who are or may become pregnant. Participants were told that the purpose of the study was to evaluate a set of warning labels intended to convey teratogenic risk, specifically: By now, you may have figured out that this research is about creating warnings and labels that help prevent birth defects. Substances that can cause harm to unborn babies can be found in various medications, including some that treat acne, high blood pressure, male pattern baldness, and different forms of cancer. We want to make sure that medication warning labels grab attention and are easy to understand. Someone who looks at this label should understand that the medication the label is on is harmful to an unborn baby and should not be taken during or in some cases even before a pregnancy.

Warning symbol preference. Participants were then shown an illustration of all five symbol 1 text conditions arrayed in a circle. They were asked to choose one warning that ‘‘grabbed your attention most’’ and one ‘‘which one is easiest to understand.’’ Justifications were recorded. Last, participants completed a brief survey that gathered demographic information, including marital status, pregnancy status, child bearing intentions, level of attained education, income, employment status, and primary language spoken in the home. Following completion of the survey, participants were asked whether they had any questions or concerns and whether they had any suggestions or comments for changing any of the symbols or labels to make them more effective. Participants were paid $20 for their time.

Analysis Because responses to the initial open-ended interpretation questions were qualitative in nature, a coding Birth Defects Research (Part A) 85:864–871 (2009)

868

MAYHORN AND GOLDSWORTHY Table 2 Aggregate Correctness of Interpretations Coded in Response to ‘‘What Do You Think This Label Means?’’

Comparison group Liberal criteria Conservative criteria

Mean

Symbol 1 text condition

Symbol-only condition

(A)

(B)

(C)

(D)

78.9% BCD 57.5%

85.5% AC 58.5%

71.3% ABD 57.0%

84.4% AC 54.4%

scheme based on the results of Mayhorn and Goldsworthy (2007) was developed to assess the correctness of participant responses. Three trained research assistants assigned a primary code to each response. The code was used to group responses consistent with an observed theme. For example, the response ‘‘Don’t take the medication during the pregnancy’’ was assigned a primary code that placed it with other responses consistent with the theme ‘‘Do not take if pregnant.’’ Next, secondary codes were assigned to capture whether each response was ‘‘correct’’ or ‘‘incorrect.’’ Generally, an interpretation was considered correct if participants provided either a correct prohibitive action or a correct consequence. Using a more liberal coding scheme initially, responses were counted as correct if they included a variety of themes such as ‘‘Do not take if you plan on becoming pregnant,’’ ‘‘Causes harm to the fetus,’’ or ‘‘Harmful to pregnant woman.’’ Later, a more conservative estimate of correctness was obtained by counting responses as correct only if they specifically included the themes ‘‘Do not take if you are pregnant’’ or ‘‘Causes birth defects’’ and did not include any critical confusion responses. Responses were coded as incorrect if they were off topic, misleading, or otherwise inaccurate (e.g., ‘‘take for upset stomach,’’ ‘‘it’s a urine test’’). Correct participants were counted only one time regardless of the number of correct responses they may have provided (e.g., a participant who responded that the warning meant ‘‘don’t take while pregnant, well, or if you might become pregnant, because it could cause birth defects’’ would be counted as correct one time). Last, a tertiary code was assigned for the purpose of determining whether a response could be categorized as a critical confusion—an interpretation that may lead directly to unsafe behavior or behavior that is contrary to the intended meaning of the warning (e.g., ‘‘this medicine keeps you from having a baby,’’ ‘‘take for pregnancy,’’ or ‘‘help baby’’). To determine inter-rater reliability (Stewart and Shamdasani, 1990), the percentage agreement between the judges was calculated for five percent of the responses. Agreement ranged from 83.5 to 95.2%. This finding indicated that the coding schemes were sufficiently well defined to reliably code participants’ responses. Demographic data, the resulting frequency data from interpretation categories and preference ratings were entered into SPSS (SPSS, Inc. Chicago, IL). Percentages were calculated for all data. Inferential statistical procedures (e.g. analyses of variance, t tests, chi-square nonparametric tests) were conducted to compare interpretation data and preference ratings by warning label with a 5 0.05. For the demographic data, correlational analyses were conducted to determine how variables such as sex and ethnicity influenced warning interpretation accuracy. Birth Defects Research (Part A) 85:864–871 (2009)

Text-only condition

RESULTS Symbol Interpretation Table 1 shows the distribution of correct respondents by warning label condition and by coding scheme.

Liberal Interpretation of Correctness Using the more liberal definition for correctness, the number of participants who were able to produce correct interpretations was calculated for each symbol. No warning within the symbol-only condition significantly exceeded the ANSI target of 85%. However, warning 2 (a multipanel design) performed significantly better than several others, including the symbol currently in use (warning 3). Several symbol-only conditions exceeded a less stringent benchmark level set forth by the International Organization for Standardization (1984) that requires a 67% rate of comprehension for safety symbols to be judged as acceptable. Not surprisingly, adding text significantly raises the levels of correct interpretation for all symbols above the levels of interpretation observed for the symbols alone. There appears to be an interaction between effect of text and the symbol to which the text is added. The smallest increase was observed for symbol 1 (multipanel), and the largest was observed for symbol 3 (consequences only symbol). Of the warnings in the symbol 1 text condition, three (warnings 3, 7, and 9) exceeded the ANSI benchmark. The text-only condition approached the ANSI benchmark of 85%. To assess the effect of text generally, the percentage of respondents who correctly interpreted the meaning of warnings 1, 3, 5, 7, and 9 were collapsed to produce a symbol 1 text condition, whereas warnings 2, 4, 6, 8, and 10 were collapsed to produce a symbol-only condition. As Table 2 illustrates, these two conditions were compared to a third condition composed of the percentage of respondents who correctly interpreted warning 11, which was the text-only control. The symbol 1 text condition produced the highest percentage of correct interpretations followed closely by the text-only condition; however, both conditions significantly varied from the symbol-only condition. The aggregate symbol 1 text condition was the only condition that exceeded the ANSI 85% criterion for correct interpretation.

Conservative Interpretation of Correctness Although the previous calculations of the percentage of correct responses included a variety of potential themes, a much more conservative estimate of correct warning interpretation would be an examination of the responses that strictly included either the prohibited action (‘‘Do not take if you are pregnant’’) or the consequences

BIRTH DEFECTS WARNINGS

Figure 2. Most preferred symbol (without text). Response to: Which one do you think is the most effective in getting noticed and being understood correctly?

(‘‘Causes birth defects’’). The second half of Table 1 provides the conservative percentage correct interpretation for each warning. The adoption of this more conservative measure of correctness resulted in none of the warning conditions exceeding the ANSI 85% criteria, a result which closely mirrors the outcomes of the previous Mayhorn and Goldsworthy (2007) study. Interestingly, especially when compared to the liberal coding, text-augmented and text-only conditions performed statistically similarly to symbol-only conditions; there was only one statistically significant difference across symbols. Examination of raw percentages indicates that text conditions underperformed their symbol-only counterparts. As shown in Table 2, further analysis collapsing the conservative measure of correct interpretation into symbol 1 text, symbol-only, and text-only conditions confirmed this by revealing that, when conservatively coding for correctness, symbol 1 text produced the highest level of correct interpretation, but this did not significantly vary from the other conditions.

Symbol Preference When shown a set of all five symbols (labeled as 1-5 in Fig. 2) with identical text corresponding to warnings 1, 3, 5, 7, and 9, participants most frequently chose symbol 1 (50.1%), which is the multipanel design used in warnings 1 and 2 that includes both a prohibition symbol and a consequences symbol, as most likely to grab their attention (Fig. 2). When asked which label was the easiest to understand, participants again overwhelmingly chose symbol 1 (57.7%).

Demographic Analyses Correlation coefficients between warning interpretation accuracy and a variety of demographic characteristics including gender, age, ethnicity, primary language, and health literacy were computed. One set of analyses examined the relationship between demographic variables and application of the liberal coding scheme, whereas a second set of correlations were calculated for the conservative coding scheme. Across coding conditions, health literacy was positively correlated with interpretation accuracy such that higher scores on the REALM were associated with an increased likelihood of correctly interpreting the warning content; r (2718) 5 0.11 for conservative and r (2718) 5 0.19 for liberal. Weaker, yet statisti-

869

cally significant, correlations between interpretation accuracy and ethnicity indicated that the likelihood of correct message interpretation varied by this grouping variable; r (2763) 5 0.05 for conservative and r (2763) 5 0.12 for liberal. Examination of the means for the conservative coding criteria indicated that Caucasian participants (M 5 63%) were more successful at interpreting the warnings than African American (M 5 50%) and Hispanic (M 5 55%) participants, yet examination of the liberal coding criteria indicated that African Americans outperformed Hispanics. This relationship might be illuminated by the correlational relationship between liberal warning interpretation and primary language, which indicates that those whose primary language is English rather than Spanish were more likely to correctly interpret the warning content; r (2723) 5 0.12. Interestingly, the correlation between primary language and conservative message interpretation was not significant. Also, consistent with previous results described in Mayhorn and Goldsworthy (2007), gender differences did not emerge in any age group when overall interpretation accuracy was examined. Likewise, age did not emerge as a consistent predictor of symbol interpretation.

DISCUSSION It would be convenient to return to our original questions and address them individually; however, they are tightly interwoven. Can symbol effectiveness be further improved, what is the impact of text, what is the impact of the liberalness of the coding scheme for correctness, and, in the end, which symbol performed best? The presence of text in a warning strongly interacts with the coding scheme that is used to determine whether the participants’ interpretations are correct. When a strict, conservative coding scheme is used, all symbols, with the exception of that used in warnings 5 and 6, performed statistically equivalently to one another regardless of the presence or absence of a text message. In fact, the text message by itself performed as well as the symbols alone or the symbol 1 text combinations. This finding implies that with the refined symbols and text message, each warning condition is essentially effectively conveying the same message to a specific proportion of the sample (approximately 60%). Because none of the warnings exceeded the 5% ANSI criteria for critical confusion, they are all equally effective in that regard as well. Therefore, it could be argued that any of the warnings, including text-only and excluding warning 6, are equally effective and could be used interchangeably. For correctness of interpretation, it does not matter whether participants saw a well-designed symbol, a well-designed text message, or both; when evaluated on the most conservative metric, they consistently comprehend the message at the same rate. But there is more to the story. When a more liberal coding scheme is used, one that allows for the richness of possible correct responses, correct interpretation not only goes up for all 11 warning conditions, which is itself hardly surprising because the liberal criteria makes it easier to be correct, but the increase varies both by symbol and by presence of text. That is, for the more liberal interpretation of correctness, some warnings achieve higher levels of correctness than others and, when comparing the text to the no-text versions of the same symbol, the amount of increase varies by the symbol: not all Birth Defects Research (Part A) 85:864–871 (2009)

870

MAYHORN AND GOLDSWORTHY

symbols are equally affected by the addition of text. For example, comparing the symbol used in warnings 1 and 2 (multipanel) to the symbol used in warnings 3 and 4 (the existing symbol), we find that the symbols with text perform equivalently (warning 1, multipanel, 84.1%; warning 3, existing symbol, 86.0%); however, the parallel warnings without text performed substantially differently (warning 2, multipanel, 80.4%; warning 4, existing symbol, 66.4%). Similar results were observed across most symbols. Therefore, although warnings 1 and 3 perform similarly, it is clear that the ability of the symbol in warnings 1 and 2 to perform well both with and without text makes it a better candidate for adoption than the existing symbol. Interestingly, the text-only warning also performed well in terms of the correct interpretation, indicating that, as far as people getting the right idea from the warnings, text suffices; however, text may not be optimal for attention getting. Turning to the issue of benchmarks, no warning condition statistically significantly exceeded the 85% benchmark for message comprehension required by ANSI regardless of the coding scheme used; however, when the liberal scheme was used, several numerically approached or exceeded the benchmark. This was true even for those warnings that included or were limited to the text statement, which is interesting because the ANSI criteria is for use of a symbol without text; that is, if a symbol does not meet the criteria for 85% correctness, then text should be added. However, even in the presence of a liberal coding scheme and text, the results failed to surpass 85% correct on a statistically significant basis. Does this mean the warnings are all ineffective and should not be used or that 85% is a poor benchmark? We do not think so; however, it does indicate the difficulty of specifying and rigorously assessing correctness, especially when the concepts to be communicated are rich, complex, and capable of eliciting an array of related meanings. The research suggests how arbitrary this benchmark might be when applied without discussion of conceptual and methodological issues. Researchers may benefit from analytical guidance when attempting to meet benchmark requirements. Moreover, whereas the ANSI Z535 testing guidelines stress the experimental testing of symbols alone, this nonspecific procedural advice does not address pragmatic aspects of testing symbol and text components of warnings in a controlled fashion as demonstrated in current work. Thus, researchers and regulatory agencies such as the U.S. Food and Drug Administration might benefit from discourse that results in benchmark requirements accompanied by standardized experimental procedures and analytic techniques. There is also some correlated evidence that suggests certain demographic variables such as health literacy, ethnicity, and primary language differentially affect teratogen message comprehension. This confirms previous studies indicating that participant characteristics may play an important role in message interpretation. Interestingly, the present study also indicates that such characteristics may interact with the correctness coding scheme: our research suggests that how we assess correctness not only affects what the outcomes are but does so differently for different target audiences. Finally, the most pragmatic question: has the research led to improved symbols and which symbol is best? Several candidate symbols outperform the existing one on Birth Defects Research (Part A) 85:864–871 (2009)

Figure 3. Most effective symbol, as shown on typical warning label with hazard text.

several benchmarks. Considering the evidence available, the multipanel symbol (Fig. 3, used in warnings 1 and 2) currently appears to be the best one. It performs equivalently to others when using conservative correctness criteria, it performs equivalently to or outperforms other symbols when presented in the presence of text, and it performs better than the existing symbol in the absence of text. With (84%) and without text (80%), it approaches the ANSI benchmark (85%) and statistically significantly the less stringent ISO benchmark. Moreover, participants clearly prefer it as more understandable and noticeable.

Conclusions From its initial procedural description in Goldsworthy and Kaplan (2006a) through the intermediary field trials that tested candidate symbols (Goldsworthy and Kaplan, 2006b; Mayhorn and Goldsworthy, 2007) to the current work that documents the culmination in the iterative design and evaluation methodology, the present study not only suggests that the goal of designing a more effective warning symbol and warning label has been successful, but also should inform others interested in birth defects prevention and, more generally, warning and hazard message design for public health. Specifically, coding schemes for interpretation correctness interact with modifications to the warnings (i.e., the inclusion of text) and with participant characteristics. Text affects symbols differentially. Selection of a symbol should consider symbol performance both with and without text together, not simply one or the other. Several symbols, in both symbol 1 text and symbol-only formats, performed better than the existing symbol. When examining the totality of evidence, a multipanel symbol appears most effective and should be considered as a replacement for the symbol currently in use. Finally, for pragmatic and cost-efficient reasons, the effort stayed within a fairly tightly defined design space (i.e., a black and white format approximating existing warning labels [Goldsworthy and Kaplan, 2006a]). Others could reasonably expand beyond these limitations and examine how such expansion affects the observed performance plateau. In addition, the series of studies have raised important issues regarding interpretation accuracy and its relationship to participant characteristics. These issues deserve considerably more attention. Finally, the availability of a symbol and symbol 1 text warning that outperforms the ones currently in use on many teratogenic pharmaceuticals should be of interest to the U.S. Food and Drug Administration, as the governing body that regulates how warning information for prescription pharmaceuticals is communicated to American consumers (Ostrove, 2006), and to manufacturers of potentially teratogenic products, for whom increases in communication effectiveness may decrease exposures and concomitant liability. These organizations, and others interested in creating or revising more effective warnings,

BIRTH DEFECTS WARNINGS should find the design and validation issues identified in our research informative to their own communications efforts.

REFERENCES American National Standards Institute. 2002. Criteria for safety symbols, Z535.3-Revised. Washington, D.C.: National Electrical Manufacturers Association. Cohen DJ, Crabtree BF. 2008. Evaluative criteria for qualitative research in health care: controversies and recommendations. Ann Fam Med 6:331–339. Daniel K, Goldman K, Lachenmayr S, et al. 2001. Interpretations of a teratogen warning symbol. Teratology 64:148–153. Davis TC, Long SW, Jackson RH, et al. 1993. Rapid estimate of adult literacy in medicine: a shortened screening tool. Fam Med 25:391–395. Davis TC, Michielutte R, Askov EN. 1998. Practical assessment of adult literacy in health care. Health Educ Behav 25:613–624. Goldsworthy RC, Kaplan B. 2006a. Warning symbol development: a case study on teratogen symbol design and evaluation. In: Wogalter MS, editor. Handbook of warnings. Mahwah, NJ: Lawrence Erlbaum Associates. pp. 739–754. Goldsworthy RC, Kaplan B. 2006b. Exploratory evaluation of several teratogen warning symbols. Birth Defects Res A Clin Mol Teratol 76:453–460. International Standards Organization. 1984. International standard for safety colours and safety signs: ISO 3864. Geneva, Switzerland: International Standards Organization. Kalsher MJ, Wogalter MS, Racicot BM. 1996. Pharmaceutical container labels and warnings: Preference and perceived readability of alternative designs and pictorials. Int J Ind Ergonomics 18:83–90.

871

Krueger RA, Casey MA 2000. Focus groups: a practical guide for applied research. 3rd ed. London: Sage. Leonard SD, Otani H, Wogalter MS. 1999. Comprehension and memory. In: Wogalter MS, DeJoy DM, Laughery KR, editors. Warnings and risk communication. London: Taylor and Francis. pp. 149–187. Mayhorn CB, Goldsworthy RC. 2007. Refining teratogen warning symbols for diverse populations. Birth Defects Res A Clin Mol Teratol 79:494– 506. Meadows M. 2001. The power of Accutane. The benefits and risks of a breakthrough acne drug. FDA Consum 35:18–23. Ostrove NM. 2006. Warning information in the labeling and advertising of pharmaceuticals. In: Wogalter MS, editor. Handbook of warnings. Mahwah, NJ: Lawrence Erlbaum Associates. pp. 515–528. Perlman SE, Leach EE, Dominguez L, et al. 2001. ‘‘Be smart, be safe, be sure’’. The revised Pregnancy Prevention Program for women on isotretinoin. J Reprod Med 46(2 Suppl):179–185. Roche Laboratories (2001). System to manage Accutane related teratogenicity (S.M.A.R.T.): Guide to best practices.http://www.fda.gov/ cder/drug/infopage/accutane/ Rousseau GK, Wogalter MS. (2006). Research on warning signs. In Wogalter MS, editor. Handbook of warnings. Mahwah, NJ: Erlbaum. pp. 147–158. Sojourner RJ, Wogalter MS. 1998. The influence of pictorials on the comprehension of and recall of pharmaceutical safety and warning information. Int J Cognitive Ergonomics 2:93–106. Stewart DW, Shamdasani PN. 1990. Focus groups: theory and practice. London: Sage. Wogalter MS, Godfrey SS, Fontenelle GA, et al. 1987. Effectiveness of warnings. Hum Factors 29:599–612.

Birth Defects Research (Part A) 85:864–871 (2009)