Inconsistencies in classification by experts of ... - Wiley Online Library

16 downloads 880 Views 354KB Size Report
*D. Ayres-de-Campos Specialist Registrar, * J. Bernardes Professor and Consultant (Obstetrics and Gynaecology), ... rate tracings here are different, as independent analysis ..... intelligent computer system for managing labour using the car-.
British Journal of Obstetrics and Gynaecology December 1999, Vol106, pp. 1307-1310

Inconsistencies in classification by experts of cardiotocograms and subsequent clinical decision *D. Ayres-de-Campos Specialist Registrar, *J. Bernardes Professor and Consultant (Obstetrics and Gynaecology), tA. Costa-Pereira Professor, *L. Pereira-Leite Professor and Head of Department *Department of Obstetrics and Gynaecology, Porto Faculty of Medine, S. Jofio Hospital; ?Department of Biosratistics and Medical Information, Porto Faculty of Medicine, Porto, Portugal

Inter-observer agreement in the interpretation according to the FIGO guidelines of 33 cardiotocographic tracings by experts and subsequent clinical decision was evaluated, using the kappa statistic (K) and the proportions of agreement (Pa). Overall agreement in the classification of tracings was fair (K = 0.48) and was better for normal (Pa = 0.62), than for suspicious (Pa = 0.42) or pathologic tracings (Pa = 0.25). Overall agreement on clinical decision was slightly higher (K = 0-59),but mostly was centred on the decision to take ‘no action’ (Pa = 0.79). Experts especially disagreed over the decisions to ‘monitor closely’ (Pa = 0.14) or to ‘interveneimmediately’ (Pa = 0.38). These limitations should be taken into account in clinical audits and in medical jurisprudence.

Introduction Auditing of cases that result in poor fetal outcome, with re-evaluation of cardiotocographic (CTG) tracings and clinical decisions, is common practice in most Western European obstetrical centres. Joint analysis of such cases frequently results in the consensus that an abnormal pattern was present and an earlier intervention was warranted. Expert witnesses often reach similar conclusions in medico-legal cases, but with far more serious implications. Consensus is reportedly much less common in studies evaluating the reproducibility of fetal heart rate analysis Conditions for interpretation of fetal heart rate tracings here are different, as independent analysis is performed and fetal outcome is unknown, and it is possible that these two factors may largely influence agreement. Most studies evaluating the reproducibility (reliability, repeatability, inter- and/or inter-observer agreement) of CTG classification were published in the late 1970s and 1980s’.’. They involved observers with different experience of CTG analysis, as well as different systems for interpreting tracings. It has been suggested that both the classification system chosen (namely, the number of categories it admits) and the experience of involved clinicians may play important roles in agreement’. Many classification and scoring systems are still used today throughout the world, but

Correspondence: Dr D. Ayres-de-Campos, Departamento de Ginecologia e Obstetricia, Faculdade de Medicina do Porto, Alameda Prof. Hernani Monteiro, 4200 Porto, Portugal. 0 RCOG 1999 British Journal of Obstetrics and Gynaecology

the International Federation of Gynaecology and Obstetrics (FIGO) guidelines for fetal monitoring’ probably represent the widest consensus yet reached in this field. Reproducibility of CTG analysis using these guidelines has, to our knowledge, only once before been reported (intra-observer) in a study on neonatal encephalopathy‘. In the present study we evaluated inter-observer agreement in expert interpretation of CTG tracings following the FIGO guidelines, and subsequent clinical decision. The aim was to evaluate whether three observers would arrive at similar classifications in conditions currently obtainable in most Western European centres for clinical audits and medico-legal cases. However, in contrast to the latter situations, independent analysis was performed and no prior knowledge of fetal outcome was made available.

Methods Thirty-three fetal heart rate tracings (16 taken antepartum and 17 taken intrapartum) were randomly selected from 22 high risk third trimester pregnancies, excluding those with poor signal quality. Antepartum fetal heart rate was obtained by external monitoring with ultrasound and autocorrelation, while a scalp electrode was used in the intrapartum. Uterine activity was assessed by tocodynamometry and fetal movements were registered by the mother. The Toitu MT 810B fetal monitor was used for registration at a paper speed of 1 c d m i n . All tracings were more than 40 minutes in duration. Tracings were sent by mail to three acknowledged experts in fetal heart rate monitoring, to be returned I307

1308 SHORT C O M M U N I C A T I O N S

within three months. All have published articles in this field, and they hold active positions in three major academic centres, where the FIGO guidelines are used routinely. Clinical information on the patients was supplied, including personal history, gestational age, gestation pathology and stage of labour. Observers were asked to classify tracings as normal, suspicious or pathologic according to the FIGO guidelines, which were enclosed. They were further requested to decide on one of three clinical management options, based on CTG tracings and clinical information: ‘no action’, ‘close monitoring’ or ‘immediate intervention’. Four antepartum CTGs were immediate precursors of other tracings. Experts were consequently asked to classify all CTGs but, in these cases, to propose a clinical decision only after the last one. Agreement was evaluated by the proportions of agreement (Pa)’, the kappa statistic (K) evaluating agreement beyond that expected by and the weighted kappa (K,) evaluating the same agreement after assigning different weights to adjacent or extreme classes6.The Cicchetti weights were used for this purpose. Ninety-five percent confidence intervals (95%CI) were calculated for all Kappa values > 0.75 were considered as excellent agreement; those between 0.40 and 0.75 as fair to good agreement; and those c 0.40 as poor agreemenP. If the higher limit of the 95%CI for the Pa was under 0.50, agreement was considered to be poor’.

CTG classification

(

\

16

\

25

\

I

LJ 14

Pathological

Clinical decision No action

\

Monitoring

I

LJ

Results The experts’ classification of tracings and subsequent clinical decision are summarised in Table 1. The results of the agreement trials between pairs of observers are illustrated in Fig. 1. Overall agreement in classification of fetal heart rate tracings was in the lower limit of the ‘fair to good’ category (K = 0.48; 95% CI 0.34-0.62). Weighted kappa was 0.58 (95% CI 0.44472). A reasonable agreement was found for normal tracings (Pa = 0.62;95%CI 0.51-0-73),and a poor agreement for suspicious (Pa = 0.42; 95% CI 0.34-0.50)and pathologic tracings (Pa = 0.25; 95% CI 0.14-0.36). When analysing antepartum cases separately, kappa was 0.57 (95% CI 0-41-0-73),while for the intrapartum it was 0.31 (95% CI 0.11-0.51). Overall, kappa on clinical decision was 0.59 (95% CI 0.43-0.76) while K,,, was 0.68 (95% CI 0-49-0-86). Agreement was significantly better for the decision to take ‘no action’ (Pa = 0.79; 95% CI 0.68-0439) than for ‘close monitoring’ (Pa = 0.14; 95% CI 0.02443) or for ‘immediateintervention’(Pa = 0.38;95%CI 0.21456). Three cases had a poor fetal outcome, defined as a 1minute Apgar score c 4, 5-minute Apgar score c 7, or umbilical artery pH < 7.10 (Table 1). In one of these

(

26

20

Intervention

Fig. 1. Venn diagram to illustrate the results of agreement trials between pairs of observers over the classification of fetal heart rate tracings as normal, suspicious, or pathologic, and over the clinical decision to take no action, monitor closely, or intervene immediately.

cases, the tracing was classified as suspicious by an expert and pathologic by the remaining two. Tracings from the other two cases were unanimously considered pathological. Experts agreed on ‘immediate intervention’ in all three cases.

Discussion Although the best a priori conditions for agreement were aimed at in this study, further improvements could still have been possible. One observer referred that he was not in the habit of reading tracings with a paper speed of 1 cdmin. All accepted, however, that they had enough time to analyse tracings and to adapt themselves 0 RCOG 1999 Br J Obstet Gynaecol 106,1307-1310

SHORT COMMUNICATIONS

1309

Table 1. Classification of fetal heart rate tracings as normal (N), suspicious (S) or pathologic (P),and clinical decision as ‘no action’ (NA), ‘close monitoring’ (M) or ‘immediate intervention’ (I), by the 3 experts for the 17 antepartum and 16 intrapartum tracings. CTG = cardiotocographic. ~~

Intrapartum

Anteparturn n 1 2 3* 4 5 6 7 8 9 10 11 12 13 14 15 16*

CTG classification N N P N

S N S S N N N S S

N N P N P N S S N

S S

Clinical decision -

-

M

M I NA I NA NA NA NA

P

I

N S N S S N

NA

N

S

N S

N S S

P

P P

P

P

S

P

P

P

P

M NA NA NA NA M

-

NA

-

-

M I I

I I I

n

NA

I NA NA NA NA NA NA -

-

M

I I I

17 18 19 20 21* 22 23 24 25 26 27 28 29 30 31 32 33

CTG classification

S S N N S N N S N

S S S N P N S S S

S

P P

S S S

S P

P P

P P

N N

N S

S S

S N

P N N N N S S S P S S N S

Clinical decision NA NA NA NA I NA NA NA NA ? I NA NA I I NA NA

NA NA NA NA I NA NA NA NA I I NA

NA NA NA NA I NA NA NA NA I I

M I

I I I

NA

NA NA

NA ?

M

*Cases with I-minute Apgar scores < 4, 5-minute Apgar scores < 7, or umbilical artery pH < 7.10. ? = Cases where experts failed to provide a clinical decision.

to this situation. Furthermore, it is possible that certain experts used their personal interpretation of some of the subjective classification criteria present in the FIGO guidelines (namely, regarding the frequency, intensity and classification of decelerations), as they have articles published on this subject. This subjectivity in definitions still present in the FIGO guidelines may have a . ~ negative influence on agreement. Spencer et ~ 1 have reported a lower intra-observer agreement in analysis of intrapartum CTGs using the FIG0 guidelines, when compared with the Krebs scoring system. Overall agreement in classification of CTG tracings was only ‘fair’, despite experienced observers and consensual guidelines. This result should be taken into account in the conduction of clinical audits, and in the sphere of medical jurisprudence. Comparisons with other studies evaluating the reproducibility of CTG analysis are of questionable value, as different numbers of observers and different scoring systems were used. Nevertheless, most studies report kappa values in the range of 0-14-0.392, which is slightly lower than our result. No tracing was classified as normal by an observer and as pathologic by another; all disagreement was found in the adjacent classes normal-suspicious and suspicious-pathologic (Fig. 1). Thus, weighted kappa was substantially higher than simple kappa. It seems intuitive that disagreement between extreme classes (normal-pathologic) should carry a greater weight when 0 RCOG 1999 Br J Obsrer Gynaecol 106,1307-1310

assessing a method’s reproducibility than disagreement between adjacent classes (normal-suspicious or suspicious-pathologic). However, this may not be the most relevant issue from an outcome and medico-legal perspective. With the latter issues in mind, for instance classification of a CTG as normal by an observer and as suspicious by another, can by itself be important. For the former, maybe a weekly re-evaluation would be planned, while for the latter a more intense vigilance would be arranged, and these two attitudes can lead to drastically different outcomes. Similar problems are involved when disagreement occurs between suspicious and pathologic classifications. Consequently, exact agreement may be the more relevant issue in this context, and perhaps because of this, the simple kappa remains the most widely used measure of agreement for this purpose’*2. Agreement was reasonable for normal tracings, but poor for suspicious and pathologic tracings. The greater number of cardiotocographic ‘events’ present in the latter probably makes them more prone to disagreement. Likewise, agreement may be worse for tracings acquired during labour, although differences did not reach statistical significance. This suggests that agreement can also depend on the population selected for the study. A low risk antepartum population is expected to include a larger number of normal tracings, and thus to yield a higher agreement. Conversely, the high risk antepartum and intrapartum pop-

1310 SHORT COMMUNICATIONS

ulation present in this study, most likely had a negative influence on overall results. In most prenatal care centres, however, a large majority of normal tracings is the rule, and consequently little disagreement will probably be found over them. It is in the small number of non-normal tracings that most disagreement may occur. Agreement on clinical decision was slightly better than on CTG classification, but was mostly centred on the decision to take ‘no action’ (Fig. 1). Experts disagreed more in the decision to ‘monitor closely’ or to ‘intervene immediately’. It again seems that when reassuring tracings are present in the setting of an uncomplicated clinical context, little disagreement is found. Disagreement occurs mainly over what clinical attitude to choose when situations diverge from the normal. Other authors have evaluated agreement in clinical decision after CTG analysis, but different methodologies were employed, so comparisons are of limited value. Keith et al.’ evaluated agreement between 17 experts in management of 50 intrapartum cases based on clinical information, CTG tracings, and interpolated fetal scalp pH values, and reported kappa values of 0.12-0.46. However, five categories for clinical decision were considered. Another study evaluating agreement between five differently experienced observers in management of 50 cases submitted to caesarean section, found unanimity among assessors in only 28%’. While considerable inconsistencies were observed among experts in classification of tracings and subsequent clinical decision, there was little disagreement in the three cases with poor fetal outcome. This is, of course, a number too small to allow any conclusions. However, it is an aspect that probably merits further research. It would be important to evaluate independent agreement over cases with poor fetal outcome without letting this outcome be known to clinicians, as this comes closer to the real issue of reproducibility of analysis in clinical audits and medical-18gal cases.

Conclusion Considerable inconsistencies were observed in experts’ interpretation of CTG tracings and subsequent clinical decision, even when following the FIG0 guidelines for fetal monitoring. Disagreement was more pronounced over suspicious and pathologic tracings. These limitations should be taken into account in the conduction of clinical audits, and in the sphere of medicaljurisprudence.

Acknowledgments The authors would like to thank Professors L. Graqa, P. Moura, and S. Jorge for their contribution in tracing analysis and clinical decision. We would also like to thank Professor H. van Geijn for his suggestionsregarding study design and Dr C. Santos for her help in statistical analysis. Research was supported by Grants 28757 of the Instituto Nacional de InvestigaqBo Cientifica (INIC), Portugal and PECS/P/SAU/207/97 of the Junta Nacional de Investigaqb Cientifica (JNICT), Portugal. References Hage ML. Interpretation of nonstress tests. Am J Obstet Gynecol 1985;153:153-155. Paneth N, Bommarito M, Stricker J. Electronic fetal monitoring and later outcome. Clin InvestMed 1993;16:159-165. Rooth G, Huch A, Huch R. Guidelines for the use of fetal monitoring. lnr J Gynecol Obstet 1987;25: 159-167. Spencer JAD, Badawi N, Burton P, Keogh J, Pemberton P,Stanley F. The intrapartum CTG prior to neonatal encephalopathy at term: a casexontrol study. Br J Obsrer Gynaecol 1997;104:25-28. Grant JM. The fetal heart rate is normal, isn’t it? Observer agreement of categorical assessments.Lancer 1991;337 215-218. Shoukri MM,Edge VL. Statistical Methods for Health Sciences. New York: CRC Press; 1996. Keith RDF, Beckley S, Garibald JM, Westgate JA, Ifeachor EC, Greene KR. A multicentre comparative study of 17 experts and an intelligent computer system for managing labour using the cardiotocogram.BrJ Obsret Gynaecol1995; 102 688-700. Barrett JFR, Jarvis GJ, MacDonald HN, Buchan PC,Tyrrell SN, Lilford RJ. Inconsistencies in clinical decisions in obstetrics. Lancet 1990;336: 549-551. Accepted 28 June 1999

0 RCOG 1999 Br J Obstet Gynaecol 106,1307-1310