MEDSYNDIKATE Design Considerations for

4 downloads 28589 Views 1MB Size Report
by limited domain knowledge. ... domain-specific extensions are kept in specialized lex- ..... sure of computation costs which considers "cheap". 333 ...
MEDSYNDIKATE Design Considerations for an Ontology-Based Medical Text Understanding System Udo Hahn a Martin Romacker a,b Stefan Schulz a,b aFreiburg University, (3FJ Text KnQwledge Engineering Lab (http: //www. coling . uni-freiburg. de) bFreiburg University Hospital' Department of Medical Informatics (http: //www. imbi. uni freiburg. de/medinf) -

MEDSYNDIKATE is a natural language processor for automatically acquiring knowledge fiom medical finding reports. The content of these documents is transferred to formal representation structures which constitute a corresponding text knowledge base. The general system architecture we present integrates requirements from the analysis of single sentences, as well as those of referentially linked sentences forming cohesive texts. The strong demands MEDSYNDIKATE poses to the availability ofexpressive knowledge sources are accountedfor by two alternative approaches to (semi)automatic ontology engineering.

INTRODUCTION With the excessive proliferation and ubiquitous accessibility of textual data in electronic form computational support for content tracking and content management is becoming a necessity. The field of information retrieval has a long-standing tradition in investigating content-sensitive document filters, which determine relevant documents relative to user queries (for a survey from a medical perspective, cf. [9]). The user may then check these documents in order to extract intellectually the desired information. More recently, research efforts have been targeted towards the automatic extraction of relevant information directly from document sources. While substantial progress has been made already (cf. various prototypes in the medical domain such as LSP [13], MEDLEE [2], MENELAS [17], RECIT [11], CLARIT [1]), current information extraction systems are limited in several ways. First, their range of understanding is bounded by limited domain knowledge. The templates these systems are supplied with allow only factual information about particular entities (patients, diagnoses, etc.) to be assembled from the analyzed documents. Also, these knowledge sources usually cannot be augmented, e.g., by some sort of concept learning device. Accordingly, when the focus of interest of a user shifts to (facets of) a topic not considered so far, new templates must be supplied or existing ones must be updated manually. In any case, for a modified set of templates the analysis has to be rerun for the entire document collection. Templates also provide either no

1067-50271001$5.00 © 2000 AMIA, Inc.

or severely limited inferencing capabilities to reason about the template fillers (hence, their understanding depth is low). Finally, the potential of information extraction systems for dealing with textual phenomena is rather constrained, if it is available at all. With the SYNDIKATE system family, we are addressing these shortcomings and aim at a more sophisticated level of knowledge acquisition from real-world texts. The source documents we deal with are currently taken from two domains, viz. test reports from the information technology domain (ITSYNDIKATE) and medical finding reports, the framework of the MEDSYNDIKATE system. MEDSYNDIKATE is designed to acquire from each input text a maximum number of simple facts ("The findings correspond to an adenocarcinoma."), complex propositions ("All mucosa layers show an inflammatory infiltration that mainly consists of lymphocytes."), and evaluative assertions ("The findings correspond to a severe chronical gastritis."). Hence, our primary goal is to extract conceptually deeper and inferentially richer forms of relational information than that found by state-of-the-art information extraction systems. To achieve this goal, se'veral requirements with respect to language processing proper have to be fulfilled. As most of the information extraction systems, we require our parser to be robust to underspecification and illformed input (cf. the protocols in [5]). Unlike almost all of them, our parsing system is particularly sensitive to the treatment of textual reference relations as established by various forms of anaphora [16]. Furthermore, since SYNDIKATE systems rely on a knowledge-rich infrastructure, particular care has to be taken to provide expressive knowledge repositories on a larger scale. We are currently exploring two approaches. First, we automatically enhance the set of already given knowledge templates through incremental concept learning routines [8]. Our second approach makes use of the large body of knowledge that has already been assembled in medical taxonomies and terminologies (e.g., the UMLS). That knowledge is automatically transformed into a description logics format and, after interactive debugging, ingated into a formal medical knowledge base covering large areas of anatomy and pathology [14].

330

Figure 1: System Architecture of SYNDIKATE

SYSTEM ARCHITECTURE In the following, major design issues for MEDSYNDIKATE are discussed, with focus on the distinction between sentence-level and text-level analysis. We will, then turn to two alternative ontology engineering methodologies satisfying the need for the (semi)automatic supply of large amounts of background knowledge. The overall architecture of SYNDIKATE is summarized in Figure 1. The general task of any SYNDIKATE system consists of mapping each incoming text, Ti, into a corresponding text knowledge base, TKBi, which contains a formal representation of Ti's content. This knowledge will be exploited by various information services, such as inferentially supported fact retrieval or text summarization.

Sentence-Level Understanding Grammatical knowledge for syntactic analysis resides in a fully lexicalized dependency grammar (cf. [7] for details), we refer to as Lexicon in Figure 1. Basic word forms (lexemes) constitute the leaf nodes of the lexicon tree, while grammatical generalizations from lexemes appear as lexeme class specifications at different levels of abstraction. The Generic Lexicon in Figure 1 contains entries which are domainindependent (such as sell, with, or month), while domain-specific extensions are kept in specialized lexicons serving the needs of particular subdomains, e.g., IT (notebook, hard disk, etc.) or medicine (adenocarcinoma, gastric mucosa, etc.). Conceptual knowledge is expressed in a KL-ONElike representation language (cf. [7] for details). These

languages support the definition of complex concept descriptions by means of conceptual roles and corresponding role filler constraints which introduce typing restrictions on possible fillers. Taxonomic reasoning can be defined as being primitive (following explicit links) or can be computed on the basis of subsumptiqp relations between complex conceptual descriptions. Also, a distinction is made between concept classes (types) and instances (representing concrete real-world entities). Each lexeme is directly associated with one (or, in case of polysemy, several) concept type(s). Accordingly, when a new lexical item is read from the input text, a dedicated process (word actor) is created for lexical parsing (step A in Figure 1), together with an instance ofthe lexeme's concept type (step B). Each word actor then negotiates dependency relations by taking syntactic constraints from the already generated dependency tree into account (step C), as well as conceptual constraints supplied by the associated instance in the domain knowledge (step D). Analogous to the Lexicon, the ontologies we provide are split up between one that serves all applications, the Upper Ontology, while specialized ontologies account for the conceptual structure of particular domains, e.g., information technology (NOTEBOOK, HARD-DISK, etc.), or medicine (ADENOCARCINOMA, GASTRIC-MUCOSA, etc.). Semantic knowledge is concerned with linkages between instances of concept classes according to those dependency relations that are established between their associated lexical items. A linkage may either be constrained by dependency relations (e.g., the subject: relation may only be interpreted conceptually in terms of AGENT or PATIENT roles), by intervening lexical material (e.g., some prepositions impose special role constraints, such as "with" does in terms of HAS-PART or INSTRUMENT roles), or it may only be constrained by conceptual compatibility between the concepts involved (e.g., for genitives) [12]. The specification of semantic knowledge shares many commonalities with domain knowledge. Hence, the overlap in Figure 1.

Text-Level Understanding The proper analysis oftextual phenomena prevents inadequate text knowledge representation structures to emerge in the course ofsentence-centered analysis [6]. Consider the following text fragment: (1) Der Befund entspricht einem hochdifferenzierten Adenokarzinom. (The findings correspond to a highly differentiated adenocarcinoma.) (2) Der Tumor hat einen Durchmesser von 2 cm. (The tumor has a diameter of 2 cm.)

331

S1 Figr 2: Une

i

A

52

ra

[FINDINGS.2-01: Befiud, ADENOCARCINOMA.6-04: Adenokarzinom, [ADENOCARCINOMA.6 04: Tumor, DIAMETER.5-06: Durchmesser, CM: cm]

Table 1: Center Lists for Sentences (1) and (2)

Figure 2: Unresolved Nominal Anaphora

Figure 3: Resolved Nominal Anaphora In the course of a purely sentence-oriented analysis, an invalid knowledge bases emerges, when each entity which has a different denotation at the text surface is treated as a formally distinct item at the symbol level of knowledge representation, although different denotations refer literally to the same conceptual entity. This is the case for nominal anaphora, an example of which is given by the reference relation between the

noun phrase "Der Tumor" (the tumor) in Sentence (2) and "Adenokarzinom" (adenocarcinoma) in Sentence (1). A false, referential description appears in Figure 2 where TuMOR.2-05 is introduced as a new representational entity, whereas Figure 3 depicts the adequate conceptual representation capturing the intended meaning at the representation level, viz. maintaining ADENOCARCINOMA.6-04 as the proper referent. The methodological framework for tracking reference relations at the text level is provided by center lists [16] (cf. step E in Figure 1). The ordering of their elements indicates that the most highly ranked element is the most likely antecedent of an anaphoric expression in the subsequent utterance while the remaining elements are (partially) ordered according to decreasing preference for establishing referential links. In Table 1 the tuple notation takes the conceptual correlate of each noun in the text knowledge base in the first place, while the lexical surface form appears in second place. Using the center list of Sentence (1) for the interpretation of Sentence (2) results in a series of queries whether FINDINGS is conceptually more special than TUMOR (answer: No) or ADENOCARCINOMA is more special than TUMOR (answer: Yes). As the second center list item for S1 fulfils all required constraints, in the conceptual representation structure of Sentence (2) TuMOR.2-05, the literal instance (cf. Figure 2), is replaced by ADENOCARCINOMA.6-04, the referentially valid identifier (cf. Figure 3). As a consequence, instead of having two unlinked sentence graphs for Sentences (1) and (2) the reference resolution for nominal anaphora leads to joining them in a single coherent and valid text knowledge graph.

Ontology Engineering MEDSYNDIKATE requires a knowledge-rich infrastructure both in terms of grammar and domain knowledge, which can hardly be maintained by human efforts alone. Rather a significant amount of knowledge should be provided automatically. For SYNDIKATE systems, we have chosen a dual strategy, one learning new concepts incrementally while understanding the texts, the other based on the reuse of a priori available comprehensive (though weak) knowledge sources. Concept Learning from Text. Extending a given core ontology by new concepts as a by-product of the text understanding process builds on two different sources of evidence - the already given domain knowledge, and the grammatical constructions in which unknown lexical items occur in the source document. The parser yields information from the grammatical constructions in which an unknown word occurs in terms of the labellings in the dependency graph (cf. Figure 4). The kinds of syntactic constructions, in which unknown words appear, are recorded and later assessed relative to the credit they lend to a particular hypothesis. Typical linguistic indicators that can be exploited for taxonomic integration are, e.g., appositions ('the symptom @gA@,') or exemplification phrases ('symptoms like @CA@'), with '@,A@' denoting the unknown word. These constructions almost unequivocally determine '@A@' when considered as a medical concept to denote an instance of a SYMPTOM. The conceptual interpretation of parse trees involving unknown words in the domain ontology leads to the derivation of concept hypotheses, which are further enriched by conceptual annotations. These reflect structural patterns of consistency, mutual justification, analogy, etc. relative to already available concept descriptions in the ontology or other concept hypotheses. Grammatical and conceptual evidence of this kind, in particular their predictive "goodness" for the learning

Figure 4: SYNDIKATE's Learning Architecture

332

task, are represented by corresponding sets of linguistic and conceptual quality labels. Multiple concept hypotheses for each unknown lexical item are organized in terms of hypothesis spaces, each of which holds alternative or further specialized conceptual readings. An inference engine embedded in the terminological system, the so-called quality machine, estimates the overall credibility of single concept hypotheses by taking the available set of quality labels for each hypothesis into account (cf. [8] for details). Reengineering Medical Terminologies. The second approach makes use of the large body of knowledge that has already been assembled in comprehensive medical terminologies. The knowledge they contain, however, cannot be applied directly to a system such as MEDSYNDIKATE, because it is characterized by inconsistencies (e.g., circular definitions), insufficient depth, gaps, etc. The methodology for refining weak medical knowledge consists of four steps (cf. [14] for more details). First, we create automatically description logics expressions by feeding the generator with data directly from the UMLS. More specifically, the mrcon, mrrel, mrsty tables from UMLS are used, which contain the concept names (concept unique identifiers, CUIs), the semantic links between two CUIs, and the semantic types assigned to each CUI, respectively. In a second step, the imported concepts, already in a logical format, are submitted to the classifier of the knowledge representation system (in our case, LOOM) in order to check whether the terminological definitions are consistent and coherent. For those elements which are inconsistent or incoherent, their validity is restituted manually by a medical domain expert. In the final step the knowledge base which has emerged so far is manually rectified and refined (e.g., by checking the adequacy of taxonomic and partonomic hierarchies).

SYNDIKATE APPLICATIONS In quantitative terms, the SYNDIKATE baseline system is neither a toy system nor a monster. The generic lexicon currently includes 5,000 entries, the IT lexicon adds 2,500, while the MED lexicon contributes 3,000 entries each. Similarly, at the ontology level, the Upper Ontology contains 1,500 concepts and roles, to which the IT ontology adds 1,500 items and the MED ontology contributes 2,500 concepts and roles. However, recent experiments with reengineering the UMLS have resulted in a very large medical knowledge base with 164,000 concepts and 76,000 relations that is currently under validation. Two different application streams are under active development at our lab. The first of these relates to medical classification services which require mapping the

finding reports to particular disease categories (e.g., the automatic assignment of ICD-9 codes to discharge summaries [10]), or making the degree of a disease explicit (staging and grading indices) [4]. We have made progress here by incorporating the interpretation of comparatives and evaluative assertions [15]. While these services still aim at the automation of standard routines in clinical documentation centers, the potential ofinferentially basedfact retrieval considerably exceeds the functionality of today's non-deductive hospital information systems. Given such an application, the validity of the text knowledge bases becomes a crucial issue. As we have already discussed, disregarding textual phenomena will cause dysfunctional system behavior in terms of incorrect answers. This can be illustrated by a query such as Q : A-: A+:

(retrieve ?x (Tumor ?x)) (Tumor.2-05, Adenocarcinoma.6-04)

(Adenocarcinoma.6-04) which triggers a search for all instances in the text knowledge base that are of type TUMOR. Given an invalid knowledge base (cf. Figure 2), the incorrect answer (A-) contains two entities, viz. TUMOR.205 and ADENOCARCINOMA.6-04, since both are in the extension of the concept TUMOR. If, however, a valid text knowledge base such as the one in Figure 3 is given, only the correct answer, ADENOCARCINOMA.6-04, is inferred (A+).

SYSTEM EVALUATION SYNDIKATE has not yet undergone a thorough empirical evaluation at these application levels. We have, however, carefully evaluated its subcomponents: 1. Sentence Parsing. We compared a standard active chart parser with full backtracking capabilities with the parser of SYNDIKATE, which is characterized by restricted backtracking capabilities, using the same grammar specifications. On average, SYNDIKATE's parser exhibits a linear time complexity the factor of which is dependent on ambiguity rates of the input sentences. The active chart parser runs into exponential time complexity whenever it encounters extragrammatical or ungrammatical input, since then it searches the entire parse space. The loss of structural descriptions due to the parser's incompleteness amounts to 10% compared with the complete, though intractable parser [5]. 2. Text Parsing. While with respect to resolution capacity (effectiveness) no significant differences could be determined, the functional centering model we propose outperforrhs the best-known centering algorithms by a rate of 50% with respect to a measure of computation costs which considers "cheap"

333

and "expensive" transitional moves between utterances to assess a text's coherency. Hence, the procedure we propose is more efficient [16]. 3. Semantic Interpretation. Our group has been pioneering work on the empirical evaluation of meaning representations. In particular, we determined the quality and coverage of semantic interpretation for randomly sampled medical texts. While recall averaged at 64-66%, precision peaked at 95% [12]. 4. Concept Learning. The concept learning component has been compared to standard learning mechanisms based only on terminological classifiers available in any sort of description logics systems. Our data indicate an increase of performance of 8% (the baseline of standard classifiers being on the order of 79%, our system achieved 87% accuracy [8]).

Evaluating a text knowledge acquisition system, however, poses tremendous methodological problems (for a discussion, cf. [3]). The main reason being that a gold standard for comparison - what constitutes a commonly agreed upon interpretation of the content of a text? - is hard to establish, even for non-narrative texts. When we assume such a consensus as given, a follow-up problem is constituted by the lack of a significant amount of already annotated text knowledge bases on which alternative analyses might be run and assessed.

CONCLUSIONS We have introduced MEDSYNDIKATE, a system for automatically acquiring knowledge from medical finding reports. Emphasis was put on the role of various knowledge sources required for 'deep' text understanding. When turning from sentence-level to textlevel analysis we considered representational inadequacies when text phenomena were not properly accounted for and, hence, proposed a solution based on

centering mechanisms. The enormous knowledge requirements posed by our approach can only be reasonably met when knowledge acquisition does not rely on human efforts only. Hence, a second major issue we have focused on concems alternative ways to support knowledge acquisition. We made two proposals. The first one deals with an automatic concept learning methodology that is fully embedded in the text understanding process, the other one exploits the vast amounts of medical knowledge assembled in various knowledge repositories such as the UMLS.

References [1] D. Evans, N. Brownlow, W. Hersh, and E. Campbell. Automatic concept identification in the electronic medical record: an experiment in extracting dosage information. In Proceedings of the AMIA'96, pages 388392, 1996. [2] C. Friedman, P. Alderson, J. Austin, J. Cimino, and S. Johnson. A general natural-language text processor for clinical radiology. Journal ofthe American Medical Informatics Association, 1(2):161-174, 1994. [3] C. Friedman and G. Hripcsak. Evaluating natural language processors in the clinical domain. Methods of Information in Medicine, 37(4/5):334-344, 1998. [4] C. Friedman, C. Knirsch, L. Shagina, and G. Hripcsak. Automating a severity score guideline for communityacquired pneumonia employing medical language processing of discharge summaries. In Proceedings of the AMIA '99, pages 256-260, 1999. Let's [5] U. Hahn, N. Br6ker, and P. Neuhaus. PARSETALK: message-passing protocols for objectoriented parsing. In H. Bunt and A. Nijholt, editors, Recent Advances in Parsing Technology. Dordrecht: Kluwer, 2000. [6] U. Hahn, M. Romacker, and S. Schulz. Discourse structures in medical reports - watch out! The generation of referentially coherent and valid text knowledge bases in the MEDSYNDIKATE system. International Journal ofMedical Informatics, 53(1):1-28, 1999. [7] U. Hahn, M. Romacker, and S. Schulz. How knowledge drives understanding: matching medical ontologies with the needs of medical language processing. Artificial Intelligence in Medicine, 15(1):25-51, 1999. [8] U. Hahn and K. Schnattinger. Towards text knowledge engineering. In Proceedings of the AAAI'98, pages 524-531, 1998. [9] W. Hersh. Information Retrieval. A Health Care Perspective. New York: Springer, 1996. [10] L. Larkey and B. Croft. Combining classifiers in text categorization. In Proceedings of the SIGIR'96, pages 289-297, 1996. [11] A. Rassinoux, J. Wagner, C. Lovis, R. Baud, A. Rector, and J. Scherrer. Analysis of medical texts based on a sound medical model. In Proceedings of the SCAMC'95, pages 27-31, 1995. [12] M. Romacker, S. Schulz, and U. Hahn. Streamlining semantic interpretation for medical narratives. In Proceedings of the AMIA '99, pages 925-929, 1999. [13] N. Sager, M. Lyman, N. Nhan, and L. Tick. Medical language processing: applications to patient data representation and automatic encoding. Methods ofInformation in Medicine, 34(1):140-146, 1995. [14] S. Schulz and U. Hahn. Knowledge engineering by large-scale knowledge reuse: experience from the medical domain. In Proceedings of the KR'2000, pages 601-610,2000. [15] S. Staab and U. Hahn. "Tall", "good", "high" - compared to what? In Proceedings of the IJCAI'97, pages 996-1001, 1997. [16] M. Strube and U. Hahn. Functional centering: grounding referential coherence in information structure. Computational Linguistics, 25(3):309-344, 1999. [17] P. Zweigenbaum, B. Bachimont, J. Bouaud, J. Charlet, and J. Boisvieux. A multi-lingual architecture for building a normalised conceptual representation from medical language. In Proceedings of the SCAMC'95, pages 357-361, 1995.

Acknowledgements. Martin Romacker and Stefan Schulz were supported by a grant from DFG (Ha 2097/5-1).

334