Knowledge Extraction for High-Performance ... - Semantic Scholar

6 downloads 0 Views 138KB Size Report
ping between \topical concepts" (a de nition is given later ... niques to discover topical concepts and to estab- .... conceptual dictionary so that they are given a.
Knowledge Extraction for High-Performance Indexing and Information Retrieval Jiping Sun

Speech and Information Processing Research Group University of Waterloo, Waterloo, Ontario, Canada N2L 3G1 [email protected]

Abstract

As the information era approaches its prime stage, ecient ways of using the large amounts of information currently available, as well as being generated day by day in text form, is still a challenging problem. Imagine this scenario: a user wants to nd a speci c piece of information which resides in a piece of text among millions of others, how can a retrieval system nd that piece of text from an ocean of texts? This is what we call "high-performance indexing and retrieval", an inspiring prospect for future library and other information service centers. This paper describes the methodology and current progress of our project { Extracting Knowledge from Text for High-performance Indexing and Information Retrieval (EKTHIIR). We rst give a de nition of high-performance indexing and retrieval. Then we introduce the methodology adopted by our knowledge extraction system, which aims at providing a leadingedge technology for indexing and information retrieval in a library or any information service center. Finally we introduce the current progress of this project and discuss the possible future work.

1 Introduction

In a library or any large-scale information service center, a bottleneck problem exists between a user and the information stored in it. On one side, a user may need to obtain some speci c piece of information, for example, about how many satellites Jupiter has. On the other side, there are large amounts of information stored on the media, some of which could be relevant to the user's request. However, a search engine may have trouble to link the user's request to the appropriate locations where the relevant information is stored. A high-performance in-

dexing and retrieval system is one that can conquer this kind of bottleneck problem. The solution we propose is this: All the information stored in an information institution will have been processed to some depth by a natural language processing system, so that a mapping between \topical concepts" (a de nition is given later on) and media storage locations is established. When a user's request is processed by a front-end module, such as a natural language interface, a number of such topical concepts implied in the user's request will be obtained. These user generated topical concepts will then be matched against a large topical concept database. Once a match is found, the corresponding storage locations will become available to the user and high-performance retrieval is thus achieved. In the following sections, we rst de ne \high-performance indexing and information retrieval", including the central idea of \topical concepts". Then we describe the methodology adopted in our EKTHIIR project carried out in UW, which uses knowledge extraction techniques to discover topical concepts and to establish the mappings between them and the media storage locations. Once such a mapping is established, a high-performance information retrieval system will be available to the users.

2 High-Performance Indexing and Information Retrieval, a Systematic View

A knowledge-based indexing system (Bachrach, C. A. and Charen T., 1978) has the advantage of high accuracy and more susticated request representation capability. In today's information retrieval services, it is highly disirable for such a system to be able to cope with speci c needs of the users and with high accuracy. The

EKTHIIR project carried out in our group aims at a new approach towards the information bottleneck problem. We view this bottleneck problem as attributable to an insucient or nonexistent knowledge base which should be part of the retrieval system and serves as a bridge between the user's needs and the storage locations. First of all, let us de ne the following two concepts which are central to our system development. De nition 1. \High-Performance Indexing and Retrieval": A high-performance indexing and information retrieval system is one that has a knowledge base about the contents of the stored information. Such knowledge serves as a bridge between the user's speci c information retrieval needs and the locations where appropriate information is stored on the media. De nition 2. \Topical Concept": The knowledge about the contents of some stored information is expressed in terms of \topical concepts", which answers such questions as \what is talked about in the texts?". A topical concept consists of the following elds: 1. Domain: Is this chunk of information (such as a few sentences, a paragraph or a couple of pages in a book) about animal zoology? or about stock market? or about an advanturous trip? 2. Main Objects: Every domain of experience has a speci c set of players in it. A topical concept indicates which players are active. For example, in a stock market domain, what is being talked about at any moment could be the \prices", the \companies", the \trends", the \gains" and \losses", or any combinations of them. In an astronomy domain, \stars", \planets" and \sattelites" could be the players. 3. Relations: The objects in a domain are linked by certain relations. This is an important part in a topical concept and a user of a retrieval system can be interested in such relations. 4. Processes and States: The objects in a domain can also be in some process or state. This is also an important part

in a topical concept and a user of a retrieval system can be interested in this as well. Based on the above de nitions, we say that a \high-performance indexing and information retrieval system" is a knowledge-based intelligent agent, which can learn about the contents of, say, all the books of a whole library and gather the learned information into a conceptual system. This conceptual system has the \topical concept" as its basic organizational unit. Such topical concepts can be matched against the user's request and function as the indexing media. A pictorial view of this High-Performance Retrieval System is shown in Figure 1. Topical Concepts domain objects

domain objects

domain objects

rel proc

rel proc

rel proc

Retrieval

Stored Information

Request Topics

System

Results

Figure 1: High-performance retrieval The indexing task is performed by a natural language processing system which extracts topical concepts from the large quantities of texts. The components of such a system consists of a conceputal dictionary, a phrasal rule database which maps sentence structures into conceptual structures, and a parser which does the actual processing of the text. In our system development, we use a feature-based cascading nistestate transducer as the parser, which interpretes phrasal rules. The text sentences rst undergo a dictionary lookup process. The words in the lookup process are given semantic features according to the dictionary. Then the sentences are matched by the phrasal rules by way of the transducer and transferred into conceputal structures. Finally the output of the transducer is sent to an indexing prost-processor to build the topical concept knowledge base. This process of High-performance Indexing System is depicted in Figure 2.

Conceptual Dictionary

Stored Information

Phrasal Rules

FST

topical concepts database

Figure 2: High-performance indexing Our design and development is based on the above concept of high-performance indexing and retrieval. The project aims at providing leading edge technology for library and other information service centers. In the next section we introduce the techniques we have been developing towards a software package of highperformance indexing and retrieval.

3 A Conceptual Approach towards Topical Concept Extraction 3.1 The natural language processor | a feature-based cascading nite state transducer (FST)

The natural language processor used in our project is a feature-based cascading nite state transducer. This transducer interpretes several groups of phrasal rules one level after another in a cascading manner. The output of applying an upper level of rules is sent as input to the stage of applying next level of rules. The initial stage is the dictionary lookup, in which words of the input text is looked up from the conceptual dictionary so that they are given a feature structure. This feature structure speci es a word's syntactic and semantic features. The latter is a list of domain speci c descriptions about a word's possible object, process, relation, state functions. This list is arranged according to the likelihood of such functions to be really in use in the real world texts. The rules are speci cations of what kind of patterns in the input sentence should be matched by a rule and, when the input matches a rule, what changes will occur to the input to form an output. The rule writing formalism consists of a regu-

lar expression backbone with featural node descriptions on the input side. On the output side non-determinacy is guranteed by allowing more than one output expression. The output expressions recognize indexed feature variables and do uni cation before instantiating the variables in the output. To put it brie y, a phrasal rule takes the form: LC MP RC ! OE where LC (left context), MP (mappted portion), RC (right context) and OE (output expression) are the four components of a rule. These components can be further speci ed as follows: LC := Element*. RC := Element*. Element := not(Term) (not this) j or(Term+) (one of these) j Term* (zero or more of this) j Term+ (one or more of this) j Termo (zero or one of this). Term := (F = V)* (feature and value pairs). F := any atomic symbol. V := atomic symbol j Term j Indexed Variable. OE := Term+. One interesting characteristic of this transducer is that it can change the length of the input string when it comes to output. This is realized by allowing the MP and OE to have different lengths and is useful when we want the output to have di erent length than the input.

3.2 The conceptual system and the linguistic components of the FST

The classi cation of words in the dictionary and the writing of the phrasal rules form a conceptual system. Or we would rather say that the linguistic work is based on some prede ned principles of conceptual knowledge organization. Basically the rules are classi ed into the following groups (for each level):  Level 1: basic expressions, which describe concept in regular forms, such as \age expression", e.g. 20-year-old, \time expression", e.g. 3 p.m., \length", e.g. three meters and fteen centimeters, etc. These

expressions often serve as descriptions of some objects. 





Level 2: object expressions, which describe objects in isolation to other objects or processes, but often accompanied but their own descriptions as the level one phrases. For example, [Object: a [Length: 50 feet tall ] saguaro]. In this example, an object phrase consists of an object key word saguaro accompanied by a length description phrase 50 feet tall. Level 3: process or states expressions, which describe objects as they are involved in a process or in a state. This usually is a noun phrase followed by an adjective phrase or an intransitive verb phrase. For example, [Process: [Object: a saguaro] does not ower and produce seeds until ...]. In this example, a process phrase consists of an object phrase and two action phrases. The following clause is left out because that makes the phrasal rule too complicated. Simple phrasal rules, while not powerful enough to extract complete knowledge, are sucient for extracting topical concepts. Level 4. relation expressions, which describe relations between objects in a domain. It is to be noticed that not all nouns frequently used in a domain are objects at the domain level (that is, the basic level). Parts of objects, for example, are usually treated as subordinative to the basic level objects and are used to form object expressions.

Such a conceptual system embedded in the dictionary and phrasal rule organization forms the basis of the natural language processor, the FST. Our experience has shown that such an approach is powerful enough for the purpose as de ned in our objective | the high-performance indexing and information retrieval. At the same time, the phrasal rules are not too complex for the FST do work with. Considering the fact that full sentence parsing is still technically not fully solved, our approach is applying NLP to practical applications without demanding too high of it.

3.3 Examples of indexing results

This subsection shows some examples of indexing a short piece of text about the American cactus saguaro. By these examples we show what topical concept we consider as important, and how these concepts are matched by phrasal rules and how they are expressed. 1. Example 1: the saguaro rises as high as 50 feet. Two rules of the category \Length" and \Process" match this input. The output topical concept is expressed as [domain: plant, object: cacti, species: saguaro, process: growth(height(500 feet))]. From this topical concept, the retrieval system will know: (1) what is being talked about: the cacti, the speci c cacti: saguaro, its growth and, in particular, its height. These are useful for responding to the user requests. In fact, since the data forms a complete knowledge, this concept structure is also useful for building a true knowledge database. 2. Example 2: cacti are native to the Western Hemisphere. \Western Hemisphere" is stored as a single dictionary entry and has a basic \object" (type = location) semantic feature. The words are native to is matched by a rule with the category \Relation" to obtain a \location" sub-category topical concept: [domain: plant, object: cacti, relation: location, object: cacti, location(native): Western Hemisphere]. Through these two examples we have shown the basic functions of the indexing engine. The conceptual structures obtained by the FST are then organized into the topical conceputal database and ready to be used by the retrieval engine. The retrieval engine uses a simple match procedure to link the user's request with some of the topical concepts in the knowledge base. The user's request can be expressed directly in the same format as the topical concepts. Or it may come as natural language sentences. In the latter case, a NLP front-end is used to transfer the natural language query into conceptual structures. Again, the feature-based cascading FST is used for this purpose.

4 Current Status of Our Porject and Future Work

Currently our project is in the conceputal development stage. A conceptual dictionary has been under construction. A 100,000 word English dictionary is used. We have de ned object, relation, process and state concepts for more than 200 domains and sub-domains. Altogether, 3,000 words have been put into the conceptual dictionary. Some initail tests with more than 200 medium length texts from 20 domains have been carried out. The results have shown promising future for further development of the system. The immediate aim of our project is to provide high-performance indexing and retrieval systems for library and information service institutions. We see some expansion of our project in terms of both application scope and depth. First of all, we are investigating the possibility of using the software on the internet. This leads to a system which can automatically indexing websites which contain large quantities of information. We have looked at more than 50 websites which claim to be knowledge bases, but in fact they only contain large quantities of text based information without a powerful search engine. Our immediate plan is to let our system automatically index these text based knowledge bases to give them a powerful search ability. Secondly, we are considering the possibility of strenghening the conceptual system and the natural language processor, the FST, so that the knowledge it obtains will be detailed enough to build true knowledge bases. This will lead to a system which transfers texts into knowledge bases which can be used by expert systems. This is what we have planned for the next stage of our research that is based on knowledge extraction from texts.

References

Appelt,Douglas E., Jerry R. Hobbs, et al., 1993: \The SRI MUC-5 JV-FASTUS information extraction system" Proceedings, Fifth Message Understanding Conference (MUC-5), Baltimore, 1993. Bachrach, C. A. and Charen T., 1978: \Selection of MEDLINE contents, the development of its thesaurus, and the indexing process" Med Inform, 1978; 3:23754. Barr, C. E., Komorowski, H. J. et al., 1988: \Conceptual modeling for the uni ed medical language system" In Greenes R. A. (ed) Proceedings, The Twelfth An-

nual Symposium on Computer Applications in Medical Care, New York: IEEE Computer Society Press, 1988. Cimino, J. J., Mallon, L. J. and Barnett, G. O., 1988: \Automated extraction of medical knowledge from MEDLINE citations" In Greenes R. A. (ed) Proceedings, The Twelfth Annual Symposium on Computer Applications in Medical Care, New York: IEEE Computer Society Press, 1988. Cimino, J. J., Elkin, P. L. and Barnett, G. O., 1992: \As we can think: the Concept space and medical hypertext" Comput Biomed Res, 1992; 25: 238-63. Harris, Z., 1988: \Language and Information" Columbia University Press, 1988. Hobbs, Jerry R., Douglass E. et al., 1992: \Robust processing of real-world natural language texts" in P. Jacobs (ed) Text-Based Intelligent Systems: Current Research and Practice in Information Extraction and Retrieval, Lawrence Erlbaum Associates, 1992. Johnson, S. B. and Gottfried, M., 1989: \Sublanguage analysis as a basis for a controlled medical vocabulary" In Kingsland L. C. (ed) Proceeding, The thirteenth Annual Symposium on Computer Applications in Medical Care, New York: IEEE Computer Society Press, 1989: 519-23. 1992. Lehnert, Wendy, Clair Cardie et al., 1991: \Description of the CIRCUS system as used for MUC-3" Proceedings, Third Message Understanding Conference (MUC-3), San Diego, 1991. Yeap, Y. K. and J. Sun, 1996: \Extracting information from judges' oral reports of family law cases" Proceedings, 5th National/1st European conference on Law, Computers and Arti cial Intelligence, Exeter, England, 1996.