BioinQA: metadata-based multi-document QA ... - Semantic Scholar

2 downloads 7782 Views 2MB Size Report
Abstract: Despite the availability of large amount of biomedical literature; ... metadata knowledge for addressing specific biomedical domain concerns like ..... questions, deeper question parsing is performed to firstly check, whether the ...
Int. J. Data Mining, Modelling and Management, Vol. 5, No. 1, 2013

37

BioinQA: metadata-based multi-document QA system for addressing the issues in biomedical domain Sparsh Mittal* Department of Electrical and Computer Engineering, Iowa State University, 2215, Coover Hall, Ames, Iowa, 50011, USA E-mail: [email protected] *Corresponding author

Saket Gupta Department of Electrical and Computer Engineering, University of Minnesota, 4-174, 200 Union St. SE, Twin Cities, Minneapolis, MN, 55455, USA E-mail: [email protected]

Ankush Mittal Department of Computer Science and Engineering, College of Engineering Roorkee, 7th KM, Roorkee, Hardwar Road, Vardhmanpuram, Uttarakhand, 247667, India E-mail: [email protected] Abstract: Despite the availability of large amount of biomedical literature; extracting relevant information catering to the exact need of the user has been difficult in the absence of efficient domain specific information retrieval tools. Biomedical question answering (QA) systems require special techniques to address domain-specific issues, since a wide variety of user-groups having different information needs; terminology and level of understanding, etc., may access the information. While specialised information retrieval tools are not suitable for beginners, general purpose search engines are not intelligent enough to respond to domain specific questions. This paper presents an intelligent QA system that answers natural language questions while adapting itself to the level of user. The system constructs answers from multiple documents for complex comparison seeking questions. The system utilises metadata knowledge for addressing specific biomedical domain concerns like heterogeneity, acronyms, etc. Experiments performed show superiority of the system over popular commercial search engines such as Google, etc. Keywords: question answering system; QAS; biomedical; metadata; multi-document QAS; evaluation metric MCRR; data mining; comparison; heterogeneity.

Copyright © 2013 Inderscience Enterprises Ltd.

38

S. Mittal et al. Reference to this paper should be made as follows: Mittal, S., Gupta, S. and Mittal, A. (2013) ‘BioinQA: metadata-based multi-document QA system for addressing the issues in biomedical domain’, Int. J. Data Mining, Modelling and Management, Vol. 5, No. 1, pp.37–56. Biographical notes: Sparsh Mittal received his BTech in Electronics and Communications Engineering from the Indian Institute of Technology, Roorkee, India. He was the graduating topper of his batch. Currently, he is pursuing his PhD in Electrical and Computer Engineering at Iowa State University, USA. He has been awarded scholarship and fellowship from IIT Roorkee and ISU. His research interests include natural language processing, data mining and FPGA implementation using system generator and VHDL. Saket Gupta received his BTech in Electronics and Communications Engineering from the Indian Institute of Technology, Roorkee, India. Currently, he is pursuing his PhD in Electrical and Computer Engineering at University of Minnesota, USA in the field of VLSICAD. He has worked on content-based retrieval, QA systems and other NLP applications for e-learning. He has also worked on MIMO communication systems; image processing; and FPGA synthesis and design using VHDL. He has been awarded many scholarships from IIT Roorkee and from other institutions. Ankush Mittal received his BTech and MS (research) from Indian Institute of Technology Delhi, and PhD from the National University of Singapore (NUS). He served as a Faculty in NUS and Indian Institute of Technology Roorkee from 2001 to 2009. Currently, he is Dean (Research) and Professor in the Department of Computer Science and Engineering, College of Engineering Roorkee, India. He has more than 60 international reputed journal papers. He was conferred with several awards, including IBM Faculty Award 2008, Outstanding Teacher Award 2008 (IIT Roorkee), Young Scientist Award 2006 (The National academy of Sciences), and Young Scientist Award 2008 (Indian Science Congress Association). His research interests include video processing, e-learning, multicore computing and AI.

1

Introduction

Recent technological advancements in biomedicine and genetics have resulted in a vast amount of data growing at an ever increasing rate. For example, the PUBMED database of national library of medicine (NLM) offers more than 14 million articles and hundreds of thousands more are being added every year (http://www.ncbi.nlm.nih.gov). However, such huge repositories of data are useful only if they can be easily accessed and the contents retrieved as per user requirements, providing information in a suitable form (Afantenos et al., 2005). Modern search engines such as Google and Bing have a huge storehouse of information but also necessitate users to manually search the vast number of documents obtained for the queries thus posed. Moreover, the structure and intent of the query, and not just the keywords, are very significant. For e.g., ‘how is snake poison employed in counteracting neurotoxic venom?’, ‘when is snake poison employed in counteracting neurotoxic venom?’ and ‘why is snake poison employed in counteracting neurotoxic venom?’ all have different meanings. Search engines using retrieving documents based on keywords only cannot differentiate between these questions. Zadeh (2006) discusses the possibility of upgrading existing search engines to

BioinQA: metadata-based multi-document QA system

39

QASs using state-of-art tools using bivalent logic and probability theory. He mentions the obstacles in such a task and also outlines a collection of non-standard concepts, ideas and tools which lead to addition of deduction capability to search engines. Today, data mining finds its use in diverse areas of science, technology and commerce, etc. Malerba (2008) discusses the spatial data mining techniques which find application in robotics, remote sensing, etc., and Kumar and Ravi (2008) applies data mining techniques to address the issue of customer credit card churn prediction in banks. There are a few research groups working on medical, domain-specific QA (Zweigenbaum, 2003). Biomedical text has complex characteristics, such as intricate technical terms which are highly domain specific (Schultz et al., 2002). Term order variations and abbreviations are also common (Song et al., 2004), with only a few definitional questions (Yu et al., 2007), as mostly analytical explanations are needed in this domain. Specialised information retrieval systems (such as PUBMED) are generally used by the researchers and experts. However, beginners prefer to use general information retrieval systems (such as Google), which suffer from the inherent drawbacks of open domain information retrieval systems. Song et al. (2004) and Jacquemart and Zweigenbaum (2003) extensively discuss the feasibility of medical domain question answering systems (QASs) and limitations of open domain question answering system (from now QAS) when applied to the biomedical domain. Furthermore, difference in jargons used by a novice and an advanced user causes heterogeneity. Thus, a need arises for the system to intelligently present an answer appropriate to the level of understanding of the user (see Section 2). Our QAS assigns the task of resolving differences of terms and formats as used by user and in corpus, entirely to the system. Thus a user is not required to remember the complex jargon of the subject. A lot of progress has been made in the field of data mining in recent years. Jing et al. (2009) present a subspace k-means clustering algorithm for high-dimensional data. They define a penalty term to the objective function of the fuzzy k-means clustering process to enable several clusters to compete for objects, which leads to the identification of the ‘true’ number of clusters and consistent clustering results. Plantevit et al. (2009) address the problem of biomedical named entity recognition. They explore the use of the mining techniques, such as sequential pattern mining and sequential rule mining and discuss the limitations. For tackling the task of named entity recognition, they also introduce LSR pattern, which enables one to effectively exercise the trade-off between the high precision of sequential rules and the high recall of sequential patterns. In our work, we use clustering technique for the calculation of ECRScore (see Section 4). To increase the robustness of tasks such as search, clustering, etc., the researchers prefer using multiple techniques and data sources (corpus) to bring out the best of them. Collection and integration of information from multiple data sources has recently attracted a lot of attention. Subramanya et al. (2008) discuss a view completion approach for multiple sources of web media based on canonical correlation analysis that heuristically predicts the missing views and also ranks all within-view features. They have shown the suitability of such an approach through the experiments on the web page classification photo tag recommendation. Similarly, Chrysostomou et al. (2008) propose a data mining method called wrapper-based decision trees (WDT), which combines different classifiers to overcome the biases of individual classifiers and uses decision trees to classify selected features. It is found that by choosing suitable subset of classifiers, a high degree of accuracy can be obtained. On the similar lines, we have used

40

S. Mittal et al.

multiple data sources (documents) to answer comparison seeking questions, since the complete answer to such question may not be present in a single document. To summarise, the contributions of our paper are as follows: 1

In the context of comparison seeking questions, our system is the first system (to the best of our knowledge) to answer such questions by integrating suitable answer segments from multiple documents. Our technique is generic and can be extended to the case of similarity seeking questions and many more. Our QAS paves the way for design of next-generation multi-document QAS.

2

Information integration of bioinformatic data has become a very vital problem. This QAS has inbuilt tools for resolving and mitigating the semantic (essential) heterogeneity problem through the use of bioinformatics metadata using the technique such as concept relation graph (CRG).

3

Various weighting and ranking schemes are employed to provide the user with information that is most important to him (see Sections 3 and 4). We introduce a new ranking scheme called mean correlational reciprocal rank (MCRR) which finds its use in evaluating a broad realm of ‘correlational’ questions.

The other salient features include integration of diverse resources like scanned books, doc files, PowerPoint slides, etc. (which have different information and presentation methods), and its capability to answer a wide variety of questions in addition to simple definitional questions (e.g., How does Tat enhance the ability of RNA polymerase to elongate?) The rest of the paper is organised as follows: Section 2 elucidates on the existing work in the biomedical field, related areas and the inherent bottlenecks. Section 3 describes the operational facets of the QAS. Section 4 presents the multi-document extraction problem and the framework of our solution. Section 5 describes our implementation to resolve the heterogeneity problem with use of metadata. Section 6 discusses experiments and results. Section 7 gives conclusions and briefly discusses future research.

2

Literature review

2.1 Question answering systems In recent years, a lot of research work has been done in the field of QAS design. Zweigenbaum (2003) presents an overview of the role and importance of QASs in biomedicine. Text retrieval conference (TREC), one of the major forums for discussions related to QASs has included a track on genomics. EQuer, the French evaluation campaign of QASs was first to provide a medical track (Ayache 2006). Compared to other domains, the biomedical domain has some unique characteristics which make development and deployment of a robust QAS both imperative and challenging (Niu et al., 2003). In the literature, several techniques for answering biomedical questions have been proposed. Some of these include answering by role identification (Niu et al., 2003; Sackett et al., 2000) and answering based on document structure (Sang et al., 2005). Cohen and Hersh (2005) present a survey of these works. Pomerantz (2005) reviews literatures from different fields to identify question taxonomies existing in those

BioinQA: metadata-based multi-document QA system

41

literatures. These taxonomies allow the classification of questions as one of the first steps in the question analysis phase. A QAS restricted to the genomics domain was developed by Rinaldi et al. (2004). They adapt an open domain QAS to answer genomic questions with emphasis on identifying term relations based on a linguistic-rich full-parser. In a study conducted with a test set of 100 medical questions collected from medical students in a specialised domain, a thorough search in Google was unable to obtain relevant documents within top five hits for 40% of the questions (Jacquemart and Zweigenbaum 2003). Moreover, due to busy practice schedules and need of answering the question swiftly, physicians spend less than two minutes on average seeking an answer to a question. Thus, most of the clinical questions remain unanswered (Ely et al., 1999). As an alternative to the existing QAS for Biomedical domain, Makar et al. (2008) proposed a service oriented architecture. It is a method for designing and developing system solutions with the view of enabling reuse of existing resources, incremental integration and ease of maintenance. Their system can be accessed by both web and mobile application clients, allowing convenient system access. The question is classified using Bayesian classifiers. The system internally searches Wikipedia and Google to obtain a hit list; these answers are subsequently summarised using Classifier4J summariser (http://classifier4j.sourceforge.net/) and presented to the user with the respective URLs. However, their performance measurements are shown for very simple definition question (such as ‘what is cerebral palsy’) which are primarily related to simple information rather than to real-life scenarios. Moreover the focus of their work is primarily on developing architecture and flexible interface and less on implementing a thorough QAS addressing specific question types occurring in medical QASs. Ely et al. (1999) proposed a taxonomy of generic clinical questions and studied a set of 1,396 questions collected from more than 150 physicians. The question collection is freely accessible at http://clinques.nlm.nih.gov. MedQA (Yu et al., 2007) is a biomedical QAS which caters to the needs of practicing physicians. However, it is still limited due to its ability to answer only definitional question. Niu et al. (2003) present their work as part of the evidence at point of care (EPoCare) project. The system works by identifying keywords and document retrieval based on keyword matching. Their data sources include the reviews of experimental results for clinical problems published in clinical evidence (CE) and evidence-based on call (EBOC). The system accepts queries in the format of PICO format (Sackett et al., 2000). In this format, a clinical question is represented by four basic elements, namely P (patient description), I (intervention), C (comparison or control intervention) and O (clinical outcome). These ‘roles’ are first located in both the question and the corresponding candidate answers. Such identification enables the system to overcome the limitation of information extraction techniques as used in general QAS named entities as the important information in medical text is captured by PICO fields. These roles are also helpful in identifying the relationships between sentences. A biomedical QAS named internet doctor (INDOC) is presented by Sondhi et al. (2007). The system first indexes the entire document set by the indexing module. The question of the user is processed to recognise difference in significance of different parts of the query and the ranking module ranks the documents by assigning weights on the basis of their relevance. To deal with the problem of complex biomedical terminology UMLS concepts have been used instead of keywords. The text parsing and mapping to UMLS concepts have been performed by

42

S. Mittal et al.

MMTX which is a programming implementation of MetaMap. Their indexing algorithm indexes the entire document (not just important keywords) in the form of sections. To reduce the user’s efforts, the final document set is clustered using k-means clustering. Their system is evaluated on the standard OHSUMED collection. However, the system cannot solve the problem of heterogeneity, anaphora (an expression referring to another) resolution and acronym expansion.

2.2 The heterogeneity problem Existing systems are not flexible enough to adapt themselves to the knowledge level and requirements of a user. It is impractical to assume that different user-groups (such as novice, researchers, etc.) would agree to common standards in the usage of technical terms since there exists a difference between the understandings of the concepts or data relationships in identical manner. This problem is known as heterogeneity. Heterogeneity of metadata is generally either of an ‘accidental’ or ‘essential’ nature. Accidental heterogeneity arises from the use of different formats and representation systems (e.g., XML, flat file, or other format) and can be solved through translation systems, which perform format conversion. Essential heterogeneity, also called semantic heterogeneity, arises from using varied vocabulary to describe similar concepts or data relationships or from using the same metadata to describe different concepts or data relationships. Li et al. (2005) semantically integrate metadata in bioinformatics data sources. The mediator/wrapper-based strategy (Chen et al., 2004; Stoimenov et al., 2000) has not been widely successful because it solves the problem reactively, after it occurs (which is more difficult).

3

System architecture

Figure 1 shows our system block diagram; in what follows, we present a description of our system. Our QAS is based on searching in context, the entities of the corpus, for effective extraction of answers. The system recognises entities of the course material using link parser. This is especially useful in biomedical domain where extended terms (e.g., nucleocapsid, immunoglobulin, ultrasonography, etc.) of the lexiconare classified as entities. The question is parsed using link parser during question parsing. Query formulation translates the question into a set of queries that is given as keyword input to the retrieval engine. We used Seft for context-based retrieval and answer re-ranking methods. In this QAS, link grammar parser decides the syntactic structure of the question, to extract part of speech information. Question classifier then uses pattern matching based on wh-words (such as when – refers to an event, why – reasoning type, etc.) and simple part-of-speech information to determine question types (Kumar et al., 2005). Questions seeking comparison may need the answer to be extracted from more than one passage or document. Such questions are dealt separately using our multi-document extraction formulation (Section 4).

BioinQA: metadata-based multi-document QA system Figure 1

43

System flow diagram

In the next step, question focus is identified by finding the object of the verb. Importance is given to question focus by assigning it more weightage during retrieval of answers. Quite logically, the answers are most appropriate when there is a local similarity in the text with the query, for example for the question ‘Is non-myeloablative allogeneic transplantation feasible in patients having HIV infection?’ query terms’ non-myeloablative’, ‘allogeneic’, ‘transplantation’, etc., have local similarity which is identified in the text, by locality-based similarity algorithm. The contribution of each occurrence of each query term is summed to arrive at a similarity score for any particular location in any document in the collection. Software tool Seft (Kretser and Moffat, 2000), matches in accuracy with the conventional information retrieval systems, and is also fast enough to be useful for up to hundreds of megabytes of text. The query formulation module finds query words from the question for providing input to the retrieval engine. The system constructs a hash table of the entities identified from the question based on the entity file, which is based on either the table of contents or the index or glossary of the biomedical corpus. These keywords (entities) are considered most important and are given the maximum weight. BioinQA also addresses the key issues of solving the heterogeneity problem, acronym expansion and understanding user’ simplicit assumptions in the answer extraction module (detailed in Section 5). BioinQA then performs phrase matching-based re-ranking by searching for occurrence of noun. The query formulation module finds query words from the question for providing input to the retrieval engine. The system constructs a hash table of the entities identified from the

44

S. Mittal et al.

question based on the entity file, which is based on either the table of contents or the index or glossary of the biomedical corpus. These keywords (entities) are considered most important and are given the maximum weight. BioinQA also addresses the key issues of solving the heterogeneity problem, acronym expansion and understanding user’s implicit assumptions in the answer extraction module (detailed in Section 5). BioinQA then performs phrase matching-based re-ranking by searching for occurrence of noun phrases (identified by question parser above). After phrase matching, system processes the passages according to the classification done in question classification.

4

Multi-document retrieval

To answer questions having much greater complexity then the simple definitional questions, simple techniques of searching the answer in a single document will not be enough. Generally the full answer may not be found at one place and the system may need to construct the answer by combining the answer snippets found from multiple locations, which may be in same or even different questions. An example of such questions is difference seeking questions, where the explanations of two different entities generally lie at different places. For example: •

What is the difference between introns and exons?



Contrast between lymphadenopathy and leuko reduction.



Differentiate between flu and anthrax symptoms.



What is the difference between nystatin and diflucan as a therapy in autism?

We have developed a novel ‘segregate algorithm’ that maps the two separate ingredients (components) of the question (for example ‘Nystatin’ and ‘Diflucan’) in their respective information documents. The question is parsed to segregate the different components of the question. In most of the cases, the comparison seeking questions follow a general trend, these questions are identified by some general words such as ‘contrast’, ‘differentiate’, ‘difference’, ‘compare’, etc. Thus, mostly identification of components is straightforward, based on the pattern matching. However, for complex comparison seeking questions, it is likely that the words like ‘contrast’, etc., may be absent. For such questions, deeper question parsing is performed to firstly check, whether the question is a comparison seeking one and if yes, secondly, the two different components are identified. Our approach is generic and will also work on the query sentences involving more than two components. Documents are then processed to extract these components (if present) and the top n documents thus obtained are re-ranked based on passage sieving (described next).

4.1 Entity cluster matching-based passage sieving Obtained passages will be most accurately depicting a contrast when their parameters or entity clusters (linked-list of the entities of a passage along with their frequency of occurrence in that passage) are very similar (e.g., the possible parameters for comparing medicines would be duration of action, dosage, ingredient levels, side-effects, etc.). Thus, re-ranking is performed by generating such entity clusters for each document and

BioinQA: metadata-based multi-document QA system

45

matching them. The link parser in the system recognises the entities of the passages and performs matching with those of second by employing the following procedure: Let Gi,n be the entity cluster set of the nth answer in the ith component, where, 1 ≤ i ≤ 2, 1 ≤ n ≤ 10. In other words, Gi,n is the collection (set) of the entities present in the nth answer and such set is found for both (1 ≤ i ≤ 2) the components. The score obtained from entity cluster matching-based re-ranking (ECR) algorithm, ECRScore is given by 10

ECRScorei , n =

∑C

i , n, k

1 ≤ i ≤ 2, 1 ≤ n ≤ 10.

k =1

Here Ci,n,k is the Similarity function and is defined as Ci ,n,k = Gi , n ∩ G j , k = G j ,k ∩ Gi , n

i, j ∈ {1, 2}, i ≠ j

The operator ∩ is used to match the entities present in both of the operands for measuring the similarity between them. More precisely the result of ∩ operation will be equal to the sum of number of entities common in both of its operands. Ci,n,k is a non-negative real number since the two entity cluster sets for whom ∩ operation is performed may have few or none elements in common. Now, the FinalScorei,n of all the passages is calculated as FinalScorei , n = w1 × CurrentScorei ,n + w2 × ECRScorei ,n where w1 + w2 = 1

Here CurrentScorei,n is the score of the passage obtained from answer selection phase. In other words, it is the score of the passages before re-ranking. Thus, at the end of the answer selection phase, ranking of the passages was performed based on CurrentScorei,n and then at the end of entity cluster matching-based passage sieving; ranking of the passages is performed based on FinalScore. w1, w2 are weights given to scores to incorporate the contribution of both modules and are chosen (empirically) in our system to be 0.7 and 0.3 respectively. The reasoning behind the selection of these values is that a larger value of w1 compared to w2 shows giving more weightage to the scores given in answer selection phase. We are conducting more tests to experimentally verify the choice of these values. Finally for 1 ≤ i ≤ 2, 1 ≤ n ≤ 10 answer passages are ranked according to their FinalScore and up to top five passages are presented to user. The utility of our entity cluster matching-based re-ranking can be explained by the following sample question. In a comparison seeking question, it is expected that different answers explaining the same aspect of entities being compared should be presented as a single answer for ensuring the maximum clarity and information content for user, thus achieving the goal of providing most relevant answer. For example for a question such as ‘what is the difference in composition of cow milk and buffalo milk?’ the components being compared are ‘cow milk’ and ‘buffalo milk’. A meaningful comparison would be to describe and contrast the entities common to both components – such as ‘fat contents’, ‘protein levels’, ‘number of essential amino acids’, ‘immunity developing properties’, ‘casein amounts’, etc. Thus, it makes perfect sense to expect an answer such as “cow milk has conjugated linoleic acid (CLA), which is anti-carcinogenic. Buffalo milk has 23% less of CLA than cow milk”; an answer such as – “cow milk has conjugated linoleic acid (CLA), which is anti-carcinogenic. Buffalo’s milk is now more prominent due to

46

S. Mittal et al.

less availability of cow milk”, would not make much sense. The ECRScore gives higher ranking to those sub-answer pairs where such meaningful entities are identified. Figure 2 Sample output for the question ‘what is the difference between glycoprotein and lipoprotein?’ (see online version for colours)

Our approach is quite general and can be applied to questions other than comparison seeking ones, such as ‘What are the similarities between Alzheimer’s disease and Parkinson’s disease?’ A characteristic of such ‘correlational’ question is that the subqueries are somehow related (‘parallel’ or ‘anti-parallel’). Both, ECR method and MCRR evaluation metric (described in Section 6) can be used in this case as well. However, the interaction between sub-queries must be clearly identified and effective integration of multiple pieces of data is required to present the answer in proper context and answering exact need of the user.

5

Semantic heterogeneity resolution through metadata

Bioinformatics field is a multidisciplinary field. Today, it is studied by general people, students, researchers, and medical practitioners. Thus, the QAS should be adaptive to their different needs. In order to bridge the gap between the levels of understanding of an experienced researcher and a novice and to address other bottlenecks of the system, we propose the following steps.

5.1 Utilisation of scientific and general terminology It is a common observation that every user accesses the knowledge source depending on his/her needs and current level of comprehension. Biology and medicine are scientific

BioinQA: metadata-based multi-document QA system

47

fields where wide variety of terminologies co-exists, each of which is likely to be used by the users. For example, a non-biology student is not likely to access information by keyword ‘homosapiens’, but by’ humans’. The system thus should respond accordingly. We allow this choice to be exercised by the user himself, for using the system for either novice search or advanced user search (see Figure 3). Figure 3 Different outputs for advanced and novice users respectively, and a display of acronym expansion (see online version for colours)

48

S. Mittal et al.

We thus develop a novel advanced-and-learner-knowledge-adaptive (ALKA) algorithm, which works and performs selective ranking (of the initial n = 10 passages) based on these principles: researchers use scientific terms and terminologies more frequently. These may also include equations, numeric data (numbers, percentage signs) and words of large length such as exoribonucleases, etc. Thus, documents relevant for researchers will include such terms with a higher frequency because it actually fulfils the need of the user, whereas those meant for the novice would include simple (short length) words with less numbers or equations. The entity file of corpus constructed in the initial phase is configured to classify the terms as either ‘biological’ (e.g., Efavirenz), ‘scientific’ (e.g., homosapiens) or ‘general’ (e.g., human), using metadata information. If a passage contains more scientific terms occurring frequently, it is given a lower rank for the novice, and a higher one for the advanced user. This technique can be extended to accommodate a wider variety of user-groups.

5.2 Use of acronyms Acronyms are of great importance in a field like biomedicine where precise scientific terms are used and any error introduced due to requirement of typing long names can be critical. Solution to the problem of acronyms will not only save user’s search time but also relieve them of the burden of remembering long scientific names to the accuracy of a single character. Thus, in order to make the system adaptive, we employ manually built acronym lists to resolve the differences in meaning due to use of acronyms at one place and its expanded form at another place. Many acronym lists have been compiled and published and are available on the web (e.g., acronym finder and canonical abbreviation/acronym list). As the purpose of this study was to demonstrate the use of information about expansions of acronyms in enhancing the answers obtained from a QAS, use of a manually built acronym list is justified.

5.3 Comprehending the implicit assumptions of the user In a personal conversation, it s commonly observed that a user’s question rarely contains the full information required o answer the question (e.g., the speaker assumes that listener knows current time, date, etc., and hence does not include these in the question; however this information may be required to properly answer the question). Rather, it essentially contains many unstated assumptions and also requires extending or narrowing the meaning of the question to either broaden or shorten the search. This is actually the case in the real life for humans as their conversations hardly include the full detail, but leave many things for the listener to assume. For example a user may ask ‘What ratio of non-vegetarian food in nutritional diet causes colorectal cancer?’ It is on part of system to decide between different types of colorectal cancer (adenocarcinoma, leiomyosarcoma, lymphoma and melanoma) to fully answer the question. To handle such questions, the system uses concepts relation graph (CRG). CRG is built using metadata information, and is a ‘one-to-many’ relation graph representation of concepts and data of the biomedical domain (the nodes of the graph represent the entities of the domain and the edges of the graph represent a relationship of ‘is a variant of’ or ‘similarity’ between them). CRG is a bipartite graph with one set (say L) containing original entities (with probably incomplete information to answer the question) and another set (say R) containing entities (with sufficient information). For example in the

BioinQA: metadata-based multi-document QA system

49

above question, the concept of colorectal cancer will be related to the four variants possible, namely adenocarcinoma, leiomyosarcoma, lymphoma and melanoma. CRG is meant to determine the missing information required to answer the question or remove the ambiguity from the question. A question is considered ambiguous if it contains a term which appears in set L. To answer such a question, the user can either be prompted to supply more information, or choose a variant from CRG or the system can still answer the question, without any change. In later case, the user is presented with the answer, along with the knowledge of CRG. The user can choose to take the help provided by CRG. If more precise information about the background of the user is available, the system can be configured to provide a unique and unambiguous answer to the user, by selecting just one specific entity from the set R of the CRG (e.g., if country information about the user is available, the specific answer to the question, e.g., ‘What is the state-of-art facility for cancer treatment in our country’). Use of this approach paves the way for development of a ‘friendly’ QAS, which will save the user from having to enter elaborate information in the question. Figure 3 shows difference in levels of answers obtained for novice and advanced user with acronym expansion. The user is given the choice to specify his desired level of understanding and an answer according to his needs is presented to him.

6

Experimentation

As sample resource, abstracts were taken from PUBMED to experiment on the system. Difference seeking questions are not generally available to be used as test questions, unlike the open domain evaluations, where test questions can be mined from question logs (Encarta, Excite, AskJeeves). For collecting the questions, we asked a group of students, studying in this area to construct the questions, which can be answered by the corpus. The group comprised of beginners and sophomores as well. The set of questions includes 40 normal questions and 20 difference seeking questions. For each normal question the system presents up to five top answers to the user and for difference seeking questions, three answers are presented to the user.

6.1 Comparison of BioinQA with the Google search engine To assess the utility of our system, we compared BioinQA with the most popular search engine, Google. BioinQA has the ability to answer both normal and comparison seeking questions. The primary reason and justification behind choice of Google is that we expect that BioinQA will be used by a wide variety of users just like Google. Hence it will be worthwhile to see how well it compares with a general purpose and popular search engine. Moreover, QASs with capability of providing multi-document answers in response to comparison seeking questions are generally not very popular. For experimentation, each of the question was posed to both Google and BioinQA. For Google, five documents were checked for presence of answer in them and for BioinQA, first five answers were checked.

50

S. Mittal et al.

6.1.1 Evaluation metrics For general questions we used the popular metric mean reciprocal answer rank (MRAR) suggested in TREC (Aloisio et al., 2005) for the assessment of QASs, which is defined as follows. RR =

1 1 , MRAR = rank[i ] n

n

1

∑ (rank[i]) i =1

where n is the number of questions, RR is the reciprocal rank. For evaluation of comparison-based questions no metric has been suggested in the literature. To evaluate BioinQA’s performance for such questions novel metric was adopted called ‘MCRR’ which is defined as: let rank1 and rank2 be ranks of correct answers given by system for both components respectively. Then MCRR =

1 n

n

1

∑ (rank1[i] × rank 2[i]) i =1

where n is the number of questions. If answer to a question is not found in passages presented to user, then it is assumed that rank of that question is Z whose value is large compared to number of passages. For calculation of MRAR Z is taken as ∞. To calculate MCRR, Z is taken as a much smaller value as it avoids punishing the case where the system provided answer to only one of the components. In our experiments we took Z as 10. The use of MCRR can be justified on the following grounds: 1

It is defined very similar to MRAR.

2

It is symmetric w.r.t. objects being compared so it takes ‘difference between A and B’ and ‘difference between B and A’ to be the same.

3

The answer to a comparison seeking question is complete when both components (e.g., lipoprotein and glycoprotein) are described, not just one. So, it punishes the answers where only one component has been answered. This approach is also better than other possible alternatives, such as taking average of individual MRAR for the two components, etc.

4

In the extreme case (assumption), where one of the two entities is always searched perfectly, Z = 1 and thus MCRR approaches MRAR. This is the case where the question is a normal question and only a single component is searched. As another extreme case, if Z is taken as some large value (say L) and the answer to one component is always missing, then the MCRR value is equal to MRAR for single component divided by L. Thus, due to absence of one component, the performance of the system is greatly reduced.

5

The flexibility of choosing Z value gives the designer freedom to experiment with the system. Moreover, Z value can also be designed to be different for first and second component, if desired.

BioinQA: metadata-based multi-document QA system

51

6.1.2 Results We calculated MRAR and MCRR for our system and Google search engine. For both of the metrics, the higher the value, the better the system is. Table 1 and Figure 4 summarise the result of our experiments. A higher value of both the metrics for BioinQA shows the superiority of our system over Google search engine. Table 1

Experimental results of BioinQA and Google on the dataset MRAR

MCRR

BioinQA

0.7333

0.3096

Google

0.6328

0.2195

Figure 4 Plot of (a) MRAR vs. % of questions asked and (b) MCRR vs. % of questions asked

(a)

(b)

52

S. Mittal et al.

Figure 5 No answer to the Google search for the question ‘what is the difference between lipoprotein and glycoprotein?’ (see online version for colours)

Figure 6 No answer by the AnswerBus QA system for the question ‘what is the difference between lipoprotein and glycoprotein?’ (see online version for colours)

BioinQA: metadata-based multi-document QA system

53

Figure 7 No answer by the Answers.com QA system for the question ‘what is the difference between lipoprotein and glycoprotein?’ (see online version for colours)

Figure 8 No answer by the START QA system for the question ‘what is the difference between lipoprotein and glycoprotein?’ (see online version for colours)

54

S. Mittal et al.

6.2 Evaluation of the results We argue that evaluation was inherently biased in favour of Google. As opposed to BioinQA, where answers were accepted as correct if and only if the answer was contained in the ‘passage’ itself (which is presented to user), the answers from Google were taken as correct if the answer was found anywhere in the ‘document’. As expected, for Google the authors had to manually search the whole document to find whether it contains an answer. This makes the user effort exorbitantly large for Google. Moreover the strategy used in Google completely fails for the comparison seeking questions. Figures 5, 6, 7 and 8 show the responses of a few QAS and search engines to the sample question. We also present the response of a few other commercial QASs to show the utility of our segregate algorithm.

7

Conclusions and future work

BioinQA effectively addresses several important issues arising in biomedical domain and shows the way to answer complex questions arising in real-life scenarios. Our technique is generic and can be extend to the case of similarity seeking questions and many more. This paper also proposes novel candidate passage re-ranking scheme called ECR and evaluation metric called MCRR. The use of the metadata (e.g., CRG graph) to understand the implicit assumptions of the user, accommodate acronyms and to answer the question, based on the expertise of the user (rather than giving fixed answers for every user – irrespective of their background) makes the system adaptive to needs of user. Experiments performed over commercial search engines such as Google show the superiority of the system. For future work, systems can be built which extend the multi-document concept to include answers from multiple media, namely audio and video, along with the text answers.

References Afantenos, S., Karkaletsis, V. and Stamatopoulos, P. (2005) ‘Summarization from medical documents: a survey’, Artificial Intelligence in Medicine, Vol. 33, No. 2, pp.157–177. Aloisio, G., Cafaro, M., Fiore, S. and Mirto, M. (2005) ‘ProGenGrid: a workflow service infrastructure for composing and executing bioinformatics grid services’, Proceedings of the 18th IEEE Symposium on Computer-Based Medical Systems, pp.555–560. Ayache, C., Grau, B. and Vilnat, A. (2006) ‘EQueR: the French Evaluation campaign of question answering system EQueR/EVALDA’, Proceedings of the 5th international Conference on Language Resources and Evaluation (LREC 2006), May, pp.1157–1160, Genoa, Italy. Chen, L., Jamil, H.M. and Wang, N. (2004) ‘Automatic composite wrapper generation for semi-structured biological data based on table structure identification’, SIGMOD Record, Vol. 33, No. 2, pp.58–64. Chrysostomou, K., Chen, S.Y. and Liu, X. (2008) ‘Combining multiple classifiers for wrapper feature selection’, Int. J. Data Mining, Modelling and Management, Vol. 1, No. 1, pp.91–102. Cohen, A.M. and Hersh, W.R. (2005) ‘A survey of current work in biomedical text mining’, Briefings in Bioinformatics, Vol. 6, No. 1, pp.57–71. Ely, J.W., Osheroff, J.A., Ebell, M.H., Bergus, G.R., Levy, B.T., Chambliss, M.L. et al. (1999) ‘Analysis of questions asked by family doctors regarding patient care’, British Medical Journal, Vol. 319, No. 7206, pp.358–361.

BioinQA: metadata-based multi-document QA system

55

Jacquemart, P. and Zweigenbaum, P. (2003) ‘Towards a medical question-answering system: a feasibility study’, Proceedings of Studies in Health Technology and Informatics, Medical Informatics Europe, pp.463–468. Jing, L., Li, J., Ng, M.K., Cheung, Y. and Huang, J. (2009) ‘SMART: a subspace clustering algorithm that automatically identifies the appropriate number of clusters’, International Journal of Data Mining, Modelling and Management, Vol. 1, No. 2, pp.149–177. Kretser, O. and Moffat, A. (2000) ‘Needles and haystacks: a search engine for personal information collections’, Proceedings of the Australasian Computer Science Conference, p.58. Kumar, D.A. and Ravi, V. (2008) ‘Predicting credit card customer churn in banks using data mining’, Int. J. Data Analysis Techniques and Strategies, Vol. 1, No. 1, pp.4–28. Kumar, P., Kashyap, S., Mittal, A. and Gupta, S. (2005) ‘A fully automatic question answering system for intelligent search in e-learning documents’, International Journal on ELearning (IJEL) Special Issue: Support for E-learning: Technologies for Electronic Documents, Vol. 4, No. 1, pp.149–166. Li, L., Singh, R.G., Zheng, G., Vandenberg, A., Vaishnavi, V. and Navathe, S. (2005) ‘A methodology for semantic integration of metadata in bioinformatic data sources’, in Proceedings of the 43rd Annual Southeast Regional Conference, pp.131–136. Makar, R., Kouta, M. and Badr, A. (2008) ‘A service oriented architecture for biomedical question answering system’, Proceedings of the IEEE Congress on Services Part II, pp.73–80. Malerba, D. (2008) ‘A relational perspective on spatial data mining’, Int. J. Data Mining, Modelling and Management, Vol. 1, No. 1, pp.103–118. Niu, Y., Hirst, G., McArthur, G. and Rodriguez-Gianolli, P. (2003) ‘Answering clinical questions with role identification’, Proceedings of the ACL Workshop on Natural Language Processing in Biomedicine, pp.73–80. Plantevit, M., Charnois, T., Klema, J., Rigotti, C. and Cremilleux, B. (2009) ‘Combining sequence and item set mining to discover named entities in biomedical texts: a new type of pattern’, International Journal of Data Mining, Modelling and Management, Vol. 1, No. 2, pp.119–148. Pomerantz, J. (2005) ‘A linguistic analysis of question taxonomies’, Journal of the American Society for Information Science and Technology, Vol. 56, No. 7, pp.715–728. Rinaldi, F., Dowdall, J., Shneider, G. and Persidis, A. (2004) ‘Answering questions in the genomics domain’, Proceedings of the ACL Workshop on Question Answering in Restricted Domains. Sackett, D., Straus, S., Richardson, W.S., Rosenberg, W. and Haynes, R.B. (2000) Evidence-Based Medicine: How to Practice and Teach EBM, 2nd ed., Churchill Livingstone, Edinburgh. Sang, E.T.K., Bouma, G. and Rijke, M.D. (2005) ‘Developing offline strategies for answering medical questions’, Proceedings of the AAAI-05 Workshop on Question Answering in Restricted Domains, pp.41–45. Schultz, S., Honeck, M. and Hahn, H. (2002) ‘Biomedical text retrieval in languages with complex morphology’, Proceedings of the Workshop on Natural Language Processing in the Biomedical Domain, pp.61–68. Sondhi, P., Raj, P., Kumar, V.V. and Mittal, A. (2007) ‘Question processing and clustering in INDOC: a biomedical question answering system’, EURASIP Journal on Bioinformatics Systems and Biology, January, Vol. 2007, pp.1–7, Hindawi Publishing Corp., New York, NY, USA. Song, Y., Kim, S. and Rim, H. (2004) ‘Terminology indexing and reweighting methods forbio medical text retrieval’, Proceedings of the SIGIR Workshop on Search and Discovery in Bioinformatics. Stoimenov, L., Djordjevic, K. and Stojanovic, D. (2000) ‘Integration of GIS data sources over the internet using mediator and wrapper technology’, Proceedings of the 10th Mediterranean Electrotechnical Conference, pp.334–336.

56

S. Mittal et al.

Subramanya, S., Wang, Z., Li, B. and Liu, H. (2008) ‘Completing missing views for multiple sources of web media’, Int. Journal on Data Mining, Modelling and Management, Vol. 1, No. 1, pp.23–44. Yu, H., Lee, M., Kaufman, D., Ely, J., Osheroff, J.A., Hripcsak, G. and Cimino, J. (2007) ‘Development, implementation, and a cognitive evaluation of a definitional question answering system for physicians’, Journal of Biomedical Informatics, Vol. 40, No. 3, pp.236–251. Zadeh, L. (2006) ‘From search engines to question-answering systems: the problems of world knowledge, relevance, deduction, and precisiation’, in Reusch, B. (Ed.): Computational Intelligence, Theory and Applications, International Conference, 9th Fuzzy Days in Dortmund, Germany, 18–20 September, pp.1–3. Zweigenbaum, P. (2003) ‘Question answering in biomedicine’, Proceedings of the Workshop on Natural Language Processing for Question Answering, pp.1–4.