A first step towards an ontology-based tool integrating

0 downloads 0 Views 292KB Size Report
This paper gives a flavour of the solutions that have ... The thesaurus is physically split into a set. 1 ... has two major components, the Metathesaurus®, a.
Biomedical Information Integration: two case-studies C. Golbreich a, C. Jacquelinet b, A. Burgun a Laboratoire d’Informatique Médicale, Faculté de Médecine, 35033 Rennes, France Département Médical et Scientifique, Etablissement Français des Greffes, Paris, France a

[email protected], {Christian.Jacquelinet, Anita.Burgun}@univ-rennes1.fr Abstract. Semantic integration issues become a key bottleneck in the deployment of a wide variety of biomedical information applications. Traditional integration, such as standards, ad-hoc systems, datawarehouses, are proved to be not sufficient. For scaling up to multiple Web sources or organizational intranets, newly suggested techniques such as mediator or peer-based integration architectures, together with formal languages for Web ontologies, are challenging. This paper presents two case studies, the limits of the classical solution adopted for integration, lessons learnt from this practical experience, and future perspectives

1

Introduction

Semantic integration is now crucial in many biomedical domains, including functional genomics (§3) but also in organ failure and transplantation (§2) as well as in many other medical domains where better patient care, as well as better understanding of diseases and sound decision making in public health require integration of large amounts of data from heterogeneous resources. This paper gives a flavour of the solutions that have currently been developed for these two complex applications. Section 2 presents the approach developed with the EfG, section 3 describes BioMeKe (Biological and Medical Knowledge Extraction) [14], an ontology-based tool achieved to relate knowledge from various Web or local resources for liver transcriptome analysis. Then, general lessons learnt from these practical experiences are discussed, and future perspectives sketched.

2

Experience with EfG in organ transplantation and renal failure

2.1 Medical context and requirements The Etablissement français des Greffes (EfG) was created in France by the law 93-43 of January 1994 as a state agency dealing with public health issues related to organ, tissue and cell transplantation. EfG is responsible for the registration of patients on the national waiting list, for the management of this list, for the allocation of all organs, retrieved in France or outdoors and for the evaluation of retrieval and transplantation activities by organ and by transplantation centre [11]. To fulfill its missions, EfG maintains a national information

system (EfG-IS) in which organ procurement organisation register data about donors and the transplantation teams periodically record patient data: at the registration on the waiting list, at the time of transplantation if any and during follow-up before or after transplantation. Furthermore, EfG is also the support of the Renal Epidemiology and Information Network (REIN) devoted to the follow-up of all patients treated by dialysis [19], connecting the cohorts of dialysed and transplanted patients to offer a complete evaluation of end-stage renal disease treatments, and is in charge of the REIN information system (REIN-IS). EfG-IS is a decisional information system at two levels: (i) it is used to aggregate information, perform evaluation studies and propose strategic decision in the field of organ failure public health policies; (ii) it is also used at the individual level for organ allocation. Thus, EfG-IS is required to integrate periodically data coming from multiple sources: patient data coming from transplant teams or from dialysis centres, information coming from organ procurement organizations when a potential donor is identified, and geographical information concerning health offer organization. In many aspects, EfG-IS as well as REIN-IS are decisional information systems. Indeed, they must support evaluation studies and decision making for public health policies in the field of organ failure. Furthermore, EfG-IS is also used at the individual level for organ allocation. To make relevant public health decisions, it is important to correlate temporal and geographical data related to health needs and their determinants (patients data), to temporal and geographical data related to health supplies and their determinants (health care offer, organ retrieval). Thus, EfG data integration requirements also meets the ones of geographical information system. Providing statistics and dashboards for deciders requires the integration of data coming from consolidated sources with Datawarehousing techniques.

2.2 Current systems EfG-IS, REIN-IS, and the terminological server 2.2.1

The EfG-IS and REIN-IS

The existing EfG-IS relies on a RTC based classical client-server architecture. Patients and donors data are registered within a relational database. The thesaurus is physically split into a set

1

of unconnected tables of values related to initial disease, complications and causes of death. It consists in a catalog without hierarchy, nor explicit differentiation or compositional principles. The terms themselves are not referring to medical standards. The lists of available terms were lacking of completeness and some of them are unused because inappropriate. EfG-IS is actually migrating to a secured web-based n-tier modular architecture comprising a terminological server [10]. REIN-IS has been prototyped as a multiple sources information system and is actually in use for evaluation in six french regions. REIN-IS relies on a secured web-based n-tier architecture [2]. The information system tier may access 3 types of database: the identification database, the production database and the data warehouse. But, in both systems, clinicians are obliged to record clinical data within national or regional databases twice, since the data already registered in their hospital information system are not integrated.

2.2.2

The terminological server and its ontology

In order to allow integration of data that are encoded with heterogeneous terminologies, the EfG has developed a terminology server that covers end stage diseases and organ transplantation. Disease description is based on frames. The slots refer to concepts from specialized hierarchies representing anatomy, organisms, etc. Protégé was used for modeling. The model was then implemented with a MySQL database. Functionalities are developed using procedural programming.

2.2.3

Limitations

The EfG terminology system serves as a basis to deal with heterogeneity of data. However, limitations are twofold. First, it represents terminologies. Second, the ontology itself is not formal. Limits of terminology Since data of local sources are encoded using terminologies, a basic requirement is that the formal ontology be connected to these terminologies. A set of terms coming from several terminologies used in diverse sources were analyzed. Description of terminology items as concepts led to create intermediary nodes that cluster diseases, but don’t denote any real concepts. In addition, it increases the needs for representing conjunction, disjunction and negation. Moreover, [4] the type and structure of the underlying coding scheme influences the determination of term meaning. A special problem with statistical classifications is the representation of residual classes (not otherwise specified), since their meaning depends on the context [9]

Limits of non formal ontology Non formal ontology has several limitations: (i) consistency is not guaranteed, (ii) inheritance mechanisms are not managed. For example, when a concept that has more than one parent is created, no mechanism is provided for multiple inheritance, (iii) new concepts cannot be inserted automatically, (iv) procedural programming to support reasoning (for example, for grouping items in the context of decision making) does not offer powerful reasoning.

3

Experience in functional genomics

3.1 The biomedical context The Word-Wide Web has made available a tremendous amount of biomedical information, but it remains tedious and time-consuming for biologists and physicians to access the information relevant to their queries. Multiple public resources are available in genomics including public databanks such as SWISS-PROT, OMIM, LocusLink, GenBank, as well as many others. This domain is characterized by: standard terminologies, multiple sources, mappings and cross-references.

3.1.1

Standard terminologies

GeneOntology™ (GO)1 is a standard for molecular biology and genomics. GO is organized with three top categories Molecular Function, Biological Process, and Cellular Component. It provides a controlled vocabulary for annotating sequences and gene products. GO concepts are broadly used as attributes in many public databases e.g. SWISSPROT, as well as for annotating sequences in specific applications inn the context of microarray experiments.. GO has been merged with the Unified Medical Language System®, (UMLS® ) which is a general medical ontology intended to help health professionals and researchers to use biomedical information from different sources. The UMLS® has two major components, the Metathesaurus®, a large repository of more than 900,000 concepts, and the Semantic Network, a limited network of 135 semantic types. The Metathesaurus is built by merging more than 100 families of vocabularies including MeSH, and in grouping sets of terms considered as synonyms under a same concept.

3.1.2

Multiple sources

Public database available on the Web are multiple, including: • Gene databanks such as LocusLink2 which is a genes database to unify knowledge about genes.. 1 2

http://www.geneontology.org http://www.ncbi.nlm.nih.gov/LocusLink/

2





Resources that provide synonymous names for genes such as HUGO3 (Human Gene Nomenclature Database). Databases of gene products such as GO Annotation @EBI4 (GOA) whose objective is to assign GO terms to gene products.

3.1.3

Mappings and cross-references

Many mappings and relations between standard ontologies and databanks are stored in online resources. For example, in many biological databases, mappings to GO5 concepts are explicitely defined e.g. mappings of SW keywords to G0 terms (Table ). Most databanks provide cross-references to other databases via accession numbers. HUGO and GOA provide links useful for gene annotation systems: !date: 2003/07/14 21:07:05 ! Evelyn Camon, SWISS-PROT. !Mapping of SWISS-PROT KEYWORDS to GO terms SP_KW:Metal-thiolate cluster > GO:metal ion binding ; GO:0046872 SP_KW:Metalloenzyme inhibitor > GO:enzyme inhibitor activity ; GO:0004857 …

Table 1 Mappings of SW keywords to GO terms

3.2 BioMeKe Biologists and physicians of INSERM U522 and LIM at Rennes study molecular mechanisms involved in human liver diseases, by means of transcriptome analysis. The objective is to find out the genes that are expressed in liver, to correlate them with patient data, in order to better understand pathological processes in liver. But for example, more than 3,000 SWentries are isolated from the tissue « Liver ». BioMeKe (Biological and Medical Knowledge Extraction), has been achieved to help them to extract and to associate medical and biological information accessible from multiple public sources, GenBank, Swissprot, LocusLink, Medline, etc, and to correlate it to the biologists data laying in their local repository (Gedaw [7]).

3.2.1

Components and functionalities

BioMeKe, is an ontology-based tool composed of two parts: a core ontology and a query processor: − BioMeKe Core Ontology (BCO) includes the main standards of the biomedical domain: the UMLS®, GO plus GOA. HUGO is integrated to address synonymy issues. All terminologies are separatly stored in a MySQL relational database. Links between items are dynamically created during the search for a given term or an annotation request. 3 4 5

http://www.gene.ucl.ac.uk/nomenclature/ http://www.geneontology.org/#annotations

− BioMeKe Query Processor uses BCO knowledge to search information in the external User’s term or file Query and answer

BCO (1)

HUGO

UMLS + MeSh ST GO + GOA

(2) Biological

(3) Medical

module

module

wrapper

Figure 1 BioMeke sources. It has three components. The heterogeneity manager (HM) uses HUGO and the UMLS for semantic unification of the different names and cross-references. The biological search module (BS) is in charge of searching for biological information in GO, and to provide access to information of several public databanks. The medical search module (MS) is in charge of searching for medical information in UMLS. If the term is found its UMLS context is displayed, including co-occurrences in MedLine. Implementation of the BioMeKe system relies on a MySQL relational database and a set of JAVA functions (wrappers) to access to the content of several public databases, the BCO databases content is accessed thanks to SQL queries.

3.2.2 Limitations BioMeKe main innovation is to be an ontologybased tool. However, the ontologies are not formal. Second, it is a procedural tool, and it provides semantic integration, but it is still limited. Limits of non formal ontologies. GO and UMLS are not structured accordi ng formal principles, and exhibit many inconsistencies. Limits of the query engine BioMeKe is mainly ground ed on various “mappings” and relations between the standard ontologies and databanks, or between databanks (by cross-references). However, since this knowledge remains implicit, many tasks are still grounded on user’s skill and own responsibility: reformulation, selection of databanks to browse etc. BioMeKe is a procedural system, based on a fixed process. As the number of online databanks always increases, more automatization and more flexibility are required, providing extensibility and dynamic sources selection possibilities Limits in semantic integr ation BioMeKe management of heterogene ity is limited. For example, it is mainly based on the synonyms found in HUGO or in the UMLS, but it does not exploit other information available in external databanks. Other limitations are related to

3

terminology and ontology versioning since GO, UMLS, and HUGO are very frequently updated.

4

Lessons and perspectives

There is a clear need of using new technologies: formal ontology web languages, and more flexible data integration.

4.1 Needs of formal ontologies Most people now agree about the limits of non formal ontologies and benefits of a formal ontology language, for the Web in general [18] and in the biomedical domain [17] [20] [6]. First, “multiple viewpoints” is an old problem in biomedicine. For example, in GO functions, processes hierarchies are organized from a biochemical viewpoint or from the chemical substances they act on. Multiple viewpoints are source of inconsistencies, when the ontology structuration is not automatized [5] [15] Moreover biologists and physicians are interested in clustering diseases, genes according to different dimensions, e.g. genes according to their functions or related pathologies, also in identifying all the gene products that share a same feature. Description Logics (DL) provide powerful services for that, and the next W3C standard Ontology Web Language OWL6 comes with useful tools e.g. the FaCT automatic classifier 7, the OilEd editor [1].

4.2 Needs of a more flexible information integration A declarative approach, allowing an explicit and formal representation of the knowledge (ontology, mappings, queries) and an inference (query) engine with powerful services in particular for ontology automatic classification, consistency checking, and for dynamic chaining of mappings is required. Extensibility and real-time data are often crucial requirements for Biomedical Information Integration. In particular, genomics is a very fastmoving field. Web sources are multiple, with huge and constantly evolving content (versioning of GO and UMLS). New online ontologies and specialized databanks frequently appear. Datawarehouses which can be quite powerful, providing high access performance, are not well appropriate to the genomics evolving data, nor for integration of the heterogeneous local clinical data into a National Decisional Information System. Thus, more flexible integration, either centralized mediators or peer-based distributed integration [3] [8] might be more appropriate [15].

4.3 Towards more adapted solutions

integration systems. A first project under definition together with EfG and other European partners aims at developing an integration system at a European level for nephrology, dialysis, renal transplantation. The architecture being under specification might be an hybrid architecture, that will allow to integrate information already stored in datawarehouses, together with mediators or peers, depending on the level (local hospital or dialysis center, regional, national, European level) of integration. Mediators are a significant progress, but for scaling up the Web, centralized integration may be not flexible enough, and distributed systems may be even more appropriate. In particular, for example, as described for functional genomics § 3.1.3, databanks are not only data “sources” but also include precious links and mappings, through their cross-references to general ontologies and to other databanks. Such local relations between sources should be explicitly represented and directly exploited to infer new information. Therefore, peer-based integration where “every participant should be able to contribute new data and relate it to existing concepts and schemas, define new schemas that others can use as frames or reference for their queries or define new relationships between existing schema or data providers” [8] might be a challenging architecture. However, whatever mediator or peer-based integration, rich formal languages are required for representing ontologies, queries, and mappings. The logical formalism to represent mappings with OWL ontologies is still an open issue. Indeed, as well studied [12] the formalism has direct implications on the query reformulation problem, and as the formalism for expressing mappings becomes more expressive, it becomes harder.

5

As illustrated on functional genomics, or in organ failure and transplantation, a formal Web ontology language like OWL, and mediator or peer-based distributed integration seem to be promising techniques. Main challenges are (1) combining them, (2) providing a language compatible with OWL that allows to specify local semantic mappings and to answer queries by chaining such mappings, (3) represent huge ontologies like GO in OWL and to partially automatize related source mappings definitions.

6

References

1.

Bechhofer S., et al.. OILEd: a Reason-able Ontology Editor for the Semantic Web. Proceedings of KI2001, Joint German/Austrian conference on Artificial Intelligence, September 19-21, Vienna. Springer-Verlag LNAI Vol. 2174, pp 396--408. 2001. Ben Saïd M., Simonet A., Guillon D., Jacquelinet C., Gaspoz F., Dufour E., Mugnier C., Jais J.P., Simonet M., Landais P. A Dynamic Web Application within an n-tier Architecture: a Multi-Source Information System for End-

Our current projects are focused on the development of more flexible information 2. 6 7

http://www.w3.org/TR/owl-ref/ http://www.cs.man.ac.uk/~horrocks/FaCT/

Conclusion

4

3. 4. 5. 6.

7.

8. 9.

10.

11. 12. 13. 14.

15.

16. 17. 18. 19. 20.

21.

Stage Renal Disease, in The New Navigators : from Professionals to Patients. Eds R. Baud, M. Fieschi, P. Le Beux and P. Ruch, Publ. IOS Press, Amsterdam 2003. MIE 2003, pp 95-100. Bernstein P et al. Data management for peer-to-peer computing: A vision, Workshop WebDB 2002. Delamarre D, Burgun A, Seka LP, Le Beux P. Automated coding of patient discharge summaries using conceptual graphs. Methods Inf Med, 1995, Sep; 34(4):345-51 Golbreich, C., B., Burgun A. Challenges for Biomedical Information. Position Statement Paper. Semantic Integration Workshop, ISWC 2003 Golbreich, C., Dameron, O., Gibaud, B., Burgun A. Web ontology language requirements w.r.t expressiveness of taxononomy and axioms in medecine, ISWC 2003, Springer. Guérin E, Moussouni F, Courselaud B, Loréal O. UML modeling of Gedaw: A gene expression data warehouse specialised in the liver. The 3rd french bioinformatics conference proceeding: JOBIM; 2002 June 10-12; France, Saint-Malo;2002. p. 319-334. Halevy, A. Y. Zachary G. Ives, Dan Suciu, and Igor Tatarinov. Schema mediation in peer data management systems. In ICDE, 2003. Ingenerf J, Reiner J, Seik B. Standardized terminological services enabling semantic interoperability between distributed and heterogeneous systems. Int J Med Inf. 2001 Dec;64(2-3):223-40). Jacquelinet C, Burgun A, Delamarre D, Strang N, Djabbour S, Boutin B, Le Beux P. Developing the ontological foundations of a terminological system for end-stage diseases, organ failure, dialysis and transplantation. Int J Med Inf. 2003 Jul;70(2-3):317-28. Jacquelinet C, Houssin D. Principles and Practice of Cadaver Organ Allocation in France. In JL Touraine et al Eds. Kluwer Academic Pub. 1998 : 23-28. Levy A. Y, Rousset MC, The Limits on Combining Recursive Horn Rules with Description Logics, AAAI/IAAI, Vol. 1 (1996) Lindberg DAB, Humphreys BL, McCray AT, The Unified Medical Language System. Meth Inform Med, 1993, 32(4): 281-91 Marquet G, Burgun A, Moussouni F, Guérin E, Le Duff F, Loréal O. BioMeKe: an ontology-based biomedical knowledge extraction system devoted to transcriptome analysis, MIE 2003 Marquet G, Golbreich C., Burgun A From an ontologybased search engine towards a mediator for medical and biological information integration, Semantic Integration Workshop, ISWC 2003, Sanibel, Florida, 2003. Povey S, Lovering R, Bruford E, Wright M, Lush M, Wain H. The HUGO Gene Nomenclature Committee (HGNC).Hum Genet. 2001;109(6):678-80 Rector A. et al. The GRAIL concept modelling language for medical terminology. Artificial Intelligence in Medicine,9:139-171, 1997. Staab S Edt Ontologies’KISSES in standardization. IEEE Intelligent Systems, 70-79 Stengel B, Landais P. Data Collection about the case management of end-stage renal insufficiency. Nephrologie, 20 (1999) 29-40. Stevens R, Baker P, Bechhofer S, Ng G, Jacoby A, Paton NW, Goble CA, Brass A. TAMBIS: transparent access to multiple bioinformatics information sources. Bioinformatics. 2000 Feb;16(2):184-5. The Gene Ontology Consortium. Creating the gene ontology resource:design and implementation. Genome Res 2001,11(8):1425-1433

5