Using the semantically interoperable biospecimen ... - Semantic Scholar

Using the semantically interoperable biospecimen repository application, caTissue End user deployment lessons learned Jack W. London, PhD and Devjani Chatterjee, PhD Kimmel Cancer Center Thomas Jefferson University Philadelphia, PA USA [email protected]

Abstract—The goal of the National Cancer Institute’s cancer Biomedical Informatics Grid initiative, or caBIG®, is the ability to share data and resources among cancer researchers. One means to achieving this goal is the development of semantically interoperable informatics tools based on common data models and controlled vocabularies. A tool for managing biospecimen repositories, caTissue, enables investigators to query for available tissues that are relevant to the needs of their research. For this functionality, the caTissue application data model must include annotation describing various specimen characteristics, and have this information accessible for query by the researcher end user. Having deployed caTissue over two years at Thomas Jefferson University, we report the lessons learned from our investigators’ use of this complex, semantically interoperable software application. Overall we have found that object model complexity and semantic completeness pose obstacles to end user accessibility that require effective strategies to overcome. Keywords-semantic interoperability; tissue banks; controlled vocabularies

I.

INTRODUCTION

The proposition that the progress of research can be advanced by the sharing of data and resources was formally recognized in the National Institutes of Health’s “Data Sharing” directive [1]. Consistent with this NIH dictate, in the summer of 2004 the National Cancer Institute announced its intention to create an infrastructure for creating, communicating and sharing biomedical informatics tools, data and other research resources, using common data standards and models. The foundation of this infrastructure would be cancer investigator access to an Internet-based grid, a cancer Biomedical Informatics Grid, or caBIG®, through which research data and the tools to access and analyze the data would be available [2]. The underlying technical objectives for informatics tool development were syntactic and semantic interoperability. Advances in molecular techniques have resulted in highthroughput technologies for genotyping, evaluating expression and transcription, sequencing, proteomics, and other laboratory analyses. The utilization of these advances in molecular techniques is essential to researching the pathophysiology and disease mechanisms for complex diseases, such as cancer, and requires access to substantial numbers of biological samples and their associated detailed phenotypic

data, including study participant demographics, diagnosis, treatment, and outcomes. This information about the source of the biospecimen can help in the selection of specimens specifically appropriate for the research question being investigated, and offers the possibility of subsequently relating experimental findings back to clinical observations (diagnoses, treatments, outcomes). Furthermore, the requirements of comparative research, data sharing, and inter-institutional analysis have led to the development of controlled vocabularies for research biospecimen annotation [3,4]. Given this necessity for human specimens for biomedical research, caBIG® has a work space dedicated to the development and deployment of biospecimen repository tools: “Tissue Banks and Pathology Tools” (TBPT). Work on a tool for biospecimen inventory management, tracking, and annotation began in 2005 within the TBPT work space. Developer groups at Washington University and the University of Pittsburgh were joined by adopting collaborators at the University of Pennsylvania, Indiana University, Yale University, and our group at Thomas Jefferson University (TJU). By the summer of 2007 a tool was available, caTissue Core, that had sufficient functionality for production deployment in biobanks. A fullfunction application, caTissue Suite, was released in January 2009. This tool permits users to enter and retrieve data concerning the collection, storage, quality assurance, and distribution of biospecimens. caTissue is sufficiently scalable and configurable for deployment across biospecimen resources of varying size and function, and that manage multiple types of biospecimens (tissue, biofluids, nucleic acid). The tool provides search functionality which allows investigators to query via the caGRID available specimens at participating institutions. The specimens are annotated with demographic (age at accession, gender, race/ethnicity) and clinical data (diagnosis, pathological status). This application has two primary end user communities: individuals who manage biospecimen repositories, and investigators who use specimens in their research. While the application’s software requirements and specifications were dictated by use case analysis of these user communities’ work flow, system design adhered to the dictates of syntactic and semantic interoperability. Deployment of caTissue here

at Thomas Jefferson University has shown that consideration of end user needs should influence how semantic interoperability is implemented. II.

CATISSUE SUITE

The caTissue Suite application enforces semantic interoperability primarily by the use of caBIG® Common Data Elements, with attribute values stored in a “permissible values” table. The clinical diagnosis attribute applies to the “specimen collection group” object, an accession of a collection of specimens obtained at a single time point, and describes any clinical diagnostic information pertaining to the individual providing the specimens. caTissue uses a standard vocabulary for this very significant tissue annotation: the “Systematized Nomenclature of Medicine -Clinical Terms” (SNOMED CT). This is a comprehensive collection of medical terminology addressing most areas of clinical information, including diseases, findings, and procedures. SNOMED CT consists of approximately 800,000 terms representing over 350,000 concepts, arranged in a “type” hierarchy (“viral pneumonia” ! “infectious pneumonia” ! “pneumonia” ! “lung disease”). III.

END USER PROBLEMS

As noted above, caTissue has two primary end user communities: tissue bankers and researchers. Tissue bankers rely on caTissue to manage their biobanks, which encompasses many use cases centered on accessioning, storing, and distributing specimens. Tissue bankers are familiar with the pathology domain and may use the caTissue application on a daily basis. Researchers use caTissue to identify and order specimens of interest. A researcher may or may not be familiar with specimen banks and may only occasionally have need to use caTissue. For both these user groups, however, the ability to easily query the biobank contents is essential. We have found that two features of caTissue Suite significantly impede the query process: the large size and complexity of the caTissue object model; and the huge scale of SNOMED CT diagnostic terminology. A. Object Model complexity The large size and complexity of the object model makes it difficult to construct caTissue database queries. It is assumed here that the tissue banker/researcher end user has no computer programming expertise and will use the graphical query tools the application provides, as opposed to writing code utilizing the APIs or directly querying the database with SQL. And even the end user with programming knowledge may not want to invest the time required to become adept at navigating caTissue’s object model, particularly if their need to query the biobank is infrequent. It is the comprehensiveness of the caTissue functionality that has resulted in this complicated schema, which in turn impedes the end user’s ability to query the information in the system. B. SNOMED CT diagnoses The comprehensiveness of SNOMED CT for clinical diagnosis also poses a problem for end user queries.

Concerns with SNOMED CT because of its large scale have been previously noted [5]. The clinical diagnosis is a very frequently used specimen annotation used in queries. But if the end user searches for specimens from individuals diagnosed with “ductal carcinoma in situ” or “DCIS,” no cases will be found, since the SNOMED CT diagnosis is “intraductal carcinoma in situ of breast.” Viewing the possible diagnoses of interest from the entire set of SNOMED CT is not a viable option. IV.

SOLUTIONS

A. Canned parametized queries The caTissue development team at Washington University addressed caTissue’s difficult query construction problem by adding to caTissue Suite the capability of storing queries for later use. This allows an individual who is knowledgeable of caTissue’s object model and the application’s functionality to create a “canned” query for use by others. Furthermore, the ability to specify query parameters was also added. B. Guided query interface Selecting specimens with clinical diagnosis as a parameter is a common scenario. It often is the first action that a researcher performs with caTissue. We have addressed this problem at TJU by developing a web client front end that guides the user to a display of available specimens. This guided query interface is rigid – the user has access to only a certain subset of caTissue objects. In a sense, it is a canned query for discovering specimens based on anatomic site, specimen type, pathological status, and finally, clinical diagnosis. V.

CONCLUSIONS

The NIH data sharing mandate requires semantic interoperability for research software applications. Recognized domain standards are logical choices for controlled vocabularies. However, end user data accessibility may be compromised by using large complex data models and comprehensive standard terminologies. In the case of the caBIG® caTissue biospecimen repository application, overly complex querying procedures were simplified by providing end users with canned queries or query front-ends built around particular retrievals. [1] [2] [3] [4]

[5]

REFERENCES “Final NIH Statement On Sharing Research Data”, NIH Notice NOTOD-03-032, February 26, 2003. The Cancer Biomedical Informatics grid (caBIG) [https://cabig.nci.nih.gov/] The National Biospecimen Network (NBN) blueprint [http://biospecimens.cancer.gov/biospecimen/network/index.asp] Becich MJ: The role of the pathologist as tissue refiner and data miner: the impact of functional genomics on the modern pathology laboratory and the critical roles of pathology informatics and bioinformatics. Mol Diagn 2000, 5(4):287. Veli N. Stroetmann (Ed.), Dipak Kalra, Pierre Lewalle, Alan Rector, Jean M. Rodrigues, Karl A. Stroetmann, Gyorgy Surjan, Bedirhan Ustun, Martti Virtanen, Pieter E. Zanstra. “Semantic Interoperability for Better Health and Safer Healthcare Research and Deployment Roadmap for Europe,” Deployment and Research Roadmap for Europe, European Commission 2009.