Medical Imaging Archiving: A Comparison ... - Semantic Scholar

19 downloads 40981 Views 608KB Size Report
large amounts of data which tends to increase, since the full patient's ... Typically, PACS use a Relational Database Management ... dealing with big data issues.
Medical imaging archiving: a comparison between several NoSQL solutions Luís A. Bastião Silva, Louis Beroud, Carlos Costa and José Luis Oliveira Medical images are stored in Picture Archiving and Communication Systems (PACS). In order to support common data and communication formats, when handling medical images between different devices and vendors, the DICOM standard (Digital Imaging and Communications in Medicine) was created. Currently, any equipment in medical institutes follows the DICOM standard to communicate, store, and visualize medical data. As a consequence, PACS require robust information and communication infrastructures to ensure that all these devices communicate in a secure and timely maner.

Abstract— The use of digital medical imaging systems has greatly increased in healthcare institutions and they are currently valuable tools supporting medical decision and treatment procedures. The proliferation of digital modalities led to an explosion on medical images production, increasing the need to have larger repositories to afford all this amount of data with high availability and performance. NoSQL databases have been replacing relational databases in some scenarios, due to their horizontal scalability and to their flexibility to adapt to dynamic requirements. In this paper, we present an implementation of a medical imaging archive supported in both MongoDB and CouchDB. This implementation is compliant with the medical imaging standards and the storage and query/retrieve performance of our different implementations were evaluated. We also discuss the strengths and weakness of the proposed implementations and present several scenarios that take advantage of the proposed solutions.

Typically, PACS use a Relational Database Management System (RDBMS) to support their archive systems [4]. Also, in these systems there are already few solutions intending to use NoSQL [4, 6], especially for index and retrieval. Nevertheless, they do not exploit the full possibility to store also files inside the NoSQL databases, as blob stores. This type of solutions can be used, not only by the traditional archive used by medical staff, but also by researchers that are dealing with big data issues.

I. INTRODUCTION During the last decade, the importance of data in the world has dramatically changed and, in many domains, the amount of data has hugely increased. Healthcare is one of such scenarios. Digital imaging laboratories are dealing with large amounts of data which tends to increase, since the full patient’s history might help medical doctors in future diagnosis. Furthermore, due to recent legislation initiatives, in some countries, data from specific modalities should be stored for a larger period of time to assure traceability of registers. In order to support this development, i.e. to manage a large amount of data, several technologies have been becoming more attractive and gaining higher acceptance [1, 2]. One such example has been the emergence of NoSQL (Not only SQL) databases, such as Cassandra, CouchDB, BigTable, HBase, MongoDB, Redis. These technologies were developed to handle large amount of schema-less data and to provide high performance. At the same time, they have been increasingly adopted by industry and research applications [3-5].

In this paper, we present an open source PACS archive, supported by two distinct NoSQL solutions: MongoDB and CouchDB. The main idea was to study different approaches for the PACS archive and also to benchmark the performance of storing, querying and retrieving a big dataset of medical imaging files, exploring the differences between them in distinct scenarios. II. RELATED WORK NoSQL solutions have their own specificities regarding persistence, replication, availability and transaction [7], and they are not always the solution for data store problems. RDBMS should continue to be used when the data validity and dynamic queries are often present, while NoSQL tend to be used when a fast access to the data is required and the scalability is a key point. The decision to use it or not depends on the application. CouchDB and MongoDB have already been compared in several scientific and research contexts. Regarding their characteristics, MongoDB was built to have a great efficiency and also to support indexing systems, while CouchDB does not. Their query engine is built through map/reduce functions. Both are built to be robust and fault-tolerant document-based stores, i.e. catered to have a large amount of documents [8].

This work was partially funded by FCT – Fundação para a Ciência e a Tecnologia, under the grant agreement SFRH/BD/79389/2011. This work also has received support from the EU/EFPIA Innovative Medicines Initiative Joint Undertaking (EMIF grant n° 115372). Luís A. Bastião Silva is with University of Aveiro in Institute of Electronics and Telematics Engineering of Aveiro – IEETA, Campus Universitário de Santiago 3810-193 Aveiro - Portugal (corresponding author to provide phone: +351 91 64 27 877) Louis Beroud is with École Supérieure d'Informatique Électronique Automatique – ESIEA, 9, rue Vésale 75005 Paris France (email: [email protected]) Carlos Costa and Jose Luis Oliveira are with University of Aveiro in Institute of Electronics and Telematics Engineering of Aveiro – IEETA, Campus Universitário de Santiago 3810-193 Aveiro – Portugal.

978-1-4799-2131-7/14/$31.00 ©2014 IEEE

Rascovsky et al. developed a CouchDB based solution to support a medical imaging archive. In this case, CouchDB is highly suitable to store, retrieve and query DICOM files. In fact, document-based databases do not have the limitation of RDBMS databases. All metadata of a DICOM file can be

65

stored without modifying any of its tags, because in NoSQL there is no need to define any kind of database schema. Also, the authors suggested huge advantages in the distribution and image management of PACS archives and that this should be considered in the next generation of PACS.

B. Core component: dataflow Dicoogle PACS receives files through DICOM communications (via C-STORE command) and locally the reception component sends it to the storage plugin. This plugin stores the file, as a binary file, with its own solution, i.e. in file system, in MongoDB or in CouchDB. The result will be an URI (Uniform Resource Identification) that can be used to retrieve this file, depending on the solution and implementation. After storing, the Dicoogle core allows to retrieve the file opening a stream on the file at the given URI. To allow other plugins to retrieve this file, it is necessary to choose a scheme for the URI. This is the scheme according to the supported plugin (file identifier is set to SOPInstanceUID in our MongoDB or CouchDB Plugin):

Dicoogle [6] is a free and open source PACS archive solution that allows extracting all textual information from the medical images and performing flexible queries over DICOM metadata. It allows extracting all DICOM attributes from the medical image file, such as Patient Name, Study Date, Accession Number, etc., and indexing all of them in a search engine (Apache Lucene in the core). It also permits performing free text queries over the data, and also structured queries such as the ones supported by the DICOM standard. Moreover, it is a Peer-to-Peer PACS solution that allows establishing on-demand connections without a complex setup time. Nonetheless, Dicoogle also relies on the binary store data in the file system and does not use a pure NoSQL solution.

PluginName://host:port/dbName/fileIdentifier  

For instance, a practical example for MongoDB plugin is: mongodb://dev-­‐pacs-­‐srv:1234/PACSDB/   1.2.840.113619.2.81.290.1.5394.2.16.20131019.215944  

There are several PACS archives already developed and many of them are open source, such as, Conquest1 or dcm4chee2. However, it is extremely complex to develop new plugins for those platforms. Dicoogle is one of open source PACS archives quite flexible to implement and incorporate new plugins [6]. Also, it is missing in the literature the performance impact of the NoSQL solutions for this kind of solutions. It is important to know if the NoSQL PACS can deal with large DICOM repositories and if they are recommendable to store DICOM files, instead of file systems traditional approaches.

Physicians analysing in DICOM workstations

Modalities

DICOM C-STORE

DICOM WADO

DICOM C-FIND

DICOM C-MOVE

Dicoogle Core

III. A PACS ARCHIVE BASED ON NOSQL SOLUTIONS A. NoSQL PACS interfaces In this paper, the main idea was to develop NoSQL PACS and measure the performance of several services. In order to implement the backend of the PACS in NoSQL, Dicoogle3 was used. It provides a Software Development Kit (SDK) to implement different kinds of plugins. There are three interfaces provided by Dicoogle (Figure 1): storage, index and a query.

indexing

Index Plugin

Query Plugin

store(DicomObject obj)

index(DicomInputStream)

query(String query)

Store CouchDB

• Storage: provides an abstraction to store and retrieve DICOM files in any kind of storage system. • Index: this plugin allow information of DICOM files.

Storage Plugin

MongoDB

Store MongoDB

CouchDB

Index CouchDB

MongoDB

Index MongoDB

CouchDB

Query CouchDB

MongoDB

Query MongoDB

CouchDB

metadata Figure 1: Dicoogle SDK: the plugins interfaces and the implemented plugins to MongoDB and CouchDB

• Query: allows searching over the information system and retrieving search results.

The Dicoogle core also uses this stream to index the file’s content. This plugin index should be responsible to store the information in its ad-hoc method, but should also be tested with the query plugin, in order to improve the queries’ speed. To support the typical use of a PACS archive, it only needs to store the DIM (DICOM Information Model) fields. Dicoogle query has a particular syntax based on the Lucene query syntax. Each query plugin has to parse the query and process it to return the matched requested values. The results returned with this method will be the ones injected in the C-FIND responses.

Thus, the complexity of developing a new PACS archive is simplified by the use of the Dicoogle SDK. The Dicoogle core manages all the operations between all plugins, including dealing with DICOM communications, which makes it more accessible to test with different NoSQL solutions.

1

http://ingenium.home.xs4all.nl/dicom.html http://www.dcm4che.org/ 3 http://www.dicoogle.com/ 2

66

When Dicoogle core wants to index a file, the process is similar to MongoDB. However, in this plugin a Map containing all metadata is created. After that, the plugin stores this HashMap as a document into a database (different from the storage plugin). In CouchDB, the indexing system is managed by the map/reduce functions.

1) MongoDB Plugin The idea was to implement a PACS archive based on MongoDB. Thus, only the plugins to MongoDB were developed, represented in Figure 1. The first limitation is that MongoDB cannot store, by default, files with a size higher than 16Mb, ensuring it does not use a big amount of RAM or bandwidth. However, DICOM files can have a size higher than this limit. To circumvent this issue, MongoDB provides an API named GridFS, which allows storing files in several chunks (256 kb max per chunks). GridFS uses one collection to store the file chunks and another one to store file metadata. The storage plugin relies on MongoDB GridFS. When it receives the DICOM file from Dicoogle core, it reads it as a binary file and stores it with GridFS, which provides horizontal scalability through replication in several nodes.

To query the database, CouchDB uses map/reduce functions. These are predefined functions stored in design documents. The plugin accesses those functions by sending a GET request to DBname/_design/_view/documentName. For example, in the query process several functions were developed, such as, a map function and a reduce function that allow to count the number of studies for each patient. First, the map function is executed. It produces a key-value list with all documents containing a patientName field. This keyvalue list is passed to the reduce function as an input. This function sums all values to produce the output. So at the end, the result of this map/reduce function will be the number of studies per patient, as requested.

When indexing a DICOM file’s content, Dicoogle core calls the storage plugin to retrieve a stream from the given URI. Then it sends this stream to the index plugin. Dcm4che is used to parse the DICOM object and retrieve all the information to the plugin. This information is stored in another collection of MongoDB (i.e. group of documents stored), but in the same database. Each indexed DICOM file corresponds to one document collection containing the binary data. For each file’s tag, it creates a field with the same keyvalue.

IV. RESULTS AND DISCUSSION A. Performance measurements In order to assess the performance of the NoSQL implementation versus a file system schema, we used a computer with 4 processors Intel Xeon X5650 2.67GHz with 4GB RAM to support the NoSQL PACS archive. The workstation clients were Intel with 2.67 GHz and 2GB of RAM. Several clients were used, such as dcm4che2 and dcmtk. A dataset with 5 studies was used, containing 1102 DICOM files. Dicoogle has already a default plugin based on file system and Lucene and is used in a regional PACS with more than 5000 exams monthly [9]. Thus, our tests were compared taking into account the Dicoogle baseline. Due to the low coupling of the solution, it was flexible to use different backend for each service. For instance, it was possible to store only the information system’s database and keep the files stored in MongoDB. The idea was to measure if it was practicable to store these binary files inside the NoSQL databases. Figure 3 compares the performance of the several solutions: Lucene (with file system), and MongoDB and CouchDB, both with and without file system. The main purpose of those tests was to compare the performance of the different plugins, with diverse modalities. For instance, it is very dissimilar to store a XA study with 15 heavy files, which represents 383Mbytes, than the MR with 613 small files, which represents 212Mbytes. In order to measure the query results and test the system’s scalability, we tested the performance according to the number of results it can retrieve, measured with different systems. We prepared three distinct databases with different number of stored documents (small - 15733, medium 31465 and large - 47197 documents).

MongoDB allows making dynamic queries. In fact, it is possible to create a BSON (Binary JSON) object that contains the query and then execute it with MongoDB. If the user performs a query inside Dicoogle, or via DICOM CFIND, this query is propagated in a pre-defined format to the plugins (Figure 2). MongoDB supports query in a BSON format only and the plugin is responsible to perform the translation. For instance, Figure 2 presents a query entered in Dicoogle and shows how it is translated into MongoDB. The query filters the acquisition for specific data and patient name. AcquisitionDate:01012005 AND PatientName:Dupont

{$and:[{“AcquisitionDate”:01012005}, {“PatientName”:{“$regex”:”^Dupont.*”, “$options”:”i”}}]}

Figure 2: Query translation from Dicoogle SDK to MongoDB

MongoDB allows indexing several fields, which permits to improve the query process performance, i.e. to faster locate the documents. The only limitation is that we can only set up 64 maximum per collection. Nevertheless, to support the PACS archive standards, this number of fields is enough, i.e. it matches with most often used queries for DICOM files (patient name, SerieInstanceUID, series date, modality and few others). 2) CouchDB In the CouchDB plugin, when the storage interface receives a DICOM file from Dicoogle core, it reads it as a binary file and creates a new document into the database. To store the DICOM file, the plugin creates an attachment to the document.

67

is good to have a backup strategy that can be supported with more low power nodes.

Time of store and index (ms)

90000" 80000"

Lucene"+"File"System" (ms)"

70000" 60000"

CouchDB uses a different approach, relying on a Peerbased replication. It is possible to always have the same data between two instances of CouchDB using a continuous replication.

MongoDB"(ms)"

50000" 40000" 30000"

MongoDB"+"File"System" (ms)"

20000"

CouchDB"(ms)"

Finally, the development of a PACS archive from scratch is heavy expensive for the developers. Adopting the solution of use Dicoogle SDK is was possible to implement both prototypes in few weeks.

10000" 0" 7.83" 15.5" 45"

212" 383"

CouchDB+"File"System" (ms)"

V. CONCLUSION

Size of the studies (Mbytes)

In this study, we developed two open source PACS archives based on MongoDB4 and CouchDB5 and we compared their performance in real case medical imaging scenarios. It was possible to conclude that MongoDB and CouchDB have quite similar performance to store and retrieve the DICOM files. They both have some drawbacks when storing big files, which happens in a few modalities, such as XA or high-resolution mammography. An interesting result, regarding the behaviour of the three developed solutions MongoDB, CouchDB and Lucene – was that the retrieve performance does not degrade significantly as the size of the database increases. In the future, in order to support more robust replication schemas, we intend to change Lucene by another search engine, like Solr or Elasticsearch, and to combine these solutions with the NoSQL databases.

Figure 3: Storage (store file +index) performance metrics

Return of the results (ms)

Figure 4 shows the time needed to retrieve different number of results with an increasing size of the data repository. It is possible to see that the time taken to retrieve results is linear, even considering the larger repositories. 4000" 3500" 3000" 2500" 2000" 1500" 1000" 500" 0"

Lucene MongoDB CouchDB Small - (483 Medium - (966 Large - (1449 results) results) results) Type of index: small contains 15733 documents; medium contains 31465 documents and large contains 47197 documents

REFERENCES [1] N. V. Chawla and D. A. Davis, "Bringing big data to personalized healthcare: A patient-centered framework," Journal of general internal medicine, vol. 28, pp. 660-665, 2013. [2] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers, "Big data: The next frontier for innovation, competition, and productivity," 2011. [3] N. Leavitt, "Will NoSQL databases live up to their promise?," Computer, vol. 43, pp. 12-14, 2010. [4] S. J. Rascovsky, J. A. Delgado, A. Sanz, V. D. Calvo, and G. Castrillón, "Informatics in Radiology: Use of CouchDB for Document-based Storage of DICOM Objects," Radiographics, vol. 32, pp. 913-927, 2012. [5] R. C. Taylor, "An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics," BMC bioinformatics, vol. 11, p. S1, 2010. [6] C. Costa, C. Ferreira, L. Bastião, L. Ribeiro, A. Silva, and J. L. Oliveira, "Dicoogle-an open source peer-to-peer PACS," Journal of Digital Imaging, vol. 24, pp. 848-856, 2011. [7] B. G. Tudorica and C. Bucur, "A comparison between several NoSQL databases with comments and notes," in Roedunet International Conference (RoEduNet), 2011 10th, 2011, pp. 1-5. [8] E. Redmond and J. R. Wilson, "Seven Databases in Seven Weeks," Pragmatic Programmers, 2012. [9] L. A. B. Silva, R. Pinho, L. S. Ribeiro, C. Costa, and J. L. Oliveira, "A Centralized Platform for Geo-Distributed PACS Management," Journal of digital imaging, pp. 1-9, 2013.

Figure 4: Query measurements with different number of results

B. Strengths and drawbacks With the explosion of the NoSQL databases, there are plenty of different data stores that can be used nowadays. Not all of them are able to store binary files, some only allow indexing metadata and searching over the data, such as the Apache Lucene engine. MongoDB proved to be a very good solution to index the DICOM metadata, allowing quite fast storage and retrieval of information. Nonetheless, Lucene is the best solution if you want to query in every DICOM tag or if you are using free text query. CouchDB indexing, on the contrary, will be a great solution if you already know every kind of query that will be processed. Despite some slower performance in storage and retrieval of DICOM files, one big advantage of NoSQL databases to support the PACS archive is their replication strategies. MongoDB uses a master/slave replication system. The primary instance of MongoDB receives all write operations from clients. All other instances will apply the same operations to their data set. It is also the primary instance that manages read operations from clients by default, but it can be set to a secondary instance. If the primary is unavailable, a secondary will be selected as primary. In a PACS archive, it

4 5

68

https://github.com/bioinformatics-ua/dicoogle-mongo-plugin https://github.com/bioinformatics-ua/dicoogle-couchdb-plugin