Using Data Grids to Manage Distributed Data Reagan W. Moore San Diego Supercomputer Center http://www.npaci.edu/DICE
[email protected] San Diego Supercomputer Center
1
National Partnership for Advanced Computational Infrastructure
Topics • Data Grids - data sharing environments – Required capabilities – Storage Resource Broker
• Examples of data grids – Distributed data management for a project – Federated digital libraries for data sharing across projects
San Diego Supercomputer Center
2
National Partnership for Advanced Computational Infrastructure
Data Grids • Software systems that manage distributed data • Control global name spaces for – – – –
Resources Users Files Metadata context
• Provide standard operations on each name space • Provide single sign-on authentication, collection management, latency management, replication, and federation • Generic distributed data management technology San Diego Supercomputer Center
3
National Partnership for Advanced Computational Infrastructure
Worldwide Universities Network David De Roure, University of Southampton
[email protected] http://www.ecs.soton.ac.uk/~dder • Implement data grid linking academic universities • Support collaborative research and education – HASTAC: Humanities, Arts, Science and Technology Advanced Collaboratory – Geo-referenced social science data collections – Earth Science data collections
• Provide data grid registry to promote federation of international data grids
San Diego Supercomputer Center
4
National Partnership for Advanced Computational Infrastructure
Foundation of the WUN Grid • • • • • •
SDSC Manchester Southampton White Rose NCSA A functioning, general purpose international Grid • A hub for federating other data grids
San Diego Supercomputer Center
5
Manchester-SDSC mirror
National Partnership for Advanced Computational Infrastructure
Data Management Systems • Data grid for managing distributed data – Latency management for bulk analyses of collections – Infrastructure independent name spaces for describing data, resources, users, and state information
• Digital library for managing data context – Curation services for managing collections – Descriptive metadata
• Persistent archive to manage technology evolution – Interoperability mechanisms between heterogeneous storage systems and user access mechanisms San Diego Supercomputer Center
6
National Partnership for Advanced Computational Infrastructure
Provide Context for Data • Properties of files – Provenance - source – Descriptive attributes – Structure
• Organize properties as metadata in a collection hierarchy – Define operations on file properties – Manage state information - location, replicas, containers
• Separate context management from content management – Maintain consistency of context as operations are done on context San Diego Supercomputer Center
7
National Partnership for Advanced Computational Infrastructure
Managing Distributed Data Data Access Methods (Web Browser, DSpace, OAI-PMH)
Storage Repository • Storage location
Naming conventions provided by storage systems
• User name • File name • File context (creation date,…) • Access constraints
San Diego Supercomputer Center
8
National Partnership for Advanced Computational Infrastructure
Storage Resource Broker Data Grid Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Collection Storage Repository
Data Grid
• Storage location
• Logical resource name space
• User name
• Logical user name space
• File name
• Logical file name space
• File context (creation date,…)
• Logical context (metadata)
• Access constraints
• Control/consistency constraints
San Diego Supercomputer Center
9
National Partnership for Advanced Computational Infrastructure
Logical Name Spaces • Storage resources – Logical names for managing collections of resources
• User names (user-name / domain / data grid) – Distinguished names for users to manage access controls
• Digital Entities (files, blobs, structured data, …) – Logical name space for global identifiers for files
• Context - Metadata attributes – Standard metadata attributes, Dublin Core – State information resulting from data grid operations – User-defined metadata
San Diego Supercomputer Center
10
National Partnership for Advanced Computational Infrastructure
Storage Repository Virtualization User Application
Archive
San Diego Supercomputer Center
11
Database
File System
National Partnership for Advanced Computational Infrastructure
Storage Repository Virtualization (Standard Operations on Logical Name Space) Remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries
User Application
Common set of operations for interacting with every type of storage repository
Archive
San Diego Supercomputer Center
12
Database
File System
National Partnership for Advanced Computational Infrastructure
Federated Server Architecture Read Application
Logical Name Or Attribute Condition
Peer-to-peer Brokering Parallel Data Access
1 5/6
6
SRB server
3
4
SRB agent
5
SRB agent
1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control
San Diego Supercomputer Center
SRB server
2
R1
13
MCAT Data Access
R2
Server(s) Spawning
National Partnership for Advanced Computational Infrastructure
Data Abstraction User Application
Archive at SDSC
San Diego Supercomputer Center
14
Database At U Md
File System at NARA
National Partnership for Advanced Computational Infrastructure
Context Abstraction Logical name space Location independent identifier Persistent identifier Collection owned data Access controls Audit trails Checksums Descriptive metadata
Common naming convention and set of attributes for describing digital entities
Archive at SDSC
Inter-realm authentication Single sign-on system
San Diego Supercomputer Center
User Application
15
Database At U Md
File System at U Texas
National Partnership for Advanced Computational Infrastructure
Latency Management -Bulk Operations • Bulk register – Create a logical name for a file
• Bulk load – Create a copy of the file on a data grid storage repository
• Bulk unload – Provide containers to hold small files and pointers to each file location
• Bulk delete – Mark as deleted in metadata catalog – After specified interval, delete file
• Bulk metadata load • Requests for bulk operations for access control setting, …
San Diego Supercomputer Center
16
National Partnership for Advanced Computational Infrastructure
SRB Latency Management Remote Proxies, Staging
Data Aggregation Containers Network Network
Source
Prefetch Destination Destination
Replication
Streaming
Caching
Server-initiated I/O
Parallel I/O
Client-initiated I/O
San Diego Supercomputer Center
17
National Partnership for Advanced Computational Infrastructure
Data Grid Federation • Link multiple independent data grids – Coordinate metadata between independent metadata catalogs
• Provide consistency and access constraints for each of the four logical name spaces (resources, users, files, metadata) – Peer-to-peer federations, data access – Replication federations, shared resources – Hierarchical federations, consistency constraints
• Tune data grid federation by implementing different consistency and access constraints San Diego Supercomputer Center
18
National Partnership for Advanced Computational Infrastructure
National Archives Persistent Archive
NARA MCAT
Principle copy stored at NARA with complete metadata catalog
San Diego Supercomputer Center
U Md MCAT
MCAT
Replicated copy at U Md for improved access, load balancing and disaster recovery
19
SDSC
Deep Archive at SDSC, no user access, but complete copy
National Partnership for Advanced Computational Infrastructure
Peer-to-Peer Data Grids Free Floating Partial User-ID Sharing
Replication Constraints
Occasional Interchange Partial Resource Sharing Replicated Data No Metadata Synch
System Set Access Controls System Controlled Complete Synch Complete User-ID Sharing
Resource Interaction
Access Constraints
User and Data Replica System Managed Replication Connection From Any Zone Complete Resource Sharing
Replicated Catalog
Consistency Constraints Hierarchical Zone Organization One Shared User-ID Nomadic System Managed Replication System Set Access Controls System Controlled Partial Synch No Resource Sharing Snow Flake Super Administrator Zone Control Master Slave
Replication Data Grids
System Controlled Complete Synch No User-ID Sharing
Federation Environments San Diego Supercomputer Center
20
Deep Archive
Hierarchical Data Grids
National Partnership for Advanced Computational Infrastructure
Generic Infrastructure • SDSC developed the Storage Resource Broker (SRB) to support access to distributed data – Effort started in 1996 as a DARPA funded project – Now support over 30 national/international projects
• Development team of 11 staff is led by – Michael Wan, data management systems – Arcot Rajasekar , information management systems
San Diego Supercomputer Center
21
National Partnership for Advanced Computational Infrastructure
Data Grid Federation zoneSRB C, C++, Java Linux Libraries I/O
Application Java, NT Browsers
Unix Shell
DLL / Python, Perl
HTTP DSpace OpenDAP
OAI, WSDL, WSRF
Federation Management Consistency & Metadata Management / Authorization,Authentication,Audit Logical Name Space Catalog Abstraction Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix
San Diego Supercomputer Center
Latency Management
Data Transport
Metadata Transport
Storage Repository Virtualization Databases Archives - Tape, File Systems DB2, Oracle, Sybase, HPSS, ADSM, ORB Unix, NT, SQLserver,Postgres, UniTree, DMF, Mac OSX mySQL, Informix CASTOR,ADS
22
National Partnership for Advanced Computational Infrastructure
Data Management Systems (Supported by Storage Resource Broker) • Data collecting – Sensor systems, object ring buffers and portals
• Data organization – Collections, manage data context
• Data sharing – Data grids, manage heterogeneity of resources
• Data publication – Digital libraries, support discovery
• Data preservation – Persistent archives, manage technology evolution
• Data analysis – Processing pipelines, manage knowledge extraction San Diego Supercomputer Center
23
National Partnership for Advanced Computational Infrastructure
GBs of data stored
Storage Resource Broker Collections at SDSC (9/27/2004 ) Data Grid NSF/ITR - National Virtual Observatory NSF - National Partnership for Advanced Computational Infrastructure Hayden Planetarium - Evolution of the Solar System visualizations NSF/NPACI - Joint Center for Structural Genomics NSF/NPACI - Biology and Environmental collections NSF - TeraGrid, ENZO Cosmology simulations
Ê
Number Number of files of Users Ê
Ê
53,778 22,165 7,201 5,228 8,704 104,370
9,507,399 5,156,765 113,600 652,031 21,881 908,600
80 380 178 50 67 3,247
NIH - Biomedical Informatics Research Network Digital Library
5,808 Ê
3,777886 Ê
172
NLM - Digital Embryo image collection NSF/NPACI - Long Term Ecological Reserve NSF/NPACI - Grid Portal NIH - Alliance for Cell Signaling microarray data NSF - National Science Digital Library SIO Explorer collection NSF/NPACI -Transana education research video collection NSF/ITR - Southern California Earthquake Center
720 251 1,917 776 2,122 92 88,199
45,365 8,381 49,665 60,177 758,233 2,387 1,790,319
23 36 392 21 27 26 59
128
203,930
29
UCSDLib Persistent Archive NARA- Research Prototype Persistent Archive NSF - National Science Digital Library persistent archive TOTAL
San Diego Supercomputer Center
24
Ê
Ê 89 254,470 3,571 26,908,350 305 TB 50 million
Ê
Ê 58 122 4,967
National Partnership for Advanced Computational Infrastructure
Conclusion • Distributed data management systems can be built on generic data grid infrastructure – Data grids to support bulk access across remote sites – Integration of data grid and digital library capabilities to manage massive data collections – Federation of data grids to build international discipline-wide collections
San Diego Supercomputer Center
25
National Partnership for Advanced Computational Infrastructure
For More Information Reagan W. Moore San Diego Supercomputer Center
[email protected] http://www.npaci.edu/DICE http://www.npaci.edu/DICE/SRB http://www.npaci.edu/dice/srb/mySRB/mySRB.html
San Diego Supercomputer Center
26
National Partnership for Advanced Computational Infrastructure