Using Data Grids to Manage Distributed Data

0 downloads 0 Views 330KB Size Report
Sep 27, 2004 - San Diego Supercomputer Center. National Partnership for Advanced Computational Infrastructure .... mySQL, Informix. File Systems. Unix, NT,.
Using Data Grids to Manage Distributed Data Reagan W. Moore San Diego Supercomputer Center http://www.npaci.edu/DICE [email protected] San Diego Supercomputer Center

1

National Partnership for Advanced Computational Infrastructure

Topics • Data Grids - data sharing environments – Required capabilities – Storage Resource Broker

• Examples of data grids – Distributed data management for a project – Federated digital libraries for data sharing across projects

San Diego Supercomputer Center

2

National Partnership for Advanced Computational Infrastructure

Data Grids • Software systems that manage distributed data • Control global name spaces for – – – –

Resources Users Files Metadata context

• Provide standard operations on each name space • Provide single sign-on authentication, collection management, latency management, replication, and federation • Generic distributed data management technology San Diego Supercomputer Center

3

National Partnership for Advanced Computational Infrastructure

Worldwide Universities Network David De Roure, University of Southampton [email protected] http://www.ecs.soton.ac.uk/~dder • Implement data grid linking academic universities • Support collaborative research and education – HASTAC: Humanities, Arts, Science and Technology Advanced Collaboratory – Geo-referenced social science data collections – Earth Science data collections

• Provide data grid registry to promote federation of international data grids

San Diego Supercomputer Center

4

National Partnership for Advanced Computational Infrastructure

Foundation of the WUN Grid • • • • • •

SDSC Manchester Southampton White Rose NCSA A functioning, general purpose international Grid • A hub for federating other data grids

San Diego Supercomputer Center

5

Manchester-SDSC mirror

National Partnership for Advanced Computational Infrastructure

Data Management Systems • Data grid for managing distributed data – Latency management for bulk analyses of collections – Infrastructure independent name spaces for describing data, resources, users, and state information

• Digital library for managing data context – Curation services for managing collections – Descriptive metadata

• Persistent archive to manage technology evolution – Interoperability mechanisms between heterogeneous storage systems and user access mechanisms San Diego Supercomputer Center

6

National Partnership for Advanced Computational Infrastructure

Provide Context for Data • Properties of files – Provenance - source – Descriptive attributes – Structure

• Organize properties as metadata in a collection hierarchy – Define operations on file properties – Manage state information - location, replicas, containers

• Separate context management from content management – Maintain consistency of context as operations are done on context San Diego Supercomputer Center

7

National Partnership for Advanced Computational Infrastructure

Managing Distributed Data Data Access Methods (Web Browser, DSpace, OAI-PMH)

Storage Repository • Storage location

Naming conventions provided by storage systems

• User name • File name • File context (creation date,…) • Access constraints

San Diego Supercomputer Center

8

National Partnership for Advanced Computational Infrastructure

Storage Resource Broker Data Grid Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Collection Storage Repository

Data Grid

• Storage location

• Logical resource name space

• User name

• Logical user name space

• File name

• Logical file name space

• File context (creation date,…)

• Logical context (metadata)

• Access constraints

• Control/consistency constraints

San Diego Supercomputer Center

9

National Partnership for Advanced Computational Infrastructure

Logical Name Spaces • Storage resources – Logical names for managing collections of resources

• User names (user-name / domain / data grid) – Distinguished names for users to manage access controls

• Digital Entities (files, blobs, structured data, …) – Logical name space for global identifiers for files

• Context - Metadata attributes – Standard metadata attributes, Dublin Core – State information resulting from data grid operations – User-defined metadata

San Diego Supercomputer Center

10

National Partnership for Advanced Computational Infrastructure

Storage Repository Virtualization User Application

Archive

San Diego Supercomputer Center

11

Database

File System

National Partnership for Advanced Computational Infrastructure

Storage Repository Virtualization (Standard Operations on Logical Name Space) Remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries

User Application

Common set of operations for interacting with every type of storage repository

Archive

San Diego Supercomputer Center

12

Database

File System

National Partnership for Advanced Computational Infrastructure

Federated Server Architecture Read Application

Logical Name Or Attribute Condition

Peer-to-peer Brokering Parallel Data Access

1 5/6

6

SRB server

3

4

SRB agent

5

SRB agent

1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control

San Diego Supercomputer Center

SRB server

2

R1

13

MCAT Data Access

R2

Server(s) Spawning

National Partnership for Advanced Computational Infrastructure

Data Abstraction User Application

Archive at SDSC

San Diego Supercomputer Center

14

Database At U Md

File System at NARA

National Partnership for Advanced Computational Infrastructure

Context Abstraction Logical name space Location independent identifier Persistent identifier Collection owned data Access controls Audit trails Checksums Descriptive metadata

Common naming convention and set of attributes for describing digital entities

Archive at SDSC

Inter-realm authentication Single sign-on system

San Diego Supercomputer Center

User Application

15

Database At U Md

File System at U Texas

National Partnership for Advanced Computational Infrastructure

Latency Management -Bulk Operations • Bulk register – Create a logical name for a file

• Bulk load – Create a copy of the file on a data grid storage repository

• Bulk unload – Provide containers to hold small files and pointers to each file location

• Bulk delete – Mark as deleted in metadata catalog – After specified interval, delete file

• Bulk metadata load • Requests for bulk operations for access control setting, …

San Diego Supercomputer Center

16

National Partnership for Advanced Computational Infrastructure

SRB Latency Management Remote Proxies, Staging

Data Aggregation Containers Network Network

Source

Prefetch Destination Destination

Replication

Streaming

Caching

Server-initiated I/O

Parallel I/O

Client-initiated I/O

San Diego Supercomputer Center

17

National Partnership for Advanced Computational Infrastructure

Data Grid Federation • Link multiple independent data grids – Coordinate metadata between independent metadata catalogs

• Provide consistency and access constraints for each of the four logical name spaces (resources, users, files, metadata) – Peer-to-peer federations, data access – Replication federations, shared resources – Hierarchical federations, consistency constraints

• Tune data grid federation by implementing different consistency and access constraints San Diego Supercomputer Center

18

National Partnership for Advanced Computational Infrastructure

National Archives Persistent Archive

NARA MCAT

Principle copy stored at NARA with complete metadata catalog

San Diego Supercomputer Center

U Md MCAT

MCAT

Replicated copy at U Md for improved access, load balancing and disaster recovery

19

SDSC

Deep Archive at SDSC, no user access, but complete copy

National Partnership for Advanced Computational Infrastructure

Peer-to-Peer Data Grids Free Floating Partial User-ID Sharing

Replication Constraints

Occasional Interchange Partial Resource Sharing Replicated Data No Metadata Synch

System Set Access Controls System Controlled Complete Synch Complete User-ID Sharing

Resource Interaction

Access Constraints

User and Data Replica System Managed Replication Connection From Any Zone Complete Resource Sharing

Replicated Catalog

Consistency Constraints Hierarchical Zone Organization One Shared User-ID Nomadic System Managed Replication System Set Access Controls System Controlled Partial Synch No Resource Sharing Snow Flake Super Administrator Zone Control Master Slave

Replication Data Grids

System Controlled Complete Synch No User-ID Sharing

Federation Environments San Diego Supercomputer Center

20

Deep Archive

Hierarchical Data Grids

National Partnership for Advanced Computational Infrastructure

Generic Infrastructure • SDSC developed the Storage Resource Broker (SRB) to support access to distributed data – Effort started in 1996 as a DARPA funded project – Now support over 30 national/international projects

• Development team of 11 staff is led by – Michael Wan, data management systems – Arcot Rajasekar , information management systems

San Diego Supercomputer Center

21

National Partnership for Advanced Computational Infrastructure

Data Grid Federation zoneSRB C, C++, Java Linux Libraries I/O

Application Java, NT Browsers

Unix Shell

DLL / Python, Perl

HTTP DSpace OpenDAP

OAI, WSDL, WSRF

Federation Management Consistency & Metadata Management / Authorization,Authentication,Audit Logical Name Space Catalog Abstraction Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix

San Diego Supercomputer Center

Latency Management

Data Transport

Metadata Transport

Storage Repository Virtualization Databases Archives - Tape, File Systems DB2, Oracle, Sybase, HPSS, ADSM, ORB Unix, NT, SQLserver,Postgres, UniTree, DMF, Mac OSX mySQL, Informix CASTOR,ADS

22

National Partnership for Advanced Computational Infrastructure

Data Management Systems (Supported by Storage Resource Broker) • Data collecting – Sensor systems, object ring buffers and portals

• Data organization – Collections, manage data context

• Data sharing – Data grids, manage heterogeneity of resources

• Data publication – Digital libraries, support discovery

• Data preservation – Persistent archives, manage technology evolution

• Data analysis – Processing pipelines, manage knowledge extraction San Diego Supercomputer Center

23

National Partnership for Advanced Computational Infrastructure

GBs of data stored

Storage Resource Broker Collections at SDSC (9/27/2004 ) Data Grid NSF/ITR - National Virtual Observatory NSF - National Partnership for Advanced Computational Infrastructure Hayden Planetarium - Evolution of the Solar System visualizations NSF/NPACI - Joint Center for Structural Genomics NSF/NPACI - Biology and Environmental collections NSF - TeraGrid, ENZO Cosmology simulations

Ê

Number Number of files of Users Ê

Ê

53,778 22,165 7,201 5,228 8,704 104,370

9,507,399 5,156,765 113,600 652,031 21,881 908,600

80 380 178 50 67 3,247

NIH - Biomedical Informatics Research Network Digital Library

5,808 Ê

3,777886 Ê

172

NLM - Digital Embryo image collection NSF/NPACI - Long Term Ecological Reserve NSF/NPACI - Grid Portal NIH - Alliance for Cell Signaling microarray data NSF - National Science Digital Library SIO Explorer collection NSF/NPACI -Transana education research video collection NSF/ITR - Southern California Earthquake Center

720 251 1,917 776 2,122 92 88,199

45,365 8,381 49,665 60,177 758,233 2,387 1,790,319

23 36 392 21 27 26 59

128

203,930

29

UCSDLib Persistent Archive NARA- Research Prototype Persistent Archive NSF - National Science Digital Library persistent archive TOTAL

San Diego Supercomputer Center

24

Ê

Ê 89 254,470 3,571 26,908,350 305 TB 50 million

Ê

Ê 58 122 4,967

National Partnership for Advanced Computational Infrastructure

Conclusion • Distributed data management systems can be built on generic data grid infrastructure – Data grids to support bulk access across remote sites – Integration of data grid and digital library capabilities to manage massive data collections – Federation of data grids to build international discipline-wide collections

San Diego Supercomputer Center

25

National Partnership for Advanced Computational Infrastructure

For More Information Reagan W. Moore San Diego Supercomputer Center [email protected] http://www.npaci.edu/DICE http://www.npaci.edu/DICE/SRB http://www.npaci.edu/dice/srb/mySRB/mySRB.html

San Diego Supercomputer Center

26

National Partnership for Advanced Computational Infrastructure