Web-time analytics software ... having online applicajons to process the data in-situ ... ://www.sas.com/content/dam/SAS/en_us/doc/whitepaper1/big-data-meets-big-data-analyjcs-105777.pdf ... Need to convert 'Big data' collecjons into HPD by.
Working with High Performance Datasets & Collec8ons Ben Evans - L. Wyborn - J. Wang, C. Trenham, K. Druken, R Yang, P. Larraondo, …, and many in NCI team • Our partners ANU, BoM, CSIRO and GA and many community stakeholders
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au @NCInews nci.org.au
NCI: World-class, high-end compu8ng services for research & innova8on Research Outcomes
What is NCI: • Australia’s most highly integrated e-infrastructure environment • Petascale supercomputer + highest performance research cloud + highest performance storage in the southern hemisphere • Comprehensive and integrated expert service • Experienced Advanced Computing Scientific & Technical Teams
Par$cular focus on Earth System, Environment and Water Mgmt
© National Computational Infrastructure 2015
Research Objectives
NCI is important to Australia because it: • NCI is National and Strategic capability • Enables research that otherwise would be impossible • Enables delivery of world-class science • Enables interrogation of big data, otherwise impossible • Enables high-impact research that matters; informs public policy • Attracts and retains world-class researchers for Australia • Catalyses development of young researchers’ skills
CommuniNes and InsNtuNons/ Access and Services
ExperNse Support and Development HPC Services Virtual Laboratories/ Data-intensive Services IntegraNon
Compute (HPC/Cloud) Storage/Network Infrastructure
Ben Evans, OzEWEX 2015
nci.org.au
Australian Research Infrastructure Funding 2006-2015 • Two main tranches of funding: • National Collaborative Research Infrastructure Strategy (NCRIS) – $542M for 2006-2011 ($75 M for cyberinfrastructure)
• Super Science Initiative – $901 million for 2009-2013 ($347M for cyberinfrastructure)
• Annual operaNonal funding of around $180M pa since 2014-2015 – All infrastructure programmes were designed ensure that Australian research conNnues to be compeNNve and rank highly on an internaNonal scale. Millions of Dollars
100
•
80
Compute
Data
Tools
Networks
60 40 20 2002
2004 © National Computational Infrastructure 2015
2006
2008 Ben Evans, OzEWEX 2015
2010
2012
2014 nci.org.au
RDSI Phase 1 (2011): Infrastructure
RDSI Phase 1 (2011): Infrastructure • The NCI Proposal in September 2011 was for a High Performance Data Node • The goal was to: • Enable dramaNc increases in the scale and reach of Australian research by providing naNonwide access to enabling data collecNons; • Specialise in naNonally significant research collecNons requiring high-performance computaNonal and data-intensive capabiliNes for their use in effecNve research methods; and • Realise synergies with related naNonal research infrastructure programs.
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Example of LeQer of Support for NCI HPD Node
• “will work with the partners to develop a shared data environment” … where … “there will be agreed standards to enable interoperability between the partners” • “ it now make sense to explore these new opportuni.es within the NCI partnership, rather than as a separate agenda that GA runs independently”
…Chris Pigram, CEO, Geoscience Australia 26 July 2011
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
The DMP enables federated governance of the collec8on
OrganisaNonal Steward of the Data CollecNon
© National Computational Infrastructure 2015
Mutually agreed plan on how the collecNon will be managed and published
Ben Evans, OzEWEX 2015
NCI Data CollecNons Manager
nci.org.au
The Research Data Storage Infrastructure Progress on Data Ingest as of 16 October, 2015: ~43 Petabytes in 8 distributed nodes 10 PBytes
Source: hfps://www.rds.edu.au/ © National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
RDS (Phase 2) targeted outcomes from this infrastructure
• Researchers are able to share, use and reuse significant collecNons of data that were previously either unavailable to them or difficult to access • Researchers will be able to access the data in a consistent manner which will support a general interface as well as discipline specific access • Researchers will be able to use the consistent interface established/funded by this project for access to data collecNons at parNcipaNng insNtuNons and other locaNons as well as data held at the Nodes
Source: hfps://www.rds.edu.au/project-overview
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Integrated World-class Scien8fic Compu8ng Environment
Data Services THREDDS
Server-side analysis and visualization
VDI: Cloud scale user desktops on data
10PB+ Research Data
© National Computational Infrastructure 2015
Web-time analytics software
Ben Evans, OzEWEX 2015
nci.org.au
Na8onal Environment Research Data Collec8ons (NERDC) 1. Climate/ESS Model Assets and Data Products 2. Earth and Marine ObservaNons and Data Products 3. Geoscience CollecNons 4. Terrestrial Ecosystems CollecNons 5. Water Management CollecNons http://geonetwork.nci.org.au.au Data CollecNons Approx. Capacity
CMIP5, CORDEX
~3 Pbytes
ACCESS products
2.4 Pbytes
LANDSAT, MODIS, VIIRS, AVHRR, INSAR, MERIS
1.5 Pbytes
Digital ElevaNon, Bathymetry, Onshore Geophysics
700 Tbytes
Seasonal Climate
700 Tbytes
Bureau of Meteorology ObservaNons
350 Tbytes
Bureau of Meteorology Ocean-Marine
350 Tbytes
Terrestrial Ecosystem
290 Tbytes
Reanalysis products
100 Tbytes
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
10+ PB of Data for Interdisciplinary Science Astronomy (Optical) 200 TB Weather 340 TB CMIP5 3PB
Earth Observ. 2 PB
Atmosphere 2.4 PB Water Ocean 1.5 PB
Bathy, DEM 100 TB
Marine Videos 10 TB
GA
CSIRO
© National Computational Infrastructure 2015
ANU
Other Na.onal
Interna.onal
Ben Evans, OzEWEX 2015
Geophysics 300 TB
nci.org.au
10+ PB of Data for Interdisciplinary Science Astronomy (Optical) 200 TB Weather 340 TB CMIP5 3PB
Earth Observ. 2 PB
Atmosphere 2.4 PB Water Ocean 1.5 PB
Bathy, DEM 100 TB
Marine Videos 10 TB
GA
CSIRO
© National Computational Infrastructure 2015
ANU
Other Na.onal
Interna.onal
Ben Evans, OzEWEX 2015
Geophysics 300 TB
nci.org.au
Managing 10+ PB of Data for Scalable In-situ Access • Combined and integrated, the NCI collecNons are too large to move • bandwidth limits the capacity to move them easily • the data transfers are too slow, complicated and too expensive • even if our data can be moved, few can afford to store 10 PB on spinning disk
• We need to change our focus to: • moving users to the data (for sophisNcated analysis) • moving processing to data • having online applicaNons to process the data in-situ • Improving the sophisNcaNon of users – with our help
• We called for a new form of system design where:
• storage and various types of computaNon are co-located • systems are programmed and operated to allow users to interacNvely invoke different forms of analysis in-situ over integrated large-scale data collecNons
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Connec8on HPC Infrastructure for Data-intensive Science • highlighted the need for balanced systems to enable Data-intensive Science including: • InterconnecNng processes and high throughput to reduce inefficiencies • The need to really care about placement of data resources • Befer communicaNons between the nodes • I/O capability to match the computaNonal power • Close coupling of cluster, cloud and storage
NCI’s Integrated High Performance Environment
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
42/58
Earth System Grid Federation: Exemplar of an International Collaboratory for large scientific data and analysis
Ben Evans, Geoscience Australia, August 2015
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
‘Big Data’ vs ‘High Performance Data’ • Big Data is a relaNve term where the volume, velocity and variety of data exceed an organisaNons storage or compute capacity for accurate and Nmely decision making • We define High Performance Data (HPD) as data that is carefully prepared, standardised and structured so that it can be used in Data-Intensive Science on HPC (Evans et al., 2015)
1964: 1KB = 2m of tape or ~ 20 cards 2014: a 4 GB Thumb drive = ~8000 Km of Tape or ~83 million cards
• Need to convert ‘Big data’ collecNons into HPD by • •
AggregaNng data into seamless high-quality data products CreaNng intelligent access to self describing data arrays
hfp://www.sas.com/content/dam/SAS/en_us/doc/whitepaper1/big-data-meets-big-data-analyNcs-105777.pdf
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
2014: 20 PB of modern storage = ~ 32 trillion metres of tape ~ 320 trillion cards nci.org.au
Next Gen and Performance Analysis of Earth Systems codes NCI, BoM and Fujitsu CollaboraNve Project 2014-6
49/58
Project A: ACCESS OpNmisaNon • Evaluate and Improve ComputaNonal methods and performance for ACCESS-Opt
NWP
• UM, APS3 (Global, Regional, City)
Seasonal Climate • ACCESS-GC2 (GloSea)
Data AssimilaNon • 4D-VAR (Atmosphere), EnKF (Ocean), DART (NCAR)
Ocean ForecasNng and Research • MOM5, CICE/SIS, WW3, ROMS
Fully Coupled Earth System Model • ACCESS-CM2, ACCESS-ESM, CMIP5/6
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Project B: Scalability of algorithms, hardware And other earth systems and geophysics codes • • • • • • • • • • • • • • • •
Tsunami – NOAA MOST and comMIT Data AssimilaNon – NCAR DART Ocean - MOM6, MITGCM, MOM5(WOMBAT), SHOC Water Quality and BioGeochemical models – parNcularly for eReefs Hydrodynamic and ecological models - Relocatable Coastal modelling project (RECOM) Weather and ConvecNon Research – Non-access (e.g., WRF) Groundwater Hydrology - ? Natural Hazards - Tropical Cyclone (TCRM), Volcanic ash code, ANUGA, EQRM Shelf Reanalysis Onshore/Offshore seismic data processing Earthquake and Seismic waves 3D Geophysics: Gravimetric, Magnetotelluric, AEM, Inversion (Forward and Back) Earth ObservaNon Satellite data processing Hydrodynamics, Oil and Petroleum ElevaNon, Bathymetry, Geodesy – data conversions, grids and processing
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Data Pla`orms of today need to scale down to small users
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
High-Res, Mul8-Decadal, Con8nental-Scale Analysis
Water Detec.on from Space • 27 Years of data from LS5 & LS7(1987-2014) • 25m Nominal Pixel ResoluNon • Approx. 300,000 individual source scenes in approx. 20,000 passes • EnNre archive of 1,312,087 ARG25 Nles => 93x1012 pixels • can be processed in ~3 hours
c/- Geoscience Australia
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Scaling down to the smaller users – e.g. AGDC
Do we enable individual scenes to be downloaded for locally hosted small scale analysis? Or do we facilitate small scale analysis, in-situ on data sets that are dynamically updated? © National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Introducing the Na8onal Environmental Data Interoperability Research Pla`orm (NERDIP) Biodiversity & Climate Change VL
Climate & Weather Systems Lab
eMAST Speddexes
eReefs
AGDC VL
All Sky Virtual Observatory
Open Nav Surface
VHIRL
Globe Claritas
VGL
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, Fortran, C, C++, Python, R, GDAL, GRASS, QGIS Models MPI, OpenMP MatLab, IDL
Visualisa.on Drish.
ANDS/RDA AODN/IMOS TERN AuScope Portal Portal Portal Portal
Data. gov.au
Digital Bathymetry & Eleva=on Portal
Tools Data Portals
Na=onal Environmental Research Data Interoperability PlaDorm (NERDIP)
Lustre
OGC W*S
ISO 19115, RIF-CS, DCAT, etc.
FITS
HDF5 MPI-enabled © National Computational Infrastructure 2015
OGC SOS
Libgdal EO
OGC WPS
netCDF-4 EO
netCDF-4 Climate/Weather/Ocean
OGC WCS
HDF-EOS
OGC WFS
OGC WMS
HP Data Library Layer 2
netCDF-CF
Open DAP
Data Library Layer 1
Fast “whole-of-library” catalogue
RDF, LD
Metadata Layer
Direct Access
Services Layer (expose data models & seman=cs)
Ben Evans, OzEWEX 2015
Airborne Geophysics Line data
SEG-Y
LAS LiDAR
BAG
HDF5 Serial Other Storage (op.ons) nci.org.au
NERDIP: Enabling Mul8ple Ways to Interact with the Data Biodiversity & Climate Change VL
Infrastructure to Lower Barriers to Entry
Climate & Weather Systems Lab
eMAST Speddexes
eReefs
AGDC VL
All Sky Virtual Observatory
Open Nav Surface
VHIRL
Globe Claritas
VGL
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways
Ace Users
Ferret, NCO, GDL, Fortran, C, C++, Python, R, GDAL, GRASS, QGIS Models MPI, OpenMP MatLab, IDL
Visualisa.on Drish.
Data Discovery
ANDS/RDA AODN/IMOS TERN AuScope Portal Portal Portal Portal
Data. gov.au
Digital Bathymetry & Eleva=on Portal
Tools Data Portals
Na=onal Environmental Research Data Interoperability PlaDorm (NERDIP)
FITS
HDF5 MPI-enabled © National Computational Infrastructure 2015
Lustre
Ben Evans, OzEWEX 2015
OGC W*S
Libgdal EO
OGC SOS
netCDF-4 EO
netCDF-4 Climate/Weather/Ocean
OGC WPS
Data Plazorm
OGC WCS
HDF-EOS
netCDF-CF
OGC WFS
OGC WMS
HP Data Library Layer 2
Open DAP
Data Library Layer 1
Fast “whole-of-library” catalogue
RDF, LD
Metadata Layer
Direct Access
Services Layer (expose data models & seman=cs)
ISO 19115, RIF-CS, DCAT, etc.
Airborne Geophysics Line data
SEG-Y
LAS LiDAR
BAG
HDF5 Serial Other Storage (op.ons) nci.org.au
NERDIP: Enabling Mul8ple Ways to Interact with the Data Biodiversity & Climate Change VL
Infrastructure to Lower Barriers to Entry
Climate & Weather Systems Lab
eMAST Speddexes
eReefs
AGDC VL
All Sky Virtual Observatory
Open Nav Surface
VHIRL
Globe Claritas
VGL
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways
Ace Users
Ferret, NCO, GDL, Fortran, C, C++, Python, R, GDAL, GRASS, QGIS Models MPI, OpenMP MatLab, IDL
Visualisa.on Drish.
Data Portals
ANDS/RDA AODN/IMOS TERN AuScope Portal Portal Portal Portal
Data. gov.au
Digital Bathymetry & Eleva=on Portal
Tools Data Portals
Na=onal Environmental Research Data Interoperability PlaDorm (NERDIP) OGC W*S
OGC SOS
OGC WPS
OGC WCS
HDF-EOS
netCDF-CF
OGC WFS
OGC WMS
HP Data Library Layer 2
Open DAP
Data Library Layer 1
Fast “whole-of-library” catalogue
RDF, LD
Metadata Layer
Direct Access
Services Layer (expose data models & seman=cs)
ISO 19115, RIF-CS, DCAT, etc.
Data Pla`orm netCDF-4 EO
netCDF-4 Climate/Weather/Ocean
Libgdal EO
FITS
HDF5 MPI-enabled © National Computational Infrastructure 2015
Lustre
Ben Evans, OzEWEX 2015
Airborne Geophysics Line data
SEG-Y
LAS LiDAR
BAG
HDF5 Serial Other Storage (op.ons) nci.org.au
Pla`orms Free Data from the “Prison of the Portals”
• Portals are for visi.ng, pla`orms are for building on • Portals present aggregated content in a way that invites exploraNon, but the experience is pre-determined by a set of decisions by the builder about what is necessary, relevant and useful. • Plazorms put design decisions into the hands of users: there are innumerable ways of interacNng with the data • Plazorms offer many more opportuniNes for innovaNon: new interfaces can be built, new visualisaNons framed, ulNmately new science rapidly emerges
Tim Sherraf hfp://www.nla.gov.au/our-publicaNons/staff-papers/from-portal-to-plazorm © National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
NERDIP: Enabling Ace Users to Interact with the Data Biodiversity & Climate Change VL
Infrastructure to Lower Barriers to Entry
Climate & Weather Systems Lab
eMAST Speddexes
eReefs
AGDC VL
All Sky Virtual Observatory
Open Nav Surface
VHIRL
Globe Claritas
VGL
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, Fortran, C, C++, Python, R, GDAL, GRASS, QGIS Models MPI, OpenMP MatLab, IDL
Visualisa.on Drish.
Data Discovery
ANDS/RDA AODN/IMOS TERN AuScope Portal Portal Portal Portal
Data. gov.au
Digital Bathymetry & Eleva=on Portal
Tools Data Portals
Na=onal Environmental Research Data Interoperability PlaDorm (NERDIP)
Lustre
OGC W*S
ISO 19115, RIF-CS, DCAT, etc.
Data Plazorm
FITS
HDF5 MPI-enabled © National Computational Infrastructure 2015
OGC SOS
Libgdal EO
OGC WPS
netCDF-4 EO
netCDF-4 Climate/Weather/Ocean
OGC WCS
HDF-EOS
netCDF-CF
OGC WFS
OGC WMS
HP Data Library Layer 2
Open DAP
Data Library Layer 1
Fast “whole-of-library” catalogue
RDF, LD
Metadata Layer
Direct Access
Services Layer (expose data models & seman=cs)
Ben Evans, OzEWEX 2015
Airborne Geophysics Line data
SEG-Y
LAS LiDAR
BAG
HDF5 Serial Other Storage (op.ons) nci.org.au
NERDIP: Applica8ons Replica8ng Ways of Interac8ng with the Data Biodiversity & Climate Change VL
Climate & Weather Systems Lab
eMAST Speddexes
eReefs
AGDC VL
All Sky Virtual Observatory
Open Nav Surface
VHIRL
Globe Claritas
VGL
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, Fortran, C, C++, Python, R, GDAL, GRASS, QGIS Models MPI, OpenMP MatLab, IDL
Visualisa.on Drish.
ANDS/RDA AODN/IMOS TERN AuScope Portal Portal Portal Portal
Data. gov.au
Digital Bathymetry & Eleva=on Portal
Tools Data Portals
Na=onal Environmental Research Data Interoperability PlaDorm (NERDIP)
Lustre
OGC W*S
ISO 19115, RIF-CS, DCAT, etc.
FITS
HDF5 MPI-enabled © National Computational Infrastructure 2015
OGC SOS
Libgdal EO
OGC WPS
netCDF-4 EO
netCDF-4 Climate/Weather/Ocean
OGC WCS
HDF-EOS
netCDF-CF
OGC WFS
OGC WMS
HP Data Library Layer 2
Open DAP
Data Library Layer 1
Fast “whole-of-library” catalogue
RDF, LD
Metadata Layer
Direct Access
Services Layer (expose data models & seman=cs)
Ben Evans, OzEWEX 2015
Airborne Geophysics Line data
SEG-Y
LAS LiDAR
BAG
HDF5 Serial Other Storage (op.ons) nci.org.au
NERDIP: Loosely coupling Applica8ons and Data via a Services Layer Biodiversity & Climate Change VL
Infrastructure to Lower Barriers to Entry
Climate & Weather Systems Lab
APPLICATION eMAST Speddexes
AGDC VL
eReefs
All Sky Virtual Observatory
Open Nav Surface
VHIRL
Globe Claritas
VGL
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways
Data Discovery Ace Users FOCUSSED DEVELOPERS
Ferret, NCO, GDL, Fortran, C, C++, Python, R, GDAL, GRASS, QGIS Models MPI, OpenMP MatLab, IDL
Visualisa.on Drish.
ANDS/RDA AODN/IMOS TERN AuScope Portal Portal Portal Portal
Data. gov.au
Digital Bathymetry & Eleva=on Portal
Tools Data Portals
Na=onal Environmental Research Data Interoperability PlaDorm (NERDIP)
ISO 19115, RIF-CS, DCAT, etc.
DATA MANAGEMENT netCDF-4 EO
netCDF-4 Climate/Weather/Ocean
OGC W*S
OGC SOS
OGC WPS
OGC WCS
HDF-EOS
netCDF-CF
OGC WFS
OGC WMS
HP Data Library Layer 2
Open DAP
Data Library Layer 1
Fast “whole-of-library” catalogue
RDF, LD
Metadata Layer
Direct Access
Services Layer (expose data models & seman=cs)
Airborne Libgdal FITS Geophysics Data Plazorm EO Line data
SEG-Y
LAS LiDAR
BAG
FOCUSSED DEVELOPERS HDF5 MPI-enabled
© National Computational Infrastructure 2015
Lustre
Ben Evans, OzEWEX 2015
HDF5 Serial
Other Storage (op.ons) nci.org.au
NERDIP: Infrastructure to Lower Barriers to Entry Climate & Weather Systems Lab
Biodiversity & Climate Change VL
eMAST AGDC Speddexes eReefs VL
All Sky Virtual Observatory
VGL
Globe Claritas
VHIRL Open Nav Surface
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways
Ace Users
Ferret, NCO, GDL, Fortran, C, C++, Python, R, GDAL, GRASS, QGIS Models MPI, OpenMP MatLab, IDL
Data Discovery
Visualisa.on Drish.
ANDS/RDA AODN/IMOS TERN AuScope Portal Portal Portal Portal
Data. gov.au
Digital Bathymetry & Eleva=on Portal
Tools Data Portals
Na=onal Environmental Research Data Interoperability PlaDorm (NERDIP) Open DAP
OGC SOS
OGC WPS
HDF-EOS
netCDF-CF
OGC WCS
OGC WFS
HP Data Library Layer 2
OGC WMS
Data Library Layer 1
RDF, LD
Direct Access
Metadata Layer
Fast “whole-of-library” catalogue
Services Layer
(expose data model & seman=cs)
ISO 19115, RIF-CS, DCAT, etc.
Data Plazorm
NetCDF-4 Climate/Weather/Ocean
NetCDF-4 EO
Libgdal EO
[FITS]
HDF5 MPI-enabled © National Computational Infrastructure 2015
Lustre
Ben Evans, OzEWEX 2015
Airborne Geophysics Line data
[SEG-Y]
BAG
LAS LiDAR
HDF5 Serial Other Storage nci.org.au (options)
Bring the collec8ons together….
hfp://www.roughlydra{ed.com/RD/RDM.Tech.Q2.07/BA1B46C4-4014-44DE-ACBB-61D49A926D00.html
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
data size/format: potenNal 9.3 PBytes in NetCDF Format
NetCDF
Non NetCDF
Size
7717 TB (6799TB available)
2579 TB (2221TB available) 1600 TB will be converted to NetCDF (bathymetry, Landsat, geophysics)
Domain
Climate Models Weather Models and Obs Water Obs Satellite Imagery Other Imagery
Satellite Imagery Astronomy Medical Biosciences Social sciences Geodesy Video Geological Model Hazards Phenology
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Straightening out the data format standards
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
NCI Compliance Checking: Data & Metadata • Metadata: Use the Afribute ConvenNon for Dataset Discovery (ACDD) (previously Unidata Dataset Discovery Conven$ons) • Data: Based on the CF-Checker developed at the Hadley Centre for Climate PredicNon and Research (UK) by Rosalyn Hatcher © National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Audi8ng the system for data standards compliance
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Grid Diversity in CMIP5 Downstream communities may not wish to deal with different grids, but the modelling communities generate data appropriate to them. Mercator grid in south
Tripolar grid in north
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
CMIP6 WGCM Infrastructure Panel recommenda8ons • Use netCDF4 with lossless compression as the data format for CMIP6. • Lossless compression from zlib (se|ngs deflate=2 and shuffle) expected to generate roughly 2X decrease in data volumes (varies depending on data entropy or noisiness). Necessarily requires upgrading en$re toolchain (data produc$on and consump$on) to netCDF4. • Recommends the use of standard grids for datasets where naNve-grid data is not strictly required. • For example: the Clivar Ocean Model Development Panel (OMDP) may request the use of World Ocean Atlas (WOA) standard grids (1◦×1◦, 0.25◦×0.25◦) as the target grid of choice. • No progress on adopNon of standard calendars. © National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Integrated World-class Scien8fic Compu8ng Environment
Data Services THREDDS
Server-side analysis and visualization
VDI: Cloud scale user desktops on data
10PB+ Research Data
© National Computational Infrastructure 2015
Web-time analytics software
Ben Evans, OzEWEX 2015
nci.org.au
Local (Direct) Methods for Accessing the data Comparing File systems and data storage types NFS-CEPH
NFS-Lustre
Local SSD
Lustre
Lustre
VDI:/short
VDI:/g/data1
VDI:/local
Raijin:/short
Raijin:/g/data2
2 cores 32GB
2 cores 32GB
2 cores 32GB
16 cores/node 32GB/node
16 cores/node 32GB/node
IO Performance Tuning Metrics Lustre
MPI-IO
HDF5
General
Stripe count
Data sieving
Variable packing
File type
Stripe size
CollecNve buffer
Chunk pafern and cache size
File size
Alignment
TransacNon size
Metadata cache
Access pafern
Compression
Concurrency
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Serial Write Throughput write 900 800 700
Throughputs
600 500
GDAL_FILL PURE_FILL
400
GDAL_NOFILL PURE_NOFILL
300 200 100 0 GTIFF
HDF5
NC_CLASSIC
NC4
NC4_CLASSIC
Interfaces © National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Serial Read Throughput read 1000 900 800
Throughputs (MB/s)
700 600 GDAL_FILL 500
PURE_FILL GDAL_NOFILL
400
PURE_NOFILL
300 200 100 0 GTIFF
HDF5
NC_CLASSIC
NC4
NC4_CLASSIC
Interfaces © National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Parallel Write Throughput Independent write 1800 1600 1400
MB/s
1200
MPI size = 16 Stripe size =1M Block size = 8G Transfer size = 32M
1000 800 600 400 200 0 1
8
16 32 Stripe count HDF5
MPIIO
64
128
POSIX
Low performance when using default parameters © National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Parallel Read Throughput Independent Read 7000 6000 5000 MB/s
MPI size = 16 Stripe size =1M Block size = 8G Transfer size = 32M
4000 3000 2000 1000 0 1
8
16 32 Stripe count HDF5
MPIIO
64
128
POSIX
Low performance when using default parameters © National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Simple Comparison of File format access latency and compression
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Trade off of performance for storage capacity • Read Throughput (MB/s) • Source file (19 MB) • Normal file (121 MB) Transfer Count
1× 5500
5× 5500
10× 5500
20× 5500
40× 5500
Transfer 22kB 110KB 220KB 440KB 880KB Size Raijin 182.39 229.19 216.08 218.45 220.58 (src) Vdi 199.8 248.23 235.42 238.15 239.28 (src) raijin 479.77 790.84 804.79 848.09 888.82 (nrm) vdi 1473.45 3965.39 4521.17 5182.22 5785.31 (nrm)
55× 5500
100× 5500
550× 5500
1000× 5500
5500× 5500
1.21MB
2.2MB
12.1MB
22MB
121MB
220.74
222
203.86 192.56 189.49
241.25
228.93 219.08 217.09 220.75
887.62
889.31 800.65 710.48 544.84
4972.67 5898.41 3818.48 3162.54 2066.02
Intelligent organisaNon of the data can close the gap. - Different paferns of access can make the access performance worse. - Befer tuning for common/important use needs. - PotenNal to store compressed data twice (with different packing) if trade-off is jusNfiable/manageable. © National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
HDF5 Chunk Cache
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
HDF5 Chunked Storage
• Data is stored in chunks of predefined size • Two-dimensional instance may be referred to as data Nling
• HDF5 library writes/reads the whole chunk Contiguous
© National Computational Infrastructure 2015
Chunked
Ben Evans, OzEWEX 2015
nci.org.au
Subset Access – compression, subseing and caches Chunk shape: 1100×1100 Subset shape: 1×5500,275×275,1100×1100 400 350
Throughput (MB/s)
300 250
1×5500_4M 1×5500_32M
200
275×275_4M 275×275_32M
150
1100×1100_4M 1100×1100_32M
100 50 0 0
1
2
3
4
5
6
7
8
9
Deflate Level
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
OpenDAP
DAP2 -> NetCDF on the wire DataArray [...] Shape (...) DType GeoTransform ProjecNon Metadata NetCDF File
© National Computational Infrastructure 2015
OpenDAP Format
Ben Evans, OzEWEX 2015
HTTP Conn
nci.org.au
OpenDAP
DAP2 -> NetCDF on the wire DataArray [...] Shape (...) DType GeoTransform ProjecNon Metadata NetCDF File
© National Computational Infrastructure 2015
OpenDAP Format
Ben Evans, OzEWEX 2015
HTTP Conn
nci.org.au
Tile Map Servers
Serving Maps
WMTS Server
THREDDS Server
Client (Browser)
2
1
3
© National Computational Infrastructure 2015
4
Ben Evans, OzEWEX 2015
nci.org.au
Tile Map Servers
Serving Maps
WMTS Server
THREDDS Server
Client (Browser)
2
1
3
© National Computational Infrastructure 2015
4
Ben Evans, OzEWEX 2015
nci.org.au
Examples of on-the-fly data delivery
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au
Key Messages on Accessing High Performance Data • Data at scales of today have to be built as shared global faciliNes based around naNonal insNtuNons. • Domain-neutral internaNonal standards for data collecNons and interoperability are criNcal for allowing complex interacNons in HP environments both within and between HPD collecNons • Needs experNse around usability and performance tuning to ensure ge|ng the most out of the data.
hfps://www.sfwa.org/wp-content/uploads/ 2010/06/iStock_000012734413XSmall.jpg
• No one can do it alone. No one organisaNon, no one group, no one country has the required resources or the experNse. • Shared collaboraNve efforts such as Research Data Alliance, the Earth Systems Grid FederaNon (ESGF), the Belmont Forum, EarthServer, the Oceans Data Interoperability Plazorm (ODIP), EarthCube, GEO and OneGeology are needed to realise the full potenNal of the new data intensive science infrastructures • It now takes a ‘village of partnerships’ to raise a ‘HPD data center’ in a Big Data World hfp://www.onegeology.org/
© National Computational Infrastructure 2015
Ben Evans, OzEWEX 2015
nci.org.au