OzEWEX3 Evans final.pptx

1 downloads 0 Views 25MB Size Report
Web-time analytics software ... having online applicajons to process the data in-situ ... ://www.sas.com/content/dam/SAS/en_us/doc/whitepaper1/big-data-meets-big-data-analyjcs-105777.pdf ... Need to convert 'Big data' collecjons into HPD by.
Working with High Performance Datasets & Collec8ons Ben Evans -  L. Wyborn -  J. Wang, C. Trenham, K. Druken, R Yang, P. Larraondo, …, and many in NCI team •  Our partners ANU, BoM, CSIRO and GA and many community stakeholders

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au @NCInews nci.org.au

NCI: World-class, high-end compu8ng services for research & innova8on Research Outcomes

What is NCI: •  Australia’s most highly integrated e-infrastructure environment •  Petascale supercomputer + highest performance research cloud + highest performance storage in the southern hemisphere •  Comprehensive and integrated expert service •  Experienced Advanced Computing Scientific & Technical Teams

Par$cular focus on Earth System, Environment and Water Mgmt

© National Computational Infrastructure 2015

Research Objectives

NCI is important to Australia because it: •  NCI is National and Strategic capability •  Enables research that otherwise would be impossible •  Enables delivery of world-class science •  Enables interrogation of big data, otherwise impossible •  Enables high-impact research that matters; informs public policy •  Attracts and retains world-class researchers for Australia •  Catalyses development of young researchers’ skills

CommuniNes and InsNtuNons/ Access and Services

ExperNse Support and Development HPC Services Virtual Laboratories/ Data-intensive Services IntegraNon

Compute (HPC/Cloud) Storage/Network Infrastructure

Ben Evans, OzEWEX 2015

nci.org.au

Australian Research Infrastructure Funding 2006-2015 •  Two main tranches of funding: •  National Collaborative Research Infrastructure Strategy (NCRIS) –  $542M for 2006-2011 ($75 M for cyberinfrastructure)

•  Super Science Initiative –  $901 million for 2009-2013 ($347M for cyberinfrastructure)

•  Annual operaNonal funding of around $180M pa since 2014-2015 –  All infrastructure programmes were designed ensure that Australian research conNnues to be compeNNve and rank highly on an internaNonal scale. Millions of Dollars

100

• 

80

Compute

Data

Tools

Networks

60 40 20 2002

2004 © National Computational Infrastructure 2015

2006

2008 Ben Evans, OzEWEX 2015

2010

2012

2014 nci.org.au

RDSI Phase 1 (2011): Infrastructure

RDSI Phase 1 (2011): Infrastructure •  The NCI Proposal in September 2011 was for a High Performance Data Node •  The goal was to: •  Enable dramaNc increases in the scale and reach of Australian research by providing naNonwide access to enabling data collecNons; •  Specialise in naNonally significant research collecNons requiring high-performance computaNonal and data-intensive capabiliNes for their use in effecNve research methods; and •  Realise synergies with related naNonal research infrastructure programs.

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Example of LeQer of Support for NCI HPD Node

•  “will work with the partners to develop a shared data environment” … where … “there will be agreed standards to enable interoperability between the partners” •  “ it now make sense to explore these new opportuni.es within the NCI partnership, rather than as a separate agenda that GA runs independently”

…Chris Pigram, CEO, Geoscience Australia 26 July 2011

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

The DMP enables federated governance of the collec8on

OrganisaNonal Steward of the Data CollecNon

© National Computational Infrastructure 2015

Mutually agreed plan on how the collecNon will be managed and published

Ben Evans, OzEWEX 2015

NCI Data CollecNons Manager

nci.org.au

The Research Data Storage Infrastructure Progress on Data Ingest as of 16 October, 2015: ~43 Petabytes in 8 distributed nodes 10 PBytes

Source: hfps://www.rds.edu.au/ © National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

RDS (Phase 2) targeted outcomes from this infrastructure

•  Researchers are able to share, use and reuse significant collecNons of data that were previously either unavailable to them or difficult to access •  Researchers will be able to access the data in a consistent manner which will support a general interface as well as discipline specific access •  Researchers will be able to use the consistent interface established/funded by this project for access to data collecNons at parNcipaNng insNtuNons and other locaNons as well as data held at the Nodes

Source: hfps://www.rds.edu.au/project-overview

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Integrated World-class Scien8fic Compu8ng Environment

Data Services THREDDS

Server-side analysis and visualization

VDI: Cloud scale user desktops on data

10PB+ Research Data

© National Computational Infrastructure 2015

Web-time analytics software

Ben Evans, OzEWEX 2015

nci.org.au

Na8onal Environment Research Data Collec8ons (NERDC) 1. Climate/ESS Model Assets and Data Products 2. Earth and Marine ObservaNons and Data Products 3. Geoscience CollecNons 4. Terrestrial Ecosystems CollecNons 5. Water Management CollecNons http://geonetwork.nci.org.au.au Data CollecNons Approx. Capacity

CMIP5, CORDEX

~3 Pbytes

ACCESS products

2.4 Pbytes

LANDSAT, MODIS, VIIRS, AVHRR, INSAR, MERIS

1.5 Pbytes

Digital ElevaNon, Bathymetry, Onshore Geophysics

700 Tbytes

Seasonal Climate

700 Tbytes

Bureau of Meteorology ObservaNons

350 Tbytes

Bureau of Meteorology Ocean-Marine

350 Tbytes

Terrestrial Ecosystem

290 Tbytes

Reanalysis products

100 Tbytes

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

10+ PB of Data for Interdisciplinary Science Astronomy (Optical) 200 TB Weather 340 TB CMIP5 3PB

Earth Observ. 2 PB

Atmosphere 2.4 PB Water Ocean 1.5 PB

Bathy, DEM 100 TB

Marine Videos 10 TB

GA

CSIRO

© National Computational Infrastructure 2015

ANU

Other Na.onal

Interna.onal

Ben Evans, OzEWEX 2015

Geophysics 300 TB

nci.org.au

10+ PB of Data for Interdisciplinary Science Astronomy (Optical) 200 TB Weather 340 TB CMIP5 3PB

Earth Observ. 2 PB

Atmosphere 2.4 PB Water Ocean 1.5 PB

Bathy, DEM 100 TB

Marine Videos 10 TB

GA

CSIRO

© National Computational Infrastructure 2015

ANU

Other Na.onal

Interna.onal

Ben Evans, OzEWEX 2015

Geophysics 300 TB

nci.org.au

Managing 10+ PB of Data for Scalable In-situ Access •  Combined and integrated, the NCI collecNons are too large to move •  bandwidth limits the capacity to move them easily •  the data transfers are too slow, complicated and too expensive •  even if our data can be moved, few can afford to store 10 PB on spinning disk

•  We need to change our focus to: •  moving users to the data (for sophisNcated analysis) •  moving processing to data •  having online applicaNons to process the data in-situ •  Improving the sophisNcaNon of users – with our help

•  We called for a new form of system design where:

•  storage and various types of computaNon are co-located •  systems are programmed and operated to allow users to interacNvely invoke different forms of analysis in-situ over integrated large-scale data collecNons

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Connec8on HPC Infrastructure for Data-intensive Science •  highlighted the need for balanced systems to enable Data-intensive Science including: •  InterconnecNng processes and high throughput to reduce inefficiencies •  The need to really care about placement of data resources •  Befer communicaNons between the nodes •  I/O capability to match the computaNonal power •  Close coupling of cluster, cloud and storage

NCI’s Integrated High Performance Environment

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

42/58

Earth System Grid Federation: Exemplar of an International Collaboratory for large scientific data and analysis

Ben Evans, Geoscience Australia, August 2015

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

‘Big Data’ vs ‘High Performance Data’ •  Big Data is a relaNve term where the volume, velocity and variety of data exceed an organisaNons storage or compute capacity for accurate and Nmely decision making •  We define High Performance Data (HPD) as data that is carefully prepared, standardised and structured so that it can be used in Data-Intensive Science on HPC (Evans et al., 2015)

1964: 1KB = 2m of tape or ~ 20 cards 2014: a 4 GB Thumb drive = ~8000 Km of Tape or ~83 million cards

•  Need to convert ‘Big data’ collecNons into HPD by •  • 

AggregaNng data into seamless high-quality data products CreaNng intelligent access to self describing data arrays

hfp://www.sas.com/content/dam/SAS/en_us/doc/whitepaper1/big-data-meets-big-data-analyNcs-105777.pdf

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

2014: 20 PB of modern storage = ~ 32 trillion metres of tape ~ 320 trillion cards nci.org.au

Next Gen and Performance Analysis of Earth Systems codes NCI, BoM and Fujitsu CollaboraNve Project 2014-6

49/58

Project A: ACCESS OpNmisaNon •  Evaluate and Improve ComputaNonal methods and performance for ACCESS-Opt

NWP

•  UM, APS3 (Global, Regional, City)

Seasonal Climate •  ACCESS-GC2 (GloSea)

Data AssimilaNon •  4D-VAR (Atmosphere), EnKF (Ocean), DART (NCAR)

Ocean ForecasNng and Research •  MOM5, CICE/SIS, WW3, ROMS

Fully Coupled Earth System Model •  ACCESS-CM2, ACCESS-ESM, CMIP5/6



© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Project B: Scalability of algorithms, hardware And other earth systems and geophysics codes •  •  •  •  •  •  •  •  •  •  •  •  •  •  •  • 

Tsunami – NOAA MOST and comMIT Data AssimilaNon – NCAR DART Ocean - MOM6, MITGCM, MOM5(WOMBAT), SHOC Water Quality and BioGeochemical models – parNcularly for eReefs Hydrodynamic and ecological models - Relocatable Coastal modelling project (RECOM) Weather and ConvecNon Research – Non-access (e.g., WRF) Groundwater Hydrology - ? Natural Hazards - Tropical Cyclone (TCRM), Volcanic ash code, ANUGA, EQRM Shelf Reanalysis Onshore/Offshore seismic data processing Earthquake and Seismic waves 3D Geophysics: Gravimetric, Magnetotelluric, AEM, Inversion (Forward and Back) Earth ObservaNon Satellite data processing Hydrodynamics, Oil and Petroleum ElevaNon, Bathymetry, Geodesy – data conversions, grids and processing

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Data Pla`orms of today need to scale down to small users

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

High-Res, Mul8-Decadal, Con8nental-Scale Analysis

Water Detec.on from Space •  27 Years of data from LS5 & LS7(1987-2014) •  25m Nominal Pixel ResoluNon •  Approx. 300,000 individual source scenes in approx. 20,000 passes •  EnNre archive of 1,312,087 ARG25 Nles => 93x1012 pixels •  can be processed in ~3 hours

c/- Geoscience Australia

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Scaling down to the smaller users – e.g. AGDC

Do we enable individual scenes to be downloaded for locally hosted small scale analysis? Or do we facilitate small scale analysis, in-situ on data sets that are dynamically updated? © National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Introducing the Na8onal Environmental Data Interoperability Research Pla`orm (NERDIP) Biodiversity & Climate Change VL

Climate & Weather Systems Lab

eMAST Speddexes

eReefs

AGDC VL

All Sky Virtual Observatory

Open Nav Surface

VHIRL

Globe Claritas

VGL

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, Fortran, C, C++, Python, R, GDAL, GRASS, QGIS Models MPI, OpenMP MatLab, IDL

Visualisa.on Drish.

ANDS/RDA AODN/IMOS TERN AuScope Portal Portal Portal Portal

Data. gov.au

Digital Bathymetry & Eleva=on Portal

Tools Data Portals

Na=onal Environmental Research Data Interoperability PlaDorm (NERDIP)

Lustre

OGC W*S

ISO 19115, RIF-CS, DCAT, etc.

FITS

HDF5 MPI-enabled © National Computational Infrastructure 2015

OGC SOS

Libgdal EO

OGC WPS

netCDF-4 EO

netCDF-4 Climate/Weather/Ocean

OGC WCS

HDF-EOS

OGC WFS

OGC WMS

HP Data Library Layer 2

netCDF-CF

Open DAP

Data Library Layer 1

Fast “whole-of-library” catalogue

RDF, LD

Metadata Layer

Direct Access

Services Layer (expose data models & seman=cs)

Ben Evans, OzEWEX 2015

Airborne Geophysics Line data

SEG-Y

LAS LiDAR

BAG

HDF5 Serial Other Storage (op.ons) nci.org.au

NERDIP: Enabling Mul8ple Ways to Interact with the Data Biodiversity & Climate Change VL

Infrastructure to Lower Barriers to Entry

Climate & Weather Systems Lab

eMAST Speddexes

eReefs

AGDC VL

All Sky Virtual Observatory

Open Nav Surface

VHIRL

Globe Claritas

VGL

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways

Ace Users

Ferret, NCO, GDL, Fortran, C, C++, Python, R, GDAL, GRASS, QGIS Models MPI, OpenMP MatLab, IDL

Visualisa.on Drish.

Data Discovery

ANDS/RDA AODN/IMOS TERN AuScope Portal Portal Portal Portal

Data. gov.au

Digital Bathymetry & Eleva=on Portal

Tools Data Portals

Na=onal Environmental Research Data Interoperability PlaDorm (NERDIP)

FITS

HDF5 MPI-enabled © National Computational Infrastructure 2015

Lustre

Ben Evans, OzEWEX 2015

OGC W*S

Libgdal EO

OGC SOS

netCDF-4 EO

netCDF-4 Climate/Weather/Ocean

OGC WPS

Data Plazorm

OGC WCS

HDF-EOS

netCDF-CF

OGC WFS

OGC WMS

HP Data Library Layer 2

Open DAP

Data Library Layer 1

Fast “whole-of-library” catalogue

RDF, LD

Metadata Layer

Direct Access

Services Layer (expose data models & seman=cs)

ISO 19115, RIF-CS, DCAT, etc.

Airborne Geophysics Line data

SEG-Y

LAS LiDAR

BAG

HDF5 Serial Other Storage (op.ons) nci.org.au

NERDIP: Enabling Mul8ple Ways to Interact with the Data Biodiversity & Climate Change VL

Infrastructure to Lower Barriers to Entry

Climate & Weather Systems Lab

eMAST Speddexes

eReefs

AGDC VL

All Sky Virtual Observatory

Open Nav Surface

VHIRL

Globe Claritas

VGL

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways

Ace Users

Ferret, NCO, GDL, Fortran, C, C++, Python, R, GDAL, GRASS, QGIS Models MPI, OpenMP MatLab, IDL

Visualisa.on Drish.

Data Portals

ANDS/RDA AODN/IMOS TERN AuScope Portal Portal Portal Portal

Data. gov.au

Digital Bathymetry & Eleva=on Portal

Tools Data Portals

Na=onal Environmental Research Data Interoperability PlaDorm (NERDIP) OGC W*S

OGC SOS

OGC WPS

OGC WCS

HDF-EOS

netCDF-CF

OGC WFS

OGC WMS

HP Data Library Layer 2

Open DAP

Data Library Layer 1

Fast “whole-of-library” catalogue

RDF, LD

Metadata Layer

Direct Access

Services Layer (expose data models & seman=cs)

ISO 19115, RIF-CS, DCAT, etc.

Data Pla`orm netCDF-4 EO

netCDF-4 Climate/Weather/Ocean

Libgdal EO

FITS

HDF5 MPI-enabled © National Computational Infrastructure 2015

Lustre

Ben Evans, OzEWEX 2015

Airborne Geophysics Line data

SEG-Y

LAS LiDAR

BAG

HDF5 Serial Other Storage (op.ons) nci.org.au

Pla`orms Free Data from the “Prison of the Portals”

•  Portals are for visi.ng, pla`orms are for building on •  Portals present aggregated content in a way that invites exploraNon, but the experience is pre-determined by a set of decisions by the builder about what is necessary, relevant and useful. •  Plazorms put design decisions into the hands of users: there are innumerable ways of interacNng with the data •  Plazorms offer many more opportuniNes for innovaNon: new interfaces can be built, new visualisaNons framed, ulNmately new science rapidly emerges

Tim Sherraf hfp://www.nla.gov.au/our-publicaNons/staff-papers/from-portal-to-plazorm © National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

NERDIP: Enabling Ace Users to Interact with the Data Biodiversity & Climate Change VL

Infrastructure to Lower Barriers to Entry

Climate & Weather Systems Lab

eMAST Speddexes

eReefs

AGDC VL

All Sky Virtual Observatory

Open Nav Surface

VHIRL

Globe Claritas

VGL

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, Fortran, C, C++, Python, R, GDAL, GRASS, QGIS Models MPI, OpenMP MatLab, IDL

Visualisa.on Drish.

Data Discovery

ANDS/RDA AODN/IMOS TERN AuScope Portal Portal Portal Portal

Data. gov.au

Digital Bathymetry & Eleva=on Portal

Tools Data Portals

Na=onal Environmental Research Data Interoperability PlaDorm (NERDIP)

Lustre

OGC W*S

ISO 19115, RIF-CS, DCAT, etc.

Data Plazorm

FITS

HDF5 MPI-enabled © National Computational Infrastructure 2015

OGC SOS

Libgdal EO

OGC WPS

netCDF-4 EO

netCDF-4 Climate/Weather/Ocean

OGC WCS

HDF-EOS

netCDF-CF

OGC WFS

OGC WMS

HP Data Library Layer 2

Open DAP

Data Library Layer 1

Fast “whole-of-library” catalogue

RDF, LD

Metadata Layer

Direct Access

Services Layer (expose data models & seman=cs)

Ben Evans, OzEWEX 2015

Airborne Geophysics Line data

SEG-Y

LAS LiDAR

BAG

HDF5 Serial Other Storage (op.ons) nci.org.au

NERDIP: Applica8ons Replica8ng Ways of Interac8ng with the Data Biodiversity & Climate Change VL

Climate & Weather Systems Lab

eMAST Speddexes

eReefs

AGDC VL

All Sky Virtual Observatory

Open Nav Surface

VHIRL

Globe Claritas

VGL

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, Fortran, C, C++, Python, R, GDAL, GRASS, QGIS Models MPI, OpenMP MatLab, IDL

Visualisa.on Drish.

ANDS/RDA AODN/IMOS TERN AuScope Portal Portal Portal Portal

Data. gov.au

Digital Bathymetry & Eleva=on Portal

Tools Data Portals

Na=onal Environmental Research Data Interoperability PlaDorm (NERDIP)

Lustre

OGC W*S

ISO 19115, RIF-CS, DCAT, etc.

FITS

HDF5 MPI-enabled © National Computational Infrastructure 2015

OGC SOS

Libgdal EO

OGC WPS

netCDF-4 EO

netCDF-4 Climate/Weather/Ocean

OGC WCS

HDF-EOS

netCDF-CF

OGC WFS

OGC WMS

HP Data Library Layer 2

Open DAP

Data Library Layer 1

Fast “whole-of-library” catalogue

RDF, LD

Metadata Layer

Direct Access

Services Layer (expose data models & seman=cs)

Ben Evans, OzEWEX 2015

Airborne Geophysics Line data

SEG-Y

LAS LiDAR

BAG

HDF5 Serial Other Storage (op.ons) nci.org.au

NERDIP: Loosely coupling Applica8ons and Data via a Services Layer Biodiversity & Climate Change VL

Infrastructure to Lower Barriers to Entry

Climate & Weather Systems Lab

APPLICATION eMAST Speddexes

AGDC VL

eReefs

All Sky Virtual Observatory

Open Nav Surface

VHIRL

Globe Claritas

VGL

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways



Data Discovery Ace Users FOCUSSED DEVELOPERS

Ferret, NCO, GDL, Fortran, C, C++, Python, R, GDAL, GRASS, QGIS Models MPI, OpenMP MatLab, IDL

Visualisa.on Drish.

ANDS/RDA AODN/IMOS TERN AuScope Portal Portal Portal Portal

Data. gov.au

Digital Bathymetry & Eleva=on Portal

Tools Data Portals

Na=onal Environmental Research Data Interoperability PlaDorm (NERDIP)

ISO 19115, RIF-CS, DCAT, etc.

DATA MANAGEMENT netCDF-4 EO

netCDF-4 Climate/Weather/Ocean

OGC W*S

OGC SOS

OGC WPS

OGC WCS

HDF-EOS

netCDF-CF

OGC WFS

OGC WMS

HP Data Library Layer 2

Open DAP

Data Library Layer 1

Fast “whole-of-library” catalogue

RDF, LD

Metadata Layer

Direct Access

Services Layer (expose data models & seman=cs)

Airborne Libgdal FITS Geophysics Data Plazorm EO Line data

SEG-Y

LAS LiDAR

BAG

FOCUSSED DEVELOPERS HDF5 MPI-enabled

© National Computational Infrastructure 2015

Lustre

Ben Evans, OzEWEX 2015

HDF5 Serial

Other Storage (op.ons) nci.org.au

NERDIP: Infrastructure to Lower Barriers to Entry Climate & Weather Systems Lab

Biodiversity & Climate Change VL

eMAST AGDC Speddexes eReefs VL

All Sky Virtual Observatory

VGL

Globe Claritas

VHIRL Open Nav Surface

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways

Ace Users

Ferret, NCO, GDL, Fortran, C, C++, Python, R, GDAL, GRASS, QGIS Models MPI, OpenMP MatLab, IDL

Data Discovery

Visualisa.on Drish.

ANDS/RDA AODN/IMOS TERN AuScope Portal Portal Portal Portal

Data. gov.au

Digital Bathymetry & Eleva=on Portal

Tools Data Portals

Na=onal Environmental Research Data Interoperability PlaDorm (NERDIP) Open DAP

OGC SOS

OGC WPS

HDF-EOS

netCDF-CF

OGC WCS

OGC WFS

HP Data Library Layer 2

OGC WMS

Data Library Layer 1

RDF, LD

Direct Access

Metadata Layer

Fast “whole-of-library” catalogue

Services Layer

(expose data model & seman=cs)

ISO 19115, RIF-CS, DCAT, etc.

Data Plazorm

NetCDF-4 Climate/Weather/Ocean

NetCDF-4 EO

Libgdal EO

[FITS]

HDF5 MPI-enabled © National Computational Infrastructure 2015

Lustre

Ben Evans, OzEWEX 2015

Airborne Geophysics Line data

[SEG-Y]

BAG

LAS LiDAR

HDF5 Serial Other Storage nci.org.au (options)

Bring the collec8ons together….

hfp://www.roughlydra{ed.com/RD/RDM.Tech.Q2.07/BA1B46C4-4014-44DE-ACBB-61D49A926D00.html

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

data size/format: potenNal 9.3 PBytes in NetCDF Format

NetCDF

Non NetCDF

Size

7717 TB (6799TB available)

2579 TB (2221TB available) 1600 TB will be converted to NetCDF (bathymetry, Landsat, geophysics)

Domain

Climate Models Weather Models and Obs Water Obs Satellite Imagery Other Imagery

Satellite Imagery Astronomy Medical Biosciences Social sciences Geodesy Video Geological Model Hazards Phenology

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Straightening out the data format standards

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

NCI Compliance Checking: Data & Metadata • Metadata: Use the Afribute ConvenNon for Dataset Discovery (ACDD) (previously Unidata Dataset Discovery Conven$ons) • Data: Based on the CF-Checker developed at the Hadley Centre for Climate PredicNon and Research (UK) by Rosalyn Hatcher © National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Audi8ng the system for data standards compliance

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Grid Diversity in CMIP5 Downstream communities may not wish to deal with different grids, but the modelling communities generate data appropriate to them. Mercator grid in south

Tripolar grid in north

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

CMIP6 WGCM Infrastructure Panel recommenda8ons •  Use netCDF4 with lossless compression as the data format for CMIP6. •  Lossless compression from zlib (se|ngs deflate=2 and shuffle) expected to generate roughly 2X decrease in data volumes (varies depending on data entropy or noisiness). Necessarily requires upgrading en$re toolchain (data produc$on and consump$on) to netCDF4. •  Recommends the use of standard grids for datasets where naNve-grid data is not strictly required. •  For example: the Clivar Ocean Model Development Panel (OMDP) may request the use of World Ocean Atlas (WOA) standard grids (1◦×1◦, 0.25◦×0.25◦) as the target grid of choice. •  No progress on adopNon of standard calendars. © National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Integrated World-class Scien8fic Compu8ng Environment

Data Services THREDDS

Server-side analysis and visualization

VDI: Cloud scale user desktops on data

10PB+ Research Data

© National Computational Infrastructure 2015

Web-time analytics software

Ben Evans, OzEWEX 2015

nci.org.au

Local (Direct) Methods for Accessing the data Comparing File systems and data storage types NFS-CEPH

NFS-Lustre

Local SSD

Lustre

Lustre

VDI:/short

VDI:/g/data1

VDI:/local

Raijin:/short

Raijin:/g/data2

2 cores 32GB

2 cores 32GB

2 cores 32GB

16 cores/node 32GB/node

16 cores/node 32GB/node

IO Performance Tuning Metrics Lustre

MPI-IO

HDF5

General

Stripe count

Data sieving

Variable packing

File type

Stripe size

CollecNve buffer

Chunk pafern and cache size

File size

Alignment

TransacNon size

Metadata cache

Access pafern

Compression

Concurrency

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Serial Write Throughput write 900 800 700

Throughputs

600 500

GDAL_FILL PURE_FILL

400

GDAL_NOFILL PURE_NOFILL

300 200 100 0 GTIFF

HDF5

NC_CLASSIC

NC4

NC4_CLASSIC

Interfaces © National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Serial Read Throughput read 1000 900 800

Throughputs (MB/s)

700 600 GDAL_FILL 500

PURE_FILL GDAL_NOFILL

400

PURE_NOFILL

300 200 100 0 GTIFF

HDF5

NC_CLASSIC

NC4

NC4_CLASSIC

Interfaces © National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Parallel Write Throughput Independent write 1800 1600 1400

MB/s

1200

MPI size = 16 Stripe size =1M Block size = 8G Transfer size = 32M

1000 800 600 400 200 0 1

8

16 32 Stripe count HDF5

MPIIO

64

128

POSIX

Low performance when using default parameters © National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Parallel Read Throughput Independent Read 7000 6000 5000 MB/s

MPI size = 16 Stripe size =1M Block size = 8G Transfer size = 32M

4000 3000 2000 1000 0 1

8

16 32 Stripe count HDF5

MPIIO

64

128

POSIX

Low performance when using default parameters © National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Simple Comparison of File format access latency and compression

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Trade off of performance for storage capacity •  Read Throughput (MB/s) •  Source file (19 MB) •  Normal file (121 MB) Transfer Count

1× 5500

5× 5500

10× 5500

20× 5500

40× 5500

Transfer 22kB 110KB 220KB 440KB 880KB Size Raijin 182.39 229.19 216.08 218.45 220.58 (src) Vdi 199.8 248.23 235.42 238.15 239.28 (src) raijin 479.77 790.84 804.79 848.09 888.82 (nrm) vdi 1473.45 3965.39 4521.17 5182.22 5785.31 (nrm)

55× 5500

100× 5500

550× 5500

1000× 5500

5500× 5500

1.21MB

2.2MB

12.1MB

22MB

121MB

220.74

222

203.86 192.56 189.49

241.25

228.93 219.08 217.09 220.75

887.62

889.31 800.65 710.48 544.84

4972.67 5898.41 3818.48 3162.54 2066.02

Intelligent organisaNon of the data can close the gap. -  Different paferns of access can make the access performance worse. -  Befer tuning for common/important use needs. -  PotenNal to store compressed data twice (with different packing) if trade-off is jusNfiable/manageable. © National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

HDF5 Chunk Cache

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

HDF5 Chunked Storage

•  Data is stored in chunks of predefined size •  Two-dimensional instance may be referred to as data Nling

•  HDF5 library writes/reads the whole chunk Contiguous

© National Computational Infrastructure 2015

Chunked

Ben Evans, OzEWEX 2015

nci.org.au

Subset Access – compression, subseing and caches Chunk shape: 1100×1100 Subset shape: 1×5500,275×275,1100×1100 400 350

Throughput (MB/s)

300 250

1×5500_4M 1×5500_32M

200

275×275_4M 275×275_32M

150

1100×1100_4M 1100×1100_32M

100 50 0 0

1

2

3

4

5

6

7

8

9

Deflate Level

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

OpenDAP

DAP2 -> NetCDF on the wire DataArray [...] Shape (...) DType GeoTransform ProjecNon Metadata NetCDF File

© National Computational Infrastructure 2015

OpenDAP Format

Ben Evans, OzEWEX 2015

HTTP Conn

nci.org.au

OpenDAP

DAP2 -> NetCDF on the wire DataArray [...] Shape (...) DType GeoTransform ProjecNon Metadata NetCDF File

© National Computational Infrastructure 2015

OpenDAP Format

Ben Evans, OzEWEX 2015

HTTP Conn

nci.org.au

Tile Map Servers

Serving Maps

WMTS Server

THREDDS Server

Client (Browser)





2



1

3

© National Computational Infrastructure 2015

4

Ben Evans, OzEWEX 2015

nci.org.au

Tile Map Servers

Serving Maps

WMTS Server

THREDDS Server

Client (Browser)





2



1

3

© National Computational Infrastructure 2015

4

Ben Evans, OzEWEX 2015

nci.org.au

Examples of on-the-fly data delivery

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au

Key Messages on Accessing High Performance Data •  Data at scales of today have to be built as shared global faciliNes based around naNonal insNtuNons. •  Domain-neutral internaNonal standards for data collecNons and interoperability are criNcal for allowing complex interacNons in HP environments both within and between HPD collecNons •  Needs experNse around usability and performance tuning to ensure ge|ng the most out of the data.

hfps://www.sfwa.org/wp-content/uploads/ 2010/06/iStock_000012734413XSmall.jpg

•  No one can do it alone. No one organisaNon, no one group, no one country has the required resources or the experNse. •  Shared collaboraNve efforts such as Research Data Alliance, the Earth Systems Grid FederaNon (ESGF), the Belmont Forum, EarthServer, the Oceans Data Interoperability Plazorm (ODIP), EarthCube, GEO and OneGeology are needed to realise the full potenNal of the new data intensive science infrastructures •  It now takes a ‘village of partnerships’ to raise a ‘HPD data center’ in a Big Data World hfp://www.onegeology.org/

© National Computational Infrastructure 2015

Ben Evans, OzEWEX 2015

nci.org.au