Canopus: A Paradigm Shift Towards Elastic Extreme ...

1 downloads 0 Views 4MB Size Report
Tim e fra ctio n s. Storage-to-compute ratio. I/O. Delta calculation and compression ..... [45] U. Ayachit, A. Bauer, B. Geveci, P. O'Leary, K. Moreland, N. Fabian,.
Canopus: A Paradigm Shift Towards Elastic Extreme-Scale Data Analytics on HPC Storage Tao Lu1 , Eric Suchyta2 , Dave Pugmire2 , Jong Choi2 , Scott Klasky2 , Qing Liu1 , Norbert Podhorszki2 , Mark Ainsworth3,2 , and Mark Ainsworth2 1

Department of Electrical and Computer Engineering, New Jersey Institute of Technology 2 Computer Science and Mathematics Division, Oak Ridge National Laboratory 3 Division of Applied Mathematics, Brown University

Abstract Scientific simulations on high performance computing (HPC) platforms generate large quantities of data. To bridge the widening gap between compute and I/O, and enable data to be more efficiently stored and analyzed, simulation outputs need to be refactored, reduced, and appropriately mapped to storage tiers. However, a systematic solution to support these steps has been lacking on the current HPC software ecosystem. To that end, this paper develops Canopus, a progressive JPEGlike data management scheme for storing and analyzing big scientific data. It co-designs the data decimation, compression and data storage, taking the hardware characteristics of each storage tier into considerations. With reasonably low overhead, our approach refactors simulation data into a much smaller, reduced-accuracy base dataset, and a series of deltas that is used to augment the accuracy if needed. The base dataset and deltas are compressed and written to multiple storage tiers. Data saved on different tiers can then be selectively retrieved to restore the level of accuracy that satisfies data analytics. Thus, Canopus provides a paradigm shift towards elastic data analytics and enables end users to make trade-offs between analysis speed and accuracy on-the-fly. We evaluate the impact of Canopus on unstructured triangular meshes, a pervasive data model used by scientific modeling and simulations. In particular, we demonstrate the progressive data exploration of Canopus using the “blob detection” use case on the fusion simulation data. I. I NTRODUCTION In 2015, the authors worked with a team of computational scientists from Princeton Plasma Physics Laboratory to conduct a cutting-edge fusion simulation, running XGC1 [1], [2] on 300,000 cores on Titan1 to model ITER2 , the prospective largest fusion reactor in the world, designed to produce 500MW power output, a ten-fold increase compared to input. The spatiotemporal resolution of the simulation initially called for a solution that could efficiently store and analyze 1 PB of particle and grid data per day. We estimated that retrieving the data from the parallel file system would alone take at least 5 hours, which was deemed too expensive3 . Furthermore, the

total capacity of one file system partition on Titan was 14 PB; writing 1 PB per day, its capacity would be exceeded in less than two weeks. These data management challenges, unsustainable in terms of both throughput and capacity, forced the application scientists to degrade the fidelity of the run to accomplish the campaign in an acceptable time. High precision and fine resolution largely account for the above-mentioned difficulties. Numerical calculations in scientific simulations often use high accuracy in both spatial and temporal dimensions to mitigate error amplification after many iterations. Unfortunately, the large data volume generated may dramatically prolong downstream analytics. In collaboration with fusion and other application scientists, we identify the following opportunities to tackle this challenge. First, analytics do not always need to be performed at the same precision or resolution level as the simulation. Enforcing the highest accuracy is often excessive and unnecessary for data analytics; lower accuracy may suffice on case-by-case bases. A range of accuracies, instead of a single accuracy, is particularly useful if the data need to be shared between a community of users who have different requirements. Second, on next-generation leadership systems Aurora4 , Summit5 , and Sierra6 , the storage hierarchy is becoming increasingly deep, with the emergence of new storage/memory technologies, such as high-bandwidth memory (HBM), non-volatile memory (NVRAM), solid-state drives (SSDs), and burst buffers7 . Different storage tiers possess distinct I/O performance characteristics and constraints, which need to be balanced appropriately to accommodate the speed and size requirements of data analytics. To facilitate data analysis in an efficient and flexible manner, we design and implement Canopus, a data management middleware based on the Adaptable I/O System (ADIOS) [3], [4], which refactors simulation results into a smaller base dataset and a series of auxiliary deltas. These refactored data are further compressed and mapped to storage tiers. Mindful of varied accuracy and performance requirements in post-run analytics, Canopus allows the base dataset to be retrieved alone or in conjunction with the auxiliary deltas, to restore the data to target accuracy levels that may or may not be the full accuracy. 4 http://aurora.alcf.anl.gov/

1 https://www.olcf.ornl.gov/titan/

5 https://www.olcf.ornl.gov/summit/

2 https://www.iter.org/

6 https://asc.llnl.gov/coral-info

3 Personal

7 http://www.nersc.gov/users/computational-systems/cori/burst-buffer/

communication with C. Chang

We evaluate the functions and performance of Canopus using three HPC simulation datasets generated by scientific applications XGC1, GenASiS [5], and a CFD I/O kernel8 , then study how Canopus impacts downstream analytics. We conduct evaluations on Titan, utilizing tmpfs [6] and the Lustre [7] parallel file system to build a two-tier storage hierarchy. We highlight three features as an executive summary of our contributions: • The design and implementation of Canopus allow simulation data to be refactored such that users can perform exploratory data analysis progressively and the fidelity can be adjusted according to scientists’ accuracy needs. • The co-design of analytics and storage provides a new data management paradigm in HPC environments. Canopus refactors and maps data to appropriate storage tiers, taking advantage of the capacity and performance characteristics of each tier. • We conducted thorough performance evaluations to understand the impact of Canopus on simulation and analytics. For blob detection on the fusion data, Canopus maintains salient features of the simulation results even if the decimation ratio is as high as 16X. We demonstrate that Canopus can accelerate end-to-end data analytics by an order of magnitude, when trading accuracy for speed is preferable. The remainder of this paper is organized as follows. Section II builds upon this introduction, continuing the discussion of Canopus’ motivation. Section III explains the detailed architecture and workflow implemented in Canopus. In Section IV, we describe the performance evaluation including the datasets and testbed used, and investigate how data saved by Canopus at different resolutions affects nontrivial analytics used for fusion simulation data. Section V further overviews related work, followed by conclusions in Section VI.

This section explains the motivation of Canopus and compare it with existing schemes to better contextualize new functionality supported by Canopus and to frame the fundamental contributions of our work. We highlight several challenges that we must address and that Canopus helps to overcome.

the simulations, especially if there are high compute and/or I/O costs associated with high accuracy. As we discussed in Section I using the example of XGC1 reading times, this scenario exists today. Canopus does not force users to work on reduced-accuracy data. Instead, it affords users the flexibility of choosing the accuracy they desire and augmenting it on-the-fly. Canopus is largely inspired by JPEG 2000 [8], which was designed to cope with transferring images over congested network links, using a wavelet-based compression technique and truncation, such that the decoding the code stream still yields a signal resembling the original, but at a reduced resolution. For twodimensional images, often not all bits in every pixel need to be retained for acceptable visual quality. If the network quality of service is low, transferring full resolution takes a long time. Compromising quality for latency to obtain a quick view, then progressively promoting the quality offers a better user experience. HPC workflows already borrow from the idea of progressive refinement, using reduced-fidelity data access paradigms for better performance during simulation. We cite code coupling in fusion simulations as an example. XGCa [9], [10] is a version of XGC1, which assumes additional symmetry to reduce the computational costs and to increase the integration time that can be simulated, at the penalty of coarsening the simulation accuracy. Workflows exist that run XGCa and XGC1 in a coupled fashion, with XGCa “fast-forwarding” the system state to later in time when interesting turbulent phenomena occur. Then XGC1 resumes from this point, because increasedfidelity is needed to resolve the features of the turbulent physics. During the production runs, XGC1 rarely writes its full particle information to disk, only in case checkpointing or restarting is needed, due to the 10 TB of data volume. More frequently, the simulation outputs a smaller data volume called f 0, which includes a summary of particle characteristics such as the distribution of particle velocities, to mitigate the I/O bottleneck. In coupled jobs, data must be exchanged between the codes several times. Hence, for performance acceleration, f 0, instead of the full dataset, is read by XGCa, then sampled to generate a particle realization.

A. Progressive Refinement

B. Canopus vs. Multi-level Compression

II. M OTIVATION

Motivation 1: Progressive refinement has been proven to be an effective method in the scenario where tradeoffs between storage size and data accuracy are desirable. Based on our domain knowledge as well as our discussions with application and data scientists, it is common that meaningful insights can be obtained through data analytics at a lower accuracy, instead of always at the full accuracy of simulation results. Canopus is motivated by the concept of progressive refinement, with the intent of enabling users to perform data analysis at various accuracies. Depending on the problem at hand, they may deem that they do not require the full accuracy saved by 8 http://cgns.sourceforge.net/CGNSFiles.html

Motivation 2: Naive multi-level compression is sub-optimal in data reduction ratio. Canopus exploits correlations between different levels and high compression ratios of deltas to further reduce the storage footprints of reduced data. Progressive refinement can be achieved via multi-level compression, without saving deltas. One generates N differentaccuracy datasets {Li | (N > i ≥ 0)} (full details of notation can be found in Section III-B), which are retrieved progressively from lower-to-higher accuracy until analytics requirements are satisfied – beginning from LN −1 , up through L0 for the original, full-accuracy data. Our approach calculates differences between levels as deltai−j = Li − Lj . We observe that consecutive levels are correlated, and the deltas are

smoother than L, (cf. Figure 4). Accordingly, the deltas should compress more easily than L. Using this as motivation, instead of storing all Li , Canopus stores {deltai−(i+1) | (N − 1 > i ≥ 0)} and a low accuracy dataset LN −1 . Restoring the data to the k-th accuracy level uses LN −1 and a subset of the deltas {deltai−(i+1) | (N − 1 < i < k)}. (Full details of the algorithm are presented in Section III-C3.)

Refactoring (decimation, HPC Simulations compression) (high accuracy)

Retrieving & Reconstruction

Base ST2 Delta1-2

Analytics Pipeline 1 (low accuracy) base = L2 Analytics Pipeline 2 (medium accuracy)

ST1

Delta0-1

base + delta1-2

ST0

Analytics Pipeline n (high accuracy) Storage Hierarchy

base + delta1-2 + delta0-1

C. Canopus vs. Adaptive Mesh Refinement Motivation 3: As compared to adaptive mesh refinement, Canopus does not require a priori knowledge about the problem to be examined. It allows for higher accuracy for any sub-regions in the problem domain. Canopus distinguishes from adaptive mesh refinement (AMR) [11], [12] as follows. AMR is a simulation-side numerical method that utilizes a priori knowledge to identify and improve the resolution of interesting sub-regions, with the goal of improving compute and memory efficiency. In contrast, Canopus is designed to support exploratory data analytics, where a priori knowledge may not be available, and it can augment the accuracy for any sub-regions in the problem domain. Moreover, adding AMR capabilities to applications is intrusive and requires significant changes to simulations, while Canopus uses declarative ADIOS interfaces and requires minimal code changes. In addition, AMR generally suffers from limited scalability due to the frequent communications between patches, while the refactoring of Canopus is done locally without communications, thus does not arouse this concern. Finally, even for AMR, as far as the simulation results are saved for future use, Canopus can be applied to the results to expedite future data analytics. Therefore, Canopus may supplement AMR.

Figure 1: Canopus workflow.

Visualizing images is a common technique during data analysis. However, to extract insights from scientific data, it is more than visualization. Descriptive, predictive, and prescriptive analytics are widely used to generate to actionable results. The design of Canopus keeps these requirements in mind to avoid losing salient information in simulation results. For XGC1 data, Kress et al. [13] demonstrated that important features that we will study in Section IV-D can be completely erased during reduction. Hence, the ability to augment accuracy is important. III. D ESIGN AND I MPLEMENTATION Canopus manages simulation results as soon as they are generated, aiming to accelerate forthcoming data analytics. In this section, we present the approaches to reduction, placement, and retrieval of simulation data. Data reduction consists of refactoring and compression. Data retrieval includes reading, decompressing, and restoring data. We begin with a highlevel presentation of Canopus, demonstrating its workflow and architecture. We then present the Canopus data reduction, placement, and retrieval procedures in great detail. The source code of Canopus is at https://github.com/ ornladios/ADIOS/tree/sirius.

D. Challenges Although Canopus was inspired by existing technologies such as JPEG 2000, there is not a one-to-one analogy. We must address several new design and implementation challenges. First, the context and problem domain of Canopus is inherently HPC science-oriented. Canopus targets big scientific data, aiming to mitigate the storage I/O bottleneck of data analysis for large scientific simulations. In this scenario, new concerns arise, such as how to co-design complex simulation data and storage in order to take advantage of capacity and performance characteristics of each tier. Scientific data also necessitates fairly complicated data reduction procedures. Though JPEG images consist of gridded pixel values, which can be fairly easily compressed using the discrete cosine transform, Canopus targets a broader class of scientific data that uses structured and unstructured (e.g., triangular) meshes to discretize the problem domain, where the data quantities are stored as floating-point values over the mesh. The variability and complicated representation of the data makes it hard to achieve high reduction ratios using a single compression algorithm.

A. Canopus Overview Design Principle: Rather than reinventing the wheel, Canopus leverages the existing I/O stacks, including MPI and POSIX. Driven by the requirements of data analytics, Canopus trades accuracy for speed and provides a new paradigm for scientists to interact with their data. Figure 1 illustrates the Canopus workflow at a glance. As introduced in Section I, the overall design is intended to provide a range of accuracies to users, using a base lowaccuracy dataset, accompanied by a series of deltas that can be applied to increase the accuracy level. The encoding process occurs to the left of the first arrow in Figure 1, and the data retrieval occurs to the right of the second arrow. The pyramid represents the levels of storage hierarchy where the data products are written and retrieved, with faster, smaller tiers at the top, and slower, larger ones toward the bottom. The encoding process occurs in three steps (detailed in Section III-C). First is refactoring – the series of variableaccuracy versions is generated from the simulation data.

Simulation

Data Analytics

ADIOS Write API

ADIOS Query API

Canopus (I/O, refactoring, compression, placement, retrieval, restoration ) ADIOS Kernel (buffering, metadata, scheduling, etc.) I/O Transport MPI

MPI_AGGREGATE

POSIX

Dataspaces

FLEXPATH

MPI_LUSTRE

scalability on leadership class systems. The interface includes several choices for I/O transport methods, such as MPI-IO, POSIX, etc., and Canopus can leverage these existing highperformance solutions without developing them from scratch. Second, ADIOS has already been used in production by a large number of petascale applications. Plugging Canopus into ADIOS requires no source code change from these applications. B. Notation

Node-local Storage (NVRAM, SSDs) Burst Buffer Remote Parallel File System Campaign Storage

Storage Tiers

For explanatory convenience, this section introduces the notation that is used throughout the remainder of the paper. Ll denotes a data variable, and superscripts select a particular accuracy-level from a progression of levels:

Figure 2: Canopus architecture.

 Second, the refactored data is further reduced using floatingpoint compression algorithms. Finally, this compressed data is distributed across the storage hierarchy. We note that Canopus can be run to save data for post-processing, in situ [14], [15], [16], [17] or in transit [14], [18]. By in situ, we mean Canopus runs on the same node as the simulation (using either the same core or a different core than the simulation process), and the in transit approach stages the data in-memory to auxiliary nodes for processing. Switching transport modes is a runtime option, requiring no source code change or recompilation. Canopus builds upon a data model that consists of meshes and data, and therefore the data refactoring adopts mesh decimation to calculate low-accuracy datasets and deltas. When the refactored data products are placed onto storage tiers, the base dataset is placed onto a faster tier, and the deltas are placed onto larger but slower tiers. We recognize that if the refactoring is to be performed during the execution of a simulation running on many cores, its performance cost needs to be manageable compared to the total simulation time; Section IV-C briefly considers write performance. However, because Canopus specifically targets applications in which the simulation results need to be written once but analyzed a number of times, (e.g., for parameter sensitivity studies), the performance and new functions in data retrieval are the primary metrics we seek to enhance. On the analytics side, Canopus allows users to select the desired level of accuracy, and the compressed data is then retrieved from the storage hierarchy as necessary. The retrieved data is decoded using the appropriate floating-point decompression algorithm, and upon decompression, the data will be restored to the target accuracy-level. Further discussions about these steps follow in Section III-E. Figure 2 locates Canopus in the I/O stack. Canopus is implemented as a super I/O transport method in ADIOS and is plugged into the simulation and analytics via the ADIOS write and query interface, respectively. There are two primary reasons why chose this approach. First, ADIOS provides a high-performance I/O interface on HPC systems, enabling I/O

N ≡ Total number of levels 0 ≤ l < N ≡ Progressive data levels L   0 ≤ l < m < N,  l−m l m . delta ≡ L −L m−l =1 l

Indexing begins at 0, and decreasing l increases the accuracy. The full-accuracy data corresponds to L0 , and the base dataset corresponds to LN −1 . We mathematically clarify the delta calculation Ll − Lm in Section III-C2. Vertices and edges in the mesh at a certain level are denoted as follows:  Gl V l , E l ≡ Mesh at level l  Vertices : V l ≡ Vil 0 ≤ i < V l   0 ≤ i < V l , l l . Edges : E ≡ Eij {j(i)} l Vil denotes the i-th vertex of the l-th level, and Eij is the bil l directional edge connecting vertex Vi to vertex Vj . Operator |· | retrieves the number of vertices or edges of a level. Using this notation, the decimation ratio between level l and the original is: 0 V dl = |V l |

For simplicity, let us assume dl = 2l . With three levels of accuracy, Canopus generates a base dataset L2 that is 25% of the full accuracy size, along with delta1−2 and delta0−1 (Figure 1). In particular, delta1−2 is the difference between L2 and L1 , and delta0−1 is the difference between L1 and L0 . Canopus further compresses L2 , delta1−2 and delta0−1 c c c into L2 , delta(1−2) and delta(0−1) , which are subsequently mapped onto the following storage tiers, ST 2 (the fastest but the smallest in capacity, e.g., NVRAM), ST 1 (slower but larger in capacity, e.g., SSD), ST 0 (the slowest but the largest in capacity, e.g., disks) respectively. Note that the adjacent levels are not necessarily mapped to adjacent physical levels due to the fact that some physical tiers may not have the sufficient capacity to accommodate the data. Next, data analytics has three options to examine the data: (1) it requests the lowest acc curacy by quickly retrieving L2 from ST 2 , and decompresses it to obtain L2 ; (2) it restores a higher accuracy by additionally c retrieving and decompressing delta(1−2) from ST 1 , and then

performing L2 + delta1−2 = L1 ; and (3) it restores the highest c accuracy by further retrieving/decompressing delta(0−1) from 0 2 1−2 0−1 ST , and then calculating L + delta + delta = L0 . These options allow users to progressively augment the data accuracy by fetching the deltas level by level on-the-fly. C. Data Refactoring In general, Canopus supports various approaches to refactoring data, including byte splitting [19], block splitting [8], and mesh decimation [20], [21], [22] for both structured and unstructured meshes. In this paper, we focus on mesh decimation because 1) it can reduce data size aggressively (e.g., by a factor of 1000), which is needed for reducing PB-level data; 2) the majority of scientific simulations use mesh-based data models, and working on meshes instead of bytes allows us to leverage the inherent application semantics (e.g., node adjacency); and 3) it can generate a lower-accuracy dataset that is complete in geometry and can be directly consumed by analytics, as opposed to techniques such as block splitting [8]. Our data refactoring approach progresses iteratively, where each iteration involves two steps: mesh decimation and delta calculation. These steps are followed by floating point compression then placement onto storage tiers. Each of these four actions are further detailed below. Furthermore, we describe data retrieval, including decompression and the final restoration processes. 1) Mesh Decimation: To decimate unstructured meshes, Canopus adopts edge collapsing [23], [21], [22] to decimate level l and generate level l + 1. To that end, we first insert all edges of a mesh into a priority queue. The priority of an edge is set to the length of an edge, and the shorter edges will be collapsed first to generate the lower resolution. In general, choosing the priority of an edge is application dependent and is left for future study. The decimation algorithm is described in Algorithm 1. It successively cuts the edge with the shortest length from the mesh until the pre-determined decimation ratio l dl is reached. Namely the shortest edge, Ei,j between vertex l l Vi and Vj at level l, will be removed from the mesh structure. Next, a new vertex, Vkl+1 = N ewV ertex(Vil , Vjl ), will be constructed, and new edges between the adjacent vertices of Vil , Vjl , and Vkl+1 are constructed and inserted into the mesh structure and the priority queue. Canopus sets Vkl+1 as a linear combination (represented by N ewV ertex(·)) of Vil and Vjl ; for simplicity we use Vkl+1 = (Vil + Vjl )/2 in this paper. Analogously, the decimated data is set to Ll+1 = N ewData(Lli , Llj ), using the simple mean. The k time complexity of decimation is dominated by the cost of the insert operation in a priority queue, which is typically O(logN ), and the complexity of finding the shortest edge is O(1). Note that the decimation is done locally without requiring communication with other processors, and therefore is embarrassingly parallel. 2) Delta Calculation: After  decimation, instead of directly saving a set of multiple levels Ll 0 ≤ l < N to storage, Canopus writes a single, low-resolution base level LN −1 and

Algorithm 1 Mesh Decimation Input: Gl (V l , E l ), Ll , dl Output: Gl+1 (V l+1 , E l+1 ), Ll+1 vertices cut ← 0 pqueue ← BuildP riorityQueue(Gl (V l , E l )) Gl+1 (V l+1 , E l+1 ) = CopyM esh(Gl (V l , E l )) Ll+1 = CopyData(Ll ) vertices cut 1.0 < 1.0 − do l+1 |V | dl Eij ← P riorityQueueP op(pqueue) RemoveEdge(Gl+1 (V l+1 , E l+1 ), Eij , pqueue)

while

Vk ← N ewV ertex(Vi , Vj ) Ll+1 ← N ewData(Lli , Llj ) k RemoveV ertex(Gl+1 , Vi ) RemoveV ertex(Gl+1 , Vj ) for all V ∈ GetN eighbors(Vi ) do AddEdge(Gl+1 (V l+1 , E l+1 ), Vk , V ) U pdateP riorityQueue(Gl+1 (V l+1 , E l+1 ), pqueue) end for for all V ∈ GetN eighbors(Vj ) do AddEdge(Gl+1 (V l+1 , E l+1 ), Vk , V ) U pdateP riorityQueue(Gl+1 (V l+1 , E l+1 ), pqueue) end for vertices cut ← vertices cut + 1 end while

computes deltas, which are the differences between adjacent accuracy levels. The rationale behind this is that in all three datasets we consider, we notice that delta is less variable than Ll (Figure 4), and for a fixed target accuracy, should be more efficiently compressible by libraries such as ZFP [24]. Equations 1 – 3 specify the delta calculation mathematically, Algorithm 2 presents the pseudocode, and Figure 3 illustrates the delta calculation of one cell within a triangular mesh. l+1 l+1 deltal−(l+1) = Llx − Estimate(Ll+1 x i , Lj , Lk ) (1)

Estimate(·) ≡ α · Ll+1 + β · Ll+1 + γ · Ll+1 i j k α + β + γ ≡ 1.

(2) (3)

Estimate(·) is an estimation function that uses the data at level l + 1 to predict level l, leveraging that the two levels are expected to be highly correlated. We further assume that l+1 l+1 Estimate(·) is a linear combination of Ll+1 i , Lj , and Lk , with normalized coefficients α, β, and γ. That is, we are anticipating that α · Ll+1 + β · Ll+1 + γ · Ll+1 can generate i j k l a reference value that is fairly close to Lx , and by saving deltalx (instead of Ll ) which in principle close to zero, the resulting data can be well compressed. For simplicity, we set α = β = γ = 1/3; the optimal form for Estimate(·) is left

Vl+1i

D. Data Placement

Vlx Vl+1k

Vlw Vlz Vly Vl+1j

Figure 3: Demonstration of delta calculation.

for future study. Algorithm 2 Delta Calculation Input: Gl (V l , E l ), Ll , Gl+1 (V l+1 , E l+1 ), Ll+1 Output: deltal−(l+1) for each triangle < Vil+1 , Vjl+1 , Vkl+1 > in Gl+1 do for each Vxl which falls into this triangle do l−(l+1) deltax = Llx − α · Ll+1 − β · Ll+1 − γ · Ll+1 i j k end for end for Figure 4 shows three examples (XGC1, GenASiS, and CFD) of decimation and delta calculation. Each set of six panels shows the original data and its full-resolution mesh, the data and mesh decimated at 4X reduction, and the two sets of deltas used to restore the original. 3) Floating-point Compression: The refactored data can be further compressed in order to reduce the storage capacity and data movement costs. As of 2016, Canopus has integrated ZFP [24] to perform compression. ZFP is a state-of-the-art floatingpoint data compressor for scientific data that de-correlates spatially correlated data by exploiting the local smoothness that typically exists between neighboring data elements, which generates near-zero coefficients that can be well compressed. We are in the process of integrating other compression libraries such as SZ [25] and FPC [26] into Canopus. As Figure 4 shows, the delta (e.g., delta0−1 , delta1−2 ) calculated between adjacent levels exhibits higher smoothness than the intermediate decimation results (e.g., L1 ). Effectively, Canopus serves as a pre-conditioner that further prepares the data for compression and improves ZFP performance. Figure 5 compares the normalized data size of the following two approaches: 1) directly using ZFP to compress all levels, and 2) using ZFP to compress the base level along with a series of deltas. For example, with 3 levels of accuracies, the first approach compresses L0 , L1 , and L2 , and the second approach (used in Canopus) compresses L2 , delta1−2 , and delta0−1 . As shown in Figure 5, Canopus can further improve the data compression ratio by 14% by compressing deltas for XGC1 data and up to 62.5% for GenASiS data. Note both cases result in similar compression speed, which is therefore not shown. Observation: Canopus serves as a pre-conditioner for compression algorithms to achieve better compression ratios.

Following compression, the base data and deltas are then placed onto storage tiers. The base data can be decimated to the point that it is small enough to fit into a high tier, e.g., NVRAM, and the additional deltas can be placed onto lower tiers that have larger capacities. If a storage tier doesn’t have sufficient capacity, it will be bypassed and the next tier will be selected. Writing data efficiently is accomplished by utilizing different I/O transports in ADIOS [3], which are provided through an abstraction layer that eliminates the need to consider storage characteristics within the applications. An I/O transport that best utilizes a specific storage tier is selected and configured in an external XML configuration file (e.g., using ADIOS MPI AGGREGATE transport for writing data on Lustre, and using ADIOS POSIX for writing data on a local storage). In this paper, we build a 2-tier storage comprised of DRAM-based tmpfs and Lustre parallel file system on Titan, and the refactored data is written in parallel from each core, with the high tier being written first, followed by the low tier. The I/O time measured in this paper is the total time spent on writing both tiers. E. Data Retrieval Data analysis with Canopus is designed to take advantage of HPC storage hierarchies. The base dataset is stored on a fast storage tier to enable rapid data exploration (e.g., data inspection). Data retrieval starts from this lowest-accuracy base dataset, and if the accuracy suffices, data retrieval concludes. Otherwise, data from the next level of accuracy is restored using the current accuracy and the associated deltas, likely read from a slower storage tier with larger capacity. The process is repeated until the data accuracy satisfies the user (cf. Figure 1). Note this process can be automated if the criteria to terminate (e.g., root mean square error between two adjacent levels) is known a priori. Otherwise it is interactive and controlled by users. Even if a higher-than-base accuracy is ultimately required, another benefit of Canopus is that the initial analysis on the low accuracy data can provide guidance to subsequent, higher fidelity data explorations, and facilitate focused data retrieval, e.g., reading smaller subsets of high accuracy data (for which we will give an example in Section IV-D). 1) I/O and Decompression: Canopus uses the ADIOS read API to retrieve the refactored data. ADIOS provides a metadata-rich binary-packed (BP) data format, which helps to reduce the complexity of retrieving data across storage tiers. Global metadata maintains the location of the refactored data; users can access attributes (e.g., location, size, and etc.) of the data via the ADIOS query interface, e.g., by calling dpot info=adios inq var(file handle, “dpot”, l), and retrieve data via the ADIOS read interface, e.g., by calling adios read var(file handle, “dpot”, offsets, sizes, dpot, l). The retrieved data is then decompressed using the associated compression libraries. 2) Data Restoration: Canopus subsequently restores data to the desired level of accuracy by applying a set of deltas to the base data. Namely,

L0

G0(V0, E0) (partial)

L2

G2(V2, E2) (partial)

delta1-2

delta0-1

(a) XGC1 (dpot)

L0

G0(V0, E0) (partial)

L2

G2(V2, E2) (partial)

delta1-2

delta0-1

(b) GenASiS (normVec magnitude)

L0

G0(V0, E0)

G2(V2, E2)

L2

delta0-1

delta1-2 (c) CFD (pressure)

Canopus

0.2 0.1 0 1

2 3 Total # of levels (a) XGC1 (dpot)

4

Direct

0.2

Canopus

Normalized size

Direct

0.3

Normalized size

Normalized size

Figure 4: Data refactoring.

0.1 0 1

2 3 Total # of levels

Direct

0.4 0.3 0.2 0.1 0 1

4

Canopus

2 3 Total # of levels

4

(c) CFD (pressure)

(b) GenASiS (normVec magnitude)

Figure 5: Canopus vs. direct compression. l−(l+1)

Algorithm 3 Restoration Input: Gl+1 (V l+1 , E l+1 ), Ll+1 , deltal−(l+1) , Gl (V l , E l ) Output: Ll Retrieve the mapping between Gl+1 (V l+1 , E l+1 ) and Gl (V l , E l ) for each triangle < Vil+1 , Vjl+1 , Vkl+1 > in Gl+1 do for each Vxl which falls into this triangle do l−(l+1) l+1 l+1 Llx = deltax + Estimate(Ll+1 i , Lj , Lk ) end for end for

l+1 l+1 Llx = deltax + Estimate(Ll+1 i , Lj , Lk ), x ∈ triangle < Vil+1 , Vjl+1 , Vkl+1 > (Figure 3). The complexity of restoration is O(n2 ), where n is the number of vertices at level l. One critical step in restoration is to identify the set of vertices that fall into a triangle < Vil+1 , Vjl+1 , Vkl+1 > at level l + 1. The brute force approach that checks whether Vnl falls into this triangle can be expensive due to the potentially large number of vertices at level l and triangles at level l + 1 in the mesh. To this end, Canopus stores the mapping between Vnl and the triangle into ADIOS metadata during the refactoring phase, and the mapping information can be subsequently used to accelerate data restoration.

In this section, we first evaluate the impact of Canopus on both simulations and data analytics. We then evaluate the endto-end time of analytics pipeline using the progressive data exploration capability offered by Canopus. A. Applications We use three applications and datasets to evaluate Canopus. Each of them includes floating-point quantities on an unstructured triangular mesh. Figure 4 provides visualizations of using Canopus for such quantities, showing two levels of mesh refinement, as well as variable quantities at these levels and the associated deltas. 1) XGC1: We apply Canopus to the use case of so-called blob transport [27], [28] in data from XGC1, the plasma physics simulation code as aforementioned. These blobs are local over/under-densities in plasma quantities, which develop near the edge of the detector, and dissipate confined energy from the system. We are particularly interested in examining behavior in the dpot variable, which measures how the electric potential deviates from background. It is a 3D scalar field, organized into a discrete set of 2D planes at different positions around the rotation axis of the toroidally-shaped reactor. We analyze the properties of blobs with and without Canopus encoding, as well as quantify the performance impacts. Figure 4a shows dpot data of one plane, each represented by a mesh of 41,087 triangles. We refer the reader to Figure 7, in which blobs are explicitly circled. 2) GenASiS: GenASiS [29] is a multi-physics code developed for the simulation of astrophysical systems involving nuclear matter. Figure 4b shows the magnetic field (normVec magnitude) surrounding a solar core collapse, resulting in a supernova. The mesh of the dataset consists of 130,050 triangles. 3) CFD: Computational fluid dynamics (CFD) studies and analyzes the interaction of liquids with surfaces under certain boundary conditions. We highlight Canopus outputs originating from a CFD simulation I/O kernel in Figure 4c. The color scale encodes pressure values near the front of a fighter jet. The deltas indicate that the most precision is needed along the interface of the material and the airflow. The mesh of the entire jet consists of 12,577 triangles. B. Testbed and Assumptions At the time that the experiments were conducted, we did not have access to a production level multi-tier storage system, such as those with burst buffer. To demonstrate the idea and feasibility of Canopus, we use a DRAM-backed tmpfs and Lustre parallel file system to emulate a two-tier storage system on Titan. The detailed system configuration of Titan can be found in [30]. All runs assume that the base dataset can always fit in tmpfs. However, in a production environment, this may not be true and we believe data migration and eviction will

120 100 80 60 40 20 0

1 Time fractions

IV. E VALUATION

Bytes per sec/1M fops

Algorithm 3 describes the restoration in detail.

2009 2013 2017 2021 2024 Year

(a) Storage-to-compute trend for large HPC systems [31]

0.8 0.6

I/O Delta calculation and compression Decimation

0.4 0.2 0 High Medium Low Storage-to-compute ratio

(b) Write performance

Figure 6: Storage-to-compute trend and its impact on data refactoring.

play an integral part, which needs to be developed in Canopus. We assume a proportional resource allocation, that the space of the high-performance storage tiers is allocated proportional to the simulation output sizes. If the size of simulation data is s, and the capacity ratio between tmpfs and Lustre file system tier is x1 , then the space of tmpfs allocated to the simulation data is xs . The performance of retrieving base data is largely a reflection of the DRAM speed, and depending on the storage technologies used for high-performance tiers, the performance achieved by Canopus may vary from the measurements here. Clearly Canopus performs the best on a system when the performance gap between tiers is pronounced. C. Write Performance Figure 6 evaluates the cost of decimation, delta calculation, and I/O under various system setups in terms of the ratio of storage to compute resources. Figure 6a begins by showing the storage-to-compute (bytes per second/1M flops) trend since 2009 for leadership class HPC systems in the U.S. The overall trend is that compute is becoming cheaper and the gap between compute and storage has widened sharply over the last decade. Figure 6b shows a time breakdown writing XGC1’s dpot variable, using Canopus with a decimation ratio of two to refactor the original 20,694 double-precision mesh values. Under high (compute-bound), medium, and low (I/O-bound) storage-to-compute capabilities, we indicate the times spent on decimation, calculating/compressing the delta between two adjacent levels, and placing refactored data (including both base and delta). For each of the compute-bound, medium, and I/O-bound scenario, we assign 32, 128, and 512 cores, respectively, along with one storage target to run XGC1. This medium case is chosen to reflect the capabilities of Titian which has 300,000 core with 2,016 storage targets. Observation: Complex data refactorization incurs overhead to simulations. However, as compute becomes cheaper on next generation HPC systems, the relative cost will go down accordingly. More importantly, this one-time cost on the simulation side will bring benefits to data analytics that will explore data many times. D. Impact on Data Analytics Canopus allows users to select the level of accuracy that is sufficient for one’s scientific discovery. The choice of accuracy

(a) L0

(b) L1

Config2

Config3

15 10 5 0 None

(c) L2

2

4 8 16 Decimation ratio

32

Avg. blob diameter (Pixel)

Number of blobs

Config1 20

Config1

20 10 0 None

(f) L5

Config1

8000 6000 4000 2000 0 None

2

Figure 7: A macroscopic view of blob detection.

4 8 16 Decimation ratio

2

4 8 16 Decimation ratio

32

(b) Avg. blob diameters

Config3

32

Blob overlap ratio

(e) L4

Aggr. blob area (Sqr pixel)

(d) L3

Config2

Config3

30

(a) Number of blobs detected Config1

Config2

40

Config2

Config3

1 0.8 0.6 0.4 None

(c) Blob area

2

4 8 16 Decimation ratio

32

(d) Blob overlapping ratio

Figure 8: A quantitative evaluation of blob detection.

Conf ig1: < 10, 200, 100 > Conf ig2: < 150, 200, 100 > Conf ig3: < 10, 200, 200 >

Figure 7 illustrates the results of blob detection under various accuracy levels, L0 through L5 . The circled areas in each plot are identified as high energy blobs. Figure 8a indicates that as the accuracy decreases, the number of blobs captured by the detection algorithm decreases as a result of information loss. We notice that the blobs tend to expand first, then disappear once the potential falls below a certain threshold, as evidenced by the increased blob diameter and area size in Figures 8b and 8c. A cluster of blobs may overlap and merge. These effects are caused by the averaging effect in the edge collapsing technique adopted by Canopus to decimate unstructured meshes. Nevertheless, in Figure 7 most blobs in the full accuracy data can still be detected using a moderately reduced accuracy, and blobs detected in the low accuracy data still have a high overlap ratio with the blobs detected in the full accuracy data (Figure 8d). Two blobs are defined as overlapped if the distance between their two centers is less than the sum of their radius. The high overlap ratio suggests that the blobs detected in the low accuracy data can still capture the areas with high electric potential to some degree. This can help scientists to quickly scan for features at low accuracy, then zoom into areas with features by fetching a subset of high accuracy data.

E. Progressive Data Exploration Figure 9a measures the end-to-end time of XGC1 analysis pipeline comprised of I/O, decompression, restoration, and blob detection phases. It is evident that I/O is the major bottleneck, with the rest incurring very low overhead. The baseline case in Figure 9a is the time spent analyzing the data with the highest accuracy (labeled as None involving no decompression and restoration). For other test cases, each measures the time spent constructing the next level of accuracy, then performing blob detection over the restored data. For example, at decimation ratio of 4, the total time spent, approximately 0.82 seconds, is the time to retrieve and c c decompress L2 and delta(1−2) , restore L1 , and perform blob detection on L1 . Figure 9b plots how long it takes to restore full-accuracy data from lower-accuracy data. For example, at c decimation ratio of 4, it takes 2.4 seconds to restore from L2 to L0 . Canopus can restore the full accuracy data and reduce the data analysis time by up to 50% (against the case that we analyze full accuracy directly) due to the I/O savings from fully utilizing the storage hierarchy and pre-conditioning data for the ZFP compressor. Similarly, Figure 10 and Figure 11 show the time usage of Canopus phases (I/O, decompression, and restoration) of GenASiS and CFD. I/O

Decompression

Restoration

Blob detection

4 3 2 1 0 None

2

4

8

16

32

Full accuracy restoration time (secs)

• • •

Observation: Low accuracy data will impact data analytics due to the information loss. However, low accuracy can generate informative results that can provide insights and guide subsequent discovery.

Analysis time (secs)

is mostly driven by two questions: How much faster data analytics can run with the reduced accuracy, and what is the impact of reduced accuracy on data analytics? This section focuses on discussing the second question. The concrete data analytics we will demonstrate next is blob detection in XGC1 data, from which fusion scientists examine the electrostatic potential (dpot) in a fusion device and study the trajectory of high energy particles. Herein, we use the blob detection function in OpenCV, an open source computer vision library, to identify areas with high electric potentials in a 2D plane of dpot data. It uses simple thresholding, grouping, and merging techniques to locate blobs. We test the following parameter pairs in terms of < minT hreshold, maxT hreshold, minArea >:

4 3 2 1 0 None

Decimation ratio

(a) End-to-end time of the analytics pipeline.

2

4 8 16 Decimation ratio

32

(b) Restoring full accuracy data from the base dataset and deltas.

Figure 9: XGC1.

Decompression

Restoration

1.5 1 0.5 0 None

2

4 8 16 Decimation ratio

32

(a) Time usage of Canopus phases.

Full accuracy restoration time (sec)

Analysis time (sec)

I/O

1.5 1 0.5 0 None

2

4 8 16 Decimation ratio

32

(b) Restoring full accuracy data from the base dataset and deltas.

Analysis time (sec)

I/O

Decompression

Restoration

8 6 4 2 0 None

2 4 Decimation ratio

8

(a) Time usage of Canopus phases.

Full accuracy restoration time (sec)

Figure 10: GenASiS.

8 6 4 2 0 None

2 4 Decimation ratio

8

(b) Restoring full accuracy data from the base dataset and deltas.

Figure 11: CFD.

V. R ELATED W ORK Data management has been identified as one of the top research challenges of exascale [32]. It is recognized that efficient data reduction, storage, analysis, and visualization are crucial due to the worsening I/O bottleneck for future systems. For efficient data analysis, new data processing frameworks and I/O systems are required to facilitate in situ data processing before data at rest [33]. However, until recently, most data produced by scientific applications have been saved on mass storage first, then analyzed and visualized at a later time [34]. Data compression methods, which reduce data sizes while preserving information are desired. Efforts have been made to enable efficient queries directly on compressed data [35]. A challenge is to develop a flexible data reduction mechanism that users can easily customize according to their data collection practices and analysis needs [33]. To avoid losing critical insights from simulation results, lossless compression [36], [37] were developed for scientific floating-point data. However, lossless compression usually achieves less than a 2X reduction ratio [38], greatly limiting its impact on reducing data footprints. Assessing the effects of data compression in simulations, Laney et. al. [38] demonstrated that lossy compression mechanisms could achieve a 3 to 5X reduction ratio without causing significant changes to important physical quantities, which validated the feasibility and benefits of applying lossy compression in exascale. Lossy compression methods including ZFP [24] and SZ [25] have now been implemented to achieve high reduction ratio with low overhead. The preliminary work of this paper [39] also uses decimation techniques to reduce data. Mesh geometries are widely used to organize scientific data, and mesh compression algorithms [40] have been implemented to compress both the geometry data and the connectivity data. To enable representing functions and analyzing features

at multiple levels of detail, the algorithm for progressive compression of arbitrary triangular meshes has been implemented [41]. Largely geared towards uniform or blockstructured meshes, ViSUS [42] and related projects [43] have undertaken efforts to use hierarchical Z-ordering to progressively refine grids. In addition, machine learning approaches were also adopted for compressing motion vectors [44]. Considering that I/O constraints are making it increasingly difficult for scientists to save a sufficient amount of raw simulation data to persistent storage, in-situ and in-transit data analysis [15], [14], [45], [46], which tries to perform as much analysis as possible while the data are still in memory, has roused research interests. A major challenge for achieving in situ analysis is that it leads to an increased run time of simulations that still need to complete within a fixed-time job allocation [46]. Additionally, not all analytical interests can be decided beforehand. Scientists may change their interests based on some observations of simulation results or based on new research. Therefore, in-situ data analysis cannot fully not replace traditional output file based analysis in the near future. To tackle the challenge of enabling in situ visualization under performance constraints, Dorier et. al. [46] propose to score and screen out interesting data blocks for visualization instead of using the full dataset, so as to reduce the data processing time. The methods of screening potentially interesting data can also be useful for the data decimation process of Canopus. System-level optimizations have also been made to expedite scientific data analysis. Consider that query-driven data exploration induces heterogeneous access patterns that further stress the underlying storage system, MLOC [47] and PARLO [48] were proposed. MLOC implemented a parallel multi-level layout optimization framework for compressed scientific data to enable an effective data exploration under various access patterns. PARLO integrated MLOC with ADIOS [3] to achieve run-time layout optimization. SDS [49] reorganizes data to match the read patterns of analytical tasks to accelerate read operations. BurstFS [50] utilizes burst buffers to support scalable and efficient aggregation of I/O bandwidth, accelerating read operations of analytical tasks by an order of magnitude. We believe system level optimizations and data compression should be co-designed to transparently benefit end users. VI. C ONCLUSION This paper describes our efforts in enabling extreme-scale data analytics via progressive factoring. This paper is motivated by a large-scale production run, in which the data was too large to be effectively analyzed. To address this challenge, we design and implement Canopus, a data management middle that allows simulation data to be refactored, compressed, and mapped to storage tiers. The key advantage of Canopus is that users can perform exploratory data analysis progressively without being forced to work on the highest accuracy data. To achieve this, Canopus utilizes mesh decimation to generate a base dataset along with a series of deltas. Its co-design of analysis and storage provides a new data management paradigm for simulation-based science.

R EFERENCES [1] C. Chang, S. Ku, P. Diamond, Z. Lin, S. Parker, T. Hahm, and N. Samatova, “Compressed ion temperature gradient turbulence in diverted tokamak edge a,” Physics of Plasmas, vol. 16, no. 5, p. 056108, 2009. [2] R. Hager and C. Chang, “Gyrokinetic neoclassical study of the bootstrap current in the tokamak edge pedestal with fully non-linear coulomb collisions,” Physics of Plasmas, vol. 23, no. 4, p. 042503, 2016. [3] Q. Liu, J. Logan, Y. Tian, H. Abbasi, N. Podhorszki, J. Y. Choi, S. Klasky, R. Tchoua, J. Lofstead, R. Oldfield, M. Parashar, N. Samatova, K. Schwan, A. Shoshani, M. Wolf, K. Wu, and W. Yu, “Hello adios: The challenges and lessons of developing leadership class i/o frameworks,” Concurr. Comput. : Pract. Exper., vol. 26, no. 7, pp. 1453– 1473, May 2014. [4] “ADIOS: Adaptable I/O System,” https://www.olcf.ornl.gov/ center-projects/adios/. [5] E. Endeve, C. Y. Cardall, R. D. Budiardja, and A. Mezzacappa, “Generation of magnetic fields by the stationary accretion shock instability,” The Astrophysical Journal, vol. 713, no. 2, p. 1219, 2010. [6] P. Snyder, “tmpfs: A virtual memory file system,” in Proceedings of the autumn 1990 EUUG Conference, 1990, pp. 241–248. [7] J. Peter, “The lustre storage architecture,” Cluster File Systems, Inc, 2004. [8] C. Christopoulos, A. Skodras, and T. Ebrahimi, “The jpeg2000 still image coding system: an overview,” IEEE Transactions on Consumer Electronics, vol. 46, no. 4, pp. 1103–1127, Nov 2000. [9] S. Ku, C. Chang, and P. Diamond, “Full-f gyrokinetic particle simulation of centrally heated global itg turbulence from magnetic axis to edge pedestal top in a realistic tokamak geometry,” Nuclear Fusion, vol. 49, no. 11, p. 115021, 2009. [10] E. S. Yoon and C. S. Chang, “A fokker-planck-landau collision equation solver on two-dimensional velocity grid and its application to particle-in-cell simulation,” Physics of Plasmas, vol. 21, no. 3, p. 032503, 2014. [Online]. Available: http://dx.doi.org/10.1063/1.4867359 [11] M. J. Berger and J. E. Oliger, “Adaptive mesh refinement for hyperbolic partial differential equations,” Stanford, CA, USA, Tech. Rep., 1983. [12] M. J. Berger and P. Colella, “Local adaptive mesh refinement for shock hydrodynamics,” J. Comput. Phys., vol. 82, no. 1, pp. 64–84, May 1989. [Online]. Available: http://dx.doi.org/10.1016/0021-9991(89)90035-1 [13] J. Kress, R. M. Churchill, S. Klasky, M. Kim, H. Childs, and D. Pugmire, “Preparing for in situ processing on upcoming leading-edge supercomputers,” Supercomputing Frontiers and Innovations, vol. 3, no. 4, pp. 49–65, 2016. [14] J. C. Bennett, H. Abbasi, P. T. Bremer, R. Grout, A. Gyulassy, T. Jin, S. Klasky, H. Kolla, M. Parashar, V. Pascucci, P. Pebay, D. Thompson, H. Yu, F. Zhang, and J. Chen, “Combining in-situ and in-transit processing to enable extreme-scale scientific analysis,” in High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, Nov 2012. [15] F. Zheng, H. Abbasi, C. Docan, J. Lofstead, Q. Liu, S. Klasky, M. Parashar, N. Podhorszki, K. Schwan, and M. Wolf, “Predata: preparatory data analytics on peta-scale machines,” in 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), April 2010, pp. 1–12. [16] S. Lasluisa, F. Zhang, T. Jin, I. Rodero, H. Bui, and M. Parashar, “In-situ feature-based objects tracking for data-intensive scientific and enterprise analytics workflows,” Cluster Computing, vol. 18, no. 1, pp. 29–40, Mar. 2015. [Online]. Available: http://dx.doi.org/10.1007/s10586-014-0396-6 [17] F. Zhang, C. Docan, M. Parashar, S. Klasky, N. Podhorszki, and H. Abbasi, “Enabling in-situ execution of coupled scientific workflow on multi-core platform,” in 2012 IEEE 26th International Parallel and Distributed Processing Symposium, May 2012, pp. 1352–1363. [18] V. Bhat, M. Parashar, and S. Klasky, “Experiments with in-transit processing for data intensive grid workflows,” in Proceedings of the 8th IEEE/ACM International Conference on Grid Computing, ser. GRID ’07. Washington, DC, USA: IEEE Computer Society, 2007, pp. 193– 200. [Online]. Available: http://dx.doi.org/10.1109/GRID.2007.4354133 [19] S. Klasky, E. Suchyta, M. Ainsworth, Q. Liu, and et. al., “Exacution: Enhancing scientific data managementfor exascale,” in ICDCS’17, Atlanta, GA, 2017. [20] W. J. Schroeder, J. A. Zarge, and W. E. Lorensen, “Decimation of triangle meshes,” in Proceedings of the 19th Annual Conference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH ’92.

[21]

[22]

[23] [24]

[25]

[26]

[27]

[28]

[29]

[30] [31]

[32]

[33]

[34] [35] [36]

[37]

[38]

[39]

[40] [41]

New York, NY, USA: ACM, 1992, pp. 65–70. [Online]. Available: http://doi.acm.org/10.1145/133994.134010 M. Garland and P. S. Heckbert, “Surface simplification using quadric error metrics,” in Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH ’97. New York, NY, USA: ACM Press/Addison-Wesley Publishing Co., 1997, pp. 209–216. [Online]. Available: http://dx.doi.org/10.1145/ 258734.258849 H. Hoppe, “New quadric metric for simplifying meshes with appearance attributes,” in Proceedings of the 10th IEEE Visualization 1999 Conference (VIS ’99), ser. VISUALIZATION ’99. Washington, DC, USA: IEEE Computer Society, 1999, pp. –. [Online]. Available: http://dl.acm.org/citation.cfm?id=832273.834119 M. Isenburg and J. Snoeyink, “Mesh collapse compression,” in SCG ’99, Miami Beach, Florida, USA, 1999. P. Lindstrom, “Fixed-rate compressed floating-point arrays,” IEEE Transactions on Visualization and Computer Graphics, vol. 20, no. 12, pp. 2674–2683, Dec 2014. S. Di and F. Cappello, “Fast Error-Bounded Lossy HPC Data Compression with SZ,” Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016, pp. 730–739, 2016. M. Burtscher and P. Ratanaworabhan, “Fpc: A high-speed compressor for double-precision floating-point data,” IEEE Transactions on Computers, vol. 58, no. 1, pp. 18–31, Jan 2009. D. A. D’Ippolito, J. R. Myra, and S. J. Zweben, “Convective transport by intermittent blob-filaments: Comparison of theory and experiment,” Physics of Plasmas, vol. 18, no. 6, p. 060501, Jun. 2011. S. Ku, R. M. Churchill, C. S. Chang, R. Hager, E. S. Yoon, M. Adams, E. D’Azevedo, and P. H. Worley, “Electrostatic gyrokinetic simulation of global tokamak boundary plasma and the generation of nonlinear intermittent turbulence,” ArXiv e-prints, Jan. 2017. E. Endeve, C. Y. Cardall, R. D. Budiardja, and A. Mezzacappa, “Generation of magnetic fields by the stationary accretion shock instability,” The Astrophysical Journal, vol. 713, no. 2, p. 1219, 2010. [Online]. Available: http://stacks.iop.org/0004-637X/713/i=2/a=1219 “Titan at Oak Ridge Leadership Computing Facility,” https://www.olcf. ornl.gov/titan/. S. Klasky, “CODAR: Center for Online Data Analysis and Reduction,” http://www.ncic.ac.cn/codesign/codesign ppt/CODAR overview Oct 27 2016 Klasky.pdf, invited speech in HPCChina’16, Xi’an, China. A. Subcommittee, “Top ten exascale research challenges,” 2014. [Online]. Available: https://science.energy.gov/∼/media/ascr/ascac/pdf/ meetings/20140210/Top10reportFEB14.pdf P. C. Wong, H. W. Shen, C. R. Johnson, C. Chen, and R. B. Ross, “The top 10 challenges in extreme-scale visual analytics,” IEEE Computer Graphics and Applications, vol. 32, no. 4, pp. 63–67, July 2012. L. Ionkov, M. Lang, and C. Maltzahn, “Drepl: Optimizing access to application data for analysis and visualization,” in MSST’13, May 2013. R. Agarwal, A. Khandelwal, and I. Stoica, “Succinct: Enabling queries on compressed data,” in NSDI’15, Oakland, CA, 2015. P. Lindstrom and M. Isenburg, “Fast and efficient compression of floating-point data,” IEEE Transactions on Visualization and Computer Graphics, vol. 12, no. 5, pp. 1245–1250, Sept 2006. N. Fout and K. L. Ma, “An adaptive prediction-based approach to lossless compression of floating-point volume data,” IEEE Transactions on Visualization and Computer Graphics, vol. 18, no. 12, pp. 2295– 2304, Dec 2012. D. Laney, S. Langer, C. Weber, P. Lindstrom, and A. Wegener, “Assessing the effects of data compression in simulations using physically motivated metrics,” Scientific Programming, vol. 22, no. 2, pp. 141–155, 2014. T. Lu, E. Suchyta, J. Choi, N. Podhorszki, S. Klasky, Q. Liu, D. Pugmire, M. Wolf, and M. Ainsworth, “Canopus: Enabling extreme-scale data analytics on big hpc storage via progressive refactoring,” in 9th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 17). Santa Clara, CA: USENIX Association, 2017. [Online]. Available: https: //www.usenix.org/conference/hotstorage17/program/presentation/lu M. Deering, “Geometry compression,” in SIGGRAPH ’95, New York, NY, USA, 1995. D. Cohen-Or, D. Levin, and O. Remez, “Progressive compression of arbitrary triangular meshes,” in VIS ’99, San Francisco, California, USA, 1999.

[42] V. Pascucci, G. Scorzelli, B. Summa, P.-T. Bremer, A. Gyulassy, C. Christensen, S. Philip, and S. Kumar, “The ViSUS visualization framework,” in High Performance Visualization: Enabling ExtremeScale Scientific Insight, ser. Chapman and Hall/CRC Computational Science, E. W. Bethel, H. C. (LBNL), and C. H. (UofU), Eds. Chapman and Hall/CRC, 2012, ch. 19. [Online]. Available: http: //www.sci.utah.edu/publications/pascucci12/Pascucci Visus ch19.pdf [43] S. Kumar, J. Edwards, P. T. Bremer, A. Knoll, C. Christensen, V. Vishwanath, P. Carns, J. A. Schmidt, and V. Pascucci, “Efficient i/o and storage of adaptive-resolution data,” in SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2014, pp. 413–423. [44] T. Baby, Y. Kim, and A. Varshney, “Unsupervised learning applied to progressive compression of time-dependent geometry,” Computers & Graphics, vol. 29, no. 3, pp. 451–461, 2005. [Online]. Available: https://doi.org/10.1016/j.cag.2005.03.021 [45] U. Ayachit, A. Bauer, B. Geveci, P. O’Leary, K. Moreland, N. Fabian, and J. Mauldin, “ParaView Catalyst: Enabling In Situ Data Analysis and Visualization,” in ISAV’15, 2015. [46] M. Dorier, R. Sisneros, L. B. Gomez, T. Peterka, L. Orf, L. Rahmani, G. Antoniu, and L. Boug, “Adaptive performance-constrained in situ visualization of atmospheric simulations,” in CLUSTER’16, Sept 2016. [47] Z. Gong, T. Rogers, J. Jenkins, H. Kolla, S. Ethier, J. Chen, R. Ross, S. Klasky, and N. F. Samatova, “Mloc: Multi-level layout optimization framework for compressed scientific data exploration with heterogeneous access patterns,” in ICPP’12, Sept 2012. [48] Z. Gong, D. A. B. II, X. Zou, Q. Liu, N. Podhorszki, S. Klasky, X. Ma, and N. F. Samatova, “Parlo: Parallel run-time layout optimization for scientific data explorations with heterogeneous access patterns,” in 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, May 2013. [49] B. Dong, S. Byna, and K. Wu, “Expediting scientific data analysis with reorganization of data,” in Cluster’13, 2013. [50] T. Wang, K. Mohror, A. Moody, K. Sato, and W. Yu, “An ephemeral burst-buffer file system for scientific applications,” in SC’16, 2016.