Block-Structured Adaptive Mesh Refinement for

2 downloads 19 Views 6MB Size Report
for block-structured adaptive mesh refinement (AMR) that is suitable for extreme-scale parallelism. A central component of this approach are data structures that ...

Block-Structured Adaptive Mesh Refinement for Simulations on Extreme-Scale Supercomputers Block-Strukturierte Adaptive Gitterverfeinerung für Simulationen auf Hochparallelen Supercomputern

Der Technischen Fakultät der Friedrich-Alexander-Universität Erlangen-Nürnberg zur Erlangung des Doktorgrades Dr.-Ing. vorgelegt von Florian Schornbaum aus Schwabach

Als Dissertation genehmigt von der Technischen Fakultät der Friedrich-Alexander-Universität Erlangen-Nürnberg Tag der mündlichen Prüfung: 3. September 2018 Vorsitzender des Promotionsorgans: Prof. Dr.-Ing. Reinhard Lerch Gutachter: Prof. Dr. Ulrich Rüde Prof. Dr. Martin Berzins Prof. Dr. Gerhard Wellein

Abstract Dynamically adapting the space discretization during the runtime of a simulation can contribute to significantly reduce the time to solution. This thesis presents a novel approach for block-structured adaptive mesh refinement (AMR) that is suitable for extreme-scale parallelism. A central component of this approach are data structures that are designed such that the size of the meta data of each process remains bounded independent of the total number of processes. Moreover, all stages of the AMR algorithm only use distributed algorithms, meaning no data replication or central resources such as a master process are required. For the dynamic load balancing in particular, the thesis proposes to exploit the hierarchical nature of the block-structured domain partitioning by creating a lightweight, temporary copy of the distributed core data structure. This copy, however, does not contain any simulation data, but only acts as a proxy that provides topological information about the domain partitioning into blocks. Ultimately, this approach enables inexpensive, diffusion-based dynamic load balancing schemes. All proposed algorithms operate on the blocks that result from the domain partitioning. This concept and its realization enable the storage of arbitrary data. In consequence, the resulting software framework can be used for different simulation methods, including mesh-based and meshfree methods. In this thesis, fluid simulations based on the lattice Boltzmann method (LBM) are presented. To that end, the thesis introduces a novel parallelization scheme for the LBM on nonuniform grids. Because of its applicability for massively parallel simulations, the LBM has increasingly become an alternative method for large-scale non-stationary flow simulations. Using such LBM-based flow simulations, the excellent performance and scalability of the AMR implementation are demonstrated for two architecturally different petascale supercomputers. Benchmarks on an IBM Blue Gene/Q system with a mesh containing 3.7 trillion unknowns distributed to 458,752 processes confirm the applicability for future extreme-scale parallel machines. On the same system, the implementation of the LBM on nonuniform grids demonstrates an absolute performance of close to a trillion lattice cell updates per second when executed with almost two million threads. On an Intel Xeonbased supercomputer, LBM-based simulations on nonuniform grids that span the entire machine are shown to be capable of reaching a performance of less than one millisecond per time step. With 65,536 processor cores, one complete AMR cycle, including adaptive refinement and dynamic load balancing, is demonstrated to be executed in half a second for a mesh that consists of a total of 13.8 billion cells.

Zusammenfassung (Abstract in German) Das dynamische Anpassen der räumlichen Diskretisierung zur Laufzeit einer Simulation kann dazu beitragen, die Dauer der Simulation deutlich zu verkürzen. Die vorliegende Arbeit stellt einen neuen Ansatz zur blockstrukturierten adaptiven Gitterverfeinerung (AMR) vor, der für hochparallele Simulationen geeignet ist. Ein zentraler Bestandteil dieses Ansatzes sind Datenstrukturen, die so konzipiert sind, dass die Menge der MetaDaten jedes Prozesses unabhängig von der Gesamtzahl der Prozesse begrenzt bleibt. Darüber hinaus verwenden alle Stufen des AMR-Algorithmus nur verteilte Algorithmen, so dass keine replizierten Daten oder zentrale Ressourcen wie ein Master-Prozess benötigt werden. Insbesondere für die dynamische Lastverteilung schlägt die vorliegende Arbeit vor, den hierarchischen Charakter der blockstrukturierten Raumaufteilung auszunutzen und eine leichtgewichtige, temporäre Kopie der verteilten Kerndatenstruktur zu erstellen. Diese Kopie enthält jedoch keine Simulationsdaten, sondern fungiert ausschließlich als Proxy, der topologische Informationen über die blockstrukturierte Raumaufteilung zur Verfügung stellt. Unter anderem ermöglicht dieser Ansatz die Verwendung kostengünstiger, diffusionsbasierter Verfahren zur dynamischen Lastverteilung. Alle vorgestellten Algorithmen arbeiten auf Ebene der Blöcke, die aus der Raumaufteilung hervorgehen. Dieses Konzept und seine Umsetzung ermöglichen die Speicherung beliebig gearteter Daten. Das resultierende Software-Framework kann somit für verschiedene Simulationsmethoden genutzt werden, insbesondere sowohl für gitterbasierte als auch gitterfreie Methoden. In der vorliegenden Arbeit werden Strömungssimulationen auf Basis der Lattice-Boltzmann-Method (LBM) behandelt. Hierzu stellt die Arbeit ein neues Parallelisierungsverfahren für die LBM auf nicht-uniformen Gittern vor. Aufgrund ihrer Eignung für hochparallele Simulationen hat sich die LBM zunehmend zu einer alternativen Methode für groß angelegte instationäre Strömungssimulationen entwickelt. Anhand derartiger Strömungssimulationen auf Basis der LBM wird die ausgezeichnete Leistung und Skalierbarkeit der AMR-Implementierung für zwei architektonisch unterschiedliche Petascale-Supercomputer demonstriert. Benchmarks auf einem Blue Gene/QSystem von IBM mit einem Gitter aus 3,7 Billionen Unbekannten, verteilt auf 458.752 Prozesse, bestätigen die Eignung für zukünftige, massiv-parallele Rechensysteme. Auf dem gleichen System ist es der Implementierung der LBM für nicht-uniforme Gitter mit knapp zwei Millionen Threads möglich, fast eine Billion Gitterzellen pro Sekunde zu aktualisieren. Auf einem Supercomputer auf Basis von Intel Xeon-Prozessoren sind Simulationen mit der LBM auf nicht-uniformen Gittern, die gleichzeitig auf allen Knoten des Supercomputers laufen, in der Lage, einen Durchsatz von über eintausend Zeitschritten pro Sekunde zu erreichen. Mit 65.536 Prozessorkernen wird gezeigt, dass es für ein Gitter mit insgesamt 13,8 Milliarden Zellen möglich ist, einen kompletten AMR-Zyklus, einschließlich adaptiver Verfeinerung und dynamischer Lastverteilung, in einer halben Sekunde durchzuführen.

Acknowledgements This thesis would not exist had I not met several people during my years of study who sparked my interest in the field of high performance computing and computer simulation. Also, this thesis would not have been possible without access to some of the world’s fastest supercomputers, many hours of stimulating discussions with my colleagues, and the persistent support of my supervisor, family, especially both my parents, and friends. In particular, I would like to thank Prof. Ulrich Rüde for giving me the freedom to always pursue my own research interests, for giving me the opportunity to visit several international conferences in order to present my work, and for his continuing support that included many helpful and always inspiring conversations. Also, I would like to give special thanks to Christian Godenschwager and Martin Bauer for several years of very close and successful collaboration and the joint development of many parts, including the entire core functionality, of the current version of the waLBerla simulation framework. I would also like to express my gratitude to Klaus Iglberger who, as an engaged tutor during my time as an undergraduate student and as supervisor for my diploma thesis, contributed significantly to sparking my interest in C++ and scientific computing. Similarly, I have greatly benefited from the lectures and exercise classes focused on high performance computing and computer simulation given by Prof. Gerhard Wellein, Georg Hager, and Markus Stürmer. I would also like to thank David Staubach, Ehsan Fattahi, Regina Degenhardt, and Simon Bogner for their assistance regarding the implementation of the lattice Boltzmann method and its extension to nonuniform grids. I am also grateful to Prof. Henning Meyerhenke for valuable discussions about diffusion-based load balancing. Additionally, I would like to thank Matthias Markl, Christoph Rettinger, Sebastian Eibl, Nils Kohl, Michael Kuron, Tobias Schruff, Dominik Bartuschat, Kristina Pickl, Daniela Anderl, Sebastian Kuckuk, Björn Gmeiner, Tobias Preclik, Harald Köstler, Jan Götz, and Christian Feichtinger for supporting me with valuable discussions about many topics related to computer simulation during my time as a research assistant at the university. Similarly, I would also like to thank Frank Deserno and Iris Weiß for their technical and administrative support over the course of these years. Furthermore, I would like to express my gratitude to Prof. Martin Berzins and once more to Prof. Gerhard Wellein and Prof. Ulrich Rüde for taking their time to review my thesis. Since my work would not have been possible without access to large-scale supercomputers, I am also grateful to the Jülich Supercomputing Center and the Leibniz Rechenzentrum in Munich for providing me with access to their computational resources. Florian Schornbaum, February 2018

Contents 1 Introduction 1.1 Motivation . . . . . . . . . 1.2 Focus and Requirements 1.3 Related Work . . . . . . . 1.4 Contribution and Outline 2 The 2.1 2.2 2.3

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

1 1 2 4 7

Lattice Boltzmann Method 9 Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Boundary Treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Grid Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Parallelization Concepts and Realization 3.1 Domain Partitioning . . . . . . . . . . . . 3.2 Unique Block Identification . . . . . . . . 3.3 Support for Arbitrary Simulation Data . 3.4 Implementation Remarks . . . . . . . . . 4 Initialization 4.1 The Setup Procedure . . . . . . . . 4.2 Static Load Balancing . . . . . . . . 4.3 Space Filling Curves . . . . . . . . . 4.4 Storing the Domain Partitioning to 5 The 5.1 5.2 5.3 5.4 5.5

. . . .

. . . . . . . . . File

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

Lattice Boltzmann Method on Nonuniform Grids Grids and Parallel Synchronization . . . . . . . . . . Parallelization Concept . . . . . . . . . . . . . . . . . Implementation . . . . . . . . . . . . . . . . . . . . . . Communication Schemes . . . . . . . . . . . . . . . . Load Balancing Requirements . . . . . . . . . . . . .

6 Dynamic Domain Repartitioning 6.1 Four-Step Procedure . . . . . . . . . . . . . 6.2 Distributed Block-Level Refinement . . . . 6.3 Proxy Data Structure . . . . . . . . . . . . 6.4 Dynamic Load Balancing . . . . . . . . . . 6.5 Data Migration and Refinement . . . . . . 6.6 AMR for the Lattice Boltzmann Method

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . .

15 15 17 19 24

. . . .

27 27 33 37 40

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

45 45 47 48 52 56

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

59 59 60 63 67 78 81

Contents

7 Validation and Benchmarks 7.1 Supercomputing Systems . . . . . . . . . . . . . . . . . . 7.2 The Lattice Boltzmann Method on Nonuniform Grids . 7.3 Performance of Statically Refined Grids . . . . . . . . . 7.4 Adaptive Mesh Refinement . . . . . . . . . . . . . . . . . 7.5 An Example Application . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

87 88 89 93 102 117

8 Outlook and Conclusion 121 8.1 Additional Framework Capabilities . . . . . . . . . . . . . . . . . . . . . . . . 121 8.2 Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A Appendix 127 A.1 Simulation of a Vocal Fold Phantom Geometry . . . . . . . . . . . . . . . . . 127 Acronyms

131

Bibliography

133

1 Introduction The introduction of this thesis is divided into four parts. The first part highlights the motivation behind the thesis which is rooted in the field of high performance computing (HPC) and computational fluid dynamics (CFD). Since CFD in general is a broad discipline with a wide variety of applications, Section 1.2 explains the focus of the thesis which is on adding support for grid refinement to an existing software framework that is based on the lattice Boltzmann method (LBM) and developing and implementing a domain partitioning that allows for adaptive mesh refinement (AMR) on current and future extreme-scale supercomputers. Section 1.3 provides an overview of related work on grid refinement for the LBM and software frameworks for large-scale block-structured adaptive mesh refinement (SAMR). The introduction then concludes in Section 1.4 with a description of the contribution of the present work and an outline of the thesis.

1.1 Motivation With the availability of modern computers and the continuing increase in their computational performance, today, the simulation of physical phenomena is an important part in many different sectors of industry. Simulations are used in order to replace experiments that are either more expensive than the simulation or even impossible to perform. They also allow to gain insight into processes that are observed in nature. Using these insights, simulations enable the improvement of current engineering processes and they can help to make predictions about process outcomes and future events. These simulations in, e.g., CFD, mechanics, or astronomy typically require a large amount of computational resources. Still, today, for many applications, there is demand for higher resolutions and/or a shorter time to solution, i.e., larger and faster simulations. In order to achieve an acceptable runtime for simulations with trillions of unknowns, the corresponding applications must rely on massively parallel execution on state-of-the-art supercomputers. For many computational models and methods, parallelization is realized by dividing the simulation domain into multiple parts and distributing these parts evenly in terms of associated workload, memory, or both among all available compute nodes. Depending on the underlying application, the actual size of the space may vary greatly and the simulation domain can span the entire known universe, the volume of all the air above the earth’s surface, or merely a small blood vessel in the human body. Despite these differences in the actual size of the space, using some sort of domain partitioning as a basis for the parallelization strategy can be applied to all of the previous examples.

1

1 Introduction This parallelization approach is also realized within the waLBerla framework [Bau15b, God13, Sch16] which serves as software basis for the thesis and whose HPC capabilities have been significantly extended by the present work. At its core, waLBerla is a general-purpose HPC software framework that is capable of supporting different numerical methods by providing generic, extensible concepts for domain partitioning, input and output, parallelization, and program flow control. Originally, however, waLBerla was specifically created for simulations based on the LBM [Fei09, Fei11]. In the last two decades, due to its suitability for massive parallelization and the incorporation of complex geometries, the LBM has gained popularity as an alternative to classical Navier-Stokes solvers for CFD [Aid10, Che98]. The waLBerla framework as well as many of the other existing codes for the LBM [Des01, Gro13, Has14b, Heu07, LB3D, Palabos, Ran13] are therefore designed for parallel computers where many show excellent scalability, i.e., their performance increases linearly with the number of processors employed. Going beyond just scalability, the carefully crafted architecture-aware implementation of the LBM in waLBerla shows excellent absolute performance which as a result can considerably reduce the time to solution and the power consumption for reaching a given computational objective [God13]. Furthermore, the waLBerla framework allows to model complex flow geometries like the blood flow within parts of the human coronary artery tree with up to 1012 lattice cells on current petascale supercomputers [God13]. Besides a wide variety of implementations of the LBM for different lattice models, collision operators, and target architectures, the waLBerla framework also supports the simulation of gas-fluid mixtures using a free surface model based on explicit interface reconstruction and a volume of fluid approach [Don11, Poh08]. Furthermore, coupling the flow simulation with a rigid body physics engine that is designed for massively parallel systems [Igl10, Pre14] allows for coupled simulations with a huge number of moving and interacting rigid bodies [Göt10, Göt12]. Applications simulated with waLBerla include additive manufacturing processes [Amm14], direct numerical simulations of particulate flows [Bar15, Bog15, Kur16], fully resolved bubbly flows [And14], riverbeds [Ret17, Sch14], and active swimmers [Pic14]. Recently, support for the phase field method was added to the framework, enabling massively parallel simulations of ternary eutectic directional solidification [Bau15a, Höt16]. Extending the waLBerla framework with further capabilities that can allow to significantly reduce the time to solution and enable applications that are currently infeasible is the motivation behind this thesis.

1.2 Focus and Requirements Originally, the waLBerla framework was designed for uniform Cartesian grids and a static, regular domain partitioning into blocks [Fei12]. The focus of this thesis is on adding support for nonuniform grids and a more flexible, less restrictive domain partitioning concept that allows for dynamic changes at runtime. This dynamic repartitioning must include the automatic migration of data and the execution of dynamic load balancing. Additionally,

2

1.2 Focus and Requirements the new data structures and the new domain partitioning scheme must be designed with scalability to millions of processes and trillions of unknowns on future extremescale supercomputers in mind. Besides scalability to massively parallel systems, the implementation of AMR must maintain the excellent performance and the high throughput1 the waLBerla framework achieves on static, uniform Cartesian grids [God13]. If only parts of the simulation domain require high resolution, many advanced methods in CFD rely on grid refinement in order to conserve computational resources in those regions where a high resolution is unnecessary. For a more accurate representation of geometry surfaces and in order to handle high velocity gradients at fluid-obstacle interfaces, these interfaces are typically modeled with a higher resolution than parts of the simulation that are entirely covered with the fluid phase. In wind tunnel simulations, for example, the finest grid is often only found in the surface region of the car or plane inside of the tunnel. For these types of simulation, resolving the entire domain with the finest resolution would lead to a significant increase in runtime, memory usage, and energy consumption. Here, static grid refinement can also help to increase the free space between the object in the tunnel and the boundaries of the simulation domain, thus reducing the influence of domain boundary conditions on the flow around the object while increasing the computational cost only slightly. Consequently, the lower the resolution of the grid towards the domain boundaries, the smaller the overhead for ultimately simulating the flow in a larger domain. For simulations with constantly moving obstacles or simulations that include large regions with unsteady, turbulent flow, however, static grid refinement cannot capture the inherently dynamic behavior of these simulations. Here, AMR can be used to dynamically adapt the resolution of the grid to the current state of the flow. For AMR to work on distributed parallel systems, the underlying data structures must support dynamic modifications to the grid, migration of data between processes, and dynamic load balancing. Even without grid refinement, the ability to dynamically repartition the grid and reassign work to different processes can be beneficial for simulations where the state of grid cells changes over the course of the simulation and the computational cost for updating a cell depends on its current state. Two-phase flow simulations2 where the simulation of one phase is responsible for the major part of the entire runtime are an example. Moreover, the ability to dynamically migrate and reassign data in order to achieve load balance can also prove to be a valuable component for the integration of fault tolerance (see Section 8.1.2). The underlying data structures should impose as few restrictions as possible on the dynamic load balancing algorithm, thus enabling a wide range of different balancing strategies, e.g., load balancing based on space filling curves (SFCs), on graph partitioning, or on diffusion schemes. Particularly, the data structures must allow for load balancing grids with different resolutions separately (see Section 5.5) and for fully distributed load balancing algorithms that neither require global information nor perform a global data exchange. Besides support for distributed algorithms, the data structures themselves must be stored distributedly. For a fixed number of grid cells that are assigned to one 1 2

number of grid cells updated per second bubbles of air rising in water / a droplet of water sliding down an inclined plane

3

1 Introduction process, the per process memory requirement of a simulation must remain unchanged and constant, independent of the total number of active processes. Distributed storage of all data structures is the foundation for scalability to possibly hundreds of millions of processes on future extreme-scale supercomputers. These distributed data structures should also allow for the efficient realization of a checkpoint/restart mechanism that, for example, enables the deliberate splitting of one simulation into multiple runs in order to overcome common time limits on supercomputers. The C++ implementation of the concepts, algorithms, and data structures that are presented in this thesis follows the open/closed software design principle that states that “software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification” [Mar96, Mey88]. The source code of the algorithms and data structures responsible for the static as well as the dynamic domain partitioning does not have to be changed in order to support different, potentially new simulation methods. All core algorithms and data structures rely on common interfaces and implement generic concepts with key parts being customizable via user-provided callbacks. These callbacks allow to adapt the core algorithms and data structures to the specific requirements of the underlying simulation without any need to modify source code in the framework. Moreover, clean interfaces for all algorithms and callbacks enable the export to Python [Bau15b]. Algorithms that can be used for realizing AMR for the LBM on top of these generic data structures are discussed in Section 6.6. The thesis also provides a short outline on how to use the presented dynamic domain repartitioning implementation for a completely different, meshfree simulation method in Section 8.1.3. Many of the existing algorithms and all of the compute kernels written for the original static, regular domain partitioning of the waLBerla framework into blocks rely on this block-structured partitioning of the global grid. Working on these piecewise uniform Cartesian grids enables maximum utilization of the single instruction/multiple data (SIMD) capabilities of modern CPUs without much additional effort due to the structured access pattern on these uniform grids. In order to guarantee backwards compatibility and maximum re-utilization of existing code, the new domain partitioning algorithms including all underlying data structures developed in this thesis are also based on a block-structured partitioning approach. The interface for algorithms that work on these blocks/parts of the domain remains unchanged. As a result, existing compute kernels that exhibit high performance because of hand-crafted, architecture-aware optimizations [God13] are immediately available for simulations that also want to make use of the dynamic repartitioning capabilities (for AMR, fault tolerance, etc.).

1.3 Related Work The discussion of related work is divided into two subsections. The first subsection addresses software frameworks that are designed for AMR on large-scale systems and are also based on an underlying block-structured domain partitioning, while the second subsection lists related work on grid refinement specifically for the LBM.

4

1.3 Related Work

1.3.1 Block-Structured Adaptive Mesh Refinement

Software frameworks for SAMR have been available for the last three decades. Recently, many SAMR codes have been compared in terms of their design, capabilities, and limitations in [Dub14]. All codes covered in this survey can run on large-scale parallel systems, are written in C/C++ or Fortran, and are publicly available. Moreover, almost all codes can, among other approaches, make use of SFCs during load balancing. Some of the SAMR codes are focused on specific applications and methods, while others are more generic and provide the building blocks for a larger variety of computational models. The codes also differ in the extent to which their underlying data structures require the redundant replication and synchronization of meta data among all processes. Meta data that increases with the size of the simulation is often an issue on large-scale parallel systems, and eliminating this need for global meta data replication is a challenge that all SAMR codes are facing. Both BoxLib [Box] and Chombo [Ada15], with Chombo being a fork of BoxLib that started in 1998, are general SAMR frameworks that are not tied to a specific application. Both, however, rely on a patch-based AMR implementation that is subject to redundant replication of meta data across all processes. Another generic framework for SAMR is Cactus [Cactus, Goo03] with mesh refinement support through Carpet [Carpet, Sch04]. According to [Dub14], FLASH [Dub09, Flash] with AMR capabilities provided by the octree-based PARAMESH package [Mac00, Param] was among the first SAMR codes to eliminate redundant meta data replication. Besides FLASH, the authors of [Dub14] conclude that Uintah [Par06, Uintah] has gone the farthest in overcoming the limitations of replicating meta data and adapting the software basis to current and future architectures. [Lui11] describes the need to eliminate globally replicated meta data and presents and evaluates scalable algorithms for the parallel AMR and the dynamic load balancing within Uintah. Uintah employs task-based parallelism with dynamic task scheduling similar to Charm++ [Kal09]. Recently, the developers behind Enzo [Bry14, Enzo], an SAMR code mainy focused on astrophysical and cosmological simulations, have also analyzed that the increasing memory consumption due to data replication is the cause of Enzo’s most significant scaling bottlenecks [Bor12]. A subsequent redesign of the software basis resulted in Enzo-P/Cello [Bor12, Enzop]. Enzo-P/Cello uses a domain partitioning based on a fully distributed forest of octrees similar to the methods studied in this thesis. Moreover, Enzo-P/Cello is built on top of Charm++ and therefore program flow also follows a taskbased parallelism model similar to Uintah. The idea of first partitioning the simulation domain into blocks using an octree approach and later creating Cartesian meshes inside these blocks was recently also adopted by a new software project: ForestClaw [Bur14]. ForestClaw uses the p4est library [Bur11] for domain partitioning, a library that already demonstrated scalability to massively parallel systems and, being based on a distributed forest of octrees implementation, also shares similarities with the approach presented in this thesis. More details on and an in-depth analysis of many of these SAMR codes can be found in the survey in [Dub14] mentioned at the beginning of this subsection.

5

1 Introduction

1.3.2 The Lattice Boltzmann Method on Nonuniform Grids

In order to incorporate grid refinement into the LBM, node-based [Fil98, Kan00, Lag12, Pen06, Töl06, Yu02] and volume-based [Che06, Guz14, Has14b, Neu13, Roh06] approaches have been proposed. In node-based approaches, the distribution functions of the LBM are located at the nodes of the lattice cells. At the interface between two different grid resolutions, fine grid nodes either coincide with or are located exactly halfway between coarse nodes. For the LBM scheme of [Töl06], a parallel extension was proposed in [Fre08]. In node-based approaches, the non-equilibrium part of the distribution functions typically must be rescaled at the interface between different grid resolutions.

In volume-based approaches, the distribution functions of the LBM are located at the center of lattice cells. During refinement, coarse cells are uniformly subdivided into multiple finer cell. As a result, the centers of fine and coarse cells will not coincide. Volume-based grid refinement approaches for the LBM allow formulations that guarantee the conservation of mass and momentum without the need to rescale the non-equilibrium part of the distribution functions. For volume-based grid refinement, [Neu13] and [Has14b] presented parallelization approaches that are suitable for large-scale systems. In order to achieve scalability to these systems, both approaches are based on a tree partitioning of the simulation space. [Has14b] introduces the MUSUBI software that relies on a linearized octree, [Neu13] uses the Peano framework that is based on a more generalized spacetree concept [Bun10]. An octree-based domain partitioning approach was also proposed by [Fie12], however, not for incorporating grid refinement but for better fitting block-based data structures to complex geometries.

Specifically for the LBM, several AMR approaches have been published. [Töl06] employs octrees on a cell level in order to realize cell-based AMR. [Yu09] describes an SAMR implementation based on the PARAMESH package. Other SAMR approaches for the LBM are presented in [Neu13] and [Fak14], which both rely on an octree-like domain partitioning. [Fak14], however, solely focuses on the AMR methodology and does not provide information about parallelization capabilities of the underlying code. Recently, the AMR scheme of [Fak14] was further extended to support two-phase flow in [Fak16]. First results of a cell-based AMR implementation built on top of the p4est library are presented in [Lah16]. This implementation, however, does not yet support levelwise load balancing as it is necessary for balanced simulations in the context of AMR schemes for the LBM. A priori static grid refinement in parallel simulations based on the LBM is studied in [Fre08, Has14b, Sch11]. Other popular simulation codes based on the LBM such as Palabos [Lag12, Palabos], OpenLB [Fie12, Heu07, OpenLB], LB3D [Gro11, LB3D], HemeLB [Gro13], HARVEY [Ran13], or LUDWIG [Des01] are currently at a state that they either do not support grid refinement, only support grid refinement for 2D problems, or have not yet demonstrated large-scale simulations on nonuniform grids.

6

1.4 Contribution and Outline

1.4 Contribution and Outline The contribution of this thesis is a software framework for a forest of octrees-like blockstructured domain partitioning that is based on fully distributed data structures and algorithms that are the foundation for scalability to extreme-scale parallel systems. These algorithms and data structures are capable of dynamic repartitioning and rebalancing at runtime and meet all of the requirements discussed in Section 1.2. For the dynamic repartitioning process, the thesis suggests the use of a temporary, lightweight, shallow copy of the core data structure that acts as a proxy and only contains topological information without any additional computational data. This lightweight proxy enables inexpensive iterative load balancing schemes, e.g., fully distributed diffusion-based load balancing algorithms. Using this framework, the thesis presents a volume-based refinement approach for the LBM that consists of a distributed memory parallelization of the algorithm described in [Roh06] combined with the interpolation scheme suggested by [Che06]. The implementation is demonstrated to work on massively parallel supercomputers with both statically and dynamically refined grids. To the best knowledge of the author, the total number of cells that can be handled and the overall performance achieved in simulations with as well as without AMR significantly exceed the data published for the LBM on nonuniform grids to date [Fre08, Has14b, Lah16, Meh11, Neu13, Sch11, Yu09]. The remainder of this thesis is organized as follows. Since simulations based on the LBM are used to demonstrate the capabilities and performance of the presented algorithms and data structures, Chapter 2 provides a brief introduction to the LBM and its grid refinement extension. In Chapter 3, insights follow into the basic concepts of the framework’s generic parallelization and domain partitioning strategies, including details about their implementation. Next, Chapter 4 describes the initialization process for simulations that are based on the data structures introduced in the previous chapter. A special focus is put on the initial, static load balancing and the utilization of SFCs. Chapter 5 contains details about the distributed memory parallelization of an existing grid refinement algorithm for the LBM and its implementation on top of the framework’s generic data structures. A description of all algorithms and concepts that lead to a four-step dynamic domain repartitioning approach then follows in Chapter 6. Here, special focus is put on the utilization of a temporary, shallow copy of the core data structure and its implications especially on the dynamic load balancing algorithm. Consequently, the chapter also provides a detailed discussion about the applicability, the advantages, and the disadvantages of and the differences between dynamic load balancing algorithms based on SFCs and an algorithm that is based on a fully distributed diffusion approach. The chapter concludes with information on how AMR for the LBM is implemented based on the dynamic repartitioning capabilities introduced earlier in the chapter. Finally, Chapter 7 verifies the correctness of the parallelization of the grid refinement algorithm for the LBM for Couette and Poiseuille flow and presents weak and strong scaling benchmarks that demonstrate the performance of the implementation on two petascale supercomputers. Moreover, comprehensive benchmarks for the dynamic domain repartitioning implementation show the applicability and performance of the approach for massively parallel simulations. The thesis then concludes in Chapter 8 with a final summary of the presented work and an

7

1 Introduction outlook on future extensions and applications such as a checkpoint/restart mechanism, a non-invasive fault tolerance scheme, or the support for meshfree simulation methods. Parts of this thesis, particularly about the LBM, its static grid refinement extension, and the parallelization of the corresponding algorithm, have been first published by the same author in “Massively Parallel Algorithms for the Lattice Boltzmann Method on Nonuniform Grids” in the SIAM Journal on Scientific Computing 38-2 (2016) [Sch16], published by the Society for Industrial and Applied Mathematics (SIAM). A preprint of the article is available at arXiv.org [Sch15]. Consequently, large parts of Chapters 2 and 5 and Sections 1.3.2, 7.2, and 7.3 are taken from this article. Similarly, parts of Chapter 6 and Sections 1.3.1, 7.4, and 7.5 have been first published by the same author in “Extreme-Scale Block-Structured Adaptive Mesh Refinement” in the SIAM Journal on Scientific Computing 40-3 (2018) [Sch18]. A preprint of this article is available at arXiv.org [Sch17]. The reproduction of parts of both articles in derivative work by the same author is explicitly granted by the “SIAM Consent to Publish”3 . All the parts of the thesis that are taken from [Sch16] or [Sch18], either identical in content, extended, more detailed, or updated, are indicated accordingly at the beginning of the corresponding chapters.

3

8

in the versions as of March 2016 and April 2018 signed by the author

2 The Lattice Boltzmann Method This chapter is divided into three sections. While Section 2.1 describes the basic principles of the LBM, Section 2.2 explains how domain boundaries are treated in the implementation of the present work. The final Section 2.3 then illustrates the implications of nonuniform grids on the relaxation parameters and the time stepping scheme of the LBM. In summary, the current chapter is mostly about the mathematical principles of the LBM and its grid refinement extension while a detailed discussion about the implementation and parallelization of the involved algorithms follows in Chapter 5. Most of the chapter was first published in [Sch16] and is identical in content to parts of this article, with some minor additions in all three sections in order to provide a more detailed description of the method. Consequently, Figure 2.1 was extended with an illustration of the D3Q27 lattice model and Figure 2.2 was added for this thesis.

2.1 Principles The LBM is typically based on a Cartesian discretization of the simulation space and uses an explicit time stepping scheme that is well suited for extensive parallelization due to its data locality. In each time step, information is only exchanged between neighboring grid cells. Each cell stores N particle distribution functions (PDFs) fα ∶ Ω × T ↦ [0, 1],

α = 0, ..., N − 1,

where Ω ⊂ R3 and T = [0, tend ] are the physical and time domain, respectively. There exist various different lattice models for the LBM, some of the most common being D2Q9 [Qia92] for two-dimensional and D3Q19 [Qia92] and D3Q27 [He97b] for threedimensional simulations (see Figure 2.1). All benchmarks and simulations in Chapter 7 employ the D3Q19 model, with one exception: The simulation in Section 7.5 is based on the D3Q27 model. In general, the lattice Boltzmann equation can then be written as fα (xi + eα ∆t, t + ∆t) − fα (xi , t) = Cα (f ), with xi denoting the center of the i-th cell in the discretized simulation domain, eα denoting the discrete velocity set {eα ∣α = 0, . . . , N − 1}, t denoting the current time step, ∆t denoting the time step size, and Cα (f ) denoting the collision operator of the LBM. Algorithmically, the lattice Boltzmann equation is typically separated into a collision (2.1a)

9

2 The Lattice Boltzmann Method

D3Q27

D3Q19

D2Q9

Figure 2.1: Schematic visualization of one lattice cell for the three-dimensional D3Q27 and D3Q19 and the two-dimensional D2Q9 model. For D3Q19, no data is exchanged across cell corners, but only across edges and faces. While all simulations in Chapter 7 are performed in 3D, all later illustrations in this and the following chapters are in 2D in order to facilitate the schematic visualization of the methods.

and a streaming step (2.1b) f˜α (xi , t) = fα (xi , t) + Cα (f ), fα (xi + eα ∆t, t + ∆t) = f˜α (xi , t),

(2.1a) (2.1b)

with f˜α denoting the post-collision state of the distribution function. During streaming, PDF values are exchanged only between neighboring cells while the collision step is a local operation in each cell (see Figure 2.2). The three most commonly used collision schemes are the single-relaxation-time (SRT) model [Bha54], the two-relaxation-time (TRT) model [Gin08a, Gin08b], and the multiple-relaxation-time (MRT) model [d’H94, d’H02]. For the SRT model, the collision operator is given by Cα (f ) = −

∆t (fα (xi , t) − fαeq (xi , t)) = −ω(fα (xi , t) − fαeq (xi , t)), τ

where τ is the relaxation time, ω ∈]0, 2[ the dimensionless relaxation parameter, and fαeq (xi , t) is the equilibrium distribution. The relaxation time τ and the relaxation parameter ω are related to the kinematic viscosity ν with ∆t c2 1 1 ) = ( − ) ∆t, (2.2) 2 3 ω 2 √ where c = ∆x/∆t is the lattice velocity and cs = c/ 3 is the speed of sound for an isothermal fluid [Aid10, Che98]. For the incompressible LBM [He97a], the equilibrium distribution function can be calculated according to ν = c2s (τ −

fαeq (xi , t) = wα [ρ(xi , t) + ρ0 (

3eα ⋅ u(xi , t) 9(eα ⋅ u(xi , t))2 3u(xi , t)2 + − )] , c2 2c4 2c2

(2.3)

with wα denoting the lattice model-specific weighting factor corresponding to eα and ρ0 denoting the reference density. The incompressible LBM is used for all the simulations in this thesis. ρ(xi , t) and u(xi , t) are the fluid density and the fluid velocity, respectively.

10

2.1 Principles

collision

streaming

Figure 2.2: Schematic visualization of one collision step (2.1a) that modifies the PDFs locally and one streaming step (2.1b) that copies PDF values to neighboring cells.

They are calculated from the first two moments ρ(xi , t) = ∑ fαeq (xi , t),

(2.4)

α

u(xi , t) =

1 eq ∑ eα fα (xi , t). ρ0 α

(2.5)

For the TRT model, the distribution functions are split into a symmetric (even) and an asymmetric (odd) part: 1 1 fα± = (fα ± fα¯ ) , fαeq± = (fαeq ± fα¯eq ), 2 2 with α ¯ denoting the inverse direction of α and fα+ /fαeq+ and fα− /fαeq− denoting the symmetric and asymmetric part of the distribution function, respectively. The corresponding collision operator is given by Cα (f ) = −λe (fα+ − fαeq+ ) − λo (fα− − fαeq− ), with λe ∈]0, 2[ and λo ∈]0, 2[ denoting the even and the odd relaxation parameter of the TRT model. If ∆t , λe = λo = ω = τ the TRT model coincides with the SRT model. The kinematic viscosity is related to λe just like it is related to ω when using the SRT model. Hence, for the TRT model, ν = c2s (

1 1 − ) ∆t. λe 2

(2.6)

For the MRT model, the collision operator is described by ˆ C(f ) = −M−1 SM(f − f eq ), where M is a matrix that linearly transforms the distribution functions f to the velocity ˆ is a diagonal matrix containing the relaxation factors. By choosing S ˆ moments and S accordingly, the MRT model can be reduced to both TRT and SRT [Luo11]. In general, the MRT model is the most flexible, the most accurate, and numerically the most stable but computationally also the most expensive of the three models. The TRT

11

2 The Lattice Boltzmann Method model is more accurate and stable than the SRT model and despite being computationally more expensive in terms of operations required, on modern hardware, highly optimized compute kernels for the TRT model can reach the same performance than compute kernels for the SRT model that have been equally optimized [God13]. Recently, Geier et al. have proposed another collision operator that also uses multiple relaxation rates but is based on relaxing cumulants instead of moments [Gei15]. All the simulations in this thesis use the TRT collision model. In general, however, the grid refinement implementation presented in Chapter 5 is independent of the collision operator and thus works with an SRT, TRT, MRT, as well as a cumulants-based collision scheme. External forces can be incorporated into the LBM by including an additional term in the collision step (2.1a) f˜α (xi , t) = fα (xi , t) + Cα (f ) + ∆tFα (xi , t). There exist various different force models for the LBM [Moh10]. Depending on the force model, the calculation of the fluid velocity (2.5) and/or the velocity terms in (2.3) must be adapted. A constant body force F caused by a globally constant acceleration a can be incorporated into the incompressible LBM with [Luo98] Fα (xi , t) = 3wα

eα ⋅ a eα ⋅ F = 3wα ρ0 2 . 2 c c

(2.7)

Non-constant body forces can be incorporated with [Luo98] Fα (xi , t) = 3wα (

eα − u(xi , t) eα ⋅ u(xi , t) +3 eα ) ⋅ F(xi , t). c2 c2

In both cases, when calculating the fluid velocity independently of the collision step, an additional term must then be added to (2.5) [Gin08b], thus u(xi , t) =

∆tF(xi , t) 1 eq . ∑ eα fα (xi , t) + ρ0 α 2ρ0

For the evaluation of the equilibrium distribution (2.3) and Fα (xi , t) during the collision step, however, the calculation of u(xi , t) must not be changed, meaning an unmodified version of (2.5) must be used.

2.2 Boundary Treatment Domain boundaries and obstacles within the fluid must receive special treatment. For this purpose, every cell of the grid is either marked as a fluid cell, a boundary/obstacle cell, or a cell that is outside of the simulation domain and hence does not need to be considered during the simulation at all. Depending on which boundary marker is set for a certain cell, a different boundary treatment is performed during a post-collision/prestreaming step. For this boundary treatment, the thesis relies on the no slip, free

12

2.3 Grid Refinement

(1)

(2)

(3)

Figure 2.3: Boundary cells are marked in gray (1). After the collision step and prior to streaming, the boundary treatment algorithm stores PDF values inside the boundary cells itself (2). During streaming, no special treatment has to be performed for cells next to the boundary: All fluid cells can pull PDF values from all of their neighbors (3), even if the neighboring cell is a boundary cell.

slip, velocity, and pressure boundary conditions already implemented in the waLBerla framework [Göt12, Pic16]. During the boundary treatment after the collision step, PDF values are calculated depending on the corresponding boundary conditions and are saved inside the boundary cells. During the subsequent streaming step, these values are pulled by their neighboring fluid cells (see Figure 2.3). As a result of this approach, the algorithms performing the collision and the streaming step can remain unchanged since during collision and streaming, no special operations are necessary near boundary cells. Because of boundary treatment being a pre-streaming operation, even existing fused stream-collide compute kernels do not require any changes. Periodic boundaries are treated by copying PDF values from one side of the simulation domain to the other, and vice versa. For parallel simulations, the communication step that normally synchronizes neighboring subdomains that reside on different processes is used to process periodic boundaries by establishing a communication channel between opposing sides of the simulation domain.

2.3 Grid Refinement In Chapter 5, the thesis presents an efficient, highly optimized parallelization of the algorithm proposed by [Roh06] for the LBM on nonuniform grids. The algorithm presented in [Roh06] relies on a 2:1 balance between neighboring grid cells at the interface between two grid levels, thus ∆xL ∆x0 ∆xL+1 = ⇒ ∆xL = L , 2 2 with L denoting the grid level and L = 0 corresponding to the coarsest level. The jump in L between a cell and all of its neighboring cells is at most 1. A nonuniform grid conforming with these features is illustrated in Figure 3.1. Moreover, the algorithm applies acoustic scaling where the time step ratio is proportional to the grid spacing ratio [Has14a, Sch11]. Thus, due to the grid spacing ratio of 2:1, twice as many time steps must be executed on level L√+ 1 compared to level L. Hence, ∆tL = ∆t0 /2L . As a result, the speed of sound cs = c/ 3 with c = ∆x/∆t remains constant on each grid level. The kinematic viscosity ν

13

2 The Lattice Boltzmann Method (2.2) must also remain constant. In order to keep ν constant, the dimensionless relaxation parameter ω must be properly scaled to each grid level, thus 1 1 1 1 − ) ∆tL = c2s ( − ) ∆t0 ωL 2 ω0 2 2ω0 , ⇒ ωL = L+1 2 + (1 − 2L )ω0 2K+1 ωK or more generally for two levels K and L ωL = L+1 . 2 + (2K − 2L )ωK νL = ν0 ⇒ c2s (

(2.8a) (2.8b)

Since for the TRT model, the kinematic viscosity is related to λe just like it is related to ω when using the SRT model (2.6), (2.8a) and (2.8b) are also used for scaling λe to different grid levels. In order to choose λo , the thesis uses Λeo = (

1 1 1 1 − )( − ), λe 2 λo 2

(2.9)

with the “magic” parameter Λeo as suggested in [Gin08b]. The thesis proposes to keep Λeo constant on all levels and scale λo on each level accordingly, thus λo,L =

4 − 2λe,L . 2 + (4Λeo − 1)λe,L

(2.10)

Typically, implementations of the LBM are formulated in a dimensionless lattice space where the lattice cell size ∆x∗ = ∆x/∆x, the lattice time step size ∆t∗ = ∆t/∆t, and the lattice reference density ρ∗0 = ρ0 /ρ0 are all equal to 1. This approach is also realized by the implementation used in this thesis. In order to set up simulations and evaluate the results, physical quantities must then be transformed from physical space into dimensionless lattice units, and vice versa [Göt12, Lat08]. Consequently, when applying a force caused by an acceleration a, the parametrization/normalization of the acceleration is used to calculate ∗ the level-dependent lattice acceleration aL (in dimensionless lattice units) with ∗ aL =a

14

(∆tL )2 ∆xL

∗ ⇒ aL =

a0∗ 2L

or more generally

∗ ∗ . aL = 2K−L aK

(2.11)

3 Parallelization Concepts and Realization This chapter introduces the core concepts of the domain partitioning and presents details about their subsequent implementation in C++. Consequently, the chapter starts with an introduction of the domain partitioning principles in Section 3.1. Parts of this section, including individual parts of Figures 3.1 and 3.2, have been first published by the same author in [Sch16]. Section 3.2 then presents a unique identification scheme for the blocks that result from the domain partitioning described in the previous section. Details about the realization of the framework’s support for arbitrary, potentially user-defined simulation data follow in Section 3.3. The chapter then concludes in Section 3.4 with general remarks about the implementation of the core data structures. In summary, the chapter introduces the software architectural building blocks that provide the foundation for the algorithms in all the following chapters.

3.1 Domain Partitioning As mentioned in Section 1.2, the work of this thesis relies on a partitioning of the simulation domain into blocks. Reasons are backwards compatibility with algorithms written for the framework’s original regular domain partitioning and the high performance enabled by the structured access patterns on piecewise uniform grids. These blocks can be assigned arbitrary data, meaning each block can store an arbitrary number of different C++ data structures. These can be classes provided by the framework, other user-defined classes, containers of the C++ standard library, etc. A detailed description of this mechanism that allows support for arbitrary data follows in Section 3.3. In a parallel environment, one block is always assigned to exactly one process. By assigning a block to a specific process, this process becomes responsible for the part of the simulation space that corresponds to this block. One block cannot be assigned to multiple processes, but multiple blocks can be assigned to the same process. As a consequence, the blocks are the smallest entity that can be used for workload distribution, meaning load balancing is achieved by distributing all blocks to all available processes. The implementation of the domain partitioning corresponds to a forest of octrees-like partitioning of the simulation space that results from further subdividing the blocks of the initial regular domain partitioning into eight equally sized smaller blocks (see Figure 3.1). As a result, the simulation domain is still partitioned into blocks. However, blocks now correspond to leaves of the forest of octrees (see Figure 3.2). Since the implementation enforces 2:1-balanced octrees, the resulting block structure perfectly

15

3 Parallelization Concepts and Realization

Figure 3.1: Blocks of the initial regular domain partitioning can be further subdivided into equally sized smaller blocks: 2 × 2 = 4 in this 2D illustration and 2 × 2 × 2 = 8 in 3D simulations. Each block can be assigned a grid of the same size (last picture). As a result, transitions between different grid resolutions only occur on the boundary of blocks. For the massively parallel algorithm for the LBM on nonuniform grids, exactly this kind of grid allocation strategy is used. However, the core data structures also support arbitrarily sized grids for each block, enabling different grid structures for other kinds of simulations. Also, the blocks do not have to be cubes but can be any rectangular, similar cuboids.

Figure 3.2: Schematic 2D illustration of the domain partitioning introduced in Figure 3.1. The largest blocks, i.e., the blocks of the initial regular decomposition, act as roots for individual octrees. As such, they create a forest of octrees visualized on the left (note that quadtrees in this 2D illustration correspond to octrees in 3D). During simulation, this forest of octrees is not stored explicitly, but it is defined implicitly by assigning each block a unique block ID that allows to reconstruct the block’s exact location within the forest of octrees. Additionally, every block stores a pair of type for every neighbor, thereby creating a distributed adjacency graph for all blocks (illustrated on the right).

mirrors the 2:1-balanced grid structure required for the refinement scheme for the LBM described in Section 2.3 and Chapter 5. Details about the construction of this structure and its initial balancing follow in Chapter 4. Moreover, Chapter 6 will show that if blocks dynamically split or merge, 2:1 balance can be maintained by performing local communication only, without the need for global synchronization. As such, the 2:1-balanced octree-like partitioning proves to be a suitable domain partitioning scheme for dynamic, massively parallel simulations. An important feature of the implementation is the fact that every block is aware of all of its spatially adjacent neighbor blocks, effectively creating a distributed adjacency graph for all blocks (see Figure 3.2). As a result, not only algorithms that are typically associated with octrees but also algorithms that operate on more general graphs can be executed. The spatial partitioning of the simulation domain, however, geometrically always corresponds to a forest of octrees. Parent nodes or parent-child relationships do not need to be saved explicitly, the tree structure is implicitly defined by globally unique block identifiers (IDs). They allow to reconstruct a block’s exact location within the forest of octrees. A detailed explanation of this block identification scheme follows in the next Section 3.2. During the simulation, each process only knows about blocks that are stored locally and blocks that are neighboring these process-local blocks. For all non-local neighbor blocks,

16

3.2 Unique Block Identification this knowledge is only comprised of the globally unique block ID and a corresponding process rank. Generally, processes have no knowledge about blocks that are not located in their immediate process neighborhood. As a consequence, the memory usage of a particular process only depends on the number of blocks assigned to this process, and not on the global number of blocks that are available for the entire simulation. This kind of fully distributed data structure is the foundation for scalability to massively parallel systems. If two processes are assigned the same number of blocks with the same amount of data that is associated with each block, the memory requirements for both processes are identical and remain unchanged, regardless of whether the simulation runs with hundreds, thousands, or millions of processes. In addition to process-local blocks and information about their immediate neighbors, every process also stores general information about the domain partitioning, including, for example, the axis-aligned bounding box (AABB) that encloses the entire domain and the number of “root” blocks that result from the initial regular domain decomposition. Further details on the actual implementation of the corresponding data structures are presented in Section 3.4.

3.2 Unique Block Identification The forest of octrees-like partitioning discussed in the previous section is more complex than a completely regular, uniform decomposition of the domain, but still follows a well-defined, structured pattern that proves to be advantageous compared to a fully unstructured partitioning scheme. This well-defined, structured pattern allows to encode a block’s exact location and size in a single unique ID. Different space-saving, bitwise encoding approaches for element identification within an octree or a forest of octrees are already in use by other frameworks and are presented in [Bor12, Bur11, Rol12]. In the present work, the block ID is represented by a single number that corresponds to a particular bit stream. This bit stream consists of three parts: a single marker bit that constitutes the most significant bit, followed by an octree index that identifies a specific root block, followed by branch indices that describe the descent within the selected octree from its root down to a leaf that corresponds to a block. Every branch index consists of three bits that encode the associated octant in zyx order, with the x-bit being the least significant bit. If a block is further divided into eight smaller blocks, the three bits of the additional branch index are appended as least significant bits to the block IDs of the new blocks. The octree index that identifies a specific root block always consists of a fixed number of bits and results from a row-major/x-major ordering of all root blocks. The number of bits required for representing the octree index is determined by the total number of available root blocks. Leading zeros guarantee a fixed width for the octree index. The construction of the block ID, including octree index ordering and branch index representation, is illustrated in detail in Figure 3.3. Since the number of root blocks in x-, y-, and z-direction and the AABB that encloses the entire domain are fixed and known on every process, a block’s level and its exact location can be determined from its block ID as follows. First, a bitwise binary search combined

17

3 Parallelization Concepts and Realization 100 (4) 3 (011)

4 (100)

5 (101)

0 (000)

1 (001)

2 (010)

octree indices

000 001 010 011 000 001 010 011

y

000 x

011 001

000 001 010 011 000 001 010 011

001 (1) y

010 (2)

000 001 010 011 x

000 001 010 011

000 001 010 011

branch indices 010

101 (5)

1011 (1 / 011) 1001011010 (1 / 001 / 011 010) 1001000 (1 / 001 / 000)

000 001 010 011

1101010 (1 / 101 / 010) 1101000000 (1 / 101 / 000 000) 1010001 (1 / 010 / 001)

Figure 3.3: Since in this example the domain is decomposed into six root blocks, the octree index consists of three bits and ranges from 000 to 101 (picture on the top left). On the bottom right, block IDs for six different blocks are shown. In brackets, the construction of every block ID is illustrated in terms of “marker bit / octree index / branch indices”. While the number of bits used for the octree index is fixed for a given partitioning, the number of branch indices and hence the number of bits required for these branch indices depends on the level/size of a block. Root blocks (blocks on level 0) only consist of the marker bit and an octree index. For this 2D illustration, the z-bit in all branch indices is always equal to zero and blocks are subdivided into four instead of eight smaller blocks compared to the 3D implementation.

with a 8-bit lookup table is used for identifying the most significant bit [And05] and hence determining the number of significant bits (#significant-bits). Given the number of bits used for representing the octree index (#octree-index-bits), the number of branch indices can then be calculated with #branch-indices =

#significant-bits − 1 − #octree-index-bits 3

and the block ID can be decomposed into its individual parts using bitwise operations. The level of a block is identical to the number of branch indices that are contained in its block ID. Moreover, by knowing a block’s exact location, the AABB that encloses the block can be easily constructed from the AABB that encloses the entire domain. Consequently, only by knowing the IDs of potentially neighboring blocks, their exact spatial relations can be reconstructed. Ultimately, parts of an octree, an entire octree, or even the entire forest of octrees can be reconstructed from a set of block IDs. For storing block IDs, two different implementations are available. The first implementation uses built-in integral data types of C++. The fixed size of these built-in data types, however, limits the maximal number of levels that can be encoded. Therefore, a second implementation based on dynamic bitsets is also available. This implementation does not impose any restrictions on the number of available bits. However, operations on these dynamically growing bitsets are more expensive than performing the same operations on built-in integral data types. Eventually, simulations must select one of the two implementations at compile time. Independent of the actual implementation, a memory-efficient, space-saving representation

18

3.3 Support for Arbitrary Simulation Data that avoids as many leading zeros as possible is employed for the serialization of block IDs. Again, two different implementations exist for this serialization. One approach is to only use as many bytes as are needed to store all significant bits of a block ID, plus one additional leading byte for encoding the number of bytes that follow. As a result, the number of bytes required for the serialization of a block ID depends on the block ID’s number of significant bits. Consequently, even if an 8-byte/64-bit integral data type would be used for storing the block IDs in Figure 3.3, only three bytes would be required for serializing the block ID 1132, and only two bytes would be required for serializing 152 113210 = 10010110102 → 00000010 00000010 01011010 15210 = 11010102 → 00000001 01101010. Another approach for the serialization is to always use as many bytes as are needed for storing the largest possible block ID of a given domain partitioning. The current depth of the forest (= maximal depth of any octree) is known on all processes. This information is sufficient in order to determine the minimal number of bytes that is enough to hold any block ID of the current domain partitioning. For the domain partitioning in Figure 3.3, for example, two bytes would be used for the serialization of every block ID. Hence, the serialization of block IDs 1132 and 152 would results in 113210 = 10010110102 → 00000010 01011010 15210 = 11010102 → 00000000 01101010. The disadvantage of using a fixed number of bytes for the serialization is that in large simulations with blocks on many different levels, block IDs for blocks on low levels with only few significant bits may require more memory during serialization as compared to the leading byte encoding approach, while the advantage of using a fixed number of bytes is the memory saved due to the general absence of a leading byte. Ultimately, a memory-efficient serialization format for block IDs proves to be beneficial when storing the entire domain partitioning to a file (see Section 4.4) or when communicating block IDs between processes, either because they are used for addressing sender and receiver blocks or because they are sent to other processes in order to inform them about the current domain partitioning (see Chapter 6).

3.3 Support for Arbitrary Simulation Data As already mentioned in previous sections, the blocks that result from the domain partitioning act as containers for an arbitrary amount of data, and every data record can be of arbitrary type. In simulations with the LBM, the blocks mainly store the Cartesian grids that contain PDF values for every grid cell. Often, each block also stores an additional grid that contains one integral value for every cell. This integral value typically encodes the current state of the cell (fluid or boundary of a certain type). The Cartesian grid is usually represented by a four-dimensional array that can be switched

19

3 Parallelization Concepts and Realization

C++ Code Snippet 3.1: For data that is supposed to be registered at blocks, the following interface must be implemented and an instance of this implementation must be passed to the block storage data structure. An example implementation of this interface is outlined in Code Snippet 3.2. Example usage is illustrated in Code Snippet 3.3. Please note that this and the following C++ code snippets rely on compiler support for C++11, sometimes C++14. If C++11/C++14 support is not available, raw pointers or smart pointers from external libraries [Boost] must be used instead of the smart pointer functionality from the standard library. 1 2 3 4 5

template < typename T > class Blo ckData Handli ng { public : typedef T value_type ;

6

virtual ~ Block DataHa ndling () {}

7 8

virtual std :: unique_ptr initialize ( const Block & ) = 0;

9 10

virtual void serialize ( const Block & , const BlockDataID & , Buffer & ) = 0; virtual std :: unique_ptr deserialize ( const Block & , Buffer & ) = 0;

11 12 13

};

between a structure of arrays and an array of structures storage format and may provide additional functionality like mechanisms for aligned memory allocation and convenient access to ghost/halo layers. For rigid body physics simulations, every block typically stores a list1 that contains (references to) all bodies whose center is located within the AABB of the block. For coupled simulations, blocks will store both Cartesian grids for the fluid simulation and lists of rigid bodies for the rigid body mechanics. Data can not be added manually by explicitly registering objects/values at blocks, but is added implicitly by registering data handling objects at the block storage data structure. This distributed data structure that maintains all blocks then iterates all these blocks and uses the data handling objects in order to allocate data for every block. Consequently, data handling objects determine how data is created for a specific block. The interface for such data handling objects is outlined in Code Snippet 3.1. Once passed to the block storage data structure, the member function initialize is used for the allocation of the data. Hence, this function must return newly allocated data that is supposed to be stored at the block that is passed to initialize. This data is then managed by the block storage data structure. Since the block that is supposed to store the data is passed to initialize, access to general properties of the block (level, AABB, etc.) and other data that is already stored at the block is also available during the data’s initial allocation. Additionally, the interface also contains two member functions serialize and deserialize that must be implemented and are used by the framework in order to store the data into and restore the data from a binary buffer. This ability of the framework to serialize the data stored at a block is an essential requirement for the algorithm that migrates blocks between processes during the dynamic domain repartitioning procedure (see Chapter 6). Furthermore, serializing block data is also an essential part of both the checkpoint/restart mechanism and the support for resilience (see Chapter 8). For the serialization, the 1

For the actual implementation in C++, usually a std::vector is the container of choice.

20

3.3 Support for Arbitrary Simulation Data framework guarantees to process all data in the same order as it was initially added, meaning both functions serialize and deserialize are executed for all block data in the order of initial registration. Only when the data structure that maintains all blocks is destroyed, typically at the end of a simulation, the order of deallocation of block data is the reverse of the order of initial allocation, thereby following the C++ paradigm for construction/destruction order. If certain block data does not carry any additional information, for example because its content can be reconstructed from other block data2 , the serialization can be omitted. Consequently, in case certain block data can be restored without knowledge about its previous state, the implementation of serialize may remain empty and the implementation of deserialize is often very similar, or even identical, to the implementation of initialize. As a result, the amount of block data that must be exchanged (during communication), additionally kept in memory (for support for resilience), or stored on the disk (for checkpoint/restart) can be reduced without a loss of information. An example implementation of the interface from Code Snippet 3.1 is outlined in Code Snippet 3.2. Here, the data that will eventually be stored at every block is an std::vector that contains elements of some generic type T. The implementation of initialize dynamically allocates memory for the std::vector via std::make_unique. The initial size of this std::vector is set when the block data handling object is created in Line 7 and is therefore identical for every block. As a result, the allocation in Line 11 is independent of the particular block that will eventually store the std::vector. Hence, the parameter passed to initialize is not needed. After allocation, the newly created std::vector is returned by initialize. Since the framework will call this function for every block, the framework also takes care of actually registering the std::vector that is returned by initialize at the blocks. For serialization, the implementation assumes that the C++ stream operators are already implemented for the binary buffer and objects of type T (see Lines 21 and 36) and values of type size_type (see Lines 19 and 33). Code Snippet 3.3 illustrates how data is added to the blocks. The example also demonstrates how this data is accessed afterwards. First, an instance of a class derived from the interface BlockDataHandling must be passed to the block storage data structure. Code Snippet 3.3 assumes that this data structure is represented by the object blocks. Consequently, the code snippet starts with passing an instance of VectorBlockData to blocks in Line 4. As a result, the framework allocates an std::vector with an initial size of K elements for every block. Internally, these std::vector containers are immediately added to the blocks and a handle of type BlockDataID that must be used to later reference and access the data is returned. In Line 6, more data, this time an std::vector that stores double precision floating point values, is added to the blocks. Again, a corresponding handle of type BlockDataID is returned. The rest of Code Snippet 3.3 then illustrates how to access the data. Lines 11 and 15 demonstrate how the data is retrieved from a block by specifying the correct data type and passing the handle 2

Simulations with the LBM often store additional grids for macroscopic quantities like the velocity in order to provide fast access to those quantities, even though macroscopic quantities can be calculated from the PDF values stored in the main computational grid (see (2.4) and (2.5)).

21

3 Parallelization Concepts and Realization

C++ Code Snippet 3.2: Example implementation of the interface introduced in Code Snippet 3.1 for block data of type std::vector. For an example usage, see Code Snippet 3.3. The utilization of std::make_unique in Line 11 requires support for C++14. For code that must work with older C++ compilers, see the comment about smart pointers in the caption of Code Snippet 3.1. Moreover, without support for C++11, range-based for loops (Lines 20 and 35) must be exchanged with traditional loops. 1 2 3 4 5

template < typename T > class VectorBlockData : public BlockDataHandling < std :: vector > { public : typedef std :: vector :: size_type size_type ;

6 7

VectorBlockData ( const size_type initialSize ) : N ( initialSize ) {}

8 9 10 11 12

std :: unique_ptr < std :: vector > initialize ( const Block & ) // parameter { // not needed return std :: make_unique < std :: vector >( N ); }

13 14 15 16 17 18 19 20 21 22 23 24 25 26

void serialize ( const Block & block , const BlockDataID & id , Buffer & buffer ) { const std :: vector * v = block . getData < std :: vector >( id ); if ( v ) { buffer size (); for ( const auto & x : * v ) buffer (); size_type size (0); buffer >> size ; v - > resize ( size ); for ( auto & x : * v ) buffer >> x ; return v ; }

39 40 41 42

private : size_type N ; };

that was returned upon registration of the data. For access to the data, a raw pointer is returned. Internally, the dynamically allocated data is maintained by the block and kept within an std::unique_ptr that was passed to the block during allocation (see return type of initialize and deserialize in Code Snippet 3.1) and indicates the block’s sole ownership and full responsibility for the proper deallocation of the data. The end of the code snippet illustrates a convenient way of saving block data to a file by only passing the block data handle to the block storage data structure and specifying a file name. The block data handling object that was passed to the data structure in Line 4

22

3.3 Support for Arbitrary Simulation Data

C++ Code Snippet 3.3: Example of how arbitrary data is first added to and then accessed at the blocks. The object blocks represents the distributed data structure that maintains all blocks. The example code also uses the block data handling implementation for std::vector containers outlined in Code Snippet 3.2. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

... // add block data BlockDataID intId = blocks . addBlockData ( VectorBlockData < int >( K ) ); BlockDataID doubleId = blocks . addBlockData ( VectorBlockData < double >( K ) ); ... // access block data for ( auto & block : blocks ) { std :: vector < int > * intVec = block . getData < std :: vector < int > >( intId ); // access data stored in intVec ... std :: vector < double > * doubleVec = block . getData < std :: vector < double > >( doubleId ); // access data stored in doubleVec ... } ... // save data to file blocks . saveBlockData ( " IntData . dat " , intId ); ...

is stored internally, linked to the block data handle that was returned upon registration, and therefore always available to the framework. Consequently, using the serialization functionality of the block data handling object, the block data handle is all that is required for the framework to create a snapshot of certain block data. Loading block data from a file works analogously. These mechanisms are also used internally by the framework for the checkpoint/restart functionality (see Section 8.1.1). In order to be able to store an arbitrary amount of data with support for arbitrary data types at every block, the framework could rely on the functionality of boost::any [Boost] or, if C++17 support is available, std::any from the standard library. The implementation of any, however, does not allow access via the base class type. Given a base class Base and a derived class Derived, access via the base class type fails when using any. Derived * derived = ... boost::any data = derived; Derived * pd = boost::any_cast( data ); /* works */ Base * pb1 = boost::any_cast( data ); /* works */ Base * pb2 = boost::any_cast( data ); /* runtime error */ UnrelatedType * p = boost::any_cast( data ); /* runtime error (expected) */ In order to store and access block data, the actual implementation is therefore based on the mechanisms presented in [Ner11]. This allows polymorphic, type-safe access. // assuming data of type Derived is stored at the block for block data handle id Derived * pd = block.getData( id ); /* works */

23

3 Parallelization Concepts and Realization

Base * pb1 = block.getData( id ); /* works */ Base * pb2 = block.getData( id ); /* works! */ UnrelatedType * p = block.getData( id ); /* runtime error (expected) */ The support for polymorphic, type-safe access enables the implementation of generic algorithms that only utilize base classes and have no knowledge about the actual derived type of the stored data. Moreover, this access even works with class hierarchies that contain multiple inheritance. Ultimately, the mechanisms and concepts presented in this section of the thesis constitute a significant advancement to the block data initialization functionality originally available in the framework [Fei12]. In order to enable dynamic changes to the block partitioning that occur during AMR like splitting and merging of blocks, the concept of block data handling objects must be extended beyond the basic functionality outlined in Code Snippet 3.1. These extensions are discussed in detail in Chapter 6.

3.4 Implementation Remarks Over the course of a simulation, different data structures are used for the representation of the block partitioning in order to adapt the implementation to the specific demands of the different phases of the simulation. These different phases are the initial domain partitioning that includes static load balancing, the dynamic repartitioning during runtime, and the actual simulation, i.e., the execution of compute kernels and the synchronization of simulation data between processes. During the initial domain partitioning phase, the block storage data structure is represented by class SetupBlockForest (blocks are managed by class SetupBlock). The main data structure that is active during the entire simulation phase and that is responsible for storing all distributed data is represented by classes BlockForest and Block. For the dynamic repartitioning phase, a temporarily constructed shallow copy of the main data structure that acts as a lightweight proxy is maintained by classes ProxyBlockForest and ProxyBlock. The specifics of and the differences between these three data structures are summarized in Table 3.1 and are discussed in greater detail in the following paragraphs. The initial domain partitioning, including the initial static load balancing, relies on a lightweight data structure that does not contain any actual simulation data. Instead, the blocks contain characteristic values for their expected workload/weight and memory consumption during the simulation. These characteristic values must be set by the application and are used by the static load balancing algorithms. The entire initialization phase is synchronously executed on all processes and the data structure used during that phase and represented by classes SetupBlockForest and SetupBlock is replicated on every process (→ the data structure is not stored distributedly / every process contains all blocks). Neighboring blocks are linked directly via pointers and the octrees that result from the successive partitioning of root blocks are stored explicitly via parent-child relationships. Moreover, the entire construction of the initial data structure, including the

24

3.4 Implementation Remarks Table 3.1: Characteristic of and differences between the data structure used during the initial domain partitioning phase (SetupBlockForest/SetupBlock), the data structured used during the simulation phase (BlockForest/Block), and the data structured that is temporarily created for the dynamic repartitioning (ProxyBlockForest/ProxyBlock).

SetupBlockForest / SetupBlock BlockForest / Block ProxyBlockForest / ProxyBlock

SetupBlockForest / SetupBlock BlockForest / Block ProxyBlockForest / ProxyBlock

simulation data (grids, unknowns, etc.)

neighbor addressing

– × –

pointers (block ID, process rank) (block ID, process rank)

distributed storage

octrees representation (for SFC construction, etc.)

– × ×

explicit (pointers) implicit (block ID) implicit (block ID)

initial static load balancing, can be executed independently from the actual simulation – it can even be executed on a completely different machine. The connection between the initial construction phase and the actual simulation is then established via a file that contains all information about the initial domain partitioning. Alternatively, the initial domain partitioning phase can be limited to only creating the coarse root blocks and leaving further octree-like block-level refinement to the fully distributed dynamic repartitioning phase during the simulation. Ultimately, the initial domain partitioning is used as basis in order to construct the main data structure represented by classes BlockForest and Block. Once the main data structure is set up, the data structure responsible for the initial domain partitioning and represented by classes SetupBlockForest and SetupBlock is destroyed. More details about the initialization phase, including insights into the initial static load balancing, follow in Chapter 4. The data structure used during the entire simulation phase (BlockForest/Block) and the data structure used during the dynamic repartitioning (ProxyBlockForest/ProxyBlock) are both stored distributedly across all active processes and comply with the depiction of distributed storage outlined at the end of Section 3.1. Every block is assigned to exactly one process. When a block is assigned to a process, the process becomes responsible for storing the block, including all associated block data, whereas no information, neither data nor meta data, is maintained about non-local blocks. Neighboring blocks are addressed via pairs of block IDs and process ranks. Topological information about the forest of octrees partitioning is implicitly available through the block IDs (cf. Section 3.2). The major difference between classes BlockForest and ProxyBlockForest is that the data structure that is represented by BlockForest stores all the simulation data, while the data structure that is represented by ProxyBlockForest acts as a lightweight proxy that contains none of this data. The characteristics of this proxy data structure and its special purpose during the dynamic repartitioning phase are covered in detail in Chapter 6. Since the three different implementations all represent the block-structured partitioning

25

3 Parallelization Concepts and Realization

Figure 3.4: Three different block-structured domain partitioning strategies that result in different topologies: 1) 2:1-balanced octrees (left), 2) 3:1-balanced general trees (middle), 3) rectilinear (right).

of the same simulation, the AABB that encloses the domain and the number of root blocks in x-, y-, and z-direction are shared by all three data structures. As a result, block IDs are globally valid and always represented by an instance of class BlockID. Consequently, depending on the current phase of the simulation, the same block can exist as an instance of class SetupBlock, Block, or ProxyBlock, and is, independent of its current representation, always referenced by the same block ID. During the simulation, only few algorithms depend on the topology that results from the block-structured partitioning. Certain layers of the communication, for example, must know how the blocks are arranged in space in order to guarantee a correct synchronization of the simulation data between neighboring blocks. Most algorithms, including all compute kernels, however, work on block-local data and are completely oblivious about neighboring blocks. Consequently, these algorithms work for any block-structured domain partitioning, independent of how blocks are exactly arranged in space (some example domain decompositions with different block-structured topologies are illustrated in Figure 3.4). This is why an additional layer of abstraction that includes abstract base classes for the block storage data structure (IBlockStorage), blocks (IBlock), and block IDs (IBlockID) is introduced for the domain partitioning. These abstract base classes act as interfaces for generic algorithms. In the present work that is based on a 2:1-balanced forest of octrees-like partitioning, class BlockForest is thus derived from class IBlockStorage and provides an implementation for the interface that is defined by IBlockStorage. The same is true for classes Block and BlockID, which are derived from classes IBlock and IBlockID, respectively. Functionality that is independent of the actual topology and hence identical for all blockstructured domain decompositions is therefore already implemented in the abstract base classes IBlockStorage and IBlock. Most particularly, this includes the mechanisms for storing arbitrary block data3 . As a result, all algorithms that iterate process-local blocks and work on block data can be implemented in terms of the abstract base classes only. A different block-structured domain partitioning strategy than the one introduced in this thesis, which may be required for future applications, is thus immediately compatible with almost all existing algorithms if its implementation is also derived from the abstract base classes. Ultimately, this design simplifies future extensions to the framework’s domain partitioning capabilities, enables high reusability of existing code, and thus significantly improves maintainability. 3

26

The actual implementation of class BlockDataHandling (cf. Code Snippet 3.1) is thus based on references to instances of class IBlock instead of Block.

4 Initialization The following chapter provides insight into the initial construction of the domain partitioning, including the initial, static load balancing. The chapter starts in Section 4.1 with a description of the core algorithm that controls the entire initialization procedure. A detailed look at the part of the algorithm responsible for load balancing, i.e., the distribution of all blocks to all available processes, follows in Section 4.2. Section 4.3 then illustrates the relation of SFCs to the forest of octrees domain partitioning and their use during the load balancing procedure.1 Finally, Section 4.4 describes the file format that allows to completely decouple the initial domain partitioning from the actual simulation.

4.1 The Setup Procedure As already mentioned in Section 3.4, the purpose of the entire initialization phase is to set up the data structures that are used during the actual simulation. The initialization phase therefore relies on a lightweight data structure that does not contain any actual simulation data. Instead, blocks of the initialization data structure only contain values that represent the expected workload and memory requirements. The initial load balancing is then based on these values. The allocation of actual simulation data is deferred until the construction of the distributed simulation data structure at the beginning of the simulation phase. The initialization phase is outlined in Figure 4.1. First, a uniform Cartesian partitioning is executed. The blocks that result from this initial partitioning act as roots for the subsequent octree decomposition. In order to better adapt to complex geometries, blocks of the initial partitioning that do not intersect with the simulation domain, i.e., blocks completely filled with empty space, can be discarded from the data structure. As a result, these blocks are not associated with any storage anymore and therefore completely vanish from the simulation. After these simulation-irrelevant blocks have been discarded, a static block-level refinement algorithm is executed. During this algorithm, a user-provided, simulation-dependent callback function determines which blocks require further refinement. A block marked for refinement by this function is uniformly split into eight smaller blocks, resulting in an octree-like domain partitioning. This callback function is called iteratively as long as the previous call resulted in additional blocks that were marked for refinement, allowing for blocks to be split multiple times. Consequently, this approach enables an arbitrary depth of the initial octree-like block decomposition. 1

Small parts of Section 4.3 were first published in [Sch18].

27

4 Initialization

(1)

(2)

(3)

(6)

(5)

(4)

DISK

DISK

Figure 4.1: Illustration of the main parts of the setup procedure. First, the space surrounding the simulation domain (marked grey) is uniformly partitioned into blocks (2). Blocks of this initial partitioning that do not intersect with the simulation domain are discarded from the data structure (3). Every remaining block can be further split into smaller blocks following an octree decomposition, resulting in a forest of octrees-like partitioning of the domain (4). Please note that although only one level of refinement is illustrated, the underlying iterative algorithm allows for blocks to be split multiple times. During the final stage of the setup procedure, the load balancing algorithm assigns a process rank to every block. In this example, the blocks are divided among two different process ranks (5). The setup procedure can be followed immediately by the simulation phase (which starts with the process-local allocation of block data) or the final state of this initial static setup can be saved to disk for later use (6).

During the following, last phase of the initialization stage, the static load balancing is executed. Here, user-defined, simulation-dependent callback functions are used in order to assign values to all blocks that correspond to the expected workload and memory requirements. These characteristic values, together with the block connectivity information provided by the octree-like domain partitioning, are the basis for the load balancing algorithm. The load balancing algorithm itself is another callback function that must adhere to a certain interface. The task of this load balancing algorithm is to assign a process rank to every block. A detailed description of the entire static load balancing process follows in Section 4.2. Once every block is assigned a process rank, the initialization phase is finished. This final state of the initial domain partitioning that consists of all blocks and their corresponding process associations can either be used immediately to proceed with the actual simulation or it can be stored to disk for later simulations. In either case, during the simulation phase, every process only allocates data for the blocks that are assigned to its rank. As already outlined in Section 3.1, no data or meta data is stored for non-local blocks. If the simulation phase immediately follows the initialization phase, the entire data associated with the initialization is destroyed before the process-local simulation data is allocated. The ability to store the final state of the initialization to disk allows to separate

28

4.1 The Setup Procedure

C++ Code Snippet 4.1: Example process flow for the initialization of the domain partitioning using an instance of class SetupBlockForest in order to set up the distributed data structure used during the simulation phase and represented by an instance of class BlockForest (cf. Table 3.1). Once an instance of class BlockForest is set up, the data associated with the setup phase can (and should) be freed, simulation data can be added, and computation can commence as illustrated in Code Snippet 3.3. 1 2 3 4

... // the initialization of the domain partitioning starts by creating // an instance of class SetupBlockForest SetupBlockForest sbf ( ... );

5 6 7 8 9 10

// registration of callback functions that control the subsequent initialization procedure // by determining which root blocks are excluded from the data structure and which blocks // require further refinement [both optional] sbf . r e g i s t e r R o o t B l o c k E x c l u s i o n C a l l b a c k ( ... ); sbf . r e g i s t e r B l o c k R e f i n e m e n t C a l l b a c k ( ... );

11 12 13

// construction of the forest of octrees domain partitioning (see Algorithm 4.1) sbf . initialize ( ... );

14 15 16 17 18

// registration of callback functions that control the subsequent load balancing procedure // by determining workload and memory consumption for every block [both optional] sbf . r e g i s t e r W o r k l o a d A s s i g n m e n t C a l l b a c k ( ... ); sbf . r e g i s t e r M e m o r y A s s i g n m e n t C a l l b a c k ( ... );

19 20 21 22

// execution of the load balancing procedure (see Algorithm 4.2, can be skipped for single process runs) auto loadBalancer = ... sbf . balanceLoad ( loadBalancer , ... );

23 24 25 26

// storing the domain partitioning to file [optional] // (typically only invoked if the initial domain partitioning is decoupled from the simulation phase) sbf . save ( " forest . dat " );

27 28 29 30 31 32

// initialization of the fully distributed data structure // that is used for the actual simulation BlockForest bf1 ( " forest . dat " , ... ); // initialization from file or ... BlockForest bf2 ( sbf , ... ); // ... immediately from an instance // of class SetupBlockForest

33 34 35 36 37

// the data stored for the instance of class SetupBlockForest can be freed manually [optional] // (it will be freed automatically once “sbf” goes out of scope and is destructed) sbf . clear (); ...

simulation setup from simulation execution. Consequently, it is possible to run the setup multiple times with different load balancing algorithms and for different partitioning and balancing configurations – it is even possible to run the setup on a completely different machine as compared to the actual simulation. A more detailed analysis of storing the setup to file follows in Section 4.4. Although the data structure used during initialization, including the entire forest of octrees connectivity information, is replicated among all processes (cf. Table 3.1), the potentially compute-intensive functionality encapsulated in the various callback functions may be implemented in a distributed parallel manner. Code Snippet 4.1 illustrates how the previously described setup process is represented in the implementation. After an instance of class SetupBlockForest, which contains the data structure used during setup,

29

4 Initialization is created, callback functions that control the domain partitioning can be registered (cf. Lines 9 and 10). A subsequent call to the function initialize generates the initial forest of octrees domain partitioning. As such, the implementation behind initialize corresponds to steps (2) to (4) of Figure 4.1. If the expected workload or the expected memory consumption varies between different blocks, Lines 17 and 18 allow to register callback functions that will be evaluated before the load balancing algorithm is executed. The load balancing algorithm is selected in Line 21 and passed to the function balanceLoad in the following line. The implementation of function balanceLoad corresponds to step (5) of Figure 4.1 and is discussed in more detail in the next section of this chapter. By making use of various user-provided, simulation-dependent callback functions, the setup procedure follows the open/closed software design principle mentioned in Section 1.2. Line 31 shows how an instance of class BlockForest, which represents the main, fully distributed data structure used during simulation, is constructed from an instance of class SetupBlockForest. Alternatively, the final state of the setup procedure can be saved as indicated in Line 26, the program can be exited at this point (not shown in Code Snippet 4.1), and the actual simulation can be started from file, thereby skipping the setup phase entirely, as shown in Line 30. Algorithm 4.1 outlines the algorithm that is executed when calling initialize in Line 13 of Code Snippet 4.1. This algorithm consists of two major parts. In the first part, the root blocks are created by performing a Cartesian partitioning of the simulation space. The number of root blocks in x-, y-, and z-direction are input parameters of the algorithm and are specified by the surrounding application. If a callback function for excluding root blocks was registered, root blocks marked for exclusion by this function are discarded from the data structure. Moreover, local block connectivity is set up by explicitly creating links to all direct neighbors of every root block. The algorithm setting up the connectivity takes domain periodicity, which is another application-specific input, into consideration. As a result, each root block has either exactly one distinct neighbor for every corner, edge, and face, or a root block can have no neighbor associated with a particular corner, edge, or face. A block has no neighbor for a particular corner, edge, or face in case of a non-periodic domain boundary or if the potential neighbor block was discarded from the data structure by the previous call to the root block exclusion function. Having set up the connectivity for all blocks marks the end of the first part of Algorithm 4.1. Hence, the second part of Algorithm 4.1 starts in line Line 11. This second part is focused on the initial, static, block-level refinement of the data structure. Consequently, this second part is only executed if the underlying application did register a callback function that can be used to determine if a block requires further refinement. At the beginning of each iteration of the following iterative procedure, all currently available blocks are passed to this callback function. If blocks are marked for refinement during execution of the callback function, these blocks are not immediately split up into smaller blocks, but first the block neighborhood is checked for violations of the 2:1 balance constraint. If a violation of the 2:1 balance constraint is detected, additional blocks are marked for refinement. This process is repeated until there are no more violations of 2:1 balance. Only then, all blocks marked for refinement are finally split into 8 smaller blocks of identical

30

4.1 The Setup Procedure

Algorithm 4.1: The algorithm outlines the initial construction of the data structure that manages the forest of octrees-like partitioning of the simulation space into blocks during simulation setup. It corresponds to steps (2) to (4) in Figure 4.1 and it represents the implementation behind the call to function initialize in Line 13 of Code Snippet 4.1. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Function SetupBlockForestConstruction create all root blocks by performing an initial Cartesian space partitioning forall root blocks do if block is not excluded from data structure then /* application-specific ... */ allocate memory for the block /* ... callback function is evaluated ... */ end /* ... (default if no callback is available: no block is excluded) */ end forall blocks do set up neighborhood (with regard to domain periodicity) /* set up links to all ... */ end /* ... neighbor blocks */ if block refinement callback function is available then repeat forall blocks do call BlockRefinementCallback(block) /* marks the block if ... */ end /* ... it needs to be split */ if at least one block was marked to be split then repeat /* mark additional blocks for splitting to maintain 2:1 balance */ forall blocks do forall neighbor blocks of the current block do if neighbor block is marked for splitting and smaller than the current block then mark the current block for splitting end end end until no additional block was marked to be split anymore split marked blocks → create 8 new, smaller blocks adapt block neighborhoods /* adjust/initialize links to neighbor blocks */ end until no block was marked to be split anymore by the callback function in Line 14 end end

size. This process of automatically maintaining 2:1 balance by forcing additional blocks to split is also illustrated in Figure 4.2. Once these smaller blocks are created, block connectivity must be restored. For all new blocks, the links that refer to all direct neighbors must be initialized. For existing blocks that are now neighbors to new blocks, these links must be adapted. This process, which is triggered in Line 27 of Algorithm 4.1, is also illustrated in Figure 4.3. The basic idea of this process is to use spatial queries in order to determine all neighbors of some block X by checking which blocks are intersected by a specific set of point probes. A block is considered to be hit by a point probe if the point lies within the AABB of the block. In order to determine the topology of the neighborhood of block X, 56 point probes are created and checked against the AABBs of a set of potential neighbor blocks. These point probes exactly coincide with the centers of imaginary, smaller neighbor blocks. Hence, 4 point probes are created for every face of block X, 2 for every edge, and 1 for every corner, resulting in 4 × 6 + 2 × 12 + 1 × 8 = 56 probes. The set of all potential neighbor blocks of block X is constructed using the topology information from the previous state of the data

31

4 Initialization

(1)

(2.1)

(3.1)

(4.1)

(2.2)

(3.2)

(4.2)

(3.3)

(4.3)

(4.4)

Figure 4.2: This figure outlines 3 iterations of the block-level refinement procedure, which is part of the initial setup of the data structure and starts with a uniform Cartesian partitioning as illustrated in (1). In this example, the callback function marks exactly one block for refinement (the block surrounded by a dashed line) at the beginning of each iteration (2.1), (3.1), and (4.1). In order to maintain 2:1 balance, additional blocks are forced to split in (3.3), (4.3), and (4.4). Note that in the actual implementation, first all blocks are only marked to be split, and once no more additional blocks are marked, all blocks that need to be split are split in one go at the end of each iteration.

structure. This set is bounded and, in the worst case2 , consists of a maximum of 455 blocks/AABBs that are checked against all 56 point probes. If such a point probe that corresponds to block X hits one of the potential neighbor blocks, this block is registered accordingly at block X as a face, edge, or corner neighbor. Note that multiple point probes may hit the same neighbor block, but no neighbor will be missed, i.e., the procedure guarantees that each direct neighbor of block X is hit by at least one point probe. Also note that this entire algorithm that relies on point probes and spatial queries in order to set up block neighborhood information is reused during the creation of the proxy data structure in the distributed SAMR procedure presented in Chapter 6. The only difference during the SAMR procedure is the way how the set of potential neighbor blocks is created, which, because of the distributed nature of the data structure at that point, requires communication with neighboring processes. If the initialization procedure presented in this section is part of a simulation that is executed in either a thread-parallel only or a hybrid MPI/thread-parallel mode, the algorithm for setting up block neighborhood information takes full advantage of the parallel execution by distributing the work evenly to all available threads (each block can 2

In the worst case, block X resulted from a block Y that was split and that had 56 smaller neighbor blocks, all of which were also split.

32

4.2 Static Load Balancing

(1.1)

(1.2)

(2)

(3)

Figure 4.3: This figure demonstrates how point probes together with spatial queries are used in order to set up block neighborhood information. In this 2D example, the entire set of point probes consists of a total of 12 probes, 2 for each of the 4 edges, and 1 for each of the 4 corners (1.2). These point probes hit 8 different blocks (2), which are then assigned as the corresponding edge and corner neighbors (3).

be processed independently). The same is true for all other major parts of Algorithm 4.1: The loops in Lines 3, 8, 13, and 18, as well as the splitting of all marked blocks in Line 26 can be executed in parallel by different threads.

4.2 Static Load Balancing After the domain partitioning into blocks is finished, static load balancing is executed. The goal of this initial load balancing is to assign a process rank to each block. Every block must be assigned exactly one rank, but multiple blocks may be assigned the same rank. Later when process i creates the BlockForest data structure from a SetupBlockForest data structure, process i only constructs an instance of class Block for every block in the setup data structure whose process rank is equal to i. All blocks whose rank is not equal to i are allocated on other processes. Neighborhood information in instances of class Block then consists of block ID & process rank pairs (cf. Table 3.1). For the initial, static load balancing of the SetupBlockForest data structure, the framework specifies the main program flow as outlined in Algorithm 4.2. The main part of this initial balancing, the actual load balancing algorithm, is realized via a callback function that must be registered beforehand (see Code Snippet 4.1). Moreover, before the actual load balancing algorithm is executed, all blocks are assigned workload and expected memory consumption values. These values are assigned by pre-registered, simulation-dependent callback functions. If no such callbacks are available, default values are assigned that result in all blocks having the same weight. The input for the load balancing function are the SetupBlockForest data structure as well as the number of processes the blocks are supposed to be distributed to. The number of processes can be identical to the current number of processes if the actual simulation follows immediately after the initialization stage. If, however, the result of the block structure initialization is saved to file for later usage, the number of processes passed to the load balancing function will be equal to the targeted number of processes for the eventual simulation. One possibility for the implementation of the load balancing function is to use SFCs. A detailed description of how SFCs can be used for the partitioning of

33

4 Initialization

Algorithm 4.2: Program flow as set by the framework in order to achieve load balancing of the SetupBlockForest data structure. This algorithm represents the implementation of the function that is called in Line 22 of Code Snippet 4.1. In addition to performing the actual load balancing by calling the load balancer in Line 6 (≙ loadBalancer of Line 21 in Code Snippet 4.1), the framework performs some additional consistency checks and optional operations. For some cases, these optional operations allow to adjust and improve the block to process mapping. 1 2 3 4 5 6 7 8 9 10

Function SetupBlockForestLoadBalancing forall blocks do call WorkloadAssignmentCallback(block) /* assign expected workload and ... call MemoryAssignmentCallback(block) /* ... memory consumption to each block end /* (default values are assigned if no callback functions are registered) call LoadBalancingAlgorithm(blocks, number of processes) perform consistency checks /* Is every block assigned to one of the available processes? /* Does the block distribution obey all per-process memory limits? etc. optional: breadth-first search for remapping process ranks optional: incorporate empty processes into process network end

*/ */ */ */ */

the SetupBlockForest data structure follows in the next Section 4.3. Another possibility is to use existing graph partitioning libraries like METIS [Kar99] for the implementation of the load balancing function. For this to work, the graph representation of the SetupBlockForest data structure must be transformed into a representation that can be used as an input to a library like METIS. Every SetupBlock corresponds to a node in the graph. Edges in the graph are described by the links of each block to all of its neighbor blocks. After successful partitioning of the graph by the third party library, the result must be transferred back to the SetupBlockForest data structure. This is achieved by simply setting the process rank of each SetupBlock according to the partition index of the corresponding node in the graph representation of the partitioning library. Once the load balancing algorithm is finished, some additional checks and optional operations are performed as outlined in Algorithm 4.2. These checks for example ensure that every block is assigned a valid process rank. It is also possible to pass a process memory limit to Algorithm 4.2. If such a limit is passed to the algorithm, the framework can check whether the block distribution satisfies this memory limit. Aside from these consistency checks, Algorithm 4.2 can also perform optional operations that adjust and improve the block to process mapping in some cases. One of these optional operations is the integration of empty processes into the process network. An empty process is a process that does not contain any blocks. Processes can remain empty if initially a simulation contains fewer blocks than processes. This situation may occur if a simulation is expected to see an increase in its resource requirements during runtime, for example because of dynamic refinement that is triggered only once turbulent flow is starting to build up. Normally, the process graph is implicitly defined by the connections between neighboring blocks (see figures (1) to (3) in Figure 4.4). Consequently, processes that are initially empty would not be connected to processes that contain blocks. This is a problem for dynamic load balancing algorithms which are based on diffusion of workload to neighboring processes. These empty processes would not be considered by the diffusion algorithm, and hence remain empty and unused for the entire simulation.

34

4.2 Static Load Balancing

3

4

0

1

4

3

1

2

3

4

0

1

7

0

1

3

1

2

0

1 4

2

2

3

2

(2)

(1)

5

4

7

5

1

3

0

1

(3)

2

0

7

1 4

6

3 6

5

3

4

3

(4.2)

5

2

7

(4.1)

Figure 4.4: The numbers inside the blocks of figure (1) illustrate the process association of the blocks. Figure (2) shows all the connections between the blocks. Figure (3) then shows the process graph that follows from this block partitioning. Since no block that is assigned to process 0 is a neighbor of a block assigned to process 2 in (2), there is no connection between processes 0 and 2 in figure (3). Assuming that the corresponding simulation runs on 8 processes, the remaining 3 processes that do not contain any blocks can be kept unconnected from the other processes (4.1), or they can be integrated into the process network by explicitly registering them as neighbors of other processes that contain blocks (4.2). These 3 currently block-less processes are then distributed uniformly among the other processes, causing the block to process mapping to be changed as compared to (1).

0

0

1

0

1 2

0

1

(2.1)

(2.2)

(2.3)

4 3

2

0

(1)

1

0

1

2 3

2 4

3

3

5

1

7

0

1

4

3 0

1

3

(2.5)

2

0

4

(2.4)

1 2

(3.2)

5

7

3

5

3

2

0

1

2

3

1

4

6

7 6

1 2

3

(2.6)

5

0

4

7

4

(3.1)

Figure 4.5: This illustration shows how both the breadth-first search reordering depicted in (2.1) to (2.6) and the inclusion (3.2)/exclusion (3.1) of empty processes affect the block to process mapping. The process graph in (1) corresponds to a block partitioning that is identical to the one in (1) and (2) of Figure 4.4.

35

4 Initialization

Algorithm 4.3: BFS-based remapping of process ranks. For a connected graph, the loop in Line 4 isn’t required. However, since blocks can be completely removed from the data structure (cf. Figure 4.1), the BFS-based remapping algorithm must be able to handle potentially disconnected graphs that consist of multiple parts, hence the requirement for the outer loop in Line 4. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Function SetupBlockForestProcessRemapping c=0 declare list l (initially empty) while not all process ranks have been remapped do pick an unmapped process and add it as head to l while l is not empty do remove process p from head of l register c as new process rank for p c=c+1 append all unmapped neighbor processes of p as tail to l end end apply process rank remapping end

This is the reason why every process additionally stores a list of all neighboring processes, thereby explicitly defining the process connectivity graph. Most, often even all, of these neighboring processes can be determined by consulting the neighborhood information stored for all blocks. Processes that initially do not contain any blocks, however, must be explicitly added to this list of neighboring processes. Only then they are connected to the process graph as illustrated by Figure 4.4, and only then they are considered during the distributed diffusion-based load balancing. Empty processes are uniformly inserted into the process graph, meaning they are not just registered as neighbors of the same process, but evenly distributed among the other processes.

Another optional operation that can be executed after the load balancing algorithm is finished is a breadth-first search (BFS) on the process graph. Here, the breadth-first traversal order is used to remap the process ranks that have been assigned by the load balancing algorithm. The idea is to have control over placing neighboring processes close to each other with respect to the hardware topology, independent of the numbering scheme of the load balancer. A scheduling policy that is supported by all major MPI implementations is to place MPI processes in ascending order such that first all available slots on the first host are used, followed by scheduling the remaining processes to the slots on the second host, etc. As a result, MPI ranks n and n + 1 then belong to two processes that are placed close to each other with respect to the hardware topology. This is why the goal of the BFS-based rank remapping is to assign ranks that are close together to processes which are neighbors in the process connectivity graph. Given the above mentioned scheduling policy, these processes are then likely to end up close to each other with respect to the machine topology. This placement is beneficial for next-neighbor communication, which is expected to be the major form of communication for any distributed-parallel simulation algorithm that is running on top of the software framework presented in this thesis. The BFS-based remapping of process ranks is illustrated in Figure 4.5 and outlined in Algorithm 4.3.

36

4.3 Space Filling Curves As outlined in Algorithm 4.3, the BFS-based remapping of process ranks consists of two steps. The first step, the graph traversal, runs in O(NP) time, with NP denoting the total number of processes. Similarly, the second step, which consists of adapting the process rank that is set at each block, runs in O(NB) time, with NB denoting the total number of blocks. The same is true for the inclusion of empty processes into the process connectivity graph. Here, the first step consist of adapting the list of neighbors that is explicitly stored for every process, while the second step also takes care of remapping all block-toprocess associations. The optional callback functions at the beginning of Algorithm 4.2 for calculating the expected workload and memory consumption are implementationdependent but often implemented such that they also run in O(NB) time. The runtime of the load balancer, however, heavily depends on the concrete implementation. The consistency checks that follow the load balancer are again of O(NB) complexity.

4.3 Space Filling Curves In general, SFCs map multidimensional data to one dimension while preserving good spatial locality of the data. As such, they can be used to define a global ordering for all the octants of an octree. Consequently, SFCs allow to construct an ordered list of all the blocks stored in the SetupBlockForest data structure. The SFC-based load balancing routines can make use of either Morton [Mor66] or Hilbert [Hil91] order. With Hilbert order, consecutive blocks are always connected via faces, whereas with Morton order, several consecutive blocks are only connected via edges or corners, with some consecutive blocks not being connected to each other at all. Hilbert order, therefore, results in better spatial locality for consecutive blocks than Morton order (see Figure 4.6). 7

8

5

6 0

(1.1)

3

4

1

2

0

9 1

2

3

4

(1.2)

9

6

8

7 0

(2.1)

9

5

6

7

8

(1.3)

5 4

3

1

2

(2.2)

0

5 1

2

4

3

8

7

9

6

(2.3)

Figure 4.6: Figures (1.1) to (1.3) illustrate Morton order-based traversal. Quadrants (2D)/Octants (3D) are always visited in the same order. As a result, consecutive blocks with regard to the curve might not be neighbors regarding the domain partitioning: blocks 4 and 5 are an example. Figures (2.1) to (2.3), in comparison, demonstrate Hilbert order-based traversal which leads to a better preservation of spatial locality. Here, consecutive blocks are guaranteed to be connected via faces.

37

4 Initialization

Algorithm 4.4: Algorithm used to construct an SFC-ordered list of the blocks stored in the SetupBlockForest data structure. Note that only blocks stored at leaves are added to the list since only those blocks make up the actual domain partitioning (cf. Figure 4.6). Also note that this algorithm is used for both Morton as well as Hilbert order-based tree traversal. The only difference between these two variants is the order in which the child nodes are pushed onto the stack in Line 11, and the order in which the root blocks are evaluated in Line 4 (see Figure 4.7 for an illustration of the evaluation order). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Function SFCConstruction declare list l (initially empty) // for storing the the SFC declare stack s (initially empty) forall root blocks do push root block on s while s is not empty do pop block b from s if b has no children then append b to l // b is a leaf node else push all children of b to s end end end return l end

During the construction of the SetupBlockForest data structure (cf. Algorithm 4.1), a SetupBlock is not deleted after it was marked for refinement and eight smaller blocks were created. In fact, the block is kept in the data structure and a parent-child relationship is established between this block and the newly created, smaller blocks. As a result, the complete tree structure is explicitly stored and always available for the SetupBlockForest data structure. For the construction of the distributed BlockForest data structure, only SetupBlocks that are stored as leaves are of interest and are used to create corresponding Blocks. However, for the SFC-based balancing, having the full tree structure still available is an advantage since both the Morton as well as the Hilbert order-based curve can be constructed by a depth-first tree traversal. This depth-first traversal of the SetupBlockForest data structure, which results in an SFC-ordered list of all blocks, is illustrated in Algorithm 4.4. For Morton order, child nodes are always visited in the same order while descending the tree, whereas for Hilbert order, the order in which child nodes are traversed depends on their position within the tree. For the construction of the Hilbert curve, lookup tables exist that specify the exact traversal order [Cam03]. Consequently, the construction of a SFC based on Hilbert instead of Morton order only results in an insignificant overhead. Both Morton as well as Hilbert order-based SFCs are only defined on octrees. The domain partitioning, however, is represented by a forest of octrees where each octree is the result of further refining one of the initial root blocks (cf. Figure 3.2). In order to obtain a global ordering of all blocks, the curves that are defined on each individual octree are stitched together to form one SFC that covers the entire data structure. This way, the last block that is traversed by the SFC in one octree is followed by the first block that is traversed by the SFC in the next octree. In Algorithm 4.4, this is achieved by iterating all root blocks and thus evaluating one octree after the other.

38

4.3 Space Filling Curves

y

x

(1)

y

x

(2)

Figure 4.7: Illustration of the evaluation order of root blocks during the construction of the global SFC. The evaluation order on the left side (1) is used in combination with Morton order-based tree traversal. The evaluation order on the right side (2) is used when the individual trees are traversed in Hilbert order. Please note that only root blocks are shown in both figures. In practice, roots blocks are often partitioned into smaller blocks which are the result of an octree-like decomposition and which represent the actual domain partitioning (cf. Figures 3.1 and 3.2). Also note that for 3D, the evaluation patterns are extended to the third dimension in a similar manner.

The exact order in which the root blocks are evaluated is illustrated in Figure 4.7. If the octrees are traversed in Morton order, the evaluation order of the root blocks mirrors this z-like traversal pattern. Similarly, if the octrees are traversed in Hilbert order, the evaluation order of the root blocks resembles a Hilbert order-like pattern. This Hilbert order-like pattern results in better spatial locality, and it can be used independent of the specific traversal order of the individual octrees. Hence, it could also be used even if the octrees are traversed in Morton order. However, using the z-like evaluation order for the root blocks in case the corresponding octrees are traversed in Morton order has one key advantage: In that case, the resulting global traversal order mirrors the construction scheme of the block ID (cf. Figure 3.3), meaning here the global ordering of all blocks can be obtained by lexicographically sorting all blocks by their block IDs based on their bit representation.3 In other words, executing Algorithm 4.4 with the root block evaluation order of Figure 4.7(1) and Morton order-based tree traversal leads to the exact same result as lexicographically sorting all blocks by their block IDs. For the initial, static load balancing of the SetupBlockForest data structure, there is no advantage in sorting the blocks over executing Algorithm 4.4. During the dynamic load balancing presented in Section 6.4.1, however, it is beneficial to be able to obtain the global SFC-based order of all blocks without the need to explicitly reconstruct any tree structure. Regardless of whether this SFC-ordered list of all blocks is constructed by sorting all block IDs or by a tree construction followed by a tree traversal, O(NlogN) time is required on average, with N denoting the total number of blocks. In order to use the SFC-ordered list of all blocks for the initial load balancing, the list must be divided into NP contiguous parts of equal weight, with NP denoting the number of requested processes. Finding an optimal solution for this constrained partitioning problem is of exponential complexity, which is infeasible in practice. Dynamic programming and memoization can help to considerably speed up this process, however at the expense of memory. Consequently, the dynamic programming approach also becomes infeasible for partitioning the list of blocks to thousands of processes. Therefore, the greedy algorithm 3

Example of lexicographically sorted block IDs: 1000 < 1001000 < 1010001 < 1011

39

4 Initialization

Algorithm 4.5: Algorithm used to partition the SFC-ordered list of all blocks into NP contiguous parts of equal weight (NP being the number of processes). Because of the greedy approach, this algorithm will run in O(N) time (N being the number of blocks), but it will not always find an optimal solution. It is, however, guaranteed to find an optimal partitioning in case all blocks possess the same weight. 1 2 3 4 5 6 7 8 9 10 11 12

Function SFCBalancing calculate total weight of all blocks forall processes do required process weight = remaining total weight / remaining processes while blocks remaining and |required process weight - current process weight - weight of next block| ≤ |required process weight - current process weight| do assign the next block in the curve to the current process adapt current process weight and remaining total weight end end end

outlined in Algorithm 4.5 is used. Because of the greedy approach, this algorithm runs in O(N) time (N being the number of blocks). Hence, Algorithm 4.5 is capable of very efficiently partitioning millions of blocks to potentially also millions of processes. Moreover, if all blocks share the same weight, the algorithm is guaranteed to find an optimal solution. This also remains true if some blocks possess a zero weight.4 However, if the blocks possess varying weights, Algorithm 4.5 will most likely not find an optimal partitioning. The quality of the solution found by Algorithm 4.5 then mainly depends on how strong block weights are varying and how many blocks there are compared to the number of processes. The more blocks, the better the quality of the partitioning in general. If there are eight times as many blocks in total as there are processes, for example, the partitioning generated by Algorithm 4.5 will typically come very close to the optimal partitioning.

4.4 Storing the Domain Partitioning to File As already mentioned in Section 4.1 and illustrated in Figure 4.1, the initialization of the block partitioning does not have to be followed immediately by the simulation. It is also possible to save the domain partitioning, including the block to process mapping, to file. This allows to run the domain partitioning with different block sizes and the load balancing with different parameters independent of the actual simulation. The initial domain partitioning together with the initial, static load balancing can be seen as a preprocessing step that can be executed separately from the simulation. This also allows to run the entire initialization process on a standard desktop computer prior to executing the corresponding simulation on a large-scale supercomputer. Once a good domain partitioning is found using the local workstation, the simulation running on the supercomputer only has to load the partitioning from file and it can start with the actual computation almost immediately after startup. This way, it is also possible to reuse the 4

40

Assigning a zero weight to a block can be useful if the block is temporarily deactivated and thus currently not taking part in the simulation.

4.4 Storing the Domain Partitioning to File

Header AABB of the simulation domain number of root blocks in x-, y-, and z-direction domain periodicity number of processes ... Process/Block Data 1st Process number of process local blocks BlockID of 1st block, BlockID of 2nd block, ... number of neighbor processes rank of 1st neighbor, rank of 2nd neighbor, ... 2nd Process number of process local blocks BlockID of 1st block, BlockID of 2nd block, ... number of neighbor processes rank of 1st neighbor, rank of 2nd neighbor, ... ...

Figure 4.8: File format for storing the domain partitioning, including the block to process mapping. The fixed-sized header contains general information about the domain partitioning, while the file body is divided into as many parts as there are processes. For each process, the file contains a list of block IDs and process ranks. The block IDs correspond to all process local blocks, while the process ranks represent all neighboring processes. This information is enough to reconstruct all the missing data, like for example the AABB of each block or the neighbor blocks of each block.

same initial partitioning multiple times for simulations that only differ in terms of their parametrization: The same simulation could, for example, be executed with different boundary conditions, different material properties, different fluid properties, etc. All those times, the entire initialization process wouldn’t have to be rerun, but the initial state could be loaded from the same file that only has to be created once. The file format is illustrated in Figure 4.8. The file consists of a fixed-sized header that contains general information about the domain partitioning and a body that contains information about the blocks and their process mapping. Besides information about periodicity of the domain and the total number of processes, the file header also stores the AABB of the domain and the number of root blocks in x-, y-, and z-direction. Using this data, it is possible to reconstruct a block’s AABB and hence its exact location within the domain from just its block ID (see Section 3.2). Since the file header only contains general information which essentially consists of a few numbers, its size is less than 1 KiB. The main information about the partitioning is stored in the file body, which is divided into as many parts as there are processes. For each process, the file only stores a list of

41

4 Initialization all block IDs that correspond to the blocks of this process and a list of process ranks that represent all the neighboring processes (potentially including empty neighbor processes that do not yet contain any blocks). The blocks’ AABBs are reconstructed from the IDs, while the neighborhood information of each block is reconstructed using point probing as described in Figure 4.3. For the point probing, the set of potential neighbor blocks is composed of all other local blocks and all the blocks contained on neighboring processes. As a result, block neighborhood information does not have to be stored explicitly in the file, but can be efficiently reconstructed when the file is loaded. Consequently, the entire file consists almost exclusively of lists of block IDs and process ranks. The number of bytes used for storing a block ID as well as the number of bytes used for storing a process rank depend on the total number of blocks and processes, respectively. Ultimately, the resulting file size will be small compared to the scale of the corresponding simulation. A file containing a domain partitioning into 100,000 blocks on 10,000 processes could be as small as 1.3 MiB: Assuming there are 1,000 root blocks, each of which can be refined up to 4 times for a maximum of 5 different levels, storing a block ID requires 3 bytes (1 marker bit + 10 bits for the octree index + 12 bits for the branch indices = 23 bits). If all blocks are fully refined, this setup allows for 4,096,000 blocks. If 100,000 blocks are distributed to 10,000 processes, each process stores 10 blocks on average. Moreover, 2 bytes are required for storing process ranks. Assuming that each process has, on average, 50 neighbor processes, the file body consists of (2 bytes for the number of blocks + 10 × 3 bytes for the block IDs + 2 bytes for the number of neighbor processes + 50 × 2 bytes for the neighbor process ranks) × 10, 000 = (2 + 10 × 3 + 2 + 50 × 2) × 10, 000 bytes = 134 × 10, 000 bytes = 1.3 MiB. A file that contains the information of ten million blocks distributed to one million processes may require less than 200 MiB: Assuming 10,000 root blocks, each of which can be refined up to 5 times for a maximum of 6 different levels, storing a block ID requires 4 bytes (1 + 14 + 15 bits = 30 bits). This setup allows for up to 328 million blocks. If 10 million blocks are distributed to one million processes, each process again stores 10 blocks on average. This time, however, 3 bytes are required for storing process ranks. Assuming, again, 50 neighbors for each process, the file body will consist of (2 + 10 × 4 + 2 + 50 × 3) × 1, 000, 000 bytes = 194 × 1, 000, 000 bytes = 185 MiB. All the data is serialized to a byte stream before it is saved to file. Similarly, when loading the data from file, it is interpreted as a stream of bytes. This serialization process is endian-independent. Numbers that occupy more than one byte in memory are decomposed into a stream of bytes using bit masks and bit shifts. When reading the data from file, single bytes are recombined into one number using the same approach in reverse. As a

42

4.4 Storing the Domain Partitioning to File result, files can be created as well as loaded on any system, independent of the endianness of the underlying architecture. When loading the domain partitioning from file, accessing the file from every process simultaneously usually results in bad performance. If hundreds or thousands of processes are running in parallel, this approach can become infeasible. Therefore, a master-slave approach was implemented where only one process first reads the data from file and then broadcasts the entire byte stream to all other processes. In practice, loading the file, broadcasting its contents, and constructing the BlockForest data structure required for the simulation only takes 10 to 30 seconds when using a million processes on a current IBM Blue Gene/Q supercomputer. For simulations that use around ten thousand processes, the entire procedure is typically finished in a couple of milliseconds. For future simulations that involve many millions of processes, parallel MPI I/O or a combination of the master-slave approach and MPI I/O may be required for reading the data from file. Since the data is stored per process, the file format is perfectly suited for usage with parallel MPI I/O.

43

5 The Lattice Boltzmann Method on Nonuniform Grids This chapter introduces the concepts and algorithmic details behind the parallelization of an existing grid refinement approach for the LBM. The chapter starts in Section 5.1 with a short illustration of how the data required for the LBM is maintained by the block structure and how this data is efficiently synchronized in parallel simulations. Section 5.2 then introduces the conceptual foundation for the parallelization of the grid refinement approach for the LBM that was proposed by [Roh06]. Details on the implementation follow in the next Sections 5.3 and 5.4. The final Section 5.5 focuses on the specific requirements for load balancing an LBM-based simulation on nonuniform grids. Major parts of this chapter were first published in [Sch16] and are therefore identical in content to parts of this article, with only some minor changes and additions in all of the following sections, including the addition of Figure 5.9.

5.1 Grids and Parallel Synchronization For simulations based on the LBM, each block stores a uniform grid of the same size as illustrated in Figure 3.1. Consequently, a globally nonuniform grid always consists of piecewise uniform grids, and transitions between different grid levels are always 2:1 balanced. Depending on the specifics of the simulation, a single block typically stores a uniform grid that contains between 1,000 and 500,000 cells. Depending on the lattice model in use, each of these cells stores multiple PDFs: 9 for the D2Q9, 19 for the D3Q19, and 27 for the D3Q27 model (cf. Figure 2.1). In a parallel environment, the framework takes care of the necessary communication between the blocks by providing an extensible, feature rich communication layer that can be adapted to the needs and communication patterns of the underlying simulation. Generally, communication is organized into two steps: packing data into and unpacking data from buffers which are exchanged between processes. If two blocks reside on the same process, the communication layer allows data to be copied directly without a call to MPI and without the intermediate exchange of a buffer. For the LBM, communication is only performed between spatially adjacent blocks. For the rest of this chapter, two spatially directly adjacent blocks are referred to as being connected via a corner if both blocks only intersect in one isolated point. If both blocks intersect in a line, they are referred to as being connected via an edge. Analogously, if they intersect in a surface, they are referred to as being connected via a face.

45

5 The Lattice Boltzmann Method on Nonuniform Grids

Figure 5.1: Illustration of three neighboring blocks in 2D and the corresponding communication in one direction (towards the block in the upper right corner) after the collision step. In this 2D example with the D2Q9 lattice model, each block contains a grid of size 2 × 2 with one additional ghost layer. On edges, three PDF values must be exchanged. On corners, only one PDF value is communicated. Essentially, only those PDF values that are streaming into the block during the subsequent streaming step must be communicated.

For parallel simulations with the LBM, the grid assigned to each block is extended by at least one additional ghost layer that is used to synchronize the data. During communication, PDF values stored in the outermost inner cells are copied to the corresponding ghost layers of neighboring blocks (see Figure 5.1). Depending on the lattice model, exchanging PDF values with fewer neighbors may be sufficient. If D3Q19 is used, PDF values must be exchanged only with neighboring blocks connected via a face or an edge, but not across a corner. Additionally, in most simulations, not all PDF values must be communicated, but only those streaming into the neighboring block as illustrated in Figure 5.1. For D3Q19, out of the 19 PDF values stored in each cell, five PDF values must be exchanged for every cell on a face-to-face connection and only one PDF value for every cell on an edge-to-edge connection. On average, this reduces the amount of data that must be communicated by a factor of 4. It should be noted that the modular design of the communication layer allows to adapt all communication steps to the specific needs of the application. If an application requires more PDF data to be present in ghost layers, always also more data can be exchanged between processes. For the forest of octrees-like domain partitioning, the communication scheme consists of three different communication patterns: (1) data exchange between two blocks of the same size on the same level, (2) sending data to a larger block on a coarser level (fine-to-coarse communication), and (3) sending data to a smaller block on a finer level (coarse-to-fine communication). These three communication patterns are also illustrated in Figure 5.2. Details on the implementation of the three communication patterns specifically for the LBM on nununiform grids follow in Sections 5.4.1, 5.4.2, and 5.4.3.

46

5.2 Parallelization Concept

(1)

(2)

(3)

Figure 5.2: For the forest of octrees-like domain partitioning, three different communication patterns exist: data exchange between blocks of the same size (1), fine-to-coarse communication (data packed on a small and unpacked on a larger block) (2), and coarse-to-fine communication (data packed on a large and unpacked on a smaller block) (3).

With and without refinement, always only one message is exchanged between two processes. This message may contain the data of multiple blocks. When operating in hybrid OpenMP and MPI mode, messages to/from different processes can be sent and received in parallel by different threads. Packing data into buffers prior to sending and unpacking data after receiving can also be done in parallel by different threads if OpenMP is activated. The OpenMP implementation of the communication module follows a task parallelism model. First, a unique work package is created for every face, every edge, and every corner of every block. These work packages are then processed in parallel by all available threads. As a result, as long as all compute kernels are also parallelized using multiple threads (most likely following a data parallelism model), all major parts of the simulation are executed thread-parallel.

5.2 Parallelization Concept Figure 5.3 schematically illustrates the algorithm proposed by [Roh06] and its parallel implementation in this thesis. Initially (1), the simulation is in a post-streaming state. The algorithm starts with performing a collision in coarse and fine cells (2). Cells in ghost layers are not included in the collision. The content of two coarse cells is then transferred to four ghost layers in the fine grid as indicated by the arrow between steps (2) and (3). Next, streaming is performed in both grids, including the two innermost ghost layers of the fine grid (4). In steps (5) and (6), another collision and streaming step is performed in the fine grid only. Streaming again includes the two innermost ghost layers. Finally, PDF values in these two innermost ghost layers are merged and transferred to the coarse grid between steps (6) and (7). However, not all PDF values are transferred from the fine to the coarse cell, but only those streaming into the coarse grid (marked with a hollow, white arrow). The ghost layer of the coarse grid is not required. As a consequence, four ghost layers are only needed for fine blocks which are neighbors to coarser blocks during the coarse-to-fine communication step (3). But four layers of cells are never communicated. Section 5.4 will show that at most two layers are transferred during coarse-to-fine communication. For fine-to-coarse communication and communication between blocks of the same size, significantly less data must be transferred.

47

5 The Lattice Boltzmann Method on Nonuniform Grids

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(1)

(2)

(3)

(4)

(5)

(6)

(7)

Figure 5.3: The series of images above the dashed line illustrates one time step on the coarse grid and two corresponding time steps on the fine grid. The images below the dashed line show the exact same process, but here the data distribution of the parallel implementation is illustrated. Since grid level transitions coincide with block boundaries, this figure also visualizes cells in ghost layers that are involved in the algorithm. Essentially, the conversion from a coarse cell to a fine cell and back again to a coarse cell illustrated in the figures above the dashed line is realized by including the ghost cells of the fine grid in the regular computation in steps (4) to (6) in the parallel implementation. Cells that change their content from one step to the next are marked with a gray, surrounding box. Once step (7) is reached, the state of the algorithm is again identical to step (1).

5.3 Implementation Algorithm 5.1 illustrates the basic structure of the recursive procedure that represents the program flow for the LBM on nonuniform grids. As such, Algorithm 5.1 is the algorithmic representation of the parallelization scheme presented in the lower two thirds of Figure 5.3.

48

5.3 Implementation

Algorithm 5.1: This recursive algorithm illustrates the program flow for the LBM on nonuniform grids. As such, it is the algorithmic representation of the parallelization scheme presented in Figure 5.3. 1 2 3 4 5 6 7

Function NonuniformTimeStep(level L) forall blocks on level L do CollisionStep ::::::::::: if L ≠ finest level then recursively call NonuniformTimeStep(L + 1) end if L ≠ coarsest level then call Explosion(L, L − 1) end call Communication(L) forall blocks on level L do StreamingStep :::::::::::: if L ≠ finest level then call Coalescence(L, L + 1)

8 9 10 11 12

end if L = coarsest level then return end forall blocks on level L do CollisionStep ::::::::::: if L ≠ finest level then recursively call NonuniformTimeStep(L + 1) end call Communication(L) forall blocks on level L do StreamingStep :::::::::::: if L ≠ finest level then call Coalescence(L, L + 1)

13 14 15 16 17 18 19 20 21 22 23 24

/* recursive call */

/* coarse-to-fine communication */ /* initiated by fine level */ /* equal-level communication */ /* streaming step */ /* fine-to-coarse communication */ /* initiated by coarse level */

/* ending of the algorithm */ /* collision step */ /* recursive call */ /* equal-level communication */ /* streaming step */ /* fine-to-coarse communication */ /* initiated by coarse level */

end

25 26

/* collision step */

end

To improve readability, functions that perform the collision or streaming step of the LBM are highlighted with a wavy underline. Functions that involve communication are underlined with a solid line. The algorithm is called with L equal to zero. As a result, the simulation is advanced by one coarse time step (L equal to zero corresponds to the coarsest level). Function Explosion performs coarse-to-fine communication. In its most basic implementation, PDF values from coarse grid cells are homogeneously distributed to fine cells according to fα,f ine (xj , t) = fα,coarse (xi , t), (5.1) where xj are all the cells on the fine grid that correspond to xi on the coarse grid. Function Coalescence performs fine-to-coarse communication. Here, PDF values stored in the fine grid are transferred to the coarse grid according to fα,coarse (xi , t) =

1 n ∑ fα,f ine (xj , t), n j=1

(5.2)

where n is the number of fine cells that correspond to one coarse cell (n = 4 in 2D and n = 8 in 3D). Function Communication takes care of transferring PDF data between grids

49

5 The Lattice Boltzmann Method on Nonuniform Grids on the same level. Boundary conditions are executed as a pre-streaming operation (cf. Section 2.2) and are realized at the beginning of function StreamingStep. Both the streaming and the collision step are only performed in the interior part of the grid, but not in ghost layers, with one exception: For blocks with at least one larger neighbor block with a coarser grid, the treatment of boundary conditions and streaming also include the two innermost ghost layers as depicted in steps (4) and (6) of Figure 5.3. The actual implementation includes further optimizations as outlined in Algorithm 5.2. On the finest grid level, the first streaming and the second collision step (cf. Lines 10 and 17 of Algorithm 5.1) can be combined into one fused stream-collide operation (cf. Line 17 of Algorithm 5.2). Combining the streaming and collision step significantly reduces the amount of data that is transferred between CPU and main memory. Not combining both steps requires to load and store the data from and to memory twice. Fusing both steps allows, for each cell, to read PDF values from neighboring cells (= streaming), perform the collision step, and store the result back to memory [Hag16]. Since most time steps are executed on the finest level, typically the majority of the workload is generated by the blocks on this level. Thus, improving the performance of the algorithm on the finest level is essential for achieving good overall performance. Function StreamCollide also includes the treatment of boundary conditions as a pre-streaming step. When hybrid parallel execution with OpenMP is used, the functions CollisionStep, StreamingStep, and StreamCollide, which are always called on an entire block, follow a non-uniform memory access (NUMA)-aware data parallelism model. The work that must be performed for every cell is then distributed evenly among all available threads. Employing a data parallelism model for the compute kernels for the LBM when using a block-based domain partitioning is also suggested in [Heu09, Kra11]. In general, the aforementioned three functions may include any performance optimizations as long as the employed optimization strategies only affect the current collision, streaming, or fused stream-collide step. Simulations typically rely on compute kernels for the LBM that make use of a “structure of arrays” data layout, loop splitting to reduce the number of concurrent store streams, and SIMD processing [Hag16]. Optimization techniques that do not allow separate collision and streaming steps like for example the AA pattern [Bai09], which requires an adapted collision followed by a fused stream-collide-stream operation, are not straightforwardly compatible with the grid refinement scheme presented in this chapter. As a consequence, Algorithm 5.2 uses a two-grid approach for storing the PDF data and performing the streaming step. As one further step of optimization, communication is executed asynchronously. In particular, this permits to overlap computation and communication. To this end, every communication step is separated into two phases. During the first phase, all message data is packed into a buffer and the communication is initiated by calling a non-blocking, asynchronous send function of MPI. Simultaneously, all the matching MPI receive functions are scheduled. In the second phase, the communication is finalized by waiting for the MPI receive functions to complete. The actual transfer of data is executed between these two phases. As a result, when reaching the second phase, the data that has already arrived can be unpacked and computation can continue immediately without stalling. The first

50

5.3 Implementation

Algorithm 5.2: Optimized version of Algorithm 5.1. As such, this algorithm also illustrates an implementation that represents the parallelization scheme presented in Figure 5.3. However, compared to Algorithm 5.1, this optimized version uses asynchronous communication and fused stream-collide operations in order to achieve better performance. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

Function AsynchronousNonuniformTimeStep(level L) forall blocks on level L do CollisionStep ::::::::::: if L ≠ coarsest level then call InitiateExplosion(L, L − 1) end call InitiateCommunication(L) if L ≠ finest level then recursively call AsynchronousNonuniformTimeStep(L + 1) // recursive call call InitiateCoalescence(L, L + 1) end if L ≠ coarsest level then call FinishExplosion(L, L − 1) call ExplosionInterpolation(L) // interpolation of exploded values end call FinishCommunication(L) if L = finest level and L ≠ coarsest level then // fused streaming & collision step forall blocks on level L do StreamCollide :::::::::::: else forall blocks on level L do StreamingStep :::::::::::: if L ≠ finest level then call FinishCoalescence(L, L + 1) end if L = coarsest level then return end forall blocks on level L do CollisionStep ::::::::::: end call InitiateCommunication(L) if L ≠ finest level then recursively call AsynchronousNonuniformTimeStep(L + 1) // recursive call call InitiateCoalescence(L, L + 1) end call FinishCommunication(L) forall blocks on level L do StreamingStep :::::::::::: if L ≠ finest level then call FinishCoalescence(L, L + 1) end end

communication between blocks on the same grid level, for example, is initiated in Line 6. The matching completion of this communication operation happens in Line 15. While the corresponding data is transferred between processes, all finer blocks are processed as a result of the recursive call in Line 8, coalescence is initiated, and explosion is finished. Finally, the homogeneous distribution of PDF values from one coarse cell to multiple finer cells (5.1) is improved by a mass and momentum conserving interpolation scheme presented in [Che06]. After all necessary PDF values were sent from the coarse grid, this interpolation is carried out by function ExplosionInterpolation in Line 13. The algorithm proposed in this section allows grid level transitions to be present at

51

5 The Lattice Boltzmann Method on Nonuniform Grids fluid-obstacle interfaces if these interfaces are treated with simple bounce back (no slip) or symmetric (free slip) boundary conditions. To this end, the treatment of boundary conditions is extended to the two innermost ghost layers on fine blocks at the interface to coarse neighbors. Furthermore, cells marked as fluid or obstacle cells must be marked consistently across the interface region where coarse and fine cells overlap. If, at the interface between different grid levels, a coarse cell is marked as being a fluid/obstacle cell, all eight corresponding fine cells must also be marked as fluid/obstacle cells.

5.4 Communication Schemes Although the asynchronous communication scheme of Algorithm 5.2 helps to improve performance, it is still crucial for parallel performance to implement all communication patterns with as little data transfer as possible. This is especially important on systems where no MPI installation is available that is capable of performing asynchronous communication. On these systems, communication is deferred until the MPI wait is executed, at which point a blocking, synchronous communication is performed. The following sections describe how the parallel communication is optimized in all three possible situations, i.e., when communication occurs between blocks on the same level, when data is transferred from coarse to fine grids, and vice versa.

5.4.1 Communication Between Blocks on the Same Level For the communication between blocks on the same level, the algorithm first must check whether or not there exists a larger block with a coarser grid in direct proximity of the two blocks that need to communicate. A block C is referred to as being in direct proximity of two blocks A and B if there exists a non-empty intersection, A ∩ B ∩ C ≠ ∅, of all three blocks. If an intersection of three blocks exists in 3D, the intersection volume is either a line or just a single point. In 2D, if three blocks intersect, they always intersect in a point. The remainder of this section assumes that A and B are adjacent blocks on the same level that need to communicate. Since every block in the BlockForest data structure knows the block IDs of all of its direct neighbors (cf. Section 3.4 and Table 3.1), checking the neighborhood of a block for the existence of a larger block with a coarser grid is a strictly process-local operation. If no larger blocks are detected in direct proximity of blocks A and B, i.e., if all blocks C with A ∩ B ∩ C ≠ ∅ are either smaller blocks with a finer grid or blocks on the same level as A and B, then the same communication scheme as used for simulations without grid refinement is applied. Both blocks A and B exchange one ghost layer and for each cell only those PDF values that stream into the neighboring block are communicated as depicted in Figure 5.1. If, however, there exists a larger block C with A ∩ B ∩ C ≠ ∅ that contains a coarser grid than blocks A and B, then Sections 5.2 and 5.3 showed that the two innermost ghost layers of A and B must be included in the streaming step and in the treatment of

52

5.4 Communication Schemes

Figure 5.4: In direct proximity of a larger neighbor block with a coarser grid, for simulations in 2D, blocks on the same level exchange two layers of cells when communicating across a corner (3D: corners and edges). When communicating across an edge (3D: face), two layers of cells are only transferred on the corner of the edge. On the inside of an edge, only one layer is used and only those PDF values that are about to stream into the neighboring block are sent. If PDF values stream across cell corners in 3D like with the lattice model D3Q27, two layers are not enough and three instead of two layers of cells must be used for the communication across block corners, edges, and face edges.

boundary conditions. Additionally, the implementation must guarantee that after two consecutive streaming steps and boundary condition treatments on A and B, all PDF values in the two ghost layers that are about to be sent to the coarse neighbor C are still valid. As a result, the communication with a neighbor on the same level must be adapted in direct proximity of a larger neighbor block C that contains a coarser grid. If PDF data is transferred across a corner or an edge from block A to B, now M layers of cells, with M = 2 for D3Q19 and M = 3 for D3Q27, are communicated, and for each cell, all PDF values are sent. For communication across a face, however, M layers of cells are only used along the edges of the face. On the inside of the face, only one layer is used. Moreover, for these interior cells, only those PDF values that are about to stream into the neighboring block are sent. This approach is illustrated in Figure 5.4 for two-dimensional simulations (note that edges in 2D correspond to faces in 3D, and M = 2 for D2Q9). In summary, communication between blocks on the same level is, in general, quite similar to the well optimized communication scheme used in simulations without refinement. Only in direct proximity of larger neighbor blocks with coarser grids, slightly more data must be exchanged. Since, for 3-dimensional simulations, the amount of data exchanged between blocks is dominated by communication across faces, ensuring that these transfers operate with the best possible efficiency is crucial for high performance. In conclusion, communication between blocks on the same level never requires to send more than two (D2Q9, D3Q19) or three (D3Q27) layers of cells. In fact, almost always only one layer of cells that only contains those PDF values streaming into the neighboring block are sent.

5.4.2 Coarse-to-Fine Block Communication During coarse-to-fine communication, coarse blocks send corner cells and the appropriate parts of edges and faces to their finer neighbor blocks. Some coarse cells must be sent to multiple fine neighbors. Thus, parts of the coarse block that must be transferred will

53

5 The Lattice Boltzmann Method on Nonuniform Grids

Figure 5.5: During coarse-to-fine communication, in order to also have valid data in all corners and edges of neighboring fine blocks (cells highlighted in gray), certain coarse cells must be sent to multiple fine neighbors. Thus, the parts of the coarse block that must be transferred will overlap.

Figure 5.6: For coarse-to-fine communication, two layers of cells are sent from the interior part of the coarse grid to the four ghost layers of the fine grid. This communication includes all PDF values stored in these cells.

overlap as illustrated in Figure 5.5. For coarse-to-fine communication, the content of two cell layers is sent from the coarse grid to the ghost layers of the fine grid (see Figure 5.6). On the fine grid, these PDF values are distributed to all four ghost layers. As mentioned at the end of Section 5.3, the implementation of this distribution uses a mass and momentum conserving interpolation scheme from [Che06]. During this coarse-to-fine communication, all PDF values are transferred from the coarse to the fine grid. Since the treatment of boundary conditions is also performed on the two innermost ghost layers of the fine grid, no PDF values can be omitted. If only those PDF values are communicated that stream into the fine block, having a simple no slip boundary condition for an obstacle that spans across a grid level transition would lead to an inconsistent state. However, if grid level transitions only occur within regions that entirely consist of fluid cells, sending only one coarse cell instead of two and sending only those PDF values that stream into the fine block is sufficient for a correct implementation.

54

5.4 Communication Schemes Careful benchmarking did not show any significant differences in overall simulation performance between sending two layers of cells and sending only one sparse layer for the coarse-to-fine communication. The reason is that for most simulations, the total communication time is dominated by the communication between blocks on the same level. This is because the bulk of communication volume is typically equal-level communication. Most time steps are performed on the finest grid and thus the majority of communication takes place between same sized blocks on the finest level.

5.4.3 Fine-to-Coarse Block Communication Fine-to-coarse communication is a post-streaming communication step. In contrast to coarse-to-fine communication and communication between blocks on the same level, here, data is read from ghost layers and written into the interior part of the grid (cf. steps (6) and (7) in Figure 5.3). This approach models the fine-to-coarse and coarse-to-fine cell conversions that are present in the grid refinement scheme of [Roh06]. The coarse grid needs information from the two innermost ghost layers of the fine grid. In order to reduce the amount of data that is communicated, all the fine cells that correspond to one coarse cell are already aggregated according to (5.2) while packing the data on the sending side. On the receiving side, i.e., on the coarse grid, these aggregated PDF values are then copied to the appropriate part of the grid. The data of some fine cells, however, never needs to be transferred to the coarse grid because of another neighboring fine block whose communication already includes this data (see Figure 5.7). The following rule applies: If a fine block is connected to a coarse block via a face, these two blocks never communicate across edges or corners. Analogously, if a fine and a coarse block are connected via an edge, there is no communication across corners for these two blocks. As a result, all connections of Figure 5.2(2) that are not also available in Figure 5.2(3) do not perform any communication. In Algorithm 5.2, coarse-to-fine as well as fine-to-coarse communication is only performed on connections illustrated in Figure 5.2(3).

Figure 5.7: If PDF values are sent from fine to coarse blocks, they must be mapped to the right part of the coarse block. Data of cells highlighted in gray never has to be sent. In contrast to coarse-to-fine communication and communication between blocks on the same level, here, PDF values are sent from ghost layers of fine blocks to the interior part of coarse blocks.

55

5 The Lattice Boltzmann Method on Nonuniform Grids

Figure 5.8: During fine-to-coarse communication, PDF values in the coarse cell that originate from the interior part of the coarse block must not be overwritten. Only PDF values that stream into the coarse block (highlighted in black) are sent from ghost layers of fine neighbors. Some PDF values arrive multiple times from different fine source blocks (highlighted with dashed lines).

Also, the implementation must never send all PDF values stored in the fine cells, but only those streaming into the coarse block as illustrated in Figure 5.8. All other PDF values in the coarse cell originate from the interior part of the coarse block and must not be overwritten. Also note that as a result of this parallelization scheme, some of the PDF values that arrive at the coarse block are sent redundantly from different fine source blocks (highlighted with dashed lines in Figure 5.8).

5.5 Load Balancing Requirements A naive load balancing scheme for the LBM might assign each block a weight that corresponds to the number of cells stored on the block as well as the block’s refinement level. Since for blocks on a fine level, twice as many time steps are executed as for blocks on the next coarser level, the weight of a fine block must be twice the weight of an equivalent block on the next coarser level. This approach of assigning block weights works well for LBM-based simulations without grid refinement since for each time step, all blocks can be processed completely independently. Synchronization is only necessary either at the beginning or at the end of each time step. However, distributing all blocks equally and according to their weights is only a suitable load balancing strategy for LBM-based simulations without grid refinement. For the parallel LBM on nonuniform grids based on Algorithm 5.2, this is not a feasible approach anymore. Here, blocks that reside on different levels cannot be processed completely independently. At fixed stages during the algorithm, blocks of different levels must interact with each other via fine-to-coarse and coarse-to-fine communication. For best performance, a suitable load balancing scheme must take into account the structure of the algorithm, including all points of communication.

56

5.5 Load Balancing Requirements

(1)

(1.1)

3

2

1

0

0 1 0 1

0

0 (2.1)

(1.2)

3

3 3 2 2

1

2

(1.3)

3 3

3 2

3

2 2 2 1

1

1 0 0

1 1

0 0

0

1

(2.2)

Figure 5.9: This figure illustrates levelwise load balancing using SFCs for a simulation with four processes. First (1), one global SFC is constructed following Algorithm 4.4. This list of blocks is then split into multiple lists, each of which only contains blocks of the same level as depicted in (1.1) to (1.3). Each list is then partitioned separately according to Algorithm 4.5, resulting in the partitioning to four processes as illustrated in (2.1) and (2.2). Each process ends up with a balanced share of blocks from each level, which perfectly fits the program flow of Algorithm 5.2.

Thus, the load balancing scheme that best fits Algorithm 5.2 processes all levels separately. The result is a levelwise balancing approach where for each level, all blocks that reside on this level are distributed to all available processes. Typically, most of the work is generated by the finest grid levels. As a consequence, perfectly distributing the work generated by these levels is crucial for maximal performance. In principle, many load balancing algorithms are suitable for this kind of problem and even a different load balancing algorithm might be used for each level. In practice, the current implementation does not vary the algorithm from level to level. Current load balancing algorithms for the LBM are either based on SFCs using Morton or Hilbert order or on graph partitioning using a library like METIS [Kar99]. When performing graph-based balancing, all blocks are first sorted by level. Afterwards, the graph partitioner is called multiple times, once for each level. Using SFCs for levelwise load balancing is illustrated in Figure 5.9. Essentially, one global SFC is constructed the same way as described in Section 4.3 and illustrated in Algorithm 4.4. The resulting list of blocks is then split into multiple lists, each of which only contains blocks of the same level. Finally, each list is partitioned according to Algorithm 4.5. Ultimately, levelwise load balancing helps to minimize process idle times for the LBM on nonuniform grids since the levelwise balancing scheme perfectly fits the program flow of Algorithm 5.2. This need for levelwise load balancing was also pointed out, but not yet implemented, in [Has14a]. Even for blocks on the same level with the same number of cells, their weights may vary since blocks that only consist of fluid cells typically generate more work than blocks that contain many cells covered by obstacles. Cells covered by obstacles can be skipped by most parts of the algorithm and hence generate practically no work. Consequently, setting the block weights to a value that is proportional to the number of fluid cells proves to be a good choice. Since the weight of each block is calculated by a user-defined callback

57

5 The Lattice Boltzmann Method on Nonuniform Grids function, also more complex schemes that also take into account the work generated by non-fluid cells as proposed by [Fie12] can be employed. Using the approach of relating the weight of a block only to the number of fluid cells, high performance and excellent scalability on current supercomputers has been demonstrated in [God13] for a simulation of the human coronary artery tree using the data structures developed in this thesis.

58

6 Dynamic Domain Repartitioning The following chapter introduces the different steps of the SAMR process developed for this thesis that are necessary to dynamically adapt the domain partitioning and rebalance and redistribute the simulation data. The implementation of these algorithms strictly follows the open/closed software design principle (see Section 1.2). Consequently, key components of the algorithms are customizable and extensible via user-provided callback functions. These callbacks are fundamental to the software architecture of the SAMR process that is presented in the following sections. They allow to adapt the core algorithms to the specific requirements of the underlying simulation without any need to modify source code in the AMR framework. Section 6.1 provides an overview of the fundamental algorithm that is responsible for the program flow of the entire SAMR procedure. Sections 6.2 to 6.5 then introduce the necessary steps of the SAMR process in detail. Finally, Section 6.6 shows how this implementation of SAMR can be used for nonuniform LBM-based simulations that rely on the parallelization introduced in Chapter 5. Parts of this chapter were first published in [Sch18]. However, compared to [Sch18], all these parts have been extended in content and with additional, new figures and algorithms for this thesis in order to provide a more detailed description of the implemented methods.

6.1 Four-Step Procedure The entire AMR pipeline consists of four distinct steps. In the first step, blocks are marked for refinement and coarsening. In the second step, this block-level refinement information is used in order to create a second, lightweight, proxy data structure that only contains this new topological information but no simulation data. The proxy blocks can be assigned weights that correspond to the expected workload generated by the actual simulation data. In the third step of the AMR procedure, the proxy data structure is then load balanced and proxy blocks are redistributed between all available processes. In the fourth and final step, the actual, still unmodified data structure containing the simulation data is adapted according to the state of the proxy data structure, including refinement/coarsening and redistribution of the simulation data. This four-step AMR procedure is outlined in Algorithm 6.1. Please note that the reevaluation of block weights that potentially results in a redistribution of all blocks can be triggered without the need of block-level refinement or coarsening, meaning the entire pipeline can be forced to be executed without any blocks being marked for refinement

59

6 Dynamic Domain Repartitioning

Algorithm 6.1: Program flow of one AMR cycle. The entire AMR pipeline consists of four distinct steps. First, blocks are marked for refinement/coarsening in Line 2. Then, a proxy data structure representing the new domain partitioning is created in Line 4 and subsequently balanced in Line 5. Finally, the actual simulation data is adapted and redistributed in Line 6. The algorithms that correspond to these four steps are outlined in Algorithms 6.2, 6.3, 6.4, and 6.9. Corresponding illustrations are presented in Figures 6.2, 6.3, 6.5, and 6.8. 1 2 3 4 5 6 7 8 9 10 11

Function DynamicRepartitioning call BlockLevelRefinement /* blocks are marked for refinement/coarsening */ /* (by a user-provided callback and the framework that enforces 2:1 balance) */ while blocks marked for refinement/coarsening exist or block weights must be reevaluated and blocks must be redistributed do call ProxyInitialization /* construction of the proxy data structure */ call DynamicLoadBalancing /* redistribution of proxy blocks */ call DataMigration /* migration and refinement/coarsening of the ... */ /* ... actual simulation data according to the state of the proxy structure */ if multiple AMR cycles are allowed then call BlockLevelRefinement /* same as Line 2 */ end end end

Figure 6.1: Example domain partitioning for a simulation that stores a grid of the same size (4 × 4) on each block and is distributed among two processes (as indicated by the separation of the data into two parts in the figures in the middle and on the right). This example partitioning illustrates the initial state of a simulation before AMR is triggered. The AMR pipeline then starts with distributed block-level refinement/coarsening in Figure 6.2, continues with the creation and load balancing of a lightweight proxy data structure in Figures 6.3 and 6.5, and ends with the actual migration and refinement/coarsening of the grid data in Figure 6.8.

or coarsening. This is useful for simulations where the workload associated with certain blocks can change during runtime such that dynamic rebalancing is desired. Please also note that the implementation allows multiple AMR cycles before the simulation resumes. As a result, blocks can be split/merged multiple times during one AMR cycle. The four steps of the AMR scheme are discussed in more detail in the following four sections. The entire AMR pipeline is also illustrated in Figures 6.2, 6.3, 6.5, and 6.8, starting with the initial domain partitioning outlined in Figure 6.1.

6.2 Distributed Block-Level Refinement The objective of the block-level refinement and coarsening phase is to assign a target level to each block such that Ltarget ∈ {Lcurrent − 1, Lcurrent , Lcurrent + 1}. In order to assign

60

6.2 Distributed Block-Level Refinement target levels to all blocks, the block-level refinement and coarsening phase is divided into two steps. First, an application-dependent callback function is evaluated on every process. The task of this function is to assign a target level to every block handled by the process. As such, initially marking blocks for refinement or coarsening is a perfectly distributed operation that can be executed in parallel on all processes. These new target levels, as set by an application-dependent callback, might violate the 2:1 balance constraint of the domain partitioning. Consequently, after the applicationdependent callback was evaluated, the framework guarantees 2:1 balance by first accepting all blocks marked for refinement and subsequently potentially forcing additional blocks to split in order to maintain 2:1 balance. Afterwards, blocks marked for coarsening are accepted for merging if, and only if, all the small blocks that merge to the same large block are marked for coarsening and if 2:1 balance is not violated. Consequently, blocks marked for refinement by the application-dependent callback are guaranteed to be split, whereas blocks marked for coarsening are only merged into a larger block if they can be grouped together according to the 2:1-balanced octree structure. This process of guaranteeing 2:1 balance can be achieved by exchanging the target levels of all process-local blocks with all neighboring processes. Afterwards, this information can be used to check if process-local blocks must change their target levels, i.e., check if they must be forced to split or are allowed to merge, due to the state of neighboring blocks. This process of exchanging block target levels with neighbors must be repeated multiple times. The number of times every process must communicate with all of its neighbors is, however, limited by the depth of the forest of octrees, i.e., the number of levels available in the block partitioning. Consequently, the runtime of this algorithm only depends on the number of levels in use and the number of neighboring processes, but it is independent of the total number of processes. The algorithm is outlined in Algorithm 6.2. First, the application-dependent callback is evaluated in Line 2 in order to set target levels for all blocks. These target levels are then checked for consistency. For each block, it must hold that Lcurrent − 1 ≤ Ltarget ≤ Lcurrent + 1 and Ltarget ≥ 0. The first loop in Line 4 forces additional blocks to split in order to guarantee 2:1 balance. Here, levels is the number of levels available in the block structure, which is equivalent to the level of the finest block plus one1 . If the current level of a block i is too small compared to the target level of one of its direct neighbor blocks j, then the target level of i is adapted, meaning if Li,current < Lj,target − 1 (for at least one neighbor block j), then Li,target = Li,current + 1. The second loop in Line 14 accepts blocks for merging if 2:1 balance is not violated. In each iteration of this second loop, first current target levels are exchanged. Then, blocks which are possible candidates to merge are determined. A block i is a possible candidate for merging if (a) it is recorded in the list of blocks that intend to merge (see Lines 8 to 13) and (b) none of its neighbor blocks j are smaller, meaning if Li,current ≥ Lj,target (for all neighbor blocks j). The information about blocks that are possible candidates for 1

For a root block, it holds that L = 0. And if the domain partitioning only contains root blocks, then the number of levels = 1.

61

6 Dynamic Domain Repartitioning

Algorithm 6.2: Algorithm for setting block target levels in order to signal the intent to split or merge. Desired target levels are set by an application-dependent callback function in Line 2. The implementation that follows makes sure that all blocks marked for splitting are guaranteed to be split and 2:1 balance of the domain partitioning is maintained. Note that this algorithm is called from/used by Algorithm 6.1. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Function BlockLevelRefinement call BlockLevelRefinementCallback perform consistency checks for the target levels set by the callback function for i = 0 ; i < levels − 1 ; i = i + 1 do // mark blocks for refinement/splitting communicate block target levels to neighbor processes mark additional blocks for splitting to maintain 2:1 balance by setting their target levels accordingly end foreach block b do if Lb,target < Lb,current then // b is marked for coarsening record b’s intent to merge set Lb,target = Lb,current end end for i = 0 ; i < levels − 1 ; i = i + 1 do // accept blocks for coarsening/merging communicate block target levels to neighbor processes determine blocks that are potential candidates for merging communicate these candidates to all neighbor processes decide which blocks are allowed to merge and set their target levels accordingly foreach block b do if Lb,target < Lb,current then // b is allowed to merge remove b from the list of blocks that intend to merge end end end end

merging is then exchanged between neighboring processes. If a block i is a candidate for merging and if (a) this block has 7 neighbors that belong to the same octet as i and (b) all these 7 other octet member blocks are also candidates for merging, then block i is allowed to merge and Li,target = Li,current − 1. It they do exist, the 7 neighbor blocks that belong to the same octet as block i all have the exact same block ID as i, except for the three least significant bits. The block-level refinement and coarsening phase is also illustrated in Figure 6.2. Note that the iterative process of evaluating block neighborhoods multiple times means that the medium sized blocks in Figure 6.2(4) are only accepted for coarsening after their smaller neighbor blocks were accepted for coarsening in Figure 6.2(3). Since this perfectly distributed algorithm causes every process to possess only local, but no global knowledge, every process must assume that on distant, non-neighbor processes blocks are marked to be split or to be merged. Consequently, all processes must continue with the next step in the four-step AMR procedure, even if there are no changes to the block partitioning. The actual implementation of Algorithm 6.2 therefore can optionally make use of two global reductions of a boolean variable as a means of optimizing the execution time as follows: Immediately after the application-dependent callback is executed, the first reduction can be used to abort the entire AMR procedure early if no blocks have been marked for refinement or coarsening. Even if some blocks are marked for coarsening, they

62

6.3 Proxy Data Structure

(1)

(2)

(3)

(4)

Figure 6.2: Starting from the domain partitioning outlined in Figure 6.1, during the distributed block-level refinement/coarsening phase, an application-dependent callback function determines which blocks must be split and which blocks may be merged (figure on the bottom left). Note that blocks are marked for coarsening independent of their neighbors. After the evaluation of the callback function, all blocks marked for refinement are accepted (figure (1) on the right) and additional blocks are registered for refinement by an iterative process in order to maintain 2:1 balance (2). Finally, another iterative procedure accepts blocks for coarsening if the entire octet of blocks (quartet in 2D) is marked for coarsening and if 2:1 balance is not violated (figures (3) and (4)).

all might violate the requirements that are necessary for merging. Therefore, a second reduction at the end of Algorithm 6.2 provides another opportunity to terminate the AMR process early. On current petascale supercomputers, the benefit of aborting the AMR algorithm, and thus not having to construct the proxy data structure and execute the load balancing procedure, outweighs the cost of the two global reductions.

6.3 Proxy Data Structure The AMR implementation of this thesis is characterized by using a light-weight proxy data structure to manage load balancing and the dynamic data redistribution in an efficient way. For this, note that the first phase of the AMR procedure outlined in the previous section only assigns target levels to all blocks, but it does not yet apply any changes to the block partitioning. During the second step of the AMR procedure, these target levels are used in order to create a second block partitioning that conforms with the new topology as defined by the target levels. This second data structure acts as a proxy for the actual, still unmodified simulation data structure. Therefore, this thesis will use the terms proxy data structure/proxy blocks and actual data structure/actual blocks in order to clearly distinguish between the two data structures whenever deemed necessary to prevent ambiguity. In the implementation, the proxy data structure is represented by classes ProxyBlockForest and ProxyBlock, while the actual simulation data structure is represented by classes BlockForest and Block (see Section 3.4 and Table 3.1).

63

6 Dynamic Domain Repartitioning

Figure 6.3: Using the results from the block-level refinement phase outlined in Figure 6.2, a lightweight proxy data structure that conforms with the new topology is created. As a result, every process stores the current, unmodified actual data structure that maintains the simulation data (figure in the middle) as well as the temporarily created proxy data structure that does not store any simulation data (figure on the right). Additionally, every block in each of these two data structures maintains a link/links (which are not visualized in the illustration) to the corresponding block(s) in the other data structure. Note that initially the proxy data structure is in an unbalanced state. In this example, most of the blocks, including all of the smallest blocks, are initially assigned to the same process. Therefore, load balancing the proxy data structure is the next step in the AMR pipeline (see Figure 6.5).

The proxy data structure only manages the process association and the connectivity information of all of its blocks, but it does not store any simulation data. Furthermore, during creation of the proxy data structure, links are established between all proxy blocks and their corresponding actual blocks. Consequently, every proxy block always knows the process where its corresponding actual block is located, and vice versa. Particularly during the third step of the AMR procedure when proxy blocks might migrate to different processes, maintaining these bilateral links is vital for the success of the entire AMR scheme. These links are represented by a target process that is stored for each actual block. This is the process owning the corresponding proxy block. Eventually, after the load balancing, this is the process the actual block must migrate its data to, hence the term target process. Additionally, there is a source process stored for each proxy block. Analogously, this is the process on which the corresponding actual block is located on, meaning, after the load balancing, this is the process from which the actual block data must be fetched, hence the term source process. If an actual block corresponds to eight smaller proxy blocks due to being marked for refinement, the actual block stores eight distinct target processes, one for each proxy block. Similarly, if eight actual blocks correspond to one larger proxy block due to all actual blocks being marked for coarsening, the proxy block stores eight source processes. The creation of the proxy data structure is illustrated in Figure 6.3. Note that for eight blocks to be merged into one larger block, the eight smaller blocks do not have to be located on the same process. Also, the creation of all proxy blocks, including the initialization of target and source processes for proxy and actual blocks, is a process-local operation. Only when setting up the connectivity information for the proxy blocks so that each proxy block knows the block IDs and process ranks of all of its neighbor blocks, communication with neighboring processes is required. Consequently, the runtime of this second step of the AMR procedure, the creation of the proxy data structure, is, just as the previous block-level refinement and coarsening phase, also independent of the total number or processes. It only depends on the number of local neighbor processes.

64

6.3 Proxy Data Structure

Algorithm 6.3: Algorithm for creating the proxy data structure from the actual simulation data structure. This algorithm is called from/used by Algorithm 6.1. In order to create the proxy data structure, the algorithm uses the target levels that were assigned to each actual block by Algorithm 6.2. The implementation does not just create proxy blocks, but also initializes the connectivity information for the proxy blocks so that each proxy block knows the block IDs and process ranks of all of its neighbor blocks. Additionally, the process neighborhood information for the proxy data structure is set up. Note that empty neighbor processes from the actual simulation data structure must also be included. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

16 17 18

19 20 21 22 23 24 25 26 27 28 29 30

31

32

Function ProxyInitialization foreach process local block b do if Lb,target = Lb,current then // no change, the target level is equal to b’s current level create proxy block p for block b on current process set target process of block b to current process set source process of proxy p to current process else if Lb,target > Lb,current then // split create 8 smaller proxy blocks from b on current process set corresponding 8 target processes on block b to current process set source process of all 8 proxy blocks to current process else // Lb,target < Lb,current → merge if the three least significant bits of b’s block ID are all zero then create larger proxy block p from b on current process set target process of block b to current process initialize the 8 source processes of proxy p by looking at all neighbor blocks of b and finding the other 7 blocks (and their process ranks) that belong to the same octet than b else foreach neighbor block n of b do if n belongs to the same octet than b and the three least significant bits of n’s block ID are all zero then // n is the block for which the condition in Line 12 is true, meaning ... // ... the corresponding merged proxy block is created on n’s process set target process of block b to the process rank of n end end end end end if changing the maximum number of levels at runtime is allowed then synchronize the number of levels available in the proxy block structure among all processes // requires a global reduction end send block IDs of all local proxy blocks to all neighbors forward the block IDs received from every neighbor process to all other neighbor processes // the process rank corresponding to every block ID is also forwarded use this information from the 1st and 2nd process neighborhood to set up the proxy block connectivity (≙ register the block IDs and process ranks of neighbors at each proxy block) set up process neighborhood information for the proxy data structure // may differ from the process neighborhood of the actual simulation data structure – please note that the process neighborhood of the actual simulation data strcuture was still used in Lines 28 and 29 end

The creation of the proxy data structure is also outlined in Algorithm 6.3. The construction of all proxy blocks from the blocks in the actual simulation data structure happens in the main loop in Line 2. Here, the target levels from the actual blocks are used to construct appropriate proxy blocks. Additionally, the links between proxy and actual blocks are initialized by setting up source (for proxy blocks) and target (for actual blocks) process

65

6 Dynamic Domain Repartitioning

0

0

1

2

0

1

2

2

0

1

1

2

(1.1)

(1.2)

0

1

0

0

2 2

2

1

2

(2)

Figure 6.4: The numbers inside the blocks represent the processes the blocks are currently located on. In the actual simulation data structure on the left (1.1), process 2 is not a neighbor of process 0. Only process 1 is a neighbor of process 0. As indicated by the figure in the middle (1.2), four small blocks are marked to be merged. In the proxy data structure, the resulting larger block is assigned to process 0. In order to set up the block neighborhood information for this new block, in addition to blocks from processes 0 and 1, blocks from process 2 must also be taken into account, although process 2 and process 0 do not contain any directly connected blocks in (1.1). In the proxy data structure, the process neighborhood is different to the process neighborhood of the corresponding actual simulation data structure.

values. For each block, setting up the information about neighbor blocks, their block IDs and process ranks, is not yet possible at this point in the algorithm. This initialization of the proxy block connectivity must be deferred until later in the algorithm when all proxy blocks are completely constructed. If the AMR pipeline is configured to allow the dynamic increase in the number of available levels, a global synchronization is necessary in order to inform every process about the number of levels available in the proxy data structure. An increase in the number of currently available levels happens if an actual block on the finest level is marked to be split. This synchronization is realized via a global reduction in Line 26. It is necessary for every process to know the number of globally available levels in the data structure, for example, for the correct implementation of Algorithm 6.2 (Lines 4 and 14). After all proxy blocks are completely constructed, Algorithm 6.3 sends their IDs to all neighbor processes (Line 28). Every process then forwards all the block IDs received from each neighbor process, together with the process ranks the IDs originated from, to all other neighbor processes (Line 29). As a result, the information about every block in the form of a pair does not only travel to direct neighbor processes, but also to all neighbors of neighbors. With all of this information, it is possible to set up the proxy block connectivity (Line 30). For each process-local proxy block, all its neighboring blocks are determined using the point probing approach described at the end of Section 4.1 and illustrated in Figure 4.3. The set of potential neighbor blocks consists of all other local proxy blocks and all proxy blocks whose information was received during the previous two-stage block information exchange. The necessity of this two-stage block information exchange is illustrated in Figure 6.4. This two-stage approach is only necessary since at this point in the algorithm, the information about neighbor processes is still coming from the actual simulation data structure. Therefore, the final task of Algorithm 6.3 is to generate the process neighborhood information for the proxy data structure. Primarily, neighbor processes are determined by examining the process ranks of the neighbor blocks of all process-local proxy blocks.

66

6.4 Dynamic Load Balancing However, as explained in Section 4.2, empty processes that currently do not contain any blocks must not be forgotten. If an empty process is not registered with at least one other regular process, the empty process is not considered by diffusion-based algorithms that operate on the process graph. Consequently, empty processes maintained as neighbors in the actual simulation data structure are also maintained as neighbors in the proxy data structure. Moreover, due to blocks being merged, a process can end up without any proxy blocks, even if the process does contain some actual blocks. Such an empty process emerges if it only contains actual blocks that are all marked to be merged and where the resulting proxy blocks are located on other processes. These empty processes (they are considered empty only from the perspective of the proxy data structure) must also be considered when setting up the process neighborhood information. Once the generation of the process neighborhood information is finished, the proxy data structure is fully constructed and can be used in the next, third step of the SAMR pipeline.

6.4 Dynamic Load Balancing The third step in the SAMR pipeline is the dynamic load balancing phase. Here, the goal is to redistribute the proxy blocks among all processes such that the proxy data structure is in balance. Similar to the block-level refinement phase outlined in Algorithm 6.2, the dynamic load balancing procedure is also divided into two parts, where first a simulationdependent, possibly user-provided callback function is executed. Here, the main task of the callback function is to determine a target process for every proxy block. Hence, this callback function represents the actual load balancing algorithm. Once this callback is finished and returns, the second part of the dynamic load balancing procedure, the framework part, takes over the program control and transfers the proxy blocks to their just assigned target processes. During this migration process, the framework also maintains the bilateral links between proxy blocks and actual blocks. Note that this means that proxy blocks contain records of target as well as source processes. The target process record of a proxy block determines to which process the proxy block must be transfered during the dynamic load balancing. The source process record points to the process where the corresponding actual block is still residing. On the other hand, actual blocks only contain records of target processes. The target process record of an actual block points to the process where the corresponding proxy block is currently residing. If this proxy block is transfered to another process during the dynamic load balancing, the target process record of the actual block must be adapted in order to maintain the bilateral link. Apart from assigning target processes to proxy blocks, the callback function at the beginning of the load balancing procedure must perform two more tasks. It must also inform the local process about all the other processes from which proxy blocks are expected to arrive and it must return whether or not another pass of the entire dynamic load balancing procedure must be performed. Requesting another pass of the entire dynamic load balancing procedure enables iterative balancing schemes, as will be used

67

6 Dynamic Domain Repartitioning

Figure 6.5: Exemplary illustration of how the proxy data structure created in Figure 6.3 may be balanced. During dynamic load balancing, an application-dependent callback function determines a target process for every proxy block. Blocks with target processes different to their current processes are highlighted in the figures on the left and in the middle. The framework then takes care of transferring these proxy blocks to their target processes. This transfer includes an update of all links between these proxy blocks and their counterparts in the simulation data structure. As illustrated in the figure, this process of first determining target processes for the proxy blocks and then migrating them accordingly can be repeated multiple times, enabling iterative, diffusion-based load balancing schemes. In this example, balance is achieved after two passes of the load balancing procedure. Nevertheless, globally load balancing the proxy data structure in just one pass is also possible.

in Section 6.4.2. Executing the dynamic load balancing procedure multiple times means proxy blocks are also exchanged multiple times. Transferring a proxy block to another process, however, only requires to send a few bytes of data: its block ID, the source process of its corresponding actual block, and the block IDs of its neighbors. As such, the transfer of these lightweight proxy blocks is an inexpensive communication operation, especially compared to the migration of an actual block with all of its simulation data, which typically requires to send orders of magnitude more bytes of data. The realization of the core load balancing algorithm as a callback function allows to perfectly adapt the dynamic load balancing process to the requirements of the underlying simulation. It is also possible to augment each proxy block with additional data that is also transfered when proxy blocks migrate to other processes. The additional proxy data can, for example, be used to encode the workload/weight of a block. This proxy block data is realized as a container of std::any objects. Generally, the extensibility of the proxy blocks with additional data is similar to the support for arbitrary simulation data presented in Section 3.3. Similar to the interface outlined in Code Snippet 3.1, appropriate initialization and serialization functions must be provided for all proxy block data. The serialization functions are required when a proxy block is transfered to another process. The dynamic load balancing procedure is illustrated in Figure 6.5 and outlined in more detail in Algorithm 6.4. As mentioned above, the entire load balancing procedure may consist of multiple passes. This iterative approach is realized by wrapping the implementation of the entire algorithm in one large loop that spans from Line 2 to Line 9. In order to update the information that each proxy block stores about its neighbors, neighboring processes exchange the proxy blocks’ target process records prior to the migration of the blocks. Also before the migration of the proxy blocks, the links between proxy and actual blocks are updated by setting the target process records of the actual blocks to the target process values of the corresponding proxy blocks. This update requires point-to-point communication between the process that currently stores the proxy block and the process that still holds the corresponding actual block.

68

6.4 Dynamic Load Balancing

Algorithm 6.4: Algorithm for load balancing the proxy data structure that was created by Algorithm 6.3. First, a simulation-dependent, possibly user-provided callback function determines where proxy blocks need to be transferred in order to achieve balance. In the second part of the algorithm, the framework takes care of migrating the proxy blocks and restoring a consistent state for the proxy data structure. Note that the algorithm allows multiple passes of the entire load balancing procedure, enabling iterative, potentially diffusion-based, balancing schemes. Also note that this algorithm is called from/used by Algorithm 6.1. 1 2 3

4 5 6 7 8 9 10

Function DynamicLoadBalancing repeat call DynamicLoadBalancingCallback /* must assign a target process to every ... */ /* ... proxy block, inform the local process about all the other processes from ... */ /* ... which proxy blocks are expected to arrive, and return whether or not ... */ /* ... another pass of the entire dynamic load balancing procedure must be performed */ inform neighbor processes about the target processes of all local proxy blocks ... ... then: adapt the neighborhood records of proxy blocks update bilateral links between proxy blocks and actual blocks ... ... then: transfer proxy blocks to their target processes update process neighborhood information of the proxy data structure until the callback in Line 3 does not request anymore iterations end

For the migration of the proxy blocks in Line 7, the implementation relies on (a) the target process records set for all proxy blocks and (b) the information on each process about all other processes from which proxy blocks are expected to arrive. The transfer of the proxy blocks is then realized by point-to-point communication between the respective processes. Once the proxy blocks have arrived at their target processes, the process neighborhood information of the proxy data structure is updated. Spatial attention must again be payed to temporarily empty processes that currently do not contain any proxy blocks. Once the process neighborhood information is updated, the proxy data structure is in a consistent state and computation can continue with either another pass of the dynamic load balancing procedure or the next step in the SAMR algorithm, the final migration of the actual simulation data. Since the framework part of Algorithm 6.4 only uses next-neighbor as well as point-to-point communication, its runtime is independent of the total number of processes. It mainly depends on the number of neighbor processes and the number of blocks per process. Consequently, the runtime and scalability of the entire dynamic load balancing procedure is mainly determined by the runtime and scalability of the callback function. Ultimately, for large-scale simulations, the chosen dynamic load balancing algorithm determines the scalability of the entire AMR scheme. Two different dynamic load balancing approaches will be presented in the following. Section 6.4.1 outlines a balancing scheme based on SFCs that requires global data synchronization among all processes, whereas Section 6.4.2 presents a balancing algorithm built on a fully distributed, diffusion-based redistribution scheme. The runtime of the latter is independent of the total number of processes. Since the dynamic load balancing algorithms only operate on proxy blocks, the terms “proxy blocks” and “blocks” are used synonymously in Sections 6.4.1 and 6.4.2.

69

6 Dynamic Domain Repartitioning

6.4.1 Space Filling Curves Regardless of whether Morton or Hilbert order as outlined in Section 4.3 is used during the SFC-based dynamic load balancing phase, the construction of the curve as implemented in this thesis requires a global information exchange among all processes. This global data synchronization is usually best realized with an allgather operation [Bur11]. If levelwise balancing is not required, then the splitting and merging of blocks can only change the local number of blocks on each process, but it can never corrupt the global order where (a) each process only owns one contiguous piece of the SFC-ordered list of all blocks and (b) consecutive processes also own consecutive pieces of this list.2 This special property of SFCs is illustrated in Figure 6.6. If, in addition, all blocks share the same weight, it is immediately obvious that the global synchronization of the number of blocks stored on each process is enough to determine where blocks need to migrate in order to achieve a balanced redistribution. Consequently, this approach results in the synchronization and global replication of one number (typically 1 byte is enough) per process. If the blocks must be balanced per level, the SFC is used to construct one list of blocks for every level. Load balance is then achieved by distributing each list separately (see Figure 5.9). As illustrated in Figure 6.6, this per-level ordering can get corrupted as a result of dynamic refinement and coarsening. Consequently, restoring the order for every 2

Note that this implies that the optional remapping of process ranks during the initialization of the domain partitioning presented in Section 4.2 must not be executed.

3 0

0

1

1

2

2

2

3

2 0

3

0

1

1

(1.1)

0

1

1

2 (2.1)

3

3

3

3

(1.2)

0 0

3

2

3

3

1

2 0

0

1

1

1 0

0

0

0

(2.2)

Figure 6.6: Example of a SFC-based distribution to 4 processes (the small numbers inside the blocks represent the processes the blocks are located on). In (1.1) and (1.2), the blocks are distributed to the 4 processes following the global ordering as given by the curve. If blocks split/merge due to refinement/coarsening (indicated by the gray boxes in (1.1)), the new blocks always align themselves correctly (1.2): Blocks along the new curve will always still be in order, but rebalance may be required. If, however, levelwise balancing is required where blocks on different levels must be balanced separately (as is illustrated in (2.1) and (2.2)), new blocks that are the result of refinement or coarsening may fall out of line (2.2): The lists of blocks for every level are not in order anymore. In contrast to (1.2), rebalancing (2.2) also requires a complete reordering, which is a far more expensive operation in terms of the amount of information that must be synchronized between all processes.

70

6.4 Dynamic Load Balancing Table 6.1: Typical amount of data that must be globally replicated and synchronized to all processes (via a single allgather operation) when using the SFC-based dynamic load balancing implementation. blocks must be balanced per level

every block has the same weight

no

yes

yes

1 byte per process

2-4 bytes per block

no

1-4 bytes per block

3-8 bytes per block

level and the subsequent rebalancing is not cheap anymore and requires an allgather of all block IDs (typically 2 to 4 bytes per block). If Morton order needs to be restored, each process must sort all block IDs lexicographically based on their bit representation (see Section 4.3). If Hilbert order is desired, the tree structure must be reconstructed from the block IDs and a Hilbert order-based list must be created as outlined in Section 4.3. Balancing the list of block IDs then follows the implementation of Algorithm 4.5. If, additionally, blocks are allowed to have individual weights, the amount of data that must be synchronized increases further by 1 to 4 bytes for every block. In case blocks have individual weights, it also makes no big difference if levelwise balancing is required or not since all these individual weights must be synchronized between all processes. The amount of data that must be globally replicated in the different cases is summarized in Table 6.1. Ultimately, the SFC-based load balancing algorithm always needs a global synchronization regardless of whether levelwise balancing or individual block weights are required. If blocks have individual weights or if levelwise balancing is required, the memory consumption per process increases linearly with the number of globally available blocks. Moreover, the construction of the SFC-ordered list requires O(NlogN) time (N being the number of blocks). If all blocks share the same weight and if per-level balancing is not necessary, SFC-based dynamic load balancing may, however, still be feasible with several millions of processes. In this special case, the memory consumption per process is of order O(NP), with NP denoting the number of processes. The rebalancing/repartitioning, which does not require any sorting, but only the calculation of the number of blocks that must be assigned to each process, can also be done in O(NP) time. Most notably, in this special case, runtime as well as memory consumption are both independent of the number of globally available blocks. Moreover, although not implemented for this thesis, the global synchronization of the entire curve can be avoided if the generation and the partitioning of the curve is done in parallel by an algorithm based on a distributed sorting network [Lui11]. This approach, however, comes at the expense of having to perform a significant amount of successive, pairwise communication exchanges (O(log 2 NP) on average) for each process.

6.4.2 Distributed Diffusion The idea of diffusion-based load balancing schemes is to use a process motivated by physical diffusion in order to determine, for every process x, the neighboring processes to

71

6 Dynamic Domain Repartitioning which local blocks of x must be transferred. Executing this diffusion-based rebalancing and then migrating proxy blocks accordingly is repeated multiple times, leading to an iterative load balancing procedure that allows blocks to migrate to distant processes. Diffusion-based load balancing schemes may not always achieve a perfect global balance, but they can be expected to quickly (≙ few iterations) eliminate processes with high load. Eliminating peak loads is essential to avoid bottlenecks and to achieve good parallel efficiency. Each step of the diffusion-based algorithm only relies on next-neighbor communication. As a result, if the number of iterations is fixed, the runtime as well as the memory consumption is independent of the total number of processes as well as the number of globally available blocks. Whereas the SFC-based balancing scheme relies on the octreelike domain partitioning, the diffusion algorithm operates on the distributed process graph, which is represented by the process neighborhood information that is maintained by the block data structure. If, for two processes i and j, there exists at least one block on process i which is directly connected via a face, edge, or corner with a block on process j, the two processes i and j are referred to as being connected/neighbors. Additionally, as mentioned in previous sections and chapters, initially empty processes may also be registered as neighbors of regular processes. The implementation of the diffusion-based balancing algorithm consists of two main parts. In the first part, a diffusion scheme originally proposed by [Cyb89] is used to determine the flow of process load fij that is required between all neighboring processes i and j in order to achieve load balance among all processes. In the second part, the results from the first part are then used to decide which blocks must be moved to which neighbor process by matching block weights to process inflow and outflow information. If load balancing per level is required, all data is calculated for each level based only on the blocks of the corresponding level. However, besides calculating data separately for each level, program flow is otherwise identical to an application that does not require per-level balancing. Consequently, all algorithms presented in this section can easily be adapted to perform load balancing per level, all operations simply have to be carried out on a per-level basis. The general algorithm is outlined in Algorithm 6.5. The diffusion scheme starts by calculating the process load wi of every process by summing up the weights of all processlocal blocks. The flow between all neighboring processes is then determined by an iterative process. First, every process i calculates the current flow fij′ between itself and every neighbor process j with [Cyb89] fij′ = αij ⋅ (wi − wj ), where αij follows the definition of [Boi90] which allows αij to be determined with nextneighbor communication only. As suggested by [Boi90], αij =

72

1 , max(di , dj ) + 1

(6.1)

6.4 Dynamic Load Balancing

Algorithm 6.5: Algorithm for performing dynamic load balancing based on a distributed, next-neighbor only diffusion scheme. As such, this algorithm represents a possible implementation for the dynamic load balancing callback function used in Line 3 of Algorithm 6.4. First, the algorithm determines the flow of process load fij that is required between all neighboring processes i and j in order to achieve load balance among all processes. These fij are then used to decide which blocks must be moved to which neighbor process. This decision is realized by either a pull or a push scheme outlined in Algorithms 6.6 and 6.7. 1 2 3 4 5 6 7

Function DiffusionLoadBalancing calculate process weight wi // ≙ sum of all local block weights ≙ process load determine number of neighbor processes di exchange di with all neighbor processes foreach neighbor process j do fij = 0 // ≙ flow between current process i and process j αij = max(d1,d )+1 i

9 10 11 12 13 14 15 16 17 18 19 20

j

end repeat /* calculate flow fij ... */ exchange wi with all neighbor processes /* ... that is required ... */ wi′ = wi /* ... between all ... */ foreach neighbor process j do /* ... neighbor processes ... */ ′ = α ⋅ (w ′ − w ) fij /* ... i and j ... */ ij j i ′ fij += fij /* ... in order to ... */ ′ wi −= fij /* ... achieve balance */ end until predefined max. number of iterations is reached // “flow” iterations call DiffusionPush(fij ) or DiffusionPull(fij ) in order to determine which blocks are sent to which neighbor process inform neighbor processes about whether or not blocks are about to be sent to them so that they know about all the processes from which blocks are expected to arrive

8

end

with di and dj denoting the number of neighbor processes of i and j, respectively. The current flow fij′ is then used to adapt the process loads wi and wj of processes i and j. Note that no blocks are yet exchanged. This procedure of calculating flows and adapting process loads accordingly is repeated for a fixed number of iterations. For the remainder of this thesis, these iterations are referred to as “flow iterations” as opposed to the number of “main iterations” of the dynamic load balancing procedure outlined in Algorithm 6.4 (Lines 2 to 9). Consequently, the diffusion-based load balancing approach represents an iterative load balancing scheme with nested iterations: For every main iteration, first a fixed number of flow iterations is executed and then proxy blocks are exchanged between neighboring processes. The resulting dynamic load balancing scheme therefore can be seen as a nested diffusion approach. The flow fij , which is used to determine the blocks that must be exchanged between processes i and j, eventually results from the summation of all fij′ . A value for fij greater than zero indicates outflow, whereas fij < 0 indicates inflow. This process of calculating fij values is also illustrated in detail in Figure 6.7. In order to determine which blocks are sent to which neighbor process, two different approaches are proposed that are referred to as “push” and “pull” scheme for the remainder of this thesis. When using the push scheme, overloaded processes decide which blocks to push to which neighbor, whereas when using the pull scheme, underloaded processes choose neighbors from which blocks are requested. A major challenge arises from the fact that although every connection to a neighbor process is assigned a certain outflow or inflow

73

6 Dynamic Domain Repartitioning

2 4

12 2

α = 1⁄4

14

2 2.0

12

2.0

20 (2.1)

α = 1⁄3

4 12

12

2

α = 1⁄3

20

α = 1⁄4

α = 1⁄4

(1)

2.7

0.5

13.5

14

4

12.7

12

0.5

(2.2)

1.2

5.8

0.4

0.9

7.3

13.1

13.5

0.4

11.2 (2.5)

13.1

12

0.3

11.4

0.4

(2.3)

1.2

7

(2.4)

1.0

7.3

2 0.3

12

0.4

0.7

7.6

0.9

11.8

5.1

0.4

0.6 12.7

4.2

0.4

1.5

0.7 4.0

4

20 (3)

4.4 8.3

4

Figure 6.7: Illustration of Algorithm 6.5. Figure (1) shows 5 processes and how they are connected to each other. The numbers inside the circles represent the processes’ weights wi . As illustrated on the left, each process typically contains multiple blocks – a process with wi = 12 might contain 4 blocks with weights 2, 2, 4, and 4. Figure (1) also outlines the αij values (6.1) that correspond to each process-process connection. Figures (2.1) to (2.5) illustrate 5 “flow” iterations (Lines 9 to 17 in Algorithm 6.5). Here, the numbers ′ on the process-process connections represent the fij values that correspond to each iteration. After each iteration, the process weights are adjusted, but no blocks are yet exchanged. Figure (3) then illustrates the state after all flow iterations are finished. Now, the numbers on the process-process connections represent ′ the final fij values, which result from the summation of all fij values. These fij values are then used by Algorithms 6.6 and 6.7 in order to determine which blocks are sent to which neighbor process (Line 18 in Algorithm 6.5). Note that this approach can lead to a desired outflow that surpasses that current weight of the process, as can be seen for the process in the lower right corner: Currently, the process possesses blocks with an accumulated weight of 4, but the algorithm determined that the process must pass on blocks with an accumulated weight of 4.4. Consequently, in certain cases, it’s impossible to achieve balance with only next-neighbor block exchange. Because of this reason, Algorithm 6.5 is called several times in succession by Algorithm 6.4. Note that in this example, perfect balance means that each process contains blocks with an accumulated weight of 10. However, perfect balance can only be achieved if the weights of the individual blocks even allow a partitioning into 5 times an exact weight of 10. Also note that although optimal balance may not be achieved by executing Algorithm 6.5 only once, this one execution is enough to considerably reduce the initial peak load of 20.

value, these flows are almost always impossible to satisfy because they rarely match with the weights of the available blocks. Only entire blocks can be migrated. It is impossible to only send part of a block to another process. For SAMR codes, computation is usually more efficient when choosing larger blocks. For mesh-based methods like the LBM, larger

74

6.4 Dynamic Load Balancing

Algorithm 6.6: Schematic program flow of the “push” approach for determining the blocks that must be sent to neighbor processes on the basis of the inflow/outflow values fij . Note that this algorithm is one of the two potential algorithms that are called at the end of Algorithm 6.5. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Function DiffusionPush(fij ) outflow i = 0 // accumulated process outflow foreach neighbor process j do if fij > 0 then outflow i += fij end while outflow i > 0 and a process j with fij > 0 exists do pick process j with largest value for fij construct a list of all the blocks that can be moved to process j and are not yet marked for migration to another process out of the previous list, pick block k that is the best fit for migration to process j and whose weight wk ≤ outflow i if such a block k exists then mark block k for migration to process j fij −= wk outflow i −= wk else fij = 0 end end end

blocks result in less communication overhead. With larger blocks, the compute kernels may also take better advantage of SIMD instructions. Therefore, in practice, large-scale simulations are often configured to use only few blocks per process. However, in these cases where only few blocks are assigned to each process, the weight of any block might exceed every single outflow/inflow value. Therefore, the central idea of the push scheme is to first calculate the accumulated process outflow outflowi of every process i by summing up all fij > 0. If the weight wk of a block k is less than or equal to outflowi , the block k is a viable candidate for being pushed to a neighbor process. Moreover, all fij > 0 are only used as clues in order to determine which neighbor process j is a viable candidate for receiving blocks. Similarly, the pull scheme first calculates the accumulated process inflow inflowi of every process i by summing up all fij < 0. Neighbor blocks are viable candidates for being requested if their weights do not exceed the accumulated process inflow. Here, all fij < 0 are used as clues in order to determine which neighbor process j is a viable candidate for providing blocks. The push scheme is outlined in Algorithm 6.6. Inside the main loop (Line 6), the algorithm picks the neighbor process j that is currently associated with the highest outflow value. If the algorithm is able to identify a block that can be sent to process j, then the accumulated process outflow outflowi as well as the flow fij towards process j is decreased by the weight of the block (Lines 12 and 13). If multiple blocks on process i are viable candidates for being sent to process j, the algorithm picks the block that is the best fit for migration to process j. A block is considered a good fit for migration if its connection to process i is weak and/or its connection to process j is strong. If, for example, a block m is connected to only one other block on process i, but to two blocks on process j, then m is considered a better choice for migration to j than a block n which is connected to two blocks on

75

6 Dynamic Domain Repartitioning

Algorithm 6.7: Schematic program flow of the “pull” approach for determining the blocks that must be sent to neighbor processes on the basis of the inflow/outflow values fij . Note that this algorithm is one of the two potential algorithms that are called at the end of Algorithm 6.5. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Function DiffusionPull(fij ) inflow i = 0 // accumulated process inflow foreach neighbor process j do if fij < 0 then inflow i −= fij end send a list of all local blocks (block IDs only) and their corresponding weights to all neighbor processes while inflow i > 0 and a process j with fij < 0 exists do pick process j with smallest value for fij // ≙ largest inflow construct a list of all remote blocks that can be fetched from process j and are not yet candidates for migration to the current process i out of the previous list, pick remote block k that is the best fit for migration from process j and whose weight wk ≤ inflow i if such a remote block k exists then locally bookmark remote block k as candidate for migration from process j to the current process i fij += wk inflow i −= wk else fij = 0 end end send a request to every neighbor process containing a list of all the remote blocks that process i wants to fetch foreach local block k do if block k is requested by only one neighbor process j then mark block k for migration to process j else if block k is requested by multiple neighbor processes then out of these processes, mark block k for migration to the neighbor process j with the largest value for fij else // block k is requested by no neighbor process block k remains on process i end end end

process i and to no block on process j. Moreover, the type of the connection (face, edge, corner) is also considered while determining the connection strength. The pull scheme is outlined in Algorithm 6.7. The first part of the pull scheme is conceptually identical to the push algorithm. Inside the main loop (Lines 7 to 18), the algorithm identifies blocks that must be fetched from neighbor processes in order to satisfy the accumulated process inflow. The main difference to the pull scheme is that, here, neighboring processes must first exchange information about their process-local blocks and their corresponding weights (Line 6). In the second part of the pull procedure, the blocks that have been identified as blocks that should be fetched are then requested from the corresponding neighbor processes. Every process must comply with these requests. A process is only allowed to deny such a request if the same block is requested by multiple neighbors. In that case, only one request can be satisfied (Line 24). Ultimately, however, all blocks that are requested are passed on to the appropriate neighbor processes.

76

6.4 Dynamic Load Balancing Eventually, the application can choose the diffusion-based balancing algorithm that uses the push scheme, the pull scheme, or both the push and the pull scheme in an alternating fashion. Consequently, the program flow of the entire dynamic load balancing procedure with alternating push/pull schemes and, for example, four main iterations for the controlling outer Algorithm 6.4 consists of the execution of ... 1. ... Algorithms 6.5 and 6.6 (push) followed by the migration of proxy blocks and the corresponding update of the proxy structure in Lines 4 to 8 of Algorithm 6.4, ... 2. ... Algorithms 6.5 and 6.7 (pull) followed by another migration of proxy blocks, ... 3. ... Algorithms 6.5 and 6.6 (push) followed by a third migration of the blocks, and ... 4. ... Algorithms 6.5 and 6.7 (pull) followed by the fourth and final block migration. For the diffusion-based balancing approach, however, finding a good, or even the optimal number of main iterations for the controlling outer Algorithm 6.4 is a challenge. If too few main iterations are performed, the block partitioning might still be in an unbalanced state, which potentially could be improved significantly by performing only one more iteration. On the other hand, if too many main iterations are performed, i.e., if further iterations are performed although globally the partitioning is already in balance, then these additional iterations will have no meaningful effect, but only consume valuable runtime. The actual implementation of Algorithm 6.5 therefore makes use of a global reduction for calculating the total simulation load (≙ global sum of all block weights) at the beginning of the algorithm. Knowledge about the total simulation load enables the algorithm to decide locally if a process is currently overloaded and whether or not balancing is required. Another, second global reduction then allows to synchronize this information among all processes. If all processes report that they are in a sufficiently balanced state already, the execution of Algorithm 6.5 is terminated early, meaning (a) no more blocks are marked for being moved to a neighboring process and (b) the controlling outer Algorithm 6.4 is notified that no further main iterations are required. In practice, a fixed, sufficiently large number of main iterations is set as an upper bound, but only a smaller, variable number of iterations is actually executed. Determining if all processes are sufficiently balanced is, however, another challenge. The naive approach of comparing the local process loads directly to the globally average process load is almost always meaningless. If a simulation on two processes contains three blocks, each of which possesses the weight 1, the average process load is (1+1+1)/2 = 1.5, meaning in case of optimal balance, one process, the process containing two blocks, exceeds this average load by 33 %. Therefore, the implementation must determine a suitable upper bound for the process load such that if all processes stay below this bound, the simulation can be considered to be sufficiently balanced. In order to calculate a suitable upper bound for the process load, this thesis suggests to use the globally average process load in combination with information about the block that globally generates the most load. The exact algorithm is outlined in Algorithm 6.8. If all blocks share the same weight, this algorithm will find the optimal upper bound for the process load.

77

6 Dynamic Domain Repartitioning

Algorithm 6.8: Algorithm for calculating an upper bound for the process load based on the successive addition of the maximal block weight (Lines 10 to 13). This upper bound is then used to determine if the global block partitioning is sufficiently balanced and the diffusion can be terminated or if further iterations are required. Note that this algorithm is called at the beginning of the actual implementation of Algorithm 6.5. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Function TerminateDiffusionEarly calculate process weight wi // ≙ sum of all local block weights ≙ process load w = AllReduceSum(wi ) // ≙ global load wblock,max,i = 0 // ≙ max. local block weight foreach local block k do wblock,max,i = max(wblock,max,i , wk ) end wblock,max = AllRedcueMax(wblock,max,i ) // ≙ max. global block weight wavg = w/number-processes // ≙ average process load wbound = 0 while wbound < wavg do wbound += wblock,max end overloaded i = false // determine if local process i is overloaded if wi > wbound then overloaded i = true unbalanced = AllReduceOr(overloaded i ) if unbalanced then return false else return true // all processes are sufficiently balanced → terminate diffusion end end

Ultimately, using Algorithm 6.8 leads to a variable number of main diffusion-based balancing iterations. Applications only need to define a maximum number of iterations. However, this functionality that allows to terminate the diffusion early and that relies on reductions in Algorithm 6.8 is completely optional. But the benefits of terminating the entire load balancing procedure early can easily amortize the cost of the three global reductions (see Figures 7.16 and 7.19 in Section 7.4.2). Section 7.4 demonstrates that the diffusion-based load balancing scheme, which is designed to be independent of the total number of processes as well as the total number of blocks, shows applicability for extreme-scale parallel simulations that involve trillions of unknowns.

6.5 Data Migration and Refinement The fourth and final step in the SAMR pipeline is the actual refinement, coarsening, and migration of all simulation data. This final step is also illustrated in Figure 6.8. The central idea of this data migration phase is to use the load-balanced state of the proxy data structure in order to adapt the simulation data structure accordingly. Because of the bilateral links between proxy blocks and actual blocks, the refinement, coarsening, and migration of the underlying simulation data is performed in one single step. In order to be able to perform the migration of the actual block data in one step, the interface that must be implemented for all simulation data and which was presented in

78

6.5 Data Migration and Refinement

Figure 6.8: This figure illustrates the fourth and final step in the SAMR pipeline. After successfully balancing the proxy data structure (see Figure 6.5), links between all proxy blocks and their counterparts in the simulation data structure enable the refinement/coarsening of simulation data, including the migration of simulation data to different processes, in one final step. The entire actual simulation data structure is adjusted based on the current state of the proxy data structure. Afterwards, the temporarily created proxy data structure is destroyed. Also note that in order to merge multiple fine blocks into one coarse block, the fine blocks do not have to be located on the same source process.

Code Snippet 3.1 must be extended with further functionality that handles the splitting and merging of blocks. This final interface for block data handling objects is outlined in Code Snippet 6.1. Function serializeCoarseToFine is used when a block is split into eight smaller blocks. The function’s signature is similar to the serialize function from Code Snippet 3.1. The only difference is that serializeCoarseToFine must write its output to an std::vector that contains multiple buffers. This std::vector will always contain exactly eight Buffer objects, each of which must be filled with the data that corresponds to one of the resulting, smaller blocks. The data inserted into the first Buffer object of the std::vector must correspond to the small block whose three least significant bits of the block ID are equal to 000. The data inserted into the second buffer must correspond to the block whose three least significant bits of the block ID are equal to 001, etc. The refinement of the data does not have to happen during this serialization operation. The actual refinement of the data can often follow the deserialization. The signature of function serializeFineToCoarse is identical to serialize. However, this function is called for blocks who are merged into a larger block. Consequently, the Buffer object must be filled with the data that corresponds to the entire small block, but the data may already be coarsened during this serialization operation. The matching deserialization functions are called during the initialization of the newly created blocks on their target processes. Note that function deserializeCoarseToFine receives one of the eight buffers filled in function serializeCoarseToFine. Function deserializeFineToCoarse, on the other hand, receives the buffers filled by eight, smaller blocks. Here, the first buffer in the std::vector again corresponds to the block whose block ID ended on 000, the second buffer corresponds to the block whose block ID ended on 001, etc. Just like initialize and deserialize (see Code Snippet 3.1), both deserialization functions return the newly allocated and initialized data, which is then added to the block by the framework. The final migration procedure is outlined in Algorithm 6.9. If some small blocks that are the result of a larger block being split remain on the same process as the large source block, these small blocks are not directly constructed from the data of the large block, but from data that was first inserted into and is then extracted from a Buffer object.

79

6 Dynamic Domain Repartitioning

C++ Code Snippet 6.1: Final implementation for class BlockDataHandling. For data that is registered at the blocks and that is supposed to be dynamically refined and coarsened, this interface must be implemented and an instance of this implementation must be passed to class BlockForest upon initialization. In case a certain simulation requires only the dynamic migration of blocks, but no dynamic splitting or merging, only the functions from Code Snippet 3.1 must be implemented and the implementation of the four functions below can be left empty. 1 2 3 4

template < typename T > class Blo ckData Handli ng { public :

5

[...] // all typedefs and functions from Code Snippet 3.1

6 7

virtual void s e r i a l i z e C o a r s e T o F i n e ( const Block & , const BlockDataID & , std :: vector < Buffer > & ) = 0; virtual void s e r i a l i z e F i n e T o C o a r s e ( const Block & , const BlockDataID & , Buffer & ) = 0;

8 9 10 11 12

virtual std :: unique_ptr d e s e r i a l i z e C o a r s e T o F i n e ( const Block & , Buffer & ) = 0; virtual std :: unique_ptr d e s e r i a l i z e F i n e T o C o a r s e ( const Block & , std :: vector < Buffer > & ) = 0;

13 14 15 16 17

};

The same is true for a large block that is the result of smaller blocks being merged, some (or all) of which already are on the same process than the large block. Ultimately, the process-local splitting and merging of blocks always also involves an intermediate buffer, but in contrast to blocks that migrate to different processes, this buffer is not transfered via MPI. Also note that in order to keep overall memory consumption as low as possible, blocks are deleted, and hence the memory they occupied is freed, once all their data is serialized to Buffer objects. Besides the fast and inexpensive migration of proxy blocks during the dynamic load balancing phase, the refinement, coarsening, and migration of data in one single step proves to be another advantage provided by the proxy data structure. In mesh-based methods like the LBM, octree-based refinement results in eight times the number of grids cells (≙ eight times the amount of memory) in refined regions. As a result, if dynamic refinement of the simulation data occurs before load balancing is performed, every process must have eight times the amount of memory available as is currently allocated by the simulation in case a process is requested to refine all of its data. Consequently, in order to avoid running out of memory, only 1⁄8 of the available memory could be used for the actual simulation and 7⁄8 of the available memory would always have to be kept in reserve for the refinement procedure. The existence of the load-balanced proxy data structure, however, allows an actual block to determine whether or not parts of its data (or even its entire data) end up on a different process after refinement. In case the refined data ends up on another process (possibly even on multiple other processes), the implementation of the data migration algorithm enables the sending source processes to only send the data that is required for the receiving target processes to correctly initialize the refined data. Allocation of the memory required

80

6.6 AMR for the Lattice Boltzmann Method

Algorithm 6.9: Algorithm for the refinement, coarsening, and migration of the actual simulation data, following the final state of the proxy data structure after its dynamic load balancing outlined in Algorithm 6.4. This algorithm represents the final step in the AMR cycle as outlined in Algorithm 6.1. The actual implementation of this function contains the execution of preregistered callback functions at three different points in the algorithm. These callbacks can be registered to be executed at the beginning prior to the serialization and destruction of the blocks, in the middle of the function prior to the creation of the new blocks, or at the end after the new blocks have been created. Consequently, these callbacks enable the execution of arbitrary, and potentially necessary, simulation-dependent pre- and post-processing functionality. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Function DataMigration foreach actual block b that must be split, merged, or sent to another process do forall data d of block b do call matching serialize function that corresponds to d end delete block b including all of its data // ≙ memory deallocation end exchange serialized block data between processes use current state of proxy data structure to adapt actual simulation data structure, including the creation of actual blocks according to their proxy counterparts foreach newly created actual block b do forall data d that must be assigned to block b do call matching deserialize function that corresponds to d end end end

for the refined data, allocation of the corresponding fine blocks, as well as the actual refinement of the data only happen on the target processes. In Figure 6.8, for example, the four medium-sized blocks that result from the large block in the upper center being split end up on both processes, three remain on the original process, while one is assigned to the other process. Ultimately, the four-step SAMR approach proposed in this thesis significantly reduces the amount of memory overhead. Almost the entire available memory can be used for simulation data at all times without the risk of running out of memory during the refinement phase.

6.6 AMR for the Lattice Boltzmann Method Since all benchmarks, including an exemplary application in Chapter 7, are based on the LBM, the following three sections focus on details of the AMR implementation for the LBM. Section 6.6.1 describes criteria that can be used in order to decide whether or not blocks must be marked for refinement or may be marked for coarsening. The next Section 6.6.2 briefly discusses the implications the LBM has for the dynamic load balancing phase. The last Section 6.6.3 outlines how data migration is handled and how coarse-to-fine and fine-to-coarse cell conversions are performed in case refinement or coarsening is triggered.

81

6 Dynamic Domain Repartitioning

6.6.1 Refinement Criteria In order to decide whether certain blocks must be marked for refinement or coarsening, various possibilities exist that depend on the specific characteristics of the simulation. If simulations contain fully resolved particles/objects within the flow, often a higher resolution is desired at these flow-obstacle interfaces. The same is true for multiphase flows. For the simulation of bubbles rising under buoyancy, for example, a higher resolution is often desired for the fluid-gas interface [Töl06, Yu09]. As a consequence, blocks covering these interface regions are kept at some predefined, fine resolution, while blocks farther away from interfaces are allowed to contain coarser grids. In vortex-dominant, turbulent flows, the dimensionless vorticity can be used as a refinement criterion. The example application in Section 7.5 uses a refinement criterion closely related to the vorticity of the flow. Here, first the gradients of all components of the dimensionless flow velocity are evaluated in x-, y-, and z-direction for each cell. Since the characteristic length scale in dimensionless lattice space is equal to 1 (last paragraph in Section 2.3), evaluating the gradients only requires to perform subtractions, all divisions by 1 can be omitted. For every cell, the algorithm then adds up all of these dimensionless velocity gradients. If the resulting sum exceeds a predefined upper limit, the block that contains the examined cell is marked for refinement. Similarly, if this sum falls below a predefined lower limit for all the cells of a block, the block is marked for potential coarsening. More information together with a detailed comparison of different refinement criteria for vortexdominant flow, including criteria based on the dimensionless vorticity and the Q criterion, is available in [Fak14]. Sometimes, additional constraints must be met. For nontrivial domain boundary conditions, for example, it may be that they cannot be easily combined with nonuniform grids. In this case, there must not be a grid level transition at the boundary, but the entire domain boundary must be kept on the same fixed level of refinement. Consequently, if flow-related refinement criteria require at least one block at a particular domain boundary to be refined, all the other blocks at the same boundary must also be refined. Similarly, a block at such a domain boundary can only be coarsened if all other blocks at the same boundary are also marked for coarsening. These constraints can be met by performing a single communication operation, typically a reduction, that only includes all those processes that contain blocks on domain boundaries. Note that these constraints must also be dealt with during the initial, static block-level refinement process outlined in Section 4.1. The framework forcing blocks to be split in order to maintain 2:1 balance can also cause some, but not all blocks on such boundaries to be split. That is why the actual implementation of the block-level refinement algorithm outlined in Algorithm 6.2 can be configured to repeat itself if during its execution a block was forced to be split by the framework. Since a block can be forced to be split on any process, but all processes must collectively decided whether or not Algorithm 6.2 must be repeated, another global reduction of a boolean variable is required at that point. Repeating the execution of Algorithm 6.2 causes the callback function at the beginning to be called once more, allowing the callback to check if all constraints are met, or if additional blocks need to be

82

6.6 AMR for the Lattice Boltzmann Method refined. Ultimately, this causes the block-level refinement algorithm to iterate until either no more blocks are marked for refinement by the callback function or no more blocks are forced to be split by the framework during the preservation of 2:1 balance.

6.6.2 Load Balancing Since the algorithm presented in Chapter 5 is used for the LBM on nonuniform grids, all the requirements for the static load balancing stated in Section 5.5 also apply for the dynamic load balancing used during AMR. Most notably, the blocks must be balanced on a per-level basis. Consequently, if the SFC-based dynamic load balancing implementation is used, 2 to 8 bytes must be globally synchronized for every block among all processes (see Table 6.1). After this global allgather operation, the SFC-ordered lists of blocks (more precisely: block IDs) must be reconstructed for each level, either by recreating the tree structure from the block IDs (for Hilbert order) or by lexicographically sorting the block IDs (for Morton oder). Following the reordering, these lists then must be partitioned into as many parts of equal weight as there are processes. This reconstruction, including the subsequent partitioning, of the SFC-ordered lists requires O(NlogN) time, with N denoting the number of blocks. Consequently, the SFC-based approach can be expected to become less favorable the larger the simulation and the more blocks are used. If, on the other hand, the diffusion-based balancing approach presented in Section 6.4.2 is used, the runtime of the dynamic load balancing phase is independent of the global number of blocks as well as the total number of processes. Ultimately, the runtime of the balancing mainly depends on the number of iterations used for the diffusion scheme. However, in order to be applicable for the LBM, the algorithms presented in Section 6.4.2 must execute all their calculations separately for each level. This approach is illustrated in Algorithm 6.10 for the calculation of the fL,ij values, which are the fij values calculated for each level L. The other algorithms that are used during the diffusion-based balancing process have to be adapted similarly. Note that this snippet corresponds to the loop from Lines 9 to 17 in Algorithm 6.5. Also note that the number of exchanged messages does not change, only the communication volume increases. However, although the volume changes, it is still only a few bytes of data. Hence, the communication time can be expected to be dominated by latency. Consequently, for a moderate number of levels (≤ 10), the increase in communication volume is of no significant consequence for the communication time. Also note that only blocks from those levels that are not yet sufficiently balanced take part in the current main iteration of the diffusion scheme. This is achieved by performing per-level evaluations also in Algorithm 6.8. Here, the three global reductions remain. However, instead of reducing single values, arrays of values are reduced, where each array contains one entry from each level. Similarly, an array of booleans is returned by the function, with each entry indicating if the corresponding level is in a sufficiently balanced state or if another iteration is required.

83

6 Dynamic Domain Repartitioning

Algorithm 6.10: Algorithm snippet that demonstrates the per-level calculation of the flow of workload between neighboring processes. As such, this snippet corresponds to the loop between Lines 9 and 17 in Algorithm 6.5. In contrast to Algorithm 6.5, here the process load wL,i and the flow fL,ij are calculated separately for each level L, but only for those levels that are not yet in a sufficiently balanced state. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

... repeat exchange wL,i (for all levels L whose blocks are not yet in a sufficiently balanced state) with all neighbor processes foreach level L do if blocks on level L are not yet in a sufficiently balanced state then ′ wL,i = wL,i foreach neighbor process j do ′ ′ fL,ij = αij ⋅ (wL,i − wL,j ) ′ fL,ij += fL,ij ′ wL,i −= fL,ij end end end until predefined max. number of ‘flow’ iterations is reached ...

6.6.3 Refinement Implementation If a block is marked for refinement during the AMR procedure, the unmodified grid data is serialized and sent to the target process(es). Only during deserialization when deserializeCoarseToFine is called for all newly created fine blocks (see Code Snippet 6.1), this data is distributed to and interpolated on the newly allocated, finer grids. Similarly, if multiple blocks are about to be merged, coarsening of the grid data is already performed on the sending side during serialization in serializeFineToCoarse. As a consequence, the coarse data is serialized, sent, and merged on the target process by deserializeFineToCoarse. The dynamic refinement as well as the dynamic coarsening of an entire block follow exactly the same mechanisms, i.e., formulas, that are utilized when transferring PDF data from coarse to fine grids (and vice versa) during execution of the algorithm for the LBM on nonuniform grids (see Section 5.3). Consequently, merging the PDF data from eight fine cells in order to calculate the PDF data required for the corresponding coarse cell follows (5.2). Similarly, the PDF data stored in a coarse cell is distributed to the corresponding eight finer cells according to (5.1). This homogeneous distribution can be further improved by additionally applying a mass and momentum conserving interpolation scheme as presented in [Che06]. For splitting one cell into eight finer cells as well as for merging eight fine cells into one coarser cell, special attention is necessary in the presence of non-fluid (obstacle/boundary) cells which might lead to non-fluid to fluid cell conversions. If a coarse, non-fluid cell that does not contain any valid PDF data is split into eight finer cells, some of these finer cells may end up as being marked as fluid cells because of the higher resolution on the fine grid. Similarly, if eight fine cells are merged into one coarser, fluid cell, some of the fine,

84

6.6 AMR for the Lattice Boltzmann Method source cells could represent non-fluid cells that do not contain valid PDF data. Moreover, even if a block x is not directly affected by AMR and does not change its resolution, but one of its neighbor blocks is split or merged, some cells at x’s borders may be forced to change their type in order to have a consistent cell type correspondence in the region of overlap between x’s grid and the newly created neighboring grid. For example, if the neighboring grid is coarsened and a coarse cell in the region of overlapping ghost layers is marked as a non-fluid cell, then all eight corresponding fine cells on block x must also be converted to non-fluid cells, and this must happen although x itself was neither refined nor coarsened. When merging eight fine cells into one coarse, fluid cell, fine cells that are non-fluid cells are disregarded by (5.2), with n being adjusted accordingly. If, however, a fine, fluid cell results from the splitting of a coarse, non-fluid cell, this fine cell must be initialized with valid PDF data. The simplest approach is to initialize this cell with the data of the next closest fluid cell that contains valid information. The conversion from non-fluid to fluid cells, including the initialization of the newly emerged fluid cells with valid data, can also follow more elaborate approaches as discussed, for example, in [Pen16].

85

7 Validation and Benchmarks The following chapter presents benchmark applications that validate the correctness of the implemented algorithms and evaluate their performance on two architecturally different petascale supercomputers. In order to enable reproducibility of the results and provide transparency with regard to the actual implementation, all the benchmark applications, together with all algorithms presented in the thesis, have been integrated into the waLBerla framework and are available as part of the software. The waLBerla software framework is available under an open source license and can be freely downloaded at http://walberla.net. The first Section 7.1 of this chapter introduces the system specifications of both supercomputers that were used for the performance evaluation. This section also explains how simulations that rely only on MPI are compared to simulations that make use of hybrid MPI/OpenMP execution. The next Section 7.2 verifies the accuracy and correctness of the parallelization scheme for the LBM on nonuniform grids presented in Chapter 5. Section 7.3 then evaluates the performance of the implementation of this parallelization scheme in order to confirm that the data structures and algorithm are fully capable of efficiently performing massively parallel simulations. Having confirmed the efficiency and applicability of the algorithm, the following Section 7.4 presents detailed performance results for another synthetic benchmark application that is also performing LBM-based simulations on nonuniform grids, but is additionally executing the entire AMR pipeline, causing the simulation to dynamically change the resolution and structure of the underlying grid. Here, special focus is put on the differences between as well as the implications of using the dynamic load balancing algorithm based on SFCs presented in Section 6.4.1 and the diffusion scheme presented in Section 6.4.2. Finally, Section 7.5 demonstrates the applicability of the AMR procedure by outlining performance results for a direct numerical simulation of highly dynamic, turbulent flow. Parts of Sections 7.1 to 7.5, including many of the results presented in Sections 7.2 to 7.5, were first published in [Sch16] and [Sch18]. In order to provide more insight into the applicability and performance of the implemented algorithms, Sections 7.2 to 7.4 contain additional benchmarks not available in [Sch16] or [Sch18]. Compared to [Sch16], which presents two different Poiseuille flow scenarios in order to verify the correctness of the parallelization scheme, Section 7.2 is extended with an additional evaluation of Couette flow. In Section 7.3, the results on SuperMUC are updated with more recent measurements and the results for the SuperMUC phase 2 system as well as the evaluation of different communication schemes are added for this thesis. Furthermore, Section 7.4 adds an evaluation of the cost of the global reductions that can optionally be executed during the diffusion-based dynamic load balancing procedure.

87

7 Validation and Benchmarks

7.1 Supercomputing Systems In order to evaluate the performance of massively parallel simulations, the benchmarks are run on two petascale supercomputers: JUQUEEN, an IBM Blue Gene/Q system located at the Jülich Supercomputing Center (Germany), and SuperMUC, an x86-based system built on Intel Xeon CPUs located at the Leibniz Rechenzentrum in Munich (Germany). JUQUEEN provides 458,752 PowerPC A2 processor cores running at 1.6 GHz, with each core capable of 4-way simultaneous multithreading. Based on the observations in [God13], in order to achieve maximal performance, all of the following benchmarks make full use of the simultaneous multithreading on JUQUEEN by placing either four processes or four threads of the same process on each core. The SuperMUC system features fewer, but more powerful, cores than JUQUEEN. It is built out of 18,432 Intel Xeon E5-2680 processors running at 2.7 GHz, which sums up to a total of 147,456 cores. The system is divided into 18 islands, where each island contains 1,024 Intel Xeon E5-2680 CPUs (≙ 8,192 cores). During its time of operation, the SuperMUC system was extended by another 6,144 Intel Xeon E5-2697v3 CPUs. This extension is referred to as the SuperMUC phase 2 system. Each of these E5-2697v3 CPUs contains 14 cores running at 2.6 GHz, which sums up to a total of 86,016 cores for the phase 2 system. Again, 1,024 CPUs are grouped together to form one islands, for a total of 6 islands. During normal system operation, simulations are restricted to at most 65,536 and 14,336 (phase 2) cores, with more cores being available to single simulations only during on-site scaling workshops. Consequently, some of the following benchmarks on SuperMUC do not contain results above 65,536 cores. The 458,752 PowerPC A2 processor cores of JUQUEEN provide a peak performance of 5.9 petaflops, while the Intel Xeon CPUs of SuperMUC provide a peak performance of 3.2 and 3.6 (phase 2) petaflops. On both JUQUEEN and SuperMUC, all simulations make use of double-precision floating-point arithmetic and the highest compiler optimization level available: level 5 for the IBM XL compiler on JUQUEEN and level 3 for the Intel compiler on SuperMUC. All of the following graphs and tables report the number of processor cores, i.e., the actual amount of hardware resources in use, not the number of processes. For a fixed number of cores, a benchmark can either be run with α processes (MPI only) or make use of hybrid execution with α/β processes and β OpenMP threads per process. On SuperMUC, α is always chosen to be equal to the number of cores. Since the simultaneous multithreading capability of JUQUEEN is fully used by all benchmarks, α is chosen to be equal to four times the number of cores when running benchmarks on JUQUEEN. Consequently, an MPI only simulation running on 512 cores of SuperMUC also consists of 512 processes, whereas an MPI only simulation running on 512 cores of JUQUEEN consists of 2,048 processes. If hybrid execution with 4 threads per process is used, a simulation on 512 cores of SuperMUC only consists of 128 processes, whereas a hybrid simulation that also makes use of 4 threads per process on 512 cores of JUQUEEN consists of 512 processes (with all 4 threads of each process being bound to the same PowerPC A2 core).

88

7.2 The Lattice Boltzmann Method on Nonuniform Grids For the performance benchmarks, the corresponding tables and graphs will state the parallelization strategy in use: MPI only or hybrid execution with typically four (β = 4) or eight (β = 8) threads per process. Since hybrid simulations use β times fewer processes, they allocate β times more cells for each process. As a result, for a given benchmark scenario and for a fixed number of cores (≙ fixed amount of hardware resources), the amount of work remains constant, independent of the parallelization strategy in use. If not explicitly stated otherwise, each benchmark was run three to five times over the course of different days and the median of all results was picked as the final measurement.

7.2 The Lattice Boltzmann Method on Nonuniform Grids The following subsections present two validation scenarios that verify the accuracy and correctness of the parallelization approach for the LBM on nonuniform grids from Chapter 5. A more extensive discussion on the general accuracy of the method is presented in [Roh06], which originally introduced the scheme for the LBM on nonuniform grids that serves as a basis for the parallelization approach that is presented in this thesis. In order to verify the correctness of the parallel implementation, the next two subsections show results for Couette and Poiseuille flow. In contrast to [Roh06], these simulations use the TRT instead of the SRT model. Concerning the two relaxation parameters of the TRT model, the thesis propose to keep Λeo (2.9) constant on all levels, scale λe according to ω in (2.8a) and (2.8b), and calculate λo from Λeo and λe according to (2.10) (see Section 2.3). In order to enable comparability of different simulations, velocities are scaled during evaluation such that the maximum velocity given by the corresponding analytical solution is always equal to 1. For all simulations, Λeo is set to 3⁄16, the Reynolds number is fixed to 10, and Ω = [0, 1] × [0, 1] × [0, 1]. As a consequence, the surface area of any cut perpendicular to the x-axis is either 1 for simulations with a rectangular or 1/4 ⋅ π for simulations with a circular cross section (cf. Figure 7.1). The volumetric flow rate Q is calculated as Q = u ¯ ⋅ A, with u ¯ denoting the mean flow velocity through the cross-sectional area A. In the volumetric approach for the LBM used in this thesis, flow velocities are evaluated at cell centers. Weighted discrete L1 , L2 , and L∞ norms of the error in the velocity are calculated by iterating all fluid cells in the domain Ω and √ 1 2 L = ∑ wi ⋅ ∆ui , L = ∑ wi ⋅ (∆ui )2 , L∞ = max ∆ui , i

i

i

where wi = (∆xi )3 is a grid level-dependent weighting factor that is equal to the volume of cell i and ∆ui = ∥ui − ui,exact ∥, with ui denoting the flow velocity computed by the simulation and ui,exact denoting the corresponding velocity given by the analytical solution. Since entire cells are marked as either fluid or obstacle and since the simulations employ bounce back boundary conditions, boundaries are always located halfway in-between two cell centers. All simulations are run in 3D and make use of the D3Q19 lattice model.

89

7 Validation and Benchmarks

Figure 7.1: Domain partitioning and flow profiles in the zy-plane (with z from left to right and y from bottom to top) for simulations of plane Couette (left), plane Poiseuille (middle), and pipe Poiseuille flow (right) with grid refinement and four different grid levels. Only the block partitioning is shown. Blocks on the coarsest level (L = 0) are colored in dark blue, while blocks on the finest level (L = 3) are colored in red. During simulation, each block consists of 10 × 10 × 10 cells. In the flow profile, high velocities are depicted in red, zero velocities in dark blue.

7.2.1 Couette Flow The simulation of plane Couette flow (cf. first illustration in Figure 7.1) uses no slip bounce back boundary conditions at y = 0, velocity bounce back boundary conditions with velocity umax = (umax , 0, 0) at y = 1, and periodic boundaries in x- and z-direction. For plane Couette flow, a linear velocity profile along the y-axis with flow induced in x-direction is expected. The fluid must be at rest at y = 0 and moving with velocity umax at y = 1. The volumetric flow rate Q = u ¯ ⋅ A = 1/2 ⋅ umax ⋅ A = 1/2 ⋅ umax . Without grid refinement, the simulation converges to the analytical solution, i.e., the volumetric flow rate as well as the flow velocities evaluated at all cell centers match the analytical solution once the steady state is reached. If grid refinement is added to the simulation, this convergence behavior to the steady state solution over time is retained. When refining only at the plate on the bottom (y = 0), or at the moving plate on the top (y = 1), or in the middle between both plates, or at the bottom as well as at the top plate, convergence over time remains unchanged and the simulation converges to the analytical solution, i.e., L∞ drops to values close to machine accuracy. The agreement with the analytical solution as well as the convergence behavior to the steady state for a simulation where both the bottom and the top plate are refined three times are also illustrated in the first graph of Figures 7.2 and 7.3.

7.2.2 Poiseuille Flow For the simulation of Poiseuille flow, the fluid is driven with a body force caused by a constant acceleration (2.7) that induces a flow in x-direction. The acceleration is scaled to different grid levels according to (2.11). All walls are fixed and set to no slip boundary conditions. After convergence to the steady state, a parabolic flow profile that is in accordance with Poiseuille’s law is expected. Here, two different setups are used (cf. second and third illustration in Figure 7.1). By having no slip bounce back boundary conditions at y = 0 and y = 1 and periodic boundaries in x- as well as z-direction, plane

90

normalized flow velocity

7.2 The Lattice Boltzmann Method on Nonuniform Grids 1 0.75 0.5 0.25 0

0

0.25

0.5

0.75

1

Couette flow

0

0.25

0.5

0.75

1

Poiseuille flow (plates)

0

0.25

0.5

0.75

1

Poiseuille flow (pipe)

Figure 7.2: Velocity profiles once the steady state is reached, evaluated along a line through the middle of the channel, parallel to the y-axis. The black line follows the analytical solution, the grey circles correspond to the velocities calculated by the simulation (only every 2nd data point is plotted). The graphs correspond to the three simulations depicted in Figure 7.1. 1

1

L∞ norm

10−4

10−1

10−8 10−2

10−12 10−16

0

40 ⋅ 103

80 ⋅ 103

Couette flow

120 ⋅ 103

10−3

0

6.6 ⋅ 103

13.3 ⋅ 103

20 ⋅ 103

Poiseuille channel (circular profile)

Figure 7.3: L∞ evaluated over the time of the simulation (simulation time in time steps on the finest grid). The black line corresponds to a simulation with the entire domain refined to the finest resolution. The grey circles correspond to a simulation with grid refinement with four different grid resolutions and the finest level only at fluid-boundary interfaces (graphs correspond to the first and third illustration of Figure 7.1). The same agreement can be seen when evaluating L1 and L2 norms (not visualized here).

Poiseuille flow is simulated. For pipe Poiseuille flow, a circular profile in the yz-plane and periodic boundaries in x-direction are chosen. For plane Poiseuille flow, the flow rate Q = u ¯ ⋅ A = 2/3 ⋅ umax ⋅ A = 2/3 ⋅ umax . As with the simulation of plane Couette flow, using the TRT model and Λeo equal to 3⁄16, simulations on globally uniform grids return the analytical solution once the steady state s reached. Introducing grid refinement with fine grids only at y = 0 and y = 1 does not affect the flow rate and does not change convergence. The simulation still returns the analytical solution once the steady state is reached, the convergence behavior is equivalent to that of plane Couette flow, and L∞ drops to values close to machine accuracy. For pipe Poiseuille flow (circular cross section), the volumetric flow rate Q = 1/2 ⋅ umax ⋅ A = 1/8 ⋅ u max ⋅ π. For the LBM with no slip conditions at curved boundaries, the expected convergence order is discussed in, e.g., [Gin08b]. Similarly, the accuracy at the refinement interface is studied in [Roh06]. For the benchmarks in this section, the circular profile is approximated by a staircase and the well-established mass-conserving first order bounce

91

7 Validation and Benchmarks

∥u − uexact ∥

5 ⋅ 10−3 4 ⋅ 10−3

L=0

−3

3 ⋅ 10

L=1

L=1

2 ⋅ 10−3 1 ⋅ 10−3

L=2

0

L=3 1/3

0

L=2 L=3 2/3

1

1/3

0

global refinement

2/3

1

refinement only near the wall

Figure 7.4: The error ∥u − uexact ∥ of simulations of Poiseuille flow through a channel with circular cross section once the steady state is reached, evaluated along a line through the middle of the channel, parallel to the y-axis. The first graph shows the error for simulations where the entire domain is refined either to level 0, 1, 2, or 3. The second graph shows the error for simulations where the finest level (1, 2, or 3) is only used to resolve the fluid-wall interface. Towards the center of the channel, the grid resolution becomes increasingly coarser, always reaching level 0 at the center (cf. third illustration in Figure 7.1). Small jumps in the velocity that are noticeable in the graph on the right are due to errors at the refinement interface and conform with the observations reported in [Roh06]. Table 7.1: Comparison of the accuracy of different simulations of pipe Poiseuille flow. The simulation was executed with different resolutions and with refining the entire domain to a certain level (global refinement), with grid refinement to different levels only close to the wall of the pipe (wall refinement), and with grid refinement to different levels only in the center of the pipe (center refinement). The corresponding refinement level is stated in brackets. On the coarsest level, the pipe consists of 60 cells in diameter, leading to 120, 240, and 480 cells for level 1, 2, and 3, respectively. The linear decrease of the error conforms with the first order bounce back boundary conditions in use. flow rate error

L1

L2

L∞

(0) (1) (2) (3)

7.96 ⋅ 10−3 4.98 ⋅ 10−3 1.86 ⋅ 10−3 1.05 ⋅ 10−3

3.48 ⋅ 10−3 2.06 ⋅ 10−3 0.77 ⋅ 10−3 0.42 ⋅ 10−3

4.28 ⋅ 10−3 2.43 ⋅ 10−3 0.90 ⋅ 10−3 0.50 ⋅ 10−3

19.2 ⋅ 10−3 8.92 ⋅ 10−3 4.59 ⋅ 10−3 2.87 ⋅ 10−3

wall refinement (1) wall refinement (2) wall refinement (3)

5.14 ⋅ 10−3 1.96 ⋅ 10−3 1.13 ⋅ 10−3

2.17 ⋅ 10−3 0.84 ⋅ 10−3 0.49 ⋅ 10−3

2.54 ⋅ 10−3 0.98 ⋅ 10−3 0.57 ⋅ 10−3

8.92 ⋅ 10−3 4.59 ⋅ 10−3 2.87 ⋅ 10−3

center refinement (1) center refinement (2) center refinement (3)

7.95 ⋅ 10−3 7.92 ⋅ 10−3 7.92 ⋅ 10−3

3.46 ⋅ 10−3 3.40 ⋅ 10−3 3.39 ⋅ 10−3

4.27 ⋅ 10−3 4.21 ⋅ 10−3 4.20 ⋅ 10−3

19.2 ⋅ 10−3 19.2 ⋅ 10−3 19.2 ⋅ 10−3

global global global global

refinement refinement refinement refinement

back boundary condition is used. Consequently, errors that converge linearly with the resolution of the simulation are expected. When uniformly refining the global grid, the expected behavior can be observed. Errors decrease linearly the higher the resolution of the underlying grid, i.e., the more cells are used for modeling the flow inside the pipe. Grid refinement can help to avoid resolving the entire domain with a finer grid. If finer grids are used only near the wall of the pipe (see third illustration in Figure 7.1), convergence over time remains unchanged (cf. second graph in Figure 7.3) and comparable accuracy with simulations that uniformly resolve the entire pipe with fine grids can be obtained (see Figure 7.4). Refining only the center of the pipe has no impact on accuracy. In conclusion,

92

7.3 Performance of Statically Refined Grids the accuracy of the simulation is mainly determined by the resolution of the grid used to resolve the fluid-wall interface at the inside wall of the pipe. Results for all of these simulations of pipe Poiseuille flow are summarized in Table 7.1. Accuracy is evaluated after convergence of the flow to a steady state. For calculating the flow rate Q, the flow through a cross sectional area in the middle of the channel parallel to the yz-plane is evaluated. The relative error compared to the analytical solution of Q is shown in the second column of Table 7.1. L1 , L2 , and L∞ norms are listed in columns 3, 4, and 5. Ultimately, the results presented here in Section 7.2 confirm the applicability and correctness of the algorithms proposed in Chapter 5.

7.3 Performance of Statically Refined Grids For the evaluation of the performance of the parallel algorithm for the LBM on nonuniform grids, a synthetic benchmark is used that simulates lid-driven cavity flow in 3D (D3Q19) with Ω = [0, 1] × [0, 1] × [0, 1]. The benchmark uses a velocity bounce back boundary condition at z = 1 in order to simulate the moving lid. No slip bounce back boundary conditions are used for all other domain boundaries. The regions where the moving lid meets the stationary domain boundaries are refined three times (see Figure 7.5). The properties of the resulting refinement structure are listed in Table 7.2. The setup is proportionally extended in y-direction when the number of processes is increased to make sure that these properties remain fixed throughout all performance benchmarks.

Figure 7.5: Three slightly different visualizations of the lid-driven cavity performance benchmark. Only the block partitioning is shown. Blocks on the coarsest level (L = 0) are colored in dark blue, while blocks on the finest level (L = 3) are colored in red. The flow profile is visualized for the xz-plane (with x from left to right and z from bottom to top). High velocities are depicted in red, zero velocities in dark blue. When increasing the number of processes, the setup is proportionally extended in y-direction. Consequently, if the number of processes is doubled, twice as many blocks are created in y-direction. Table 7.2: The memory requirements and workload of each grid level as well as the amount of space covered by every level in the lid-driven cavity performance benchmark. The memory requirements are proportional to the number of blocks (the finest level accounts for 59.81 % of all blocks).

domain coverage ratio workload share memory/block share

L=0

L=1

L=2

L=3

77.78 % 1.10 % 6.54 %

16.67 % 3.76 % 11.22 %

4.17 % 15.02 % 22.43 %

1.39 % 80.13 % 59.81 %

93

7 Validation and Benchmarks

7.3.1 Weak and Strong Scaling Weak scaling is realized by assigning four blocks of the finest level to every process. As a result, each process additionally holds one or two blocks of level 2, one or no block of level 1, and one or no block of the coarsest level. On average, each process will hold 6.6875 blocks, independent of the total number of processes. When increasing the number of processes, the number of cells per block (and process) remains constant. As a consequence, the global number of cells increases linearly with the number of processes. Strong scaling is realized by assigning only one block of the finest level to each process. As a result, each process additionally holds either one or no block from the other levels. On average, each process will hold 1.6719 blocks. The more processes are used, the fewer cells are allocated per block (and process). As a consequence, the global number of cells remains constant, independent of the total number of processes. For the weak scaling benchmark, performance is reported in terms of million fluid lattice cell updates per second (MFLUP/S). For ideal weak scaling, MFLUP/S must scale linearly with the number of processor cores. For the strong scaling benchmark, the number of time steps executed each second on the finest grid is reported. The more time steps are executed each second, the higher the throughput of the simulation and the shorter the time to solution for a fixed problem size. For ideal strong scaling, the number of time steps executed each second must scale linearly with the number of processor cores. Both weak and strong scaling benchmarks also report MFLUP/S divided by the number of processor cores as million fluid lattice cell updates per second and core (MFLUP/SC). Consequently, MFLUP/SC is proportional to parallel efficiency. If MFLUP/SC remains constant, parallel efficiency is equal to 1. All measurements use the same compute kernels as described in [God13]. These compute kernels make use of SIMD instructions on the IBM Blue Gene/Q architecture of JUQUEEN as well as the x86-based hardware of SuperMUC. Since, in waLBerla, the performance of the fastest TRT kernel is identical to the performance of the fastest SRT kernel [God13], all results reported here apply for both collision schemes, TRT and SRT. SuperMUC For evaluating weak scaling performance on SuperMUC, the benchmark uses 3.34 million cells per core. As a consequence, with 147,456 cores of SuperMUC, simulations that consist of up to 493 billion cells can be executed. These simulations achieve up to 727,963 MFLUP/S. Since the algorithms do not involve global communication and since the underlying data structures are fully distributed to all processes, almost perfect weak scaling can be observed (see first graph in Figure 7.6). Hybrid simulations using either four or eight threads per process are slightly faster than simulations that rely on MPI only (see Table 7.3). For the evaluation of strong scaling performance, the global number of cells is fixed to 8.56 million. The same simulation runs on 512 up to 65,536 cores. In the largest simulation

94

7.3 Performance of Statically Refined Grids

2.5

MFLUP/SC

10

5

4 3

104

2 103

1 0

3

106 MFLUP/SC

5

MFLUP/S

MFLUP/S

MFLUP/SC

6

28

210

212 214 cores

1200

2

900

1.5

600

1

300

0.5 0

216

time steps / sec MFLUP/SC

29 210 211 212 213 214 215 216 cores

time steps / sec

7

0

Figure 7.6: Weak (left) and strong (right) scaling performance on SuperMUC. For evaluating weak scaling performance, the benchmark uses 3.34 million cells/core. For evaluating strong scaling performance, the global number of cells is set to approx. 8.56 million. Only the best results of three different parallelization strategies (MPI only, MPI+OpenMP with 4 threads/process, MPI+OpenMP with 8 threads/process) are plotted. Detailed results are listed in Table 7.3. Table 7.3: Detailed results for weak scaling (3.34 million cells/core) and strong scaling (8.56 million cells in total) performance on SuperMUC. The table lists results for simulations that rely on MPI only and for hybrid simulations with four and eight threads per process. weak scaling (MFLUP/SC) cores 128 256 512 1 024 2 048 4 096 8 192 16 384 32 768 65 536 147 456

MPI only

OpenMP (4)

OpenMP (8)

4.96897 4.95482 4.90876 4.90767 4.87695 4.81926 4.87076 4.77174 4.39070 4.20160 4.12766

5.22079 5.15403 4.99421 5.01152 5.00139 4.98401 4.98326 4.97621 4.96548 4.74741 4.76084

5.18481 5.03051 4.99775 4.98892 4.99353 4.97058 4.97622 4.98115 4.96384 4.94107 4.93682

strong scaling (time steps / sec) MPI only

OpenMP (4)

OpenMP (8)

143.36 222.69 340.16 503.34 736.12 968.84 1085.10 1093.83

181.63 321.17 521.96 797.72 1057.21 1207.26 1318.49 1327.91

178.33 296.40 515.85 778.51 947.05 966.24 1181.75 1189.09

with 65,536 cores, each core is responsible for only 131 cells. As expected, parallel efficiency decreases when more processes are used and when fewer cells are assigned to each core. However, throughput constantly increases up to 1,327 time steps per second on the finest grid (see second graph in Figure 7.6), enabling simulations that perform up to 4.8 million time steps per hour compute time. Hybrid simulations are generally faster (cf. Table 7.3) since for hybrid simulations fewer processes with larger blocks per process can be used. As a result, the total amount of communicated data is smaller compared to simulations that use MPI parallelization only. SuperMUC Phase 2 Results on the phase 2 system are similar. The benchmark again uses 3.34 millions cells per core when evaluating the weak scaling performance. Consequently, the simulation consists of 287 billions cells when executed on all 86,016 cores. This simulation achieves

95

7 Validation and Benchmarks

105

3

104

2 1

MFLUP/SC

MFLUP/SC

4

0

2.5

MFLUP/S

MFLUP/S

MFLUP/SC

5

106

103 28

210

212 cores

214

2

1200

1.5

800

1

400

0.5 0

216

1600 time steps / sec MFLUP/SC

29 210 211 212 213 214 215 216 cores

time steps / sec

6

0

Figure 7.7: Weak (left) and strong (right) scaling performance on the SuperMUC phase 2 system. For evaluating weak scaling performance, the benchmark uses 3.34 million cells/core. For evaluating strong scaling performance, the global number of cells is set to approx. 7.49 million. Only the best results of three different parallelization strategies (MPI only, MPI+OpenMP with 7 threads/process, MPI+OpenMP with 14 threads/process) are plotted. Detailed results are listed in Table 7.4. Table 7.4: Detailed results for weak scaling (3.34 million cells/core) and strong scaling (7.49 million cells in total) performance on the SuperMUC phase 2 system. The table lists results for simulations that rely on MPI only and for hybrid simulations with seven and fourteen threads per process. weak scaling (MFLUP/SC)

strong scaling (time steps / sec)

cores

MPI only

OpenMP (7)

OpenMP (14)

MPI only

OpenMP (7)

OpenMP (14)

112 224 448 896 1 792 3 584 7 168 14 336 28 672 57 344 86 016

4.32947 4.21321 4.23272 4.20763 4.21934 4.14270 4.17555 4.12525 4.10896 3.88681 3.77745

4.24799 4.19012 4.15239 4.13524 4.12759 4.11629 4.10989 4.10849 4.11895 4.11847 4.09939

— 2.92791 2.85473 2.85354 2.84195 2.82523 2.82758 2.80416 2.80484 2.80156 2.80991

118.78 169.61 277.40 412.28 717.25 1017.70 1237.86 1278.48 1450.26

165.41 304.50 540.39 930.63 1033.25 1216.12 1362.40 1478.44 1471.24

— 213.89 370.93 532.26 580.23 677.03 801.14 1004.86 1026.03

up to 352,613 MFLUP/S. Again, almost perfect weak scaling can be observed (see first graph in Figure 7.7). However, on the phase 2 system fewer cells are updated per second and core (cf. Table 7.4) when compared to the results on SuperMUC listed in Table 7.3. For 3.34 million cells per core, performance for the LBM on SuperMUC is limited by the available memory bandwidth [God13]. SuperMUC provides 102.4 GiB/s per node (≙ 6.4 GiB/s per core), while the phase 2 system provides 137 GiB/s per node (≙ 4.9 GiB/s per core). Consequently, the increase in memory bandwidth per node does not keep up with the increase from 16 to 28 cores per node. Ultimately, the lower MFLUP/SC values measured on the phase 2 system match with the decrease in memory bandwidth per core. For the evaluation of strong scaling performance, the global number of cells is fixed to 7.49 million and the same simulation is run on 448 up to 86,016 cores. As expected, parallel efficiency again decreases when more processes are used and when fewer cells are assigned to each core. However, throughput constantly increases up to 57,344 cores. Here,

96

7.3 Performance of Statically Refined Grids the simulation reaches 1,478 time steps per second on the finest grid (see second graph in Figure 7.7), enabling simulations that perform up to 5.3 million time steps per hour compute time. In this simulation with 57,344 cores, each core is responsible for only 131 cells. With so few cells, other factors besides the memory bandwidth become important. Due to its newer architecture including a faster interconnect with lower latencies, the phase 2 system is able to achieve a higher maximal number of time steps per second. JUQUEEN On JUQUEEN, qualitatively the same behavior can be observed. However, in contrast to SuperMUC, JUQUEEN provides many more processor cores. The simulations that run on the entire machine make use of 1.835 million threads. Since these large simulations require data structures that consist of millions of blocks, the benchmark makes use of the two-stage initialization process and hence loads the results of the initialization phase from file instead of creating the domain partitioning and performing load balancing at runtime (see Chapter 4). For the evaluation of weak scaling performance, the benchmark uses 1.93 million cells per core, resulting in almost one trillion cells (886 billion) for the largest simulation. Similar to the results on SuperMUC, almost perfect weak scaling can be observed (see first graph in Figure 7.8), with hybrid simulations using four threads per process being significantly faster than simulations that rely on MPI only and simulations that use eight threads per process (see Table 7.5). With 882,852 MFLUP/S for the largest simulation on 458,752 cores, 16.8 trillion PDFs are updated each second. For the evaluation of strong scaling performance, the global number of cells is fixed to 233 million and the same simulation is run on 512 up to 458,752 cores. In simulations on 458,752 cores, each core is responsible for only 508 cells. Again, parallel efficiency is decreasing when more processes are used and when fewer cells are assigned to each core.

MFLUP/SC

105 10

4

1.5 1

103

0.5 0

106

102 26

28 210 212 214 216 218 cores

time steps / sec

1.5

MFLUP/SC

80 60

1

40

0.5 0

100

20 210

212

214 cores

216

218

time steps / sec

2

MFLUP/S

MFLUP/SC

MFLUP/SC

2.5

MFLUP/S

3

0

Figure 7.8: Weak (left) and strong (right) scaling performance on JUQUEEN. For evaluating weak scaling performance, the benchmark uses 1.93 million cells/core. For evaluating strong scaling performance, the global number of cells is set to approx. 233 million. Only the best results of three different parallelization strategies (MPI only, MPI+OpenMP with 4 threads/process, MPI+OpenMP with 8 threads/process) are plotted. Detailed results are listed in Table 7.5.

97

7 Validation and Benchmarks Table 7.5: Detailed results for weak scaling (1.93 million cells/core) and strong scaling (233 million cells in total) performance on JUQUEEN. The table lists results for simulations that rely on MPI only and for hybrid simulations with four and eight threads per process. weak scaling (MFLUP/SC) cores 32 64 128 256 512 1 024 2 048 4 096 8 192 16 384 32 768 65 536 131 072 262 144 458 752

strong scaling (time steps / sec)

MPI only

OpenMP (4)

OpenMP (8)

MPI only

OpenMP (4)

OpenMP (8)

1.72690 1.72024 1.70207 1.70720 1.69481 1.70529 1.67337 1.66438 1.65579 1.62633 1.62369 1.60442 1.58830 1.58418 —

2.06960 2.05401 2.04498 2.00083 1.98645 1.98478 1.97916 1.97675 1.97816 1.95565 1.94555 1.94437 1.93697 1.92665 1.92446

1.70424 1.75702 1.67617 1.71336 1.70756 1.67012 1.67471 1.67879 1.67131 1.65981 1.67588 1.66758 1.65841 1.65170 1.64429

4.083 6.804 10.464 15.983 20.527 26.633 34.335 39.391 46.231 51.478 —

4.937 7.961 12.460 21.644 30.990 45.913 58.519 70.840 78.146 81.019 89.376

4.942 8.641 13.280 23.128 30.129 46.443 65.057 73.794 81.712 89.843 98.017

Throughput, however, constantly increases up to 98 time steps per second on the finest grid (see second graph in Figure 7.8). In order to achieve maximal throughput, a hybrid parallelization approach must be employed (cf. Table 7.5).

Comparison with Simulations on Globally Uniform Grids Compared to benchmarks for simulations on uniform grids [God13], the nonuniform LBM scheme that uses statically refined grid structures looses a factor of 2 to 2.5 in terms of MFLUP/S numbers. This is due to the fact that in the uniform LBM scheme, there is significantly less overhead. The uniform scheme does not need to perform interpolation, it communicates less data, and it can always make use of fused stream-collide compute kernels. Moreover, in [God13], strong scaling performance is only reported for a simulation of blood flow in the coronary artery tree. In this kind of simulation, blocks are less densely populated with fluid cells, resulting in less data that needs to be processed and communicated. Throughput in this type of simulation is typically two to three times higher than in simulations with blocks that are densely populated with fluid cells. When comparing the number of time steps that can be executed in one second by the nonuniform LBM scheme presented in this thesis to the throughput in uniform simulations with blocks that are densely populated with fluid cells, a decrease by a factor of 3 to 4 can be observed. However, this loss in efficiency is more than compensated by the fact that, in general, nonuniform simulations based on locally refined grid structures generate significantly less workload and require much less memory than simulations that uniformly resolve the entire domain with the highest resolution (see Section 7.5). With more than 1,000 time steps per second, the benchmark achieves considerably higher throughput rates on SuperMUC than on JUQUEEN. This is also in good agreement with previous observations in [God13].

98

7.3 Performance of Statically Refined Grids

7.3.2 Communication Strategies In addition to introducing the basic concepts of the algorithm for the LBM on nonuniform grids, Chapter 5 also proposes various optimizations for the communication that are aimed at improving the performance of the implementation. Instead of using a synchronous communication scheme (see Algorithm 5.1), Section 5.3 outlines the structure of the final algorithm which includes asynchronous communication (see Algorithm 5.2). Section 5.4 illustrates the implementation of the corresponding communication routines which perform as little data transfer as possible. When data is exchanged between blocks on the same level (see Section 5.4.1), only one layer of cells is exchanged for the interior part of faces (in 3D, and edges in 2D). Moreover, only the PDF values that are about to stream into the neighboring block are sent. If the framework would not allow for such a fine-grain adjustment of the communication routines, a full synchronization of all ghost layers would have to be executed. In order to evaluate the performance gains due to these algorithmic optimizations, another weak scaling is performed using the lid-driven cavity benchmark introduced at the beginning of Section 7.3. For each core count, the benchmark is executed four times. First, the benchmark is executed with default settings. Here, the optimized communication routines as well as the asynchronous communication scheme are used during the simulation. The results of this first run serve as a performance baseline. After that, the benchmark is executed three more times: once with the optimized communication routines but with the synchronous communication scheme, once with the asynchronous communication scheme but with the unoptimized communication routines that perform a full synchronization of all ghost layers, and once with both the full synchronization of all ghost layers as well as the synchronous communication scheme. For all of these three additional benchmark runs, the runtime overhead compared to the first run is measured. SuperMUC The results for SuperMUC are outlined in Figure 7.9. Detailed results are also listed in Table 7.6. On SuperMUC, performing synchronous communication increases the runtime by 2.8 % for a simulation with 53.5 ⋅ 103 cells/core. In a significantly larger simulation that contains 836 ⋅ 103 cells/core, the increase in runtime due to synchronous communication is almost identical with 2.3 %. Consequently, the performance gains due to asynchronous communication are small, but present for all benchmark runs on 128 up to 65,536 cores. Performing a full synchronization of all ghost layers has a larger impact on performance. Here, for a simulation with 53.5 ⋅ 103 cells/core, the total runtime increases by 21 % on average. If the number of cells processed by each core increases to 836 ⋅ 103 , the overhead introduced by performing a full synchronization of all ghost layers gets smaller, but is still 8.7 % on average. Finally, if both the synchronous communication scheme as well as the communication routines that perform full synchronization are used, total simulation time will, on average, increase by 27 % or 13 %, respectively. Ultimately, the various optimizations that target the communication aspects of the algorithm for the nonuniform

99

7 Validation and Benchmarks

53.5 ⋅ 103 cells/core

overhead in %

50

836 ⋅ 103 cells/core

40

full / sync. full / async. sparse / sync.

30 20 10 0

27 28

210

212 cores

214

216

27 28

210

212 cores

214

216

Figure 7.9: Overall increase in total simulation time for various benchmarks on SuperMUC compared to a benchmark run that makes use of the asynchronous communication scheme together with optimized communication routines that transfer as little data as possible (≙ sparse/async., baseline against which the increase in runtime of the other benchmark runs is measured). “full” indicates full synchronization of all ghost layers, whereas “sparse” indicates the usage of the optimized communication routines. Similarly, “sync.” indicates synchronous communication, whereas “async.” indicates the usage of the asynchronous communication scheme. Corresponding detailed results are listed in Table 7.6. Table 7.6: Precise percentage values that correspond to the graphs in Figure 7.9. 53.5 ⋅ 103 cells/core sparse

full

836 ⋅ 103 cells/core sparse

full

cores

sync.

async.

sync.

sync.

async.

sync.

128 256 512 1 024 2 048 4 096 8 192 16 384 32 768 65 536

6.8 % 5.3 % 1.5 % 1.3 % 2.5 % 1.8 % 1.2 % 1.2 % 2.2 % 4.0 %

22 % 20 % 20 % 17 % 20 % 19 % 18 % 19 % 21 % 35 %

28 % 26 % 22 % 20 % 23 % 22 % 22 % 24 % 35 % 45 %

4.5 % 6.0 % 2.1 % 1.8 % 2.1 % 0.8 % 2.0 % 1.1 % 1.2 % 1.7 %

9.5 % 9.7 % 8.7 % 6.4 % 6.8 % 8.9 % 8.8 % 9.1 % 9.6 % 9.4 %

13 % 15 % 12 % 11 % 12 % 12 % 12 % 12 % 13 % 17 %

avg.:

2.8 %

21 %

27 %

2.3 %

8.7 %

13 %

LBM help to significantly reduce the runtime of the entire simulation. Not using these algorithmic enhancements can increase the runtime by almost 50 % for simulations that employ a large number of cores and a relatively small number of cells that are processed by each core.

JUQUEEN Results for the same series of benchmarks executed on JUQUEEN are outlined in Figure 7.10 and Table 7.7. Contrary to the results on SuperMUC, for a relatively small number of cells assigned to each core, the synchronous communication scheme is faster than the asynchronous scheme. However, if the number of cells is increased and hence more data must be communicated, using the asynchronous scheme again results in the best

100

7.3 Performance of Statically Refined Grids

23.4 ⋅ 103 cells/core

overhead in %

20

324 ⋅ 103 cells/core

15

full / sync. full / async. sparse / sync.

10 5 0 −5

28

210

212 214 cores

216

218

28

210

212 214 cores

216

218

Figure 7.10: Same series of benchmarks as presented in Figure 7.9, however, for simulations executed on JUQUEEN instead of SuperMUC. On JUQUEEN, using the synchronous communication scheme increases the performance for simulations that contain a relatively small number of cells assigned to each core. This increase in performance when using the synchronous communication scheme is indicated by a negative overhead compared to the performance baseline, which corresponds to a simulation that uses the default settings (≙ asynchronous communication). For larger simulations, similar results as on SuperMUC can be observed. Here, the asynchronous communication scheme also leads to the shortest simulation runtime. Table 7.7: Precise percentage values that correspond to the graphs in Figure 7.10. 23.4 ⋅ 103 cells/core sparse cores

full

324 ⋅ 103 cells/core sparse

full

sync.

async.

sync.

sync.

async.

sync.

128 256 512 1 024 2 048 4 096 8 192 16 384 32 768 65 536 131 072 262 144

−0.2 % −3.2 % −3.5 % −3.2 % −3.6 % −3.7 % −2.4 % −2.7 % −1.7 % −1.3 % 0.1 % 1.0 %

9.8 % 8.4 % 8.9 % 11 % 9.1 % 9.9 % 8.9 % 8.9 % 11 % 12 % 12 % 13 %

12 % 8.4 % 7.5 % 8.2 % 7.5 % 7.7 % 8.8 % 8.4 % 9.2 % 9.9 % 9.0 % 15 %

4.1 % 2.5 % 2.5 % 1.1 % 0.7 % 0.3 % 1.3 % 1.8 % 1.0 % 1.3 % 0.9 % 0.3 %

7.6 % 6.1 % 9.3 % 8.1 % 8.3 % 8.0 % 8.5 % 9.5 % 9.4 % 9.6 % 9.4 % 9.1 %

12 % 11 % 12 % 10 % 11 % 9.6 % 10 % 11 % 11 % 11 % 11 % 12 %

avg.:

−2.0 %

10 %

9.3 %

1.5 %

8.6 %

11 %

performance. In these simulations, similar to the results on SuperMUC, the advantage of the asynchronous communication scheme is small, but present for all benchmark runs on 128 up to 262,144 cores. Performing a full synchronization of all ghost layers is always slower than employing the optimized communication routines that transfer significantly less data. Independent of the number of cells that are assigned to each core, performing a full synchronization of all ghost layers introduces an overhead of approximately 10 % for all core counts that have been tested. Ultimately, the runtime gain when using the various optimizations for the communication is smaller on JUQEEN as compared to SuperMUC. Still, activating all these optimizations on JUQUEEN will almost always yield in a consistent and noticeable speed-up of the simulation.

101

7 Validation and Benchmarks

7.4 Adaptive Mesh Refinement After evaluating the accuracy and performance of the algorithm for the LBM on nonuniform grids in Sections 7.2 and 7.3, this section evaluates the four-step AMR procedure introduced in Chapter 6. In order to assess the performance of the AMR procedure, a benchmark application that simulates lid-driven cavity flow based on the LBM is used. The initial setup of this simulation is identical to the benchmark application outlined at the beginning of Section 7.3. Some regions of the domain are refined three times, resulting in a total of four different levels of refinement. Moreover, the benchmark employs the D3Q19 model that results in 19 unknowns (≙ PDF values) per cell. At some predefined point in time, AMR is artificially triggered by marking all blocks on the finest level for coarsening and simultaneously generating an equal number of finest cells by marking some of the coarser neighbor blocks for refinement. Some more blocks are automatically marked for refinement by the AMR framework in order to preserve 2:1 balance. Consequently, the region of finest resolution moves slightly inwards. Ultimately, 72 % of all cells change their size during this AMR process. The intent is to put an unusually high amount of stress on the dynamic repartitioning procedure. The domain partitioning before and after AMR is triggered is illustrated in Figure 7.11. During this repartitioning, the average number of blocks assigned to each process increases by 33 %. As a result, the amount of data (≙ number of cells) also increases by the same factor, since every block, independent of its level, stores a grid of the same size. Similar to the benchmark used in Section 7.3, when increasing the number of processes by a factor of 2, the size of the domain is also extended by a factor of 2 in y-direction, doubling the number of blocks on each level. As a result, the distribution characteristics (see Table 7.8) remain constant, independent of the number of processes in use. For a fixed number of grid cells stored at each block, increasing the number of processes then corresponds to a weak scaling scenario of the simulation where the global number of blocks and cells increases linearly with the number of processes. To also evaluate the performance of the AMR pipeline for varying amounts of data per process (as is the case when strong scaling a simulation), the benchmarks are executed multiple times with different numbers of cells stored at each block. The stated average number of cells per core always correspond to the state of the simulation after AMR was executed.

Figure 7.11: Block partitioning of the benchmark application before (figure on the left) and after (right) AMR is triggered. The initial setup before AMR is triggered is identical to Figure 7.5. Table 7.8 summarizes corresponding memory and workload distribution statistics.

102

7.4 Adaptive Mesh Refinement Table 7.8: Statistics about the distribution of workload and memory among the four available levels in the simulation illustrated in Figure 7.11. Even though the finest cells only cover a small portion of the domain, cells on the finest level account for most of the generated workload and memory consumption (→ most blocks are located on the finest level). This is true before and after the dynamic repartitioning. L=0

L=1

L=2

L=3

domain coverage ratio workload share memory/block share

77.78 % 1.10 % 6.54 %

16.67 % 3.76 % 11.22 %

4.17 % 15.02 % 22.43 %

1.39 % 80.13 % 59.81 %

initially

domain coverage ratio workload share memory/block share

66.67 % 0.78 % 4.23 %

22.23 % 4.13 % 11.27 %

9.72 % 28.94 % 39.44 %

1.39 % 66.15 % 45.07 %

after AMR

Table 7.9: Average and maximal number of blocks assigned to each process for the benchmark application illustrated in Figure 7.11. The numbers in the third and fourth column correspond to the state of the simulation after AMR was triggered, while the numbers in the second column correspond to the initial state of the simulation. These numbers are independent of the total number of processes. If more processes are used, the number of blocks on each level is increased proportionally. As a consequence, the average as well as the maximal number of blocks per process remain identical for any total number of processes. avg. blocks/proc. (max. blocks/proc.) load balancing level 0 1 2 3

initially 0.383 0.656 1.313 3.500

(1) (1) (2) (4)

before 0.328 0.875 3.063 3.500

( 1) ( 9) (11) (16)

after 0.328 0.875 3.063 3.500

(1) (1) (4) (4)

The average and maximal number of blocks assigned to each process are listed in Table 7.9. Before load balancing is executed during the AMR procedure, the distribution of blocks (more precisely: proxy blocks) is highly irregular with some processes containing far more blocks on certain levels than the average number of blocks per process would suggest. Some processes, for example, are assigned nine blocks on level 1 at the beginning of the AMR procedure. Only after load balancing, a perfect distribution is achieved with no single process containing more blocks than expected. As a result, no process ends up with more than one block on level 1. In all of the benchmarks that follow, the time required for one complete AMR cycle is mainly characterized by the time that is required for the dynamic load balancing (≙ 3rd step in the AMR pipeline) and the migration of the simulation data, which includes the refinement and the coarsening of the data (≙ 4th step). The runtime of the 1st and the 2nd step, the block-level refinement phase and the construction of the proxy data structure, is independent of the number of processes in use. It is almost constant in each series of measurements. Consequently, apart from presenting the time that is required for the entire AMR procedure, detailed results are only presented for the 3rd and the 4th step, the dynamic load balancing stage and the migration of the simulation data.

103

7 Validation and Benchmarks

7.4.1 Space Filling Curves The performance of the entire AMR procedure is greatly influenced by the algorithm that is chosen during the dynamic load balancing stage. Consequently, the performance of the AMR pipeline is evaluated twice. In this section, the performance of the AMR algorithm is evaluated when using the load balancing approach based on SFCs introduced in Sections 6.4.1 and 4.3. In the next section, the same series of benchmarks is run again with a diffusion-based load balancing approach. Since LBM-based simulations require per-level balancing, a global synchronization of all block IDs using an MPI allgather operation is necessary when using the SFC-based balancing algorithm from Section 6.4.1.

SuperMUC On SuperMUC, there is barely any difference between using Morton or Hilbert order for the SFC-based balancing algorithm (see Table 7.10). If a small difference in execution time can be measured, the AMR procedure that uses the Hilbert order-based balancing is a bit slower due to the necessary tree reconstruction. For Morton order, all blocks are sorted lexicographically by their block IDs (cf. Section 4.3). Consequently, only detailed results for benchmarks that use Morton order-based balancing are presented. Results for Hilbert order-based balancing are very similar. The benchmarks always make use of hybrid parallel execution with four OpenMP threads per MPI process. This hybrid parallelization approach results in a fast execution time of the AMR procedure (faster as compared to MPI only, see Section 7.4.3) and also allows for the implementation of the LBM on nonuniform grids to reach high performance (cf. Table 7.3 of Section 7.3). Figure 7.12 shows detailed results when using SFC-based dynamic balancing. The time required for the load balancing stage only depends on the total number of globally available proxy blocks, but is independent of the amount of simulation data stored in each actual block. Consequently, the time required for the load balancing stage is identical in all three scenarios. As expected, the runtime of the SFC-based balancing algorithm increases Table 7.10: Time required for the execution of the entire AMR pipeline, including dynamic load balancing and data migration, on SuperMUC (in seconds). The table compares SFC-based load balancing using Hilbert order with SFC-based load balancing using Morton order. Times are listed for four benchmark series that only vary in the amount of data assigned to each block (and therefore each core). 62.1 ⋅ 103 cells/core

210 ⋅ 103 cells/core

497 ⋅ 103 cells/core

971 ⋅ 103 cells/core

cores

Hilbert

Morton

Hilbert

Morton

Hilbert

Morton

Hilbert

Morton

512 1 024 2 048 4 096 8 192 16 384 32 768 65 536

0.15 0.17 0.18 0.20 0.22 0.28 0.33 0.57

0.15 0.17 0.18 0.20 0.20 0.26 0.29 0.42

0.40 0.43 0.45 0.48 0.50 0.63 0.75 1.04

0.40 0.43 0.45 0.47 0.48 0.61 0.71 0.93

0.95 1.00 1.06 1.11 1.13 1.40 1.66 1.96

0.95 1.01 1.05 1.10 1.11 1.37 1.60 1.88

1.77 1.86 1.96 2.01 2.06 2.56 3.01 3.50

1.75 1.86 1.95 2.02 2.04 2.50 2.93 3.38

104

7.4 Adaptive Mesh Refinement

62.1 ⋅ 103 cells/core

0.5

time in sec.

0.8 time in sec.

time in sec.

0.4 0.3 0.2 0.1 0

4 3.5 3 2.5 2 1.5 1 0.5 0

210 ⋅ 103 cells/core

1

29

210

211

212 213 cores

214

215

0.6 0.4 0.2 0

216

29

210

211

212 213 cores

214

215

216

971 ⋅ 103 cells/core

complete AMR cycle data migration dynamic load balancing (SFC – Morton order)

29

210

211

212 213 cores

214

215

216

Figure 7.12: Detailed runtime for the entire AMR cycle on SuperMUC when using SFC-based dynamic load balancing for three different series of benchmarks that only vary in the amount of data allocated for each block. A breakdown of all times used to generate these graphs is available in Table 7.11. Table 7.11: Breakdown of all times (in seconds) from the series of benchmarks outlined in Figure 7.12. Note that although there are additional measurements provided in this table for a benchmark that uses 497 ⋅ 103 cells/core, there are no corresponding graphs in Figure 7.12. complete AMR cycle 103 )

cores 512 1 024 2 048 4 096 8 192 16 384 32 768 65 536

load balancing

cells/core (in 62.1 210 497

971

cells/core (in 62.1 210 497

0.15 0.17 0.18 0.20 0.20 0.26 0.29 0.42

1.75 1.86 1.95 2.02 2.04 2.50 2.93 3.38

0.003 0.007 0.009 0.016 0.020 0.043 0.052 0.140

0.40 0.43 0.45 0.47 0.48 0.61 0.71 0.93

0.95 1.01 1.05 1.10 1.11 1.37 1.60 1.88

data migration

103 )

0.003 0.007 0.008 0.017 0.020 0.045 0.053 0.133

0.003 0.007 0.008 0.016 0.019 0.042 0.055 0.150

971 0.003 0.006 0.009 0.017 0.020 0.043 0.052 0.138

cells/core (in 103 ) 62.1 210 497 971 0.08 0.09 0.09 0.09 0.09 0.12 0.14 0.19

0.24 0.25 0.27 0.28 0.28 0.36 0.46 0.53

0.59 0.62 0.65 0.68 0.69 0.89 1.10 1.28

1.09 1.15 1.20 1.26 1.25 1.66 2.08 2.41

with the number of processes due to the allgather operation and the subsequent sorting of all block IDs. If the amount of data per block increases, the time required for the migration stage increases proportionally. Furthermore, migration time also increases with the number of processes since SFC-based balancing results in a global reassignment of all blocks regardless of the blocks’ previous process associations. As a consequence, some of

105

7 Validation and Benchmarks the data must be migrated between distant processes, and this distance increases the more processes are utilized. On SuperMUC, more bandwidth is available for communication within the same compute island (1 island ≙ 8,192 cores) than for communication between different islands. Ultimately, SFC-based dynamic load balancing shows good performance (140 ms on 65,536 cores), with the runtime of the entire AMR cycle being dominated by the data migration stage, which includes the refinement and coarsening of the cell data. Besides the visualization of the results in Figure 7.12, a breakdown of all times is also presented in Table 7.11. For 13.8 billion cells (≙ 261 billion unknowns) on 65,536 cores, the entire AMR procedure is finished in less than one second. JUQUEEN On JUQUEEN, the benchmark also makes use of hybrid parallel execution. Here, the benchmark even uses eight threads per MPI process in order to reduce to total number of processes. Fewer processes result in a smaller number of globally available blocks, since the more threads are assigned to one process, the more cells are stored at each block. Using fewer processes for the same number of allocated cores also results in more memory available for one process. Dealing with fewer processes, fewer blocks, and more memory per process is crucial for the SFC-based balancing performance on JUQUEEN (see Section 7.4.3). Moreover, for hybrid parallel execution with eight threads per process, the implementation of the LBM on nonuniform grids still achieves good performance. Different to the results on SuperMUC, Table 7.12 shows that Morton order-based balancing is significantly faster than Hilbert order-based balancing. Here, the reconstruction of the forest of octrees and its subsequent traversal required for Hilbert order-based balancing results in a notable penalty on the Blue Gene/Q architecture. Table 7.12: Time required for the execution of the entire AMR pipeline, including dynamic load balancing and data migration, on JUQUEEN (in seconds). The table compares SFC-based load balancing using Hilbert order with SFC-based load balancing using Morton order. Times are listed for three benchmark series that only vary in the amount of data assigned to each block (and therefore each core).

cores 256 512 1 024 2 048 4 096 8 192 16 384 32 768 65 536 131 072 262 144 458 752

106

31.1 ⋅ 103 cells/core

127 ⋅ 103 cells/core

429 ⋅ 103 cells/core

Hilbert

Morton

Hilbert

Morton

Hilbert

Morton

0.48 0.57 0.60 0.75 0.88 1.16 1.51 2.12 3.35 5.82 10.63 18.54

0.47 0.53 0.57 0.71 0.78 0.90 1.04 1.26 1.65 2.44 4.05 7.17

1.02 1.18 1.23 1.39 1.52 1.68 2.23 2.78 3.93 6.50 11.41 19.22

0.97 1.06 1.20 1.25 1.32 1.43 1.61 1.88 2.20 3.09 4.73 7.80

2.83 2.64 2.93 3.12 3.14 3.25 3.85 4.73 5.82 8.63 13.57 21.62

2.85 3.02 2.92 3.04 2.90 3.18 3.37 3.76 3.97 5.08 6.77 10.22

8 7 6 5 4 3 2 1 0

31.1 ⋅ 103 cells/core

time in sec.

time in sec.

7.4 Adaptive Mesh Refinement

28

210

212

214 cores

216

218

8 7 6 5 4 3 2 1 0

127 ⋅ 103 cells/core

28

210

212

214 cores

216

218

429 ⋅ 103 cells/core

12 time in sec.

10 8

complete AMR cycle data migration dynamic load balancing (SFC – Morton order)

6 4 2 0

28

210

212

214 cores

216

218

Figure 7.13: Detailed runtime for the entire AMR cycle on JUQUEEN when using SFC-based dynamic load balancing for three different series of benchmarks that only vary in the amount of data allocated for each block. A breakdown of all times used to generate these graphs is available in Table 7.13.

Figure 7.13 illustrates the performance of the AMR procedure on JUQUEEN when employing SFC-based balancing using Morton order. Results when using Hilbert orderbased balancing are very similar, even though the runtime is notably longer. This section therefore refrains from presenting corresponding graphs and solely focuses on Morton order-based balancing. As expected, the runtime of the balancing algorithm is independent of the amount of data stored on each block, but it increases significantly the more processes are used. JUQUEEN has considerably more cores than SuperMUC and the disadvantage of a global synchronization based on an MPI allgather operation becomes clearly visible in the timings. This approach does not scale to extreme numbers of processes and, as a consequence, will not be feasible if the number of cores continues to increase. The time required for the migration procedure is again, as expected, proportional to the amount of data stored on each block. In contrast to SuperMUC, the communication network on JUQUEEN shows homogeneous performance across almost the entire system. As a result, for the same amount of data per process, the runtime of the migration procedure is almost completely independent of the number of processes. Ultimately, in large simulations, the runtime of the entire AMR algorithm is dominated by the SFC-based dynamic balancing step. A breakdown of all times is presented in

107

7 Validation and Benchmarks Table 7.13: Breakdown of all times (in seconds) from the series of benchmarks outlined in Figure 7.13. complete AMR cycle cores 256 512 1 024 2 048 4 096 8 192 16 384 32 768 65 536 131 072 262 144 458 752

103 )

cells/core (in 31.1 127 429 0.47 0.53 0.57 0.71 0.78 0.90 1.04 1.26 1.65 2.44 4.05 7.17

2.85 3.02 2.92 3.04 2.90 3.18 3.37 3.76 3.96 5.08 6.77 10.22

0.97 1.06 1.20 1.25 1.32 1.43 1.61 1.88 2.20 3.09 4.73 7.80

load balancing

data migration

103 )

cells/core (in 31.1 127 429

cells/core (in 103 ) 31.1 127 429

0.02 0.06 0.06 0.11 0.18 0.30 0.39 0.59 0.97 1.76 3.35 6.31

0.02 0.07 0.08 0.12 0.19 0.33 0.43 0.62 1.01 1.86 3.54 6.52

0.16 0.16 0.16 0.16 0.15 0.16 0.16 0.18 0.17 0.18 0.18 0.33

0.02 0.06 0.07 0.12 0.18 0.30 0.40 0.60 0.99 1.80 3.44 6.33

0.49 0.48 0.53 0.51 0.47 0.48 0.50 0.58 0.52 0.59 0.58 0.74

1.57 1.53 1.56 1.61 1.40 1.48 1.53 1.77 1.58 1.82 1.79 2.19

Table 7.13. On JUQUEEN, SFC-based AMR with up to 197 billion cells (≙ 3.7 trillion unknowns) is possible for simulations on all 458,752 cores. Executing one full AMR cycle will then take 10 seconds to complete.

7.4.2 Diffusion-Based Load Balancing As mentioned at the beginning of the previous section, this section evaluates the runtime of the AMR procedure when using the diffusion-based load balancing approach introduced in Section 6.4.2. Similar to comparing Morton and Hilbert order-based SFCs, two different configurations for the diffusion algorithm are compared in this section. The first configuration exclusively uses the push scheme, with 15 flow iterations during each execution of the push algorithm. The other configuration alternates between calling the push and the pull algorithm for every execution of the diffusion procedure outlined in Algorithm 6.5. Here, each call to the push or pull algorithm only executes 5 flow iterations. For the rest of Chapter 7, these two configurations are referred to as “push” and “push/pull” configuration, respectively. For all subsequent benchmarks, both variants always converge towards perfect balance as shown in Table 7.9, for every number of cores and on both systems, SuperMUC and JUQUEEN. When using the “push” configuration, executing fewer flow iterations (5, 8, 10, 12, etc.) does not always result in perfect balance, hence the 15 flow iterations that are used for this configuration. SuperMUC On SuperMUC, again hybrid parallelization with four threads per process is used for all benchmarks. Figure 7.14 shows the number of main iterations1 that are required for the diffusion procedure until perfect balance is achieved on SuperMUC. The number 1

number of main iterations ≙ number of times Algorithm 6.5 is executed during the dynamic load balancing stage

108

7.4 Adaptive Mesh Refinement 10

iterations

8 6

push push/pull

4 2 0

29

210 211 212 213 214 215 216 cores

Figure 7.14: Number of main iterations that are required for the diffusion procedure until perfect balance is achieved on SuperMUC. Table 7.14: Time required for the execution of the entire AMR pipeline, including dynamic load balancing and data migration, on SuperMUC (in seconds). The table compares diffusion-based load balancing using the push with diffusion-based load balancing using the push/pull scheme. Times are listed for four benchmark series that only vary in the amount of data assigned to each block (and therefore each core). 62.1 ⋅ 103 cells/core

210 ⋅ 103 cells/core

497 ⋅ 103 cells/core

971 ⋅ 103 cells/core

cores

push

push/pull

push

push/pull

push

push/pull

push

push/pull

512 1 024 2 048 4 096 8 192 16 384 32 768 65 536

0.14 0.16 0.18 0.20 0.21 0.23 0.24 0.27

0.15 0.17 0.18 0.20 0.21 0.23 0.26 0.26

0.36 0.38 0.41 0.42 0.43 0.46 0.48 0.53

0.36 0.39 0.40 0.43 0.44 0.46 0.49 0.53

0.86 0.91 0.93 0.96 0.96 1.00 1.02 1.11

0.89 0.88 0.94 0.98 0.98 1.00 1.08 1.08

1.53 1.60 1.63 1.65 1.65 1.69 1.75 1.80

1.58 1.62 1.65 1.69 1.69 1.80 1.86 1.90

of iterations increases very little as the number of processes/utilized cores increases exponentially. The push/pull version, on average, requires one more iteration than the push only version. Moreover, executing the pull algorithm is more expensive than executing the push algorithm (see Algorithms 6.6 and 6.7). However, the push/pull version performs considerably fewer flow iterations (5 instead of 15). Ultimately, both versions result in almost identical times for the entire AMR procedure as shown in Table 7.14. Consequently, a more detailed analysis of the performance of the AMR procedure when using the diffusion-based approach focuses on the push/pull version only. These detailed results are presented in Figure 7.15. Again, the time required for the balancing algorithm is independent of the amount of simulation data stored on each actual block. Contrary to SFC-based balancing, however, the time required for the diffusion-based balancing increases much slower and mainly depends on the number of main iterations required for the diffusion approach. If the number of main iterations is identical (as is the case for 215 and 216 cores, for example), the time required for the dynamic load balancing stage also remains almost identical. Furthermore, since diffusion-based balancing requires only communication between neighboring processes, the time required for the data migration stage proves to be virtually independent of the total number of processes.

109

7 Validation and Benchmarks

62.1 ⋅ 103 cells/core

0.3 0.2 0.1 0

29

210

211

212 213 cores

214

210 ⋅ 103 cells/core

0.8 time in sec.

time in sec.

0.4

215

0.6 0.4 0.2 0

216

29

210

211

212 213 cores

214

215

216

971 ⋅ 103 cells/core

2.5 time in sec.

2 complete AMR cycle data migration dynamic load balancing (diffusion – push/pull)

1.5 1 0.5 0

29

210

211

212 213 cores

214

215

216

Figure 7.15: Detailed runtime for the entire AMR cycle on SuperMUC when using diffusion-based dynamic load balancing for three different series of benchmarks that only vary in the amount of data allocated for each block. A breakdown of all times used to generate these graphs is available in Table 7.15. Table 7.15: Breakdown of all times (in seconds) from the series of benchmarks outlined in Figure 7.15. Note that although there are additional measurements provided in this table for a benchmark that uses 497 ⋅ 103 cells/core, there are no corresponding graphs in Figure 7.15. complete AMR cycle 103 )

cores 512 1 024 2 048 4 096 8 192 16 384 32 768 65 536

load balancing

cells/core (in 62.1 210 497

971

cells/core (in 62.1 210 497

0.15 0.17 0.18 0.20 0.21 0.23 0.26 0.26

1.58 1.62 1.65 1.69 1.69 1.80 1.86 1.90

0.014 0.024 0.031 0.046 0.052 0.069 0.085 0.087

0.36 0.39 0.40 0.43 0.44 0.46 0.49 0.53

0.89 0.87 0.93 0.98 0.98 1.01 1.08 1.08

data migration

103 )

0.013 0.023 0.030 0.045 0.051 0.071 0.086 0.088

0.013 0.022 0.031 0.045 0.052 0.070 0.088 0.086

971 0.013 0.023 0.031 0.045 0.052 0.072 0.086 0.087

cells/core (in 103 ) 62.1 210 497 971 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07

0.20 0.20 0.20 0.20 0.20 0.20 0.23 0.23

0.53 0.51 0.51 0.52 0.51 0.51 0.54 0.56

0.88 0.87 0.88 0.88 0.87 0.89 0.97 1.01

Ultimately, the time required for the dynamic load balancing always remains below 90 ms. For simulations where large amounts of data must be communicated during the final migration phase, the runtime of the AMR algorithm remains almost constant. A breakdown of all times is also available in Table 7.15. For 13.8 billion cells (≙ 261 billion unknowns) on 65,536 cores, the entire AMR procedure completes in half a second. This is

110

all-reduce overhead in %

7.4 Adaptive Mesh Refinement 30 20 10 0 -10 -20

29

210

211

212 213 cores

214

215

216

Figure 7.16: Average overhead in runtime on SuperMUC for the diffusion algorithm (Algorithm 6.5 with five flow iterations and the push scheme) when three global reductions (via Algorithm 6.8) are included at the beginning of the diffusion algorithm. The bars show the minimal and maximal overhead that was measured during a series of multiple benchmark runs. A negative overhead indicates a faster execution of the version including the global reductions.

almost twice as fast when compared to SFC-based balancing. As indicated by Figure 7.14 and stated at the end of Section 6.4.2, the diffusion-based dynamic load balancing algorithm makes use of a variable number of main diffusion iterations by default. As a consequence, Algorithm 6.5 is only executed as many times as are needed for the block partitioning to achieve balance. In order to enable a variable number of main diffusion iterations, three global reductions (via the execution of Algorithm 6.8) must by included at the beginning of the diffusion algorithm. Another series of benchmarks aims at evaluating the overhead that is introduced by these reductions. For these benchmarks, only one main diffusion iteration is performed, meaning Algorithm 6.5 (using five flow iterations and the push scheme) is only executed once, its runtime is measured, and the benchmark is terminated. First, this execution of the diffusion algorithm includes Algorithm 6.8 and the three global reductions. Immediately afterwards, the benchmark is run again, but this time without the inclusion of Algorithm 6.8. Based on these two consecutive runs, the overhead that is introduced by the reductions is calculated. The two runs themselves are repeated multiple times. The average overhead for one iteration of the diffusion algorithm when including the reductions is outlined in Figure 7.16. This graph also includes bars that show the minimal and maximal overhead that was measured. A negative overhead indicates that for the corresponding execution of the benchmark, the version including the reductions was measured to be faster. Ultimately, the results indicate that the time required for the reductions is significantly less than the fluctuations in the runtime for the diffusion algorithm between multiple benchmarks runs. The overhead of approx. 2.5 % (≙ average overhead for all measurements across all core counts) that is introduced by the reductions for the diffusion algorithm has an even smaller impact on the runtime of the entire AMR procedure, considering that the dynamic load balancing is only one part of the AMR pipeline (see Figure 7.15). Consequently, the results on SuperMUC confirm that the global reductions introduced via Algorithm 6.8 can be activated without hesitation.

111

7 Validation and Benchmarks

JUQUEEN On JUQUEEN, the benchmarks now also use four threads per process, different from the eight threads that were employed for SFC-based balancing in Section 7.4.1. For diffusion-based balancing, the total number of processes does not need to be restricted and the version utilizing four instead of eight threads per process results in the best performance for the entire simulation, including the algorithm for the LBM on nonuniform grids (cf. Table 7.5 of Section 7.3). Similar to the results on SuperMUC, Figure 7.17 shows that the number of main iterations required to achieve perfect balance slightly increases while the number of processes increases exponentially. As expected due to the results on SuperMUC, the version utilizing the push/pull scheme requires, on average, one more iteration than the push only configuration. Just as on SuperMUC and as outlined in Table 7.16, the runtime of the entire AMR procedure shows hardly any difference between the push/pull and the push only version. 12

iterations

10 8

push push/pull

6 4 2 0

29

210 211 212 213 214 215 216 217 218 cores

52 87 45

28

Figure 7.17: Number of main iterations that are required for the diffusion procedure until perfect balance is achieved on JUQUEEN. Table 7.16: Time required for the execution of the entire AMR pipeline, including dynamic load balancing and data migration, on JUQUEEN (in seconds). The table compares diffusion-based load balancing using the push with diffusion-based load balancing using the push/pull scheme. Times are listed for three benchmark series that only vary in the amount of data assigned to each block (and therefore each core).

112

31.1 ⋅ 103 cells/core

127 ⋅ 103 cells/core

429 ⋅ 103 cells/core

cores

push

push/pull

push

push/pull

push

push/pull

256 512 1 024 2 048 4 096 8 192 16 384 32 768 65 536 131 072 262 144 458 752

0.60 0.71 0.80 0.91 1.03 1.09 1.18 1.23 1.35 1.36 1.48 1.55

0.61 0.69 0.84 0.91 1.09 1.19 1.22 1.32 1.38 1.49 1.51 1.57

1.05 1.22 1.34 1.44 1.56 1.65 1.74 1.85 1.86 1.79 1.94 2.06

1.10 1.20 1.38 1.43 1.62 1.75 1.76 1.92 1.98 2.10 2.10 2.11

2.17 2.40 2.48 2.52 2.67 2.67 2.80 2.87 3.02 3.04 3.15 3.26

2.23 2.27 2.47 2.52 2.76 2.78 2.87 2.92 3.03 3.33 3.25 3.52

7.4 Adaptive Mesh Refinement

31.1 ⋅ 103 cells/core

2

1.5 1 0.5 0

127 ⋅ 103 cells/core

2.5 time in sec.

time in sec.

2

28

210

212

214 cores

216

1.5 1 0.5 0

218

28

210

212

214 cores

216

218

429 ⋅ 103 cells/core

5 time in sec.

4 complete AMR cycle data migration dynamic load balancing (diffusion – push/pull)

3 2 1 0

28

210

212

214 cores

216

218

Figure 7.18: Detailed runtime for the entire AMR cycle on JUQUEEN when using diffusion-based dynamic load balancing for three different series of benchmarks that only vary in the amount of data allocated for each block. A breakdown of all times used to generate these graphs is available in Table 7.17. Table 7.17: Breakdown of all times (in seconds) from the series of benchmarks outlined in Figure 7.18. complete AMR cycle cores 256 512 1 024 2 048 4 096 8 192 16 384 32 768 65 536 131 072 262 144 458 752

103 )

cells/core (in 31.1 127 429 0.61 0.69 0.84 0.91 1.09 1.19 1.22 1.32 1.38 1.49 1.51 1.57

2.23 2.27 2.47 2.53 2.76 2.78 2.87 2.92 3.03 3.33 3.25 3.52

1.10 1.20 1.38 1.43 1.62 1.75 1.76 1.92 1.98 2.10 2.10 2.11

load balancing

data migration

103 )

cells/core (in 31.1 127 429

cells/core (in 103 ) 31.1 127 429

0.18 0.22 0.33 0.37 0.52 0.61 0.63 0.72 0.77 0.88 0.88 0.95

0.20 0.23 0.35 0.40 0.54 0.64 0.66 0.75 0.81 0.92 0.92 0.99

0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.11

0.20 0.23 0.33 0.39 0.54 0.63 0.64 0.75 0.80 0.91 0.90 0.98

0.30 0.30 0.30 0.28 0.28 0.28 0.27 0.28 0.29 0.29 0.29 0.32

0.84 0.84 0.86 0.78 0.77 0.78 0.77 0.79 0.81 0.83 0.81 0.93

The scaling characteristics for the entire AMR procedure are similar to the results on SuperMUC. Detailed results when using the push/pull scheme on JUQUEEN are presented in Figure 7.18. Results for the push only version are almost completely identical and therefore omitted from those graphs. As expected, the time required for the dynamic

113

all-reduce overhead in %

7 Validation and Benchmarks 10 5 0 -5 -10

28

210

212

214 cores

216

218

Figure 7.19: Average overhead in runtime on JUQUEEN for the diffusion algorithm (Algorithm 6.5 with five flow iterations and the push scheme) when three global reductions (via Algorithm 6.8) are included at the beginning of the diffusion algorithm. The bars show the minimal and maximal overhead that was measured during a series of multiple benchmark runs. A negative overhead indicates a faster execution of the version including the global reductions.

balancing stage is independent of the amount of simulation data stored on each actual block. Moreover, the very slow increase in runtime for the balancing algorithm (compared to the exponential increase in the number of cores and processes) again conforms with the increase in main iterations required for the diffusion scheme to achieve perfect global balance. For a fixed amount of data assigned to each block, the runtime of the data migration stage, however, remains nearly unchanged from 256 up to 458,752 cores. A corresponding breakdown of all times on JUQUEEN is available in Table 7.17. Consequently, for massively parallel simulations, the diffusion-based dynamic load balancing approach promises superior scaling characteristics as compared to the SFC-based load balancing scheme evaluated in Section 7.4.1. When using the diffusion-based approach, executing one entire AMR cycle for a simulation that consists of 197 billion cells (≙ 3.7 trillion unknowns) and runs on all 458,752 cores of JUQUEEN only takes 3.5 seconds to complete, as opposed to the 10 seconds when using the SFC-based balancing algorithm. As on SuperMUC, the runtime overhead generated by the global reductions, which are necessary to allow for a variable number of diffusion iterations, is evaluated with another series of benchmarks. These benchmarks are identical to the benchmarks on SuperMUC. One main iteration of the diffusion algorithm (Algorithm 6.5 with five flow iterations and the push scheme) is executed, its runtime is measured, and the benchmark is terminated. This measurement is performed once with and once without the global reductions. The two corresponding runs of the benchmark are themselves repeated multiple times. The resulting overhead of the version using the reductions is outlined in Figure 7.19. On JUQUEEN, the runtime of one iteration of the diffusion algorithm is significantly longer than the runtime required for the execution of the three reduction operations. This is true for all measurements up to 458,752 cores. Consequently, the average overhead for all measurements across all core counts of the version including the reductions is effectively 0 %. The graphs in Figure 7.19 are the result of fluctuations in the runtime of the diffusion algorithm between different benchmark runs. Given enough samples, the average overhead can be expected to result in very close to 0 % for each core count.

114

7.4 Adaptive Mesh Refinement

7.4.3 Comparison between Different Strategies Finally, Figures 7.20 and 7.21 provide a direct comparison between different parallelization and dynamic load balancing strategies on SuperMUC and JUQUEEN, respectively. The larger the number of processor cores, the greater the performance advantage of the diffusion-based balancing approach over the SFC-based balancing algorithm. A hybrid parallelization strategy with multiple threads per process results in a smaller total number of processes. As a result, algorithms whose runtime increases proportionally to the number of processes benefit from an MPI/OpenMP hybrid parallelization. As expected, if the AMR procedure uses the SFC-based balancing algorithm and relies on MPI only parallelization, the resulting runtime is greater as compared to using the same AMR pipeline in combination with MPI/OpenMP hybrid parallelization. This is generally true with only one exception. On SuperMUC, when using the SFC-based balancing scheme, MPI only parallelization is faster than MPI/OpenMP hybrid parallelization in case each block contains a large amount of data (971 ⋅ 103 cells/core). Here the data migration stage dominates the runtime (cf. Figure 7.12), but the current implementations of the algorithms that serialize, send, and receive the block data do not yet fully support thread-parallel execution. 62.1 ⋅ 103 cells/core

1.2

1.25 time in sec.

time in sec.

1 0.8 0.6 0.4 0.2 0

29

210

211

212 213 cores

214

215

216

0.75 0.5 0

29

210

211

212 213 cores

214

215

216

971 ⋅ 103 cells/core

3.75 time in sec.

1

0.25

4.5

SFC – Morton order (MPI only)

3

SFC – Morton order (MPI+OpenMP → 4 threads/proc.)

2.25 1.5

diffusion – push/pull (MPI+OpenMP → 4 threads/proc.)

0.75 0

210 ⋅ 103 cells/core

1.5

29

210

211

212 213 cores

214

215

216

Figure 7.20: Runtime of one entire AMR cycle, including dynamic load balancing and data migration, on SuperMUC. Results are compared for different parallelization and balancing strategies.

115

7 Validation and Benchmarks

31.1 ⋅ 103 cells/core

20

16 time in sec.

time in sec.

16 12 8 4 0

28

210

212

214 cores

216

218

8 4 0

28

210

212

214 cores

216

218

SFC – Morton order (MPI only)

12 time in sec.

12

429 ⋅ 103 cells/core

15

9

SFC – Morton order (MPI+OpenMP → 8 threads/proc.)

6

diffusion – push/pull (MPI+OpenMP → 4 threads/proc.)

3 0

127 ⋅ 103 cells/core

20

28

210

212

214 cores

216

218

Figure 7.21: Runtime of one entire AMR cycle, including dynamic load balancing and data migration, on JUQUEEN. Results are compared for different parallelization and balancing strategies.

On JUQUEEN, the advantages of the diffusion-based balancing approach are obvious. AMR that relies on the SFC-based balancing scheme proposed in Sections 6.4.1 and 4.3 suffers from the scheme’s O(N log N) scaling properties, whereas, in direct comparison, AMR that relies on diffusion-based balancing shows nearly constant runtime. For the SFC-based balancing and MPI only parallelization, some results are missing in the graphs of Figure 7.21. The corresponding simulations cannot complete the AMR procedure successfully since they run out of memory during the global allgather synchronization of the balancing stage. In order to successfully use the SFC-based balancing algorithm on all 458,752 cores of JUQUEEN, an MPI/OpenMP hybrid parallelization scheme must be used in order to reduce the number of processes. As a consequence, more memory is available for each individual process and the allgather operation can be executed successfully. Ultimately, the diagrams in Figure 7.21 show the superior performance and scaling characteristics of a fully distributed AMR pipeline that relies on the diffusion-based instead of the SFC-based dynamic load balancing algorithm. However, since the diffusion-based balancing approach represents an iterative, local balancing scheme, perfect balance cannot be guaranteed, as opposed to a global, SFCbased balancing scheme. Consequently, although the chosen benchmark puts a lot of pressure on the repartitioning procedure by triggering refinement/coarsening for more than two thirds of the global mesh, other scenarios might not result in perfect balance for

116

7.5 An Example Application the two configurations introduced in Section 7.4.2 (“push” with 15 and “push/pull” with 5 flow iterations). The advantage of the iterative, fully distributed balancing approach is, however, that it makes the memory requirement of the entire AMR pipeline completely independent of the total number of processes. The memory required by a process here only depends on the amount of simulation data that is assigned to this process, but it is independent of the global amount of data and the total number of processes. In consequence, as the results in this and the previous section indicate, this approach can potentially scale to millions of processes and beyond and is thus well prepared for the upcoming generation of exascale supercomputers. Furthermore, the results of various different simulations that were performed during the development and evaluation of the diffusion-based scheme indicate that even if perfect balance is not achieved, the peak workloads are generally greatly reduced already after the first few main iterations of the diffusion algorithm. For the LBM-based simulations that are studied with the waLBerla framework, the sizes of the blocks are always chosen such that each process ends up with only few blocks per level. Typically, each process will contain no more than around four blocks of each level. As a result, most of the time required for one iteration of the LBM is spent executing the compute kernel that updates the 19 (for the D3Q19 model) or 27 (D3Q27) values stored in each cell of the grid contained within each block. Less time is spent for communication and the synchronization of data between neighboring blocks. These kinds of simulations that only contain very few blocks per process (with hundreds to thousands of cells per block) do not face the same partitioning quality challenges that unstructured, cell-based codes are facing where for each process thousands of individual cells must be kept as compact agglomerations with low surface to volume ratios. For the LBM-based simulations with few blocks per process, the partitioning quality is mainly determined by the balance/imbalance of the number of blocks per process. As shown in Section 7.4.2, few iterations (≤ 10) of the iterative scheme have been enough for the benchmark application to eliminate all those imbalances.

7.5 An Example Application In order to demonstrate the capability of the presented algorithms, this section finally turns to an application-oriented example. Figures 7.22 and 7.23 (with further illustrations available in Appendix A.1) illustrate a phantom geometry of the vocal fold as it is used to study the voice generation within the human throat [Bec09]. This direct numerical simulation with a Reynolds number of 2,500 uses the LBM with the D3Q27 lattice and the TRT collision model. The simulation runs on 3,584 cores of the SuperMUC phase 2 system and starts with a completely uniform partitioning of the entire domain into a total of 23.8 million fluid cells. AMR with a refinement criterion based on velocity gradients as described in Section 6.6.1 causes the simulation to end up with 308 million fluid cells distributed to 5 different levels.

117

7 Validation and Benchmarks

Figure 7.22: 3D simulation using a phantom geometry of the human vocal fold. The figure corresponds to time step 60,000 (≙ time step 960,000 on the finest level) and shows the domain partitioning into blocks. Each block consists of 34 × 34 × 34 cells. The different colors of the blocks depict their process association. However, note that the color transitions are too faint to differentiate between the 3,584 different processes. The figure also shows the flow field on a cutting plane in the middle of the channel. Additional visualizations of the flow field on cutting planes can be found in Figure 7.23 and Appendix A.1.

Figure 7.23: Visualizations of the flow field on two cutting planes through the middle of the simulation domain as viewed from the top and the side. Both flow fields belong to the state of the simulation depicted in Figure 7.22. The grid overlay shows the block partitioning, not the individual lattice cells. Moreover, these figures do not show the entire domain. The actual domain extends slightly further to the left and a lot further to the right (to ensure a non-reflecting outflow boundary).

The simulation runs for approx. 24 hours (including constant output for later visualization and analysis) and spans 180,000 time steps on the coarsest and 2,880,000 time steps on the finest grid. Although the AMR procedure (cf. Algorithm 6.1) is executed in every time step of the 180,000 coarse time steps, actual dynamic refinement/coarsening of the grid data is only triggered 537 times. Ultimately, executing the entire AMR pipeline, including dynamic load balancing and the migration of the data, only happens every 335 time steps

118

7.5 An Example Application (on average). The time spent executing the AMR procedure accounts for 17 % of the total runtime. 95 % of that time is spent on the first AMR pipeline stage and the evaluation of the refinement criterion, i.e., the decision of whether blocks require refinement or are candidates for coarsening. Consequently, only 5 % of the time spent for AMR (≙ 1 % of the total runtime) is consumed by dynamic load balancing and data migration. Of the 308 million fluid cells at the end of the simulation, almost 50 % are located on the second finest level. Over the course of the simulation, the total number of blocks increases from 612 (all on the coarsest level) to 8030 (distributed among all 5 levels). During the final phase of the simulation, 311 times less memory is required and 701 times less workload is generated as compared to the same simulation with the entire domain refined to the finest level. Such a simulation with a uniform, fine grid would consist of approx. 96 billion fluid cells. In terms of memory alone, this simulation would exceed the available memory for 3,584 cores on the SuperMUC phase 2 system. Assuming the simulation would fit into the memory and achieve a sustained performance of 4.1 MFLUP/SC (cf. Table 7.4), it would take approx. 220 days to complete. The fact that 17 % of the total runtime are consumed by the execution of the AMR procedure, 5 % of which are eventually consumed by the dynamic load balancing and the migration of the data between processes, is very specific for the simulation outlined in this section. Providing a general statement about the impact of the SAMR scheme presented in this thesis on the performance of any arbitrary simulations is impossible. The impact on performance always depends on a variety of application-specific factors: the time needed for executing the compute kernels of the simulation, the time needed for evaluating the simulation-specific refinement criteria, the frequency in which the AMR procedure is actually executed, etc.

119

8 Outlook and Conclusion This chapter concludes the thesis. First, Section 8.1 provides an outlook on additional capabilities for the simulation framework that are enabled by the data structures and algorithms presented in the previous seven chapters. Most of these additional capabilities are currently in active development, with some of them being already available as part of the open source code that can be downloaded from http://walberla.net. Finally, Section 8.2 summarizes the main results and contributions of the thesis.

8.1 Additional Framework Capabilities The mechanisms that are used during the four-stage AMR procedure as well as the AMR procedure itself enable the implementation of additional capabilities like a checkpoint/restart mechanism and the support for resilience, both briefly illustrated in the following two Sections 8.1.1 and 8.1.2. Since the entire SAMR implementation follows the open/closed principle and therefore does not restrict itself to a specific simulation method, Section 8.1.3 outlines how other methods completely different to the LBM can run on top of the presented data structures. Finally, Section 8.1.4 proposes future extensions for the load balancing implementation.

8.1.1 Checkpoint/Restart Since for all block data there also exist corresponding serialization functions that are required for migrating blocks during the AMR procedure (see Section 6.5 and Code Snippets 6.1 and 3.1), a checkpoint/restart mechanism can be implemented without much additional effort. Checkpoint/restart mechanisms can be useful to support fault tolerance through periodic snapshots of the application to disk. Moreover, they can also be used for simulations whose runtime exceeds the 24 and 48 hour limits often enforced by the job scheduling systems on large-scale supercomputers. Long running simulations can create a snapshot of the application data when reaching these runtime limits. When a snapshot has been saved to disk, the simulation can terminate and be resumed later by loading the previously created snapshot. The checkpoint is created by a serialization of all block data and storage to disk using parallel MPI I/O. For the serialization, the serialize function of class BlockDataHandling is called for all block data (cf. Code Snippet 3.1). The returned Buffer object internally holds a byte stream that can easily be saved to a file. Similarly, the implementation of the

121

8 Outlook and Conclusion restart mechanism consists of reading the data from disk, followed by the deserialization of the block data. Here, the deserialize function of class BlockDataHandling is invoked. Consequently, the entire implementation of the checkpoint/restart mechanism consists of calling the serialization/deserialization functions available for all block data together with fitting MPI functions that handle parallel I/O. The current topology of the distributed block partitioning is also saved to a file as outlined in Section 4.4. This file is loaded during application restart and used for initializing the data structure prior to deserializing the block data.

8.1.2 Resilience Similar to the checkpoint/restart implementation, the block data serialization functionality together with the redistribution mechanism of the AMR pipeline allow to implement support for resilience [Koh17, Koh18]. The goal is to enable simulations that continue to run properly, with fewer resources, and without any need to restart the application, in case one process or even multiple processes crash due to CPU or node-level failure. The key concept of this implementation that uses the data structures and algorithms presented in this thesis is to omit writing any data to disk, but instead create periodic, redundant, in-memory snapshots that allow to restore a previous state of the simulation at runtime, similar to the functionality available in Charm++ [Zhe12]. During creation of such a snapshot, every process X stores its own, current state as well as the current state of one other process Y . As a result, the snapshot’s data is stored redundantly. One possible strategy is for process X to store its own state as well as the state of process Y = (X + N /2) mod N , with N being the total number of processes. If process Y fails, all processes fall back to the state of the previous snapshot by restoring their own data, with process X additionally also restoring the data of Y . The restoration of the data is immediately followed by the execution of one AMR cycle that ensures load balance of the simulation on fewer processes. Consequently, in the best-case scenario, up to half of all processes can fail simultaneously, with the simulation continuing to run properly. These redundant in-memory snapshots, however, occupy 2⁄3 of the memory, leaving only 1⁄3 of the available memory to the actual simulation. This strategy, on the other hand, completely omits the time-intensive generation of checkpoints to disk and only requires much faster pairwise point-to-point inter-process communication during snapshot creation. Ultimately, mainly functionality already available from the AMR implementation discussed in Chapter 6 is used. Additionally, a fault-tolerant implementation of MPI is required that signals failed processes to the application instead of shutting down execution [Bla13].

8.1.3 Other Applications and Simulation Methods Since the framework’s underlying data structures support data of arbitrary type, simulations different from grid-based methods can be implemented on top of the AMR

122

8.1 Additional Framework Capabilities functionality presented in Chapter 6. For rigid body physics simulations, for example, every block can store a list of all bodies whose center is located within the AABB of the block. Such a simulation must provide callback functions that can be used for serializing/deserializing this list of bodies. Then, for example, the number of bodies stored at a block can be used as refinement/coarsening criterion. If the number of bodies exceeds some predefined upper limit, the block is marked for refinement. Similarly, if the accumulated number of bodies contained in an octet of blocks falls below a particular lower limit, the blocks are marked for a potential merge. This approach results in an adaptively and dynamically load-balanced rigid body physics simulation. A corresponding implementation based on the parallel algorithms presented in [Igl10, Pre14, Pre15] is currently in development for the waLBerla framework. This implementation will enable stand-alone rigid body physics simulations as well as more complex simulations that couple fluid with rigid body dynamics and make full use of the AMR capabilities introduced in this thesis.

8.1.4 Load Balancing The four-step SAMR procedure for the dynamic domain repartitioning presented in Chapter 6 includes one step that is dedicated to load balancing. In this step, the SAMR framework, however, only manages the data structures that are about to be changed because of decisions made by the load balancing algorithm. The actual load balancing algorithm is not part of the SAMR framework code, but must be provided via a callback function during the setup stage of a simulation. Sections 6.4.1 and 6.4.2 outline two different load balancing approaches and how they can work on top of the underlying block-structured domain partitioning. Since the actual load balancing algorithm is injected into the SAMR pipeline as a callback function, other dynamic load balancing strategies can be incorporated without the need to modify framework code. Section 7.4 shows that both the SFC-based and the diffusion-based dynamic load balancing approach outlined in Sections 6.4.1 and 6.4.2 work well with LBM-based simulations as they are typically set up with the waLBerla framework. As stated at the end of Section 7.4.3, these LBM-based simulations have a small number of blocks per process and hence the partitioning quality is mainly determined by the balance/imbalance of this number of blocks per process. Future work that builds on the SAMR pipeline presented in this thesis will further study and analyze the current partitioning quality and its influence on different simulations. Consequently, future work will also investigate the applicability of dynamic load balancing algorithms provided by other, specialized libraries like ParMETIS [ParME, Sch02], Zoltan [Bom12, Zoltan], or PT-Scotch [Che08, PTSco]. Furthermore, as an alternative to the SFC-based algorithm outlined in Section 6.4.1, future work may also evaluate a different approach that uses the principles of a distributed sorting network for generating and partitioning the curve in parallel [Lui11].

123

8 Outlook and Conclusion

8.2 Results Summary One of the core contributions of this thesis is an approach for SAMR that exploits the hierarchical nature of a block-structured domain partitioning by using a lightweight, temporary copy of the core data structure during the AMR process. The temporarily created data structure does not contain any of the simulation data and only acts as a proxy for the actual data structure. This proxy data structure was shown to enable inexpensive, iterative, diffusion-based dynamic load balancing schemes that do not require to communicate the actual simulation data during the entire load balancing phase. The key concepts behind the distributed data structures and algorithms that are the basis for the SAMR procedure have also been presented. The domain partitioning was shown to be based on a forest of octrees-like space partitioning into blocks, where each block only knows the location (≙ process rank) and ID of every direct neighbor block, but has no knowledge about all the other blocks in the simulation. This leads to a distributed graph structure that is stored in a perfectly distributed manner, i.e., meta data memory consumption is completely independent of the total number of processes. Ultimately, this allows simulations to efficiently scale to extreme-scale parallel machines. Using these data structures and algorithms as a basis, the thesis introduced a distributed memory parallelization approach for the grid refinement technique for the LBM originally proposed by [Roh06]. In order for this parallelization to be suitable for massively parallel simulations, the thesis suggested an asynchronous communication scheme that requires only minimal communication and a level-wise load balancing strategy that perfectly fits the structure of the algorithms involved. Since all presented benchmarks and simulations make use of the TRT model instead of the SRT or MRT model often found in literature, a method for scaling the two relaxation parameters of the TRT collision model across different grid resolutions was proposed. The correctness of the introduced parallelization scheme for the LBM on nonuniform grids was verified for Couette and Poiseuille flow. Furthermore, not only near-perfect weak scalability on two petascale supercomputers but also an absolute performance of close to a trillion cell updates per second for a simulation with almost two million concurrent threads and a total number of 886 billion cells was demonstrated. When strong scaling a benchmark application with several millions of cells on the Intel-based SuperMUC system, the simulation was able to reach a performance of less than one millisecond per time step. A hybrid parallel execution model where multiple OpenMP threads are executed per MPI process did prove to be essential for best performance, especially when strong scaling simulations to massively parallel systems. Moving from completely static to fully dynamic simulations, it was demonstrated that an entire AMR cycle can be executed in half a second for a mesh that consists of 13.8 billion cells when using 65,536 processor cores. The applicability of the SAMR algorithm was also confirmed for meshes with up to 197 billion cells, distributed to almost half a million cores. As such, the proposed SAMR approach demonstrates state-of-the-art scalability. To the best knowledge of the author, the scale as well as the performance of the benchmarks

124

8.2 Results Summary presented in this thesis significantly exceed the data previously published for LBM-based simulations capable of AMR [Fre08, Has14b, Lah16, Meh11, Neu13, Sch11, Yu09]. All concepts and algorithms presented in this thesis have been implemented in the waLBerla software framework. Moreover, the LBM-based benchmark applications used in Chapter 7 were also added to the framework and are available as part of the software. Consequently, being included in the waLBerla software package, all the work presented in this thesis is available under an open source license and can be freely downloaded at http://walberla.net. As outlined in more detail in the previous section, future work will use the distributed data structures combined with the presented SAMR algorithm for meshfree simulation methods that work fundamentally different to the LBM. Future work will also see the integration and evaluation of additional dynamic load balancing algorithms based on existing, specialized libraries [Bom12, Che08, Sch02]. Since this thesis concentrated on the parallelization of one specific grid refinement scheme for the LBM, future work might also look into the evaluation of other refinement algorithms and other interpolation schemes [Fak14, Fak16, Qi16].

125

A Appendix A.1 Simulation of a Vocal Fold Phantom Geometry Both Figures 7.22 and 7.23 of Section 7.5 only depict the simulation of a phantom geometry of the vocal fold for a single point in time at time step 60,000. Figure A.1 provides additional visualizations of the same simulation. Figures A.2 and A.3 show the evolution of the simulation over time. Finally, Figure A.4 provides a better, unobstructed insight into the underlying block partitioning of the simulation domain.

Figure A.1: State of the simulation at coarse time steps 10,000 and 180,000 (≙ end of the simulation).

127

A Appendix

Figure A.2: Evolution of the simulation introduced in Section 7.5 over time. The figures represent 2D slices through the simulation as viewed from the side. They show the state of the simulation during the build-up phase at coarse time steps 4,000, 6,000, 11,000, 20,000, 36,000, and 60,0000. The grid overlay shows the block partitioning, not the individual lattice cells. Each block consists of 34 × 34 × 34 cells. Moreover, the actual simulation extends slightly further to the left and a lot further to the right.

128

A.1 Simulation of a Vocal Fold Phantom Geometry

Figure A.3: Similar to Figure A.2, these figures represent 2D slices through the simulation introduced in Section 7.5 as viewed from the top. They show the state of the simulation during the build-up phase at coarse time steps 4,000, 6,000, 8,000, 12,000, 20,000, 28,000, 44,000, and 60,0000.

129

A Appendix

Figure A.4: View inside the block partitioning that corresponds to the simulation introduced in Section 7.5. A part of the domain is intentionally cut out of the visualization in order to enable an unobstructed view inside the simulation. The different colors of the blocks depict their process association. The color transitions are, however, too faint to differentiate between the 3,584 different processes in use. The three figures show the state of the simulation at coarse time steps 30,000, 90,000, and 180,000.

130

Acronyms AABB axis-aligned bounding box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 AMR adaptive mesh refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 BFS breadth-first search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 CFD computational fluid dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CPU central processing unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 D2Q9 two-dimensional lattice model with 9 directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 D3Q19 three-dimensional lattice model with 19 directions (corners missing) . . . . . . . . . 9 D3Q27 three-dimensional lattice model with 27 directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 HPC high performance computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 I/O input/output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 ID identifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 LBM lattice Boltzmann method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 MFLUP/S million fluid lattice cell updates per second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 MFLUP/SC million fluid lattice cell updates per second and core. . . . . . . . . . . . . . . . . . . 94 MPI message passing interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 MRT multiple-relaxation-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 NUMA non-uniform memory access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 PDF particle distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 SAMR block-structured adaptive mesh refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 SFC space filling curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 SIMD single instruction/multiple data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 SRT single-relaxation-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 TRT two-relaxation-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

131

Bibliography [Ada15]

M. Adams, P. Colella, D. T. Graves, J. N. Johnson, H. S. Johansen, N. D. Keen, T. J. Ligocki, D. F. Martin, P. W. McCorquodale, D. Modiano, P. O. Schwartz, T. D. Sternberg, and B. Van Straalen. Chombo Software Package for AMR Applications – Design Document. Technical report, Lawrence Berkeley National Laboratory, 2015.

[Aid10]

C. K. Aidun and J. R. Clausen. Lattice-Boltzmann method for complex flows. Annual Review of Fluid Mechanics, 42:439–472, 2010.

[Amm14]

R. Ammer, M. Markl, U. Ljungblad, C. Körner, and U. Rüde. Simulating fast electron beam melting with a parallel thermal free surface lattice Boltzmann method. Computers & Mathematics with Applications, 67(2):318–330, 2014.

[And05]

S. E. Anderson. Bit Twiddling Hacks. https://graphics.stanford.edu/~seander/bithacks.html. 2018-01-31.

[And14]

D. Anderl, M. Bauer, C. Rauh, U. Rüde, and A. Delgado. Numerical simulation of adsorption and bubble interaction in protein foams using a lattice Boltzmann method. Food & Function, 5(4):755–763, 2014.

[Bai09]

P. Bailey, J. Myre, S.D.C. Walsh, D.J. Lilja, and M.O. Saar. Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors. In Proceedings of the International Conference on Parallel Processing, ICPP ’09, pages 550–557, 2009.

[Bar15]

D. Bartuschat and U. Rüde. Parallel Multiphysics Simulations of Charged Particles in Microfluidic Flows. Journal of Computational Science, 8(0):1–19, 2015.

[Bau15a]

M. Bauer, J. Hötzer, M. Jainta, P. Steinmetz, M. Berghoff, F. Schornbaum, C. Godenschwager, H. Köstler, B. Nestler, and U. Rüde. Massively Parallel Phase-field Simulations for Ternary Eutectic Directional Solidification. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’15, pages 8:1–8:12. ACM, 2015.

[Bau15b]

M. Bauer, F. Schornbaum, C. Godenschwager, M. Markl, D. Anderl, H. Köstler, and U. Rüde. A Python extension for the massively parallel multiphysics simulation framework waLBerla. International Journal of Parallel, Emergent and Distributed Systems, 31(6):529–542, 2015.

133

Bibliography [Bec09]

S. Becker, S. Kniesburges, S. Müller, A. Delgado, G. Link, M. Kaltenbacher, and M. Döllinger. Flow-structure-acoustic interaction in a human voice model. The Journal of the Acoustical Society of America, 125(3):1351–1361, 2009.

[Bha54]

P. L. Bhatnagar, E. P. Gross, and M. Krook. A Model for Collision Processes in Gases. I. Small Amplitude Processes in Charged and Neutral One-Component Systems. Physical Review, 94:511–525, 1954.

[Bla13]

W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Post-failure recovery of MPI communication capability: Design and rationale. International Journal of High Performance Computing Applications, 27(3):244– 254, 2013.

[Bog15]

S. Bogner, S. Mohanty, and U. Rüde. Drag correlation for dilute and moderately dense fluid-particle systems using the lattice Boltzmann method. International Journal of Multiphase Flow, 68(0):71–79, 2015.

[Boi90]

J. E. Boillat. Load balancing and Poisson equation in a graph. Concurrency: Practice and Experience, 2(4):289–313, 1990.

[Bom12]

E. G. Boman, U. V. Catalyurek, C. Chevalier, and K. D. Devine. The Zoltan and Isorropia Parallel Toolkits for Combinatorial Scientific Computing: Partitioning, Ordering, and Coloring. Scientific Programming, 20(2):129–150, 2012.

[Boost]

Boost C++ libraries. http://www.boost.org. 2018-01-31.

[Bor12]

J. Bordner and M. L. Norman. Enzo-P / Cello: Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology. In Proceedings of the Extreme Scaling Workshop, BW-XSEDE ’12, pages 4:1–4:11. University of Illinois at Urbana-Champaign, 2012.

[Box]

BoxLib. https://ccse.lbl.gov/BoxLib/. 2017-07-31.

[Bry14]

G. L. Bryan, M. L. Norman, B. W. O’Shea, T. Abel, J. H. Wise, M. J. Turk, D. R. Reynolds, D. C. Collins, P. Wang, S. W. Skillman, B. Smith, R. P. Harkness, J. Bordner, J. Kim, M. Kuhlen, H. Xu, N. Goldbaum, C. Hummels, A. G. Kritsuk, E. Tasker, S. Skory, C. M. Simpson, O. Hahn, J. S. Oishi, G. C. So, F. Zhao, R. Cen, and Y. Li. ENZO: An Adaptive Mesh Refinement Code for Astrophysics. The Astrophysical Journal Supplement Series, 211(2):1–52, 2014.

[Bun10]

H.-J. Bungartz, M. Mehl, T. Neckel, and T. Weinzierl. The PDE framework Peano applied to fluid dynamics: an efficient implementation of a parallel multiscale fluid dynamics solver on octree-like adaptive Cartesian grids. Computational Mechanics, 46(1):103–114, 2010.

134

Bibliography [Bur11]

C. Burstedde, L. C. Wilcox, and O. Ghattas. p4est: Scalable Algorithms for Parallel Adaptive Mesh Refinement on Forests of Octrees. SIAM Journal on Scientific Computing, 33(3):1103–1133, 2011.

[Bur14]

C. Burstedde, D. Calhoun, K. Mandli, and A. R. Terrel. ForestClaw: Hybrid forest-of-octrees AMR for hyperbolic conservation laws. Advances in Parallel Computing, 25:253–262, 2014.

[Cactus]

Cactus. http://cactuscode.org/. 2018-01-31.

[Cam03]

P. M. Campbell, K. D. Devine, J. E. Flaherty, L. G. Gervasio, and J. D. Teresco. Dynamic Octree Load Balancing Using Space-Filling Curves. Technical Report CS-03-01, Williams College Department of Computer Science, 2003.

[Carpet]

Carpet. https://carpetcode.org/. 2018-01-31.

[Che98]

S. Chen and G. D. Doolen. Lattice Boltzmann method for fluid flows. Annual Review of Fluid Mechanics, 30(1):329–364, 1998.

[Che06]

H. Chen, O. Filippova, J. Hoch, K. Molvig, R. Shock, C. Teixeira, and R. Zhang. Grid refinement in lattice Boltzmann methods based on volumetric formulation. Physica A: Statistical Mechanics and its Applications, 362(1):158–167, 2006.

[Che08]

C. Chevalier and F. Pellegrini. PT-Scotch: A tool for efficient parallel graph ordering. Parallel Computing, 34(6):318–331, 2008.

[Cyb89]

G. Cybenko. Dynamic load balancing for distributed memory multiprocessors. Journal of Parallel and Distributed Computing, 7(2):279–301, 1989.

[Des01]

J.-C. Desplat, I. Pagonabarraga, and P. Bladon. LUDWIG: A parallel Lattice-Boltzmann code for complex fluids. Computer Physics Communications, 134(3):273–290, 2001.

[d’H94]

D. d’Humières. Generalized Lattice-Boltzmann Equations. Rarefied Gas Dynamics: Theory and Simulations, pages 450–458, 1994.

[d’H02]

D. d’Humières. Multiple-relaxation-time lattice Boltzmann models in three dimensions. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 360(1792):437–451, 2002.

[Don11]

S. Donath. Wetting Models for a Parallel High-Performance Free Surface Lattice Boltzmann Method. PhD thesis, FAU Erlangen-Nürnberg, 2011.

[Dub09]

A. Dubey, K. Antypas, M. K. Ganapathy, L. B. Reid, K. Riley, D. Sheeler, A. Siegel, and K. Weide. Extensible component-based architecture for FLASH, a massively parallel, multiphysics simulation code. Parallel Computing, 35(10–11):512–522, 2009.

135

Bibliography [Dub14]

A. Dubey, A. Almgren, J. Bell, M. Berzins, S. Brandt, G. Bryan, P. Colella, D. Graves, M. Lijewski, F. Löffler, B. O’Shea, E. Schnetter, B. Van Straalen, and K. Weide. A survey of high level frameworks in block-structured adaptive mesh refinement packages. Journal of Parallel and Distributed Computing, 74(12):3217–3227, 2014.

[Enzo]

Enzo. http://enzo-project.org/. 2018-01-31.

[Enzop]

Enzo-P/Cello. http://client64-249.sdsc.edu/cello/. 2018-01-31.

[Fak14]

A. Fakhari and T. Lee. Finite-difference lattice Boltzmann method with a block-structured adaptive-mesh-refinement technique. Physical Review E, 89(3):033310, 2014.

[Fak16]

A. Fakhari, M. Geier, and T. Lee. A mass-conserving lattice Boltzmann method with dynamic grid refinement for immiscible two-phase flows. Journal of Computational Physics, 315:434–457, 2016.

[Fei09]

C. Feichtinger, J. Götz, S. Donath, K. Iglberger, and U. Rüde. Parallel Computing: Numerics, Applications, and Trends, chapter WaLBerla: Exploiting Massively Parallel Systems for Lattice Boltzmann Simulations, pages 241–260. Springer London, 2009.

[Fei11]

C. Feichtinger, S. Donath, H. Köstler, J. Götz, and U. Rüde. WaLBerla: HPC software design for computational engineering simulations. Journal of Computational Science, 2(2):105–112, 2011.

[Fei12]

C. Feichtinger. Design and Performance Evaluation of a Software Framework for Multi-Physics Simulations on Heterogeneous Supercomputers. PhD thesis, FAU Erlangen-Nürnberg, 2012.

[Fie12]

J. Fietz, M. J. Krause, C. Schulz, P. Sanders, and V. Heuveline. Euro-Par 2012 Parallel Processing, chapter Optimized Hybrid Parallel Lattice Boltzmann Fluid Flow Simulations on Complex Geometries, pages 818–829. Springer Berlin Heidelberg, 2012.

[Fil98]

O. Filippova and D. Hänel. Grid Refinement for Lattice-BGK Models. Journal of Computational Physics, 147(1):219–228, 1998.

[Flash]

FLASH. http://flash.uchicago.edu/site/flashcode/. 2018-01-31.

[Fre08]

S. Freudiger, J. Hegewald, and M. Krafczyk. A parallelisation concept for a multi-physics lattice Boltzmann prototype based on hierarchical grids. Progress in Computational Fluid Dynamics, 8(1–4):168–178, 2008.

[Gei15]

M. Geier, M. Schönherr, A. Pasquali, and M. Krafczyk. The cumulant lattice Boltzmann equation in three dimensions: Theory and validation. Computers & Mathematics with Applications, 70(4):507–547, 2015.

136

Bibliography [Gin08a]

I. Ginzburg, F. Verhaeghe, and D. d’Humières. Study of simple hydrodynamic solutions with the two-relaxation-times lattice Boltzmann scheme. Communications in Computational Physics, 3(3):519–581, 2008.

[Gin08b]

I. Ginzburg, F. Verhaeghe, and D. d’Humières. Two-relaxation-time Lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions. Communications in Computational Physics, 3(2):427–478, 2008.

[God13]

C. Godenschwager, F. Schornbaum, M. Bauer, H. Köstler, and U. Rüde. A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pages 35:1–35:12. ACM, 2013.

[Goo03]

T. Goodale, G. Allen, G. Lanfermann, J. Massó, T. Radke, E. Seidel, and J. Shalf. The Cactus Framework and Toolkit: Design and Applications. In Vector and Parallel Processing – VECPAR’2002, 5th International Conference, Lecture Notes in Computer Science. Springer, 2003.

[Göt10]

J. Götz, K. Iglberger, M. Stürmer, and U. Rüde. Direct numerical simulation of particulate flows on 294912 processor cores. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–11. IEEE Computer Society, 2010.

[Göt12]

J. Götz. Massively Parallel Direct Numerical Simulation of Particulate Flows. PhD thesis, FAU Erlangen-Nürnberg, 2012.

[Gro11]

D. Groen, O. Henrich, F. Janoschek, P. Coveney, and J. Harting. Lattice-Boltzmann Methods in Fluid Dynamics: Turbulence and Complex Colloidal Fluids. In Juelich Blue Gene/P Extreme Scaling Workshop 2011. Juelich Supercomputing Centre, 2011.

[Gro13]

D. Groen, J. Hetherington, H. B. Carver, R. W. Nash, M. O. Bernabeu, and P. V. Coveney. Analysing and modelling the performance of the HemeLB lattice-Boltzmann simulation environment. Journal of Computational Science, 4(5):412–422, 2013.

[Guz14]

S. M. Guzik, T. H. Weisgraber, P. Colella, and B. J. Alder. Interpolation methods and the accuracy of lattice-Boltzmann mesh refinement. Journal of Computational Physics, 259(0):461–487, 2014.

[Hag16]

G. Hager, J. Treibig, J. Habich, and G. Wellein. Exploring performance and power properties of modern multi-core chips via simple machine models. Concurrency and Computation: Practice and Experience, 28(2):189–210, 2016.

[Has14a]

M. Hasert. Multi-scale Lattice Boltzmann simulations on distributed octrees. PhD thesis, RWTH Aachen, 2014.

137

Bibliography [Has14b]

M. Hasert, K. Masilamani, S. Zimny, H. Klimach, J. Qi, J. Bernsdorf, and S. Roller. Complex fluid simulations with the parallel tree-based Lattice Boltzmann solver Musubi. Journal of Computational Science, 5(5):784–794, 2014.

[He97a]

X. He and L.-S. Luo. Lattice Boltzmann Model for the Incompressible Navier-Stokes Equation. Journal of Statistical Physics, 88(3):927–944, 1997.

[He97b]

X. He and L.-S. Luo. Theory of the lattice Boltzmann method: From the Boltzmann equation to the lattice Boltzmann equation. Physical Review E, 56(6):6811–6817, 1997.

[Heu07]

V. Heuveline and J. Latt. The OpenLB Project: An Open Source and Object Oriented Implementation of Lattice Boltzmann Methods. International Journal of Modern Physics C, 18(04):627–634, 2007.

[Heu09]

V. Heuveline, M. J. Krause, and J. Latt. Towards a hybrid parallelization of lattice Boltzmann methods. Computers & Mathematics with Applications, 58(5):1071–1080, 2009.

[Hil91]

D. Hilbert. Ueber die stetige Abbildung einer Linie auf ein Flächenstück. Mathematische Annalen, 38:459–460, 1891.

[Höt16]

J. Hötzer, P. Steinmetz, Ma. Jainta, S. Schulz, M. Kellner, B. Nestler, A. Genau, A. Dennstedt, M. Bauer, H. Köstler, and U. Rüde. Phase-field simulations of spiral growth during directional ternary eutectic solidification. Acta Materialia, 106:249–259, 2016.

[Igl10]

K. Iglberger. Software Design of a Massively Parallel Rigid Body Framework. PhD thesis, FAU Erlangen-Nürnberg, 2010.

[Kal09]

L. V. Kale and G. Zheng. Advanced Computational Infrastructures for Parallel and Distributed Adaptive Applications, chapter Charm++ and AMPI: Adaptive Runtime Strategies via Migratable Objects, pages 265–282. John Wiley & Sons, Inc., 2009.

[Kan00]

D. Kandhai, W. Soll, S. Chen, A. Hoekstra, and P. Sloot. FiniteDifference Lattice-BGK methods on nested grids. Computer Physics Communications, 129(1–3):100–109, 2000.

[Kar99]

G. Karypis and V. Kumar. A Fast and Highly Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on Scientific Computing, 20(1):359–392, 1999.

[Koh17]

N. Kohl, J. Hötzer, F. Schornbaum, M. Bauer, C. Godenschwager, H. Köstler, B. Nestler, and U. Rüde. A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations. arXiv:1708.08286 [cs.DC], 2017.

138

Bibliography [Koh18]

N. Kohl, J. Hötzer, F. Schornbaum, M. Bauer, C. Godenschwager, H. Köstler, B. Nestler, and U. Rüde. A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations. The International Journal of High Performance Computing Applications, 2018.

[Kra11]

M. J. Krause, T. Gengenbach, and V. Heuveline. Hybrid Parallel Simulations of Fluid Flows in Complex Geometries: Application to the Human Lungs. In Euro-Par 2010 Parallel Processing Workshops, volume 6586 of Lecture Notes in Computer Science, pages 209–216. Springer Berlin Heidelberg, 2011.

[Kur16]

M. Kuron, G. Rempfer, F. Schornbaum, M. Bauer, C. Godenschwager, C. Holm, and J. de Graaf. Moving charged particles in lattice Boltzmann-based electrokinetics. The Journal of Chemical Physics, 145(21):214102, 2016.

[Lag12]

D. Lagrava, O. Malaspinas, J. Latt, and B. Chopard. Advances in multi-domain lattice Boltzmann grid refinement. Journal of Computational Physics, 231(14):4808–4822, 2012.

[Lah16]

M. Lahnert, C. Burstedde, C. Holm, M. Mehl, G. Rempfer, and F. Weik. Towards Lattice-Boltzmann on Dynamically Adaptive Grids – Minimally-Invasive Grid Exchange in ESPResSo. In Proceedings of the ECCOMAS Congress 2016, VII European Congress on Computational Methods in Applied Sciences and Engineering, pages 1–25, 2016.

[Lat08]

J. Latt. Choice of units in lattice Boltzmann simulations. Technical report, LBMethod.org, 2008.

[LB3D]

LB3D. http://ccs.chem.ucl.ac.uk/lb3d. 2018-01-31.

[Lui11]

J. Luitjens. The scalability of parallel adaptive mesh refinement within Uintah. PhD thesis, The University of Utah, 2011.

[Luo98]

L.-S. Luo. Unified Theory of Lattice Boltzmann Models for Nonideal Gases. Physical Review Letters, 81(8):1618–1621, 1998.

[Luo11]

L.-S. Luo, W. Liao, X. Chen, Y. Peng, and W. Zhang. Numerics of the lattice Boltzmann method: Effects of collision models on the lattice Boltzmann simulations. Physical Review E, 83(5):056710, 2011.

[Mac00]

P. MacNeice, K. M. Olson, C. Mobarry, R. de Fainchtein, and C. Packer. PARAMESH: A parallel adaptive mesh refinement community toolkit. Computer Physics Communications, 126(3):330–354, 2000.

[Mar96]

R. C. Martin. The Open-Closed Principle. C++ Report, 1996.

[Meh11]

M. Mehl, T. Neckel, and P. Neumann. Navier-Stokes and LatticeBoltzmann on octree-like grids in the Peano framework. International Journal for Numerical Methods in Fluids, 65(1–3):67–86, 2011.

139

Bibliography [Mey88]

B. Meyer. Object-Oriented Software Construction. Prentice Hall, 1988.

[Moh10]

A. A. Mohamad and A. Kuzmin. A critical evaluation of force term in lattice Boltzmann method, natural convection problem. International Journal of Heat and Mass Transfer, 53(5–6):990–996, 2010.

[Mor66]

G. M. Morton. A Computer Oriented Geodetic Data Base; and a New Technique in File Sequencing. Technical report, IBM Ltd., 1966.

[Ner11]

C. Neri. Twisting the RTTI System for Safe Dynamic Casts of void* in C++. http://www.drdobbs.com/229401004. 2018-01-31.

[Neu13]

P. Neumann and T. Neckel. A dynamic mesh refinement technique for Lattice Boltzmann simulations on octree-like grids. Computational Mechanics, 51(2):237–253, 2013.

[OpenLB] OpenLB. http://optilb.org/openlb/. 2018-01-31. [Palabos]

Palabos. http://www.palabos.org/. 2018-01-31.

[Param]

PARAMESH. http://opensource.gsfc.nasa.gov/projects/paramesh/index.php. 2018-01-31.

[ParME]

ParMETIS. http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview. 2018-01-31.

[Par06]

S. G. Parker. A component-based architecture for parallel multi-physics PDE simulation. Future Generation Computer Systems, 22(1–2):204–216, 2006.

[Pen06]

Y. Peng, C. Shu, Y.T. Chew, X.D. Niu, and X.Y. Lu. Application of multi-block approach in the immersed boundary-lattice Boltzmann method for viscous fluid flows. Journal of Computational Physics, 218(2):460–478, 2006.

[Pen16]

C. Peng, Y. Teng, B. Hwang, Z. Guo, and L.-P. Wang. Implementation issues and benchmarking of lattice Boltzmann method for moving rigid particle simulations in a viscous flow. Computers & Mathematics with Applications, 72(2):349–374, 2016.

[Pic14]

K. Pickl, M. Hofmann, T. Preclik, H. Köstler, A.-S. Smith, and U. Rüde. Parallel Simulations of Self-propelled Microorganisms. In Parallel Computing: Accelerating Computational Science and Engineering (CSE), volume 25 of Advances in Parallel Computing, pages 395–404. IOS Press, 2014.

[Pic16]

K. Pickl. Fully Resolved Simulation of Self-propelled Microorganisms on Supercomputers. PhD thesis, FAU Erlangen-Nürnberg, 2016.

[Poh08]

T. Pohl. High Performance Simulation of Free Surface Flows Using the LatticeBoltzmann Method. PhD thesis, FAU Erlangen-Nürnberg, 2008.

140

Bibliography [Pre14]

T. Preclik. Models and Algorithms for Ultrascale Simulations of Non-smooth Granular Dynamics. PhD thesis, FAU Erlangen-Nürnberg, 2014.

[Pre15]

T. Preclik and U. Rüde. Ultrascale simulations of non-smooth granular dynamics. Computational Particle Mechanics, 2(2):173–196, 2015.

[PTSco]

PT-Scotch. http://www.labri.fr/perso/pelegrin/scotch/. 2018-01-31.

[Qi16]

J. Qi, H. Klimach, and S. Roller. Implementation of the compact interpolation within the octree based Lattice Boltzmann solver Musubi. Computers & Mathematics with Applications, 2016.

[Qia92]

Y. H. Qian, D. D’Humières, and P. Lallemand. Lattice BGK Models for Navier-Stokes Equation. EPL (Europhysics Letters), 17(6):479–484, 1992.

[Ran13]

A. Randles. Modeling Cardiovascular Hemodynamics Using the Lattice Boltzmann Method on Massively Parallel Supercomputers. PhD thesis, Harvard University, 2013.

[Ret17]

C. Rettinger, C. Godenschwager, S. Eibl, T. Preclik, T. Schruff, R. Frings, and U. Rüde. Fully Resolved Simulations of Dune Formation in Riverbeds. In Proceedings of the International Supercomputing Conference, ISC 2017, pages 3–21. Springer International Publishing, 2017.

[Roh06]

M. Rohde, D. Kandhai, J. J. Derksen, and H. E. A. van den Akker. A generic, mass conservative local grid refinement technique for lattice-Boltzmann schemes. International Journal for Numerical Methods in Fluids, 51(4):439–468, 2006.

[Rol12]

S. Roller, J. Bernsdorf, H. Klimach, M. Hasert, D. Harlacher, M. Cakircali, S. Zimny, K. Masilamani, L. Didinger, and J. Zudrop. High Performance Computing on Vector Systems 2011, chapter An Adaptable Simulation Framework Based on a Linearized Octree, pages 93–105. Springer Berlin Heidelberg, 2012.

[Sch02]

K. Schloegel, G. Karypis, and V. Kumar. Parallel static and dynamic multi-constraint graph partitioning. Concurrency and Computation: Practice and Experience, 14(3):219–240, 2002.

[Sch04]

E. Schnetter, S. H. Hawley, and I. Hawke. Evolutions in 3D numerical relativity using fixed mesh refinement. Classical and Quantum Gravity, 21(6):1465–1488, 2004.

[Sch11]

M. Schönherr, K. Kucher, M. Geier, M. Stiebler, S. Freudiger, and M. Krafczyk. Multi-thread implementations of the lattice Boltzmann method on non-uniform grids for CPUs and GPUs. Computers & Mathematics with Applications, 61(12):3730–3743, 2011.

141

Bibliography [Sch14]

T. Schruff, F. Schornbaum, C. Godenschwager, U. Rüde, R. M. Frings, and H. Schüttrumpf. Numerical Simulation Of Pore Fluid Flow And Fine Sediment Infiltration Into The Riverbed. In Proceedings of the International Conference on Hydroinformatics, HIC 2014, 2014.

[Sch15]

F. Schornbaum and U. Rüde. Massively Parallel Algorithms for the Lattice Boltzmann Method on Non-uniform Grids. arXiv:1508.07982 [cs.DC], 2015.

[Sch16]

F. Schornbaum and U. Rüde. Massively Parallel Algorithms for the Lattice Boltzmann Method on Nonuniform Grids. SIAM Journal on Scientific Computing, 38(2):C96–C126, 2016.

[Sch17]

F. Schornbaum and U. Rüde. Extreme-Scale Block-Structured Adaptive Mesh Refinement. arXiv:1704.06829 [cs.DC], 2017.

[Sch18]

F. Schornbaum and U. Rüde. Extreme-Scale Block-Structured Adaptive Mesh Refinement. SIAM Journal on Scientific Computing, 40(3):C358–C387, 2018.

[Töl06]

J. Tölke, S. Freudiger, and M. Krafczyk. An adaptive scheme using hierarchical grids for lattice Boltzmann multi-phase flow simulations. Computers & Fluids, 35(8–9):820–830, 2006.

[Uintah]

Uintah. http://uintah.utah.edu/. 2018-01-31.

[Yu02]

D. Yu, R. Mei, and W. Shyy. A multi-block lattice Boltzmann method for viscous fluid flows. International Journal for Numerical Methods in Fluids, 39(2):99–120, 2002.

[Yu09]

Z. Yu and L.-S. Fan. An interaction potential based lattice Boltzmann method with adaptive mesh refinement (AMR) for two-phase flow simulation. Journal of Computational Physics, 228(17):6456–6478, 2009.

[Zhe12]

G. Zheng, X. Ni, and L. V. Kalé. A scalable double in-memory checkpoint and restart scheme towards exascale. In Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), pages 1–6, 2012.

[Zoltan]

Zoltan. http://www.cs.sandia.gov/Zoltan. 2018-01-31.

142