International Journal of Computational Fluid Dynamics

ISSN: 1061-8562 (Print) 1029-0257 (Online) Journal homepage: http://www.tandfonline.com/loi/gcfd20

Adaptive mesh refinement and load balancing based on multi-level block-structured Cartesian mesh Takashi Misaka, Daisuke Sasaki & Shigeru Obayashi To cite this article: Takashi Misaka, Daisuke Sasaki & Shigeru Obayashi (2017): Adaptive mesh refinement and load balancing based on multi-level block-structured Cartesian mesh, International Journal of Computational Fluid Dynamics, DOI: 10.1080/10618562.2017.1390085 To link to this article: http://dx.doi.org/10.1080/10618562.2017.1390085

Published online: 12 Nov 2017.

Submit your article to this journal

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=gcfd20 Download by: [Tohoku University]

Date: 12 November 2017, At: 22:40

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS, https://doi.org/./..

Adaptive mesh refinement and load balancing based on multi-level block-structured Cartesian mesh Takashi Misakaa , Daisuke Sasakib and Shigeru Obayashic

Downloaded by [Tohoku University] at 22:40 12 November 2017

a Frontier Research Institute for Interdisciplinary Sciences, Tohoku University, Sendai, Japan; b Kanazawa Institute of Technology, Hakusan, Ishikawa, Japan; c Institute of Fluid Science, Tohoku University, Sendai, Japan

ABSTRACT

ARTICLE HISTORY

We developed a framework for a distributed-memory parallel computer that enables dynamic data management for adaptive mesh refinement and load balancing. We employed simple data structure of the building cube method (BCM) where a computational domain is divided into multi-level cubic domains and each cube has the same number of grid points inside, realising a multi-level blockstructured Cartesian mesh. Solution adaptive mesh refinement, which works efficiently with the help of the dynamic load balancing, was implemented by dividing cubes based on mesh refinement criteria. The framework was investigated with the Laplace equation in terms of adaptive mesh refinement, load balancing and the parallel efficiency. It was then applied to the incompressible Navier– Stokes equations to simulate a turbulent flow around a sphere. We considered wall-adaptive cube refinement where a non-dimensional wall distance y+ near the sphere is used for a criterion of mesh refinement. The result showed the load imbalance due to y+ adaptive mesh refinement was corrected by the present approach. To utilise the BCM framework more effectively, we also tested a cube-wise algorithm switching where an explicit and implicit time integration schemes are switched depending on the local Courant-Friedrichs-Lewy (CFL) condition in each cube.

Received May Accepted September

1. Introduction Resolving flow features with appropriate mesh resolution is critically important especially for unsteady computational fluid dynamics (CFD) simulation such as large-eddy simulation (LES). Adaptive mesh refinement (AMR) is one of most effective techniques to adapt mesh resolution to local flow features such as shock, vortex, and shear layer (Berger and Colella 1989). The key elements of Cartesian mesh-based AMR are (1) data structure, (2) parallelisation, and (3) refinement criteria. The points (1) and (2) are deeply related to the code design; therefore, the development of an AMR code involves tough coding work and time-consuming test runs. Due to the paradigm of computing clusters with distributedmemory multi-core processors, one have to consider the parallelisation between computing nodes and within a node, which further complicates the development of an AMR code. To alleviate this, open AMR libraries such as Chombo (Adams et al. 2015), SAMURAI (Gunney et al. 2016) and so on have been developed. However, it is not always easy to fully use those libraries especially when a code needs to be modified for our own purpose including the change of numerical schemes, inter-grid data transfer, and parallelisation strategy. CONTACT Takashi Misaka

[email protected]

© Informa UK Limited, trading as Taylor & Francis Group

KEYWORDS

Adaptive mesh reﬁnement; load balancing; block-structured Cartesian mesh; large-eddy simulation

Cartesian mesh-based CFD simulation is recently gaining attentions due to its capabilities of fast mesh generation around complex geometries and efficient solution procedures. The building cube method (BCM) is one of Cartesian mesh-based approaches that employ a simple block-structured mesh called ‘cube’ (Nakahashi 2005). A flow field is divided by an assemblage of multi-level cuboids, where those cubes join with the size ratio of 1/2, 1 or 2. Each cube has the same number of mesh points so that the local resolution is determined by the cube size. The cubes are generated by the geometry-adaptive refinement (Ishida, Takahashi, and Nakahashi 2008), which reduces the total number of mesh points compared to a uniform Cartesian mesh. Intensive developments have been conducted by Nakahashi et al., which results in several flow solvers based on compressible/incompressible Navier–Stokes equations, Euler equations, and linearised Euler equations for aeroacoustic analysis (Fukushima et al. 2016; Nakahashi 2005; Sakai et al. 2013; Su et al. 2012). The BCM is also effective for large-scale computing as it is demonstrated by Onishi et al. (2014). Tools have also been developed such as a fast mesh generator (Ishida, Takahashi, and Nakahashi 2008) and data compression techniques for post-processing (Sakai et al. 2013). There are many approaches to handle curved wall

Downloaded by [Tohoku University] at 22:40 12 November 2017

2

T. MISAKA ET AL.

surface in a Cartesian mesh. Focusing on techniques used in the BCM, efforts are put into a staircase representation, low Reynolds number immersed boundary method (IBM), the IBM combined with a logarithmic wall model, and a coupling of a grid-less solver (Su, Yamamoto, and Nakahashi 2012) or an unstructured mesh solver. The simplicity of the BCM data structure is expected to be effective for the dynamic data management required for the AMR because an AMR code tends to become complex when it is applied to three-dimensional complex geometries. In addition, the AMR should be used along with the dynamic load balancing on a distributedmemory machine such that the number of mesh points assigned to each MPI rank can be varied during the AMR simulation, which can be implemented easily using cubes of the BCM. Our motivation is to develop an AMR code for unsteady flow simulation such as LES or direct numerical simulation (DNS). For this purpose, we developed a framework for a distributed-memory parallel computer that enables dynamic data management for AMR and dynamic load balancing. We employ an approach of the BCM as a basic data structure. We added a capability of solution AMR that may work efficiently with the help of the dynamic load balancing. The code is applied to the Laplace equation to investigate the effect of AMR and load balancing. We then apply it to an incompressible flow field, where the wall-adaptive cube refinement is conducted using a non-dimensional wall distance y+ near an object as a criterion of the mesh refinement. The simple data structure of the BCM enables the mixed implementation of different numerical schemes with various fidelities, where the load balancing would work efficiently in such situation. This paper is organised as follows, Section 2 describes the approach for the dynamic cube management based on the BCM. It is demonstrated using the Laplace equation in Section 3. An example of the incompressible Navier–Stokes equations is shown in Section 4. Section 5 concludes the paper.

2. Dynamic cube management 2.1. Data structure in building cube method The data management strategies of the BCM are documented in the original paper (Nakahashi 2005). It is explained briefly to make this paper self-consistent. The BCM basically employs an equally-spaced Cartesian mesh for the simplicities in the mesh generation, in high-order schemes and in the post-processing. To adapt such equally-spaced mesh to complex geometries and local flow scales, the BCM employs multi-level sub-mesh called ‘cube’ so that the local mesh size is determined by

Figure . An example of ‘cubes’and mesh cells of the building cube method.

the size of the cube. Figure 1 shows an example of a BCM mesh, where each cube is discretised by 163 mesh points. With the BCM approach, it is capable to (1) treat complex geometry, (2) use locally dense mesh depending on the local flow scales, (3) implement a high-order scheme easily, (4) parallelise the flow solver, and (5) handle huge data in the post-processing. The present study is mainly focusing on the fourth point, where the load-balance in parallel computations and the solution AMR are realised with the help of the simple data structure of the BCM. Because the computation in each cube is independent, cube-level parallelisation and mesh refinement become simple as mentioned above. Another feature to be mentioned is that the data compression in pre-/post-processing because the total number of mesh points tends to become very large if a complex geometry is handled by a Cartesian mesh. The BCM mesh only keeps the following information: (1) number of cubes, (2) size and coordinates of the front-lower-left corner for each cube, (3) neighbouring-cube information for each cube, (4) number of division of a cube, and (5) cell in–out information. The data (5) may become large because it is flag information for each mesh point, while other data are for each cube, i.e. the size is small. This cell in–out information is compressed by the runlength technique (Nakahashi 2005). The data compression in post-processing has also been proposed to reduce the file output during the simulation (Sakai et al. 2013). 2.2. Cube management for dynamic load balancing To realise dynamic cube management, we employ a space-filling curve, especially, the Peano–Hilbert curve, which is often used for ordering scattered data. The use of a space-filling curve in the unstructured Cartesian CFD solver is well documented in Aftosmis, Berger, and Murman (2004). Here, we simply apply this technique for cubes of the BCM. Figure 2 shows an

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS

3

Figure . Schematic of the load balancing based on wall-clock time.

Downloaded by [Tohoku University] at 22:40 12 November 2017

Figure . The Peano–Hilbert space-ﬁlling curve for cube reﬁnement and load balancing.

example of the Peano–Hilbert space-filling curve, where two-dimensional multi-level cubes are mapped into one-dimensional ordered information. A good point of this space-filling curve is that the domain decomposition for parallel computation can be done by simply locate dividers along the one-dimensional ordered information. The decomposed domain in the original two-dimensional space has nice property, i.e. each sub-domain is sufficiently agglomerated realising relatively small cut surface (Aftosmis, Berger, and Murman 2004). Figure 2 also shows how cube refinement is conducted in the case of three MPI ranks. Once a certain cube is divided into four small cubes (eight small cubes in three dimensions) for mesh refinement, a number of cubes assigned to a certain MPI rank increases. This change can be naturally handled by using a space filling curve. After the cube refinement, the load balancing is done to level the number of cubes assigned to each MPI rank. In this study, we consider the dynamic load balancing using the space-filling curve as mentioned above. That is, the dividers for separating each MPI rank are set simply based on the wall-clock time elapsed in each MPI rank. The AMR and the IBM for solid wall boundaries cause the load imbalance between MPI ranks. We count the wall-clock time for calculation and one-to-one MPI data transfer on each MPI rank. Suppose that the wall-clock time for each MPI rank denotes ti (i = 1, N ), where N is the total number of MPI ranks. We define the ratio of the wall-clock time as follows:

Figure . A computation domain and a series solution of the Laplace equation on a cross-sectional slice.

where ri is the inverse of ti normalised by the sum of them. Using ri , the position of the divider eci along the spacefilling curve is defined by i eci = ncube,all k=1 N

rk

j=1 r j

,

(2)

where ncube,all is the total number of cubes. The procedure of load balancing is schematically shown in Figure 3 in the case of N = 3. Before the load balancing, the measured wall-clock time in MPI rank 2 is smaller than that of MPI rank 1 and 3, while the number of cubes assigned in each MPI rank is the same. The position of dividers eci has the relationship; ec1 = ec2 − ec1 + 1 = ec3 − ec2 + 1 before the load balancing. For the load balancing, the divider positions are modified based on Equation (2) resulting in the increased number of cubes in MPI rank 2 as in Figure 3. By repeating this process, the wall-clock time on each MPI rank is sufficiently levelled. In incompressible flow simulations, the load balancing is carried out every a few hundred time steps.

2.3. Adaptive cube refinement 1/ti ri = N , j=1 1/t j

(1)

The AMR is conducted simply by dividing each cube into eight cubes in three dimensions (one cube into four cubes

4

T. MISAKA ET AL.

Downloaded by [Tohoku University] at 22:40 12 November 2017

Figure . The wall-clock time and the number of cubes with/without load balancing: (a) cubes, MPI ranks and (b) , cubes, MPI ranks.

in two dimensions). The solution AMR requires a criterion for selecting meshes to be refined, such as vorticity magnitude, entropy gradient and so on. We consider in this paper the gradient of a variable in the case with the Laplace equation and the non-dimensional wall distance y+ in the incompressible Navier–Stokes equations case, which will be described in the following sections. The adaptive cube refinement is conducted independently in each computing node; therefore, the number of cubes assigned to each node varies during the computation. The number of mesh points tends to increase especially near the solid wall in the wall-bounded flow simulation with a Cartesian mesh. The load imbalance due to the AMR and the IBM can be alleviated by the approach described in the previous section.

should be up to infinity as an analytical solution; however, we consider 50 terms here after the convergence study concerning the number of terms. Figure 4 shows the domain and the series solution on a cross-sectional slice. The Laplace equation is discretised by a secondorder central difference scheme and the resulting finite difference equations are solved by the successive-overrelaxation (SOR) iteration method to obtain a numerical solution. The numerical solution is then compared with the series solution, especially, by the integration of the difference between numerical and series solutions within each cube.

3. The Laplace equation in three dimensions

Figure 5 shows the wall-clock time with/without load balancing for (a) 512 cubes with 32 MPI ranks, and (b) 4,096 cubes with 64 MPI ranks. Here, the wallclock time for the SOR iteration and one-to-one communication using MPI_Get for halo data exchange are only counted. The load imbalance is artificially generated by varying the number of SOR iteration in each MPI rank using a random number generator: nmax = int(50.0∗ rand()+1.0), where nmax denotes the maximum number of iterations. MPI/OpenMP hybrid parallelisation was employed with a fixed number of 12 threads per MPI rank on the SGI ICE X system. As shown in Figure 5, the wall-clock time is deviated before load balancing, while the number of cubes is the same for all MPI ranks. After the load balancing, the deviation of the wall-clock time is reduced and the number of cube for each MPI rank is varied. The effect of load balancing is pronounced for the case where the number of cubes per MPI rank is large, i.e. 16 cubes per MPI rank in

3.1. Governing equations and the numerical methods We consider a field obtained by a solution of the threedimensional Laplace equation, i.e. ∇ 2 φs = 0. As for boundary conditions, a constant value φs (x, y, 1) = 1 is set on the top boundary, and φs (x, y, z) = 0 is set to zero on other boundaries. In this system, a series solution of the Laplace equation in a cubic domain can be obtained analytically as follows (which is used as a reference solution to validate a numerical solution): 50 16 sinh γmn z φs x, y, z = 2 sin mπ x sin nπ y, (3) π m,n=1 mn sinhγmn

√ where γmn = (mπ )2 +(nπ )2 . The cubic domain has sides of unit length for simplicity. The summation in Equation (3)

3.2. Load balancing

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS

5

arbitrary number of nodes were chosen for MPI parallelisation. 3.3. Adaptive cube refinement based on the gradient

Downloaded by [Tohoku University] at 22:40 12 November 2017

Figure . The wall-clock time with/without load balancing for different number of total MPI ranks.

Figure 5(a) and 64 cubes per MPI rank in Figure 5(b) before load balancing. In the present simulation, the minimum number of cubes in each MPI rank was set to 3; therefore, the freedom of moving the cubes between MPI ranks is limited if the number of cubes per MPI rank is small. Figure 6 shows the wall-clock time with and without load balancing versus the number of MPI ranks. The wall-clock time here includes the time for SOR iteration, MPI_Get, and MPI_Allreduce for collecting the residual. We used SGI ICE X and NEC LX406Re-2, where a MPI/OpenMP hybrid approach is employed with a fixed 12 threads per MPI rank. The result again appears such that the load balancing is more effective when the number of cube assigned to each MPI rank is large. The irregular scaling is observed because the load imbalance is artificially created using a random number as mentioned above. Table 1 shows the specification of computers used in this study. SGI ICE X and NEC LX406Re-2 are used for the solution of the Laplace equation as shown before, while SGI UV2000 is employed for the incompressible flow simulation described in Section 4. Because SGI UV2000 is a cluster with a large shared memory, an

Table . Speciﬁcation of computing resources. node (CPU, memory) NEC LXRe- Intel Xeon E-v ( core) × , GB SGI UV Intel Xeon E-v ( core) × , TB Intel Xeon E-v ( core) × , TB Intel Xeon E-v ( core) × , TB SGI ICE X Intel Xeon E-v ( core) × , GB Intel Xeon E-v ( core) × , GB Intel Xeon E-v ( core) × , GB

No. of nodes nodes unit units units nodes nodes nodes

Figure 7 shows the numerical solution, the difference between the numerical and series solutions, the derivative of numerical solution integrated in each cube, and MPI rank assign to each cube on a cross-sectional slice. Solid black lines show cube boundaries. In this configuration, the difference between numerical and series solutions becomes large near top corners as shown in Figure 7, and cubes near the corner should be refined by the AMR. Figure 8 shows the errors of the numerical solutions during mesh refinement with two different criteria, i.e. the maximum value of |∇φn | in each cube in Figure 8(b) and the cube-integrated |∇φn | in Figure 8(c), where |∇φn | denotes the magnitude of the gradient of φn . For the reference, the error of uniformly refined case is also plotted. The magnitude of ∇φn has large value near the top corners; therefore, cubes in the corners are continuously refined. It is shown from the plot that this approach is not efficient to reduce the error. On the other hand, the refinement index defined by the integral of the gradient within a cube efficiently reduces the error compared to the uniformly refined case and the case of the other refinement criterion. That is, the AMR realised the smaller error with the same number of mesh points (the smaller number of mesh points to achieve a certain value of the error). In Figure 8, the distributions of refined cubes by the two criteria are also shown where approximately 5000 cubes are used for calculation in both cases.

4. Incompressible Navier–Stokes simulation 4.1. Governing equations and the numerical methods For flow simulation, we employ the incompressible Navier–Stokes equations: ∂u j ∂u j uk 1 ∂ p ∂ + =− + (ν + νt ) 2S jk , ∂t ∂xk ρ0 ∂x j ∂xk ∂u j = 0, ∂x j

(4) (5)

where u j and p represent the velocity components in three spatial directions ( j, k = 1, 2, or 3), the pressure deviation from the reference state p = p0 + p , respectively. S jk = (∂u j /∂xk + ∂uk /∂x j ) /2 denotes the strain rate tensor. The summation convention is used for the

Downloaded by [Tohoku University] at 22:40 12 November 2017

6

T. MISAKA ET AL.

Figure . Distributions on a cross-sectional slice: (a) numerical solution φn , (b) the diﬀerence between numerical and series solutions, (c) the derivative of the numerical solution integrated within each cube and (d) MPI rank of each cube.

Figure . Error during adaptive mesh reﬁnement using diﬀerent mesh adaptation indices: (a) the maximum value of |∇φn | in each cube and (b) cube-integrated |∇φn |, where |∇φn | denotes the magnitude of the gradient of the numerical solution φ.

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS

Downloaded by [Tohoku University] at 22:40 12 November 2017

Figure . Wall treatment based on immersed boundary method.

velocity components u j . In this study, a typical value of air density is employed: ρ0 = 1.2 kg/m3 . The kinematic viscosity in Equation (4) is defined by the sum of molecular viscosity and eddy viscosity obtained by a subgrid-scale model (Kobayashi 2005). The above equations are discretised by the fullyconservative fourth-order central difference scheme (Morinishi 1998), where the staggered arrangement of velocity components is employed. Time integration is performed by the third-order low-storage Runge–Kutta scheme (Williamson 1980). It would be effective to switch numerical schemes in each cube based on the local Courant-Friedrichs-Lewy (CFL) condition, because the minimum mesh spacing becomes very small in AMR and the CFL condition limits the size of time step. The switching of numerical schemes and an IBM on a Cartesian grid cause the load imbalance between processors; therefore, the load balancing introduced in this study would be effective. In an explicit cube (CFL = 0.5). The velocity-pressure iterative solver of Hirt et al. (1972) is used for obtaining a pressure field. In the pressure solver, the magnitude of velocity divergence is used as a threshold to stop the iteration. Therefore, the velocity divergence is kept smaller than a pre-defined threshold during time integration. Information exchange between neighbouring cubes with different size requires some interpolations techniques, i.e. a tri-linear interpolation for small to large cubes and the second-order Lagrange interpolation for large to small cubes are employed in this study. The conservation of mass and momentum fluxes by the interpolation was investigated carefully by Manhart (2004). Because the interpolation coefficients can be calculated in advance, the computational cost of the inter-cube interpolation during time stepping is minimised.

4.2. Immersed boundary method There are several approaches to handle curved wall surface in a Cartesian mesh. In this study, we employ an IBM that use image points (Mittal 2005). Figure 9 schematically shows mesh cells and image points (IPs) located near the wall. Because we employ the staggered arrangement of velocity components, wall boundary conditions are applied to cell faces as shown in the figure. The

Figure . The inﬂuence of wall y+ on the velocity distribution near-wall boundary.

8

T. MISAKA ET AL.

Downloaded by [Tohoku University] at 22:40 12 November 2017

Figure . Cube distributions during adaptive mesh reﬁnement based on a non-dimensional wall distance y+ .

position of IPs is defined based on local mesh size, i.e. the IP height from the wall is 1.8 times larger value than local √ mesh spacing, such that the distance is larger than 3x. The size of cubes varies locally on the wall surface during the AMR; therefore, the height of IPs also changes depending on the local mesh spacing. In the case of non-slip wall boundary, we assume a linear velocity distribution from IP down to the wall. And the velocity value at the height of cell face centre is set to the velocity boundary condition of a staggered velocity component. Note that there exist rooms for improving the modelling of the velocity distribution between the IP and the wall, for example, a quadratic distribution and a distribution defined by a log-low model. Because the high-order spatial schemes require several mesh points to calculate the derivative of flow quantities, the cell face

boundary conditions are also applied to the cell inside an object (ghost cell faces). Again we use a linear distribution from the IPs down to the wall, i.e. the negative velocity value is applied on the cell faces inside the object. The boundary condition for pressure is similarly applied as that for velocity components, but the quantity is assigned on the cell centre, not cell faces. Zeroth-order extrapolation is considered for pressure boundary condition; therefore, pressure value on the IP is directly applied to boundary cells and ghost cells inside the object. Figure 10 shows the distribution of a crossflow velocity component for various minimum mesh widths and Reynolds numbers. The current IBM in combination with spatiotemporal schemes results in oscillations around the wall surface when the normalised wall distance y+ is large. The systematic study showed that the wall y+

Figure . Flow ﬁelds and cube distributions during adaptive mesh reﬁnement for (a) Re = .E+, (b) Re = .E+ with y+ < adaptive mesh reﬁnement and (c) Re = .E+ with y+ < adaptive mesh reﬁnement.

Figure . Surfaces colored by pressure coeﬃcient during adaptive mesh reﬁnement for (a) x = .E−, cubes, (b) x = .E−, cubes and (c) x = .E−, cubes.

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS

9

x = −5d, while the outlet boundary is x = 10d, where d denotes a diameter of the sphere. In the Cartesian mesh with immersed boundary treatment, the mesh resolution near the wall surface is critically important; therefore, we conduct mesh refinement for improving mesh resolution near the wall. We employ a refinement criterion of a non-dimensional wall distance y+ defined by Equation (6). The near-wall cubes are refined so that the maximum value of y+ becomes smaller than a pre-defined value in the order of one, O(1).

Downloaded by [Tohoku University] at 22:40 12 November 2017

y+ =

Figure . Time-averaged pressure coeﬃcient Cp along with the experimental values (Achenbach ).

should be in the order of one to avoid such oscillations. This is one of the reasons to employ the AMR based on y+ . 4.3. Adaptive cube refinement based on y+ We consider an incompressible flow around a sphere of unit diameter to demonstrate the present BCM framework coupled with a Navier–Stokes flow solver. The uniform velocity is specified in the inlet boundary and the convection boundary condition is set to the outlet boundary. The Neumann boundary condition is applied to the other outer boundaries. The inlet boundary is located at

ρu+ x , u+ = τw /ρ, μ

(6)

where x denotes the mesh spacing near the wall and shear stress τw is evaluated using the velocity value at an image point of the IBM. With this approach, the boundary layer profile is resolved by a constant number of mesh points regardless of Reynolds numbers. In addition, we apply the condition that the cube size change for neighbouring cubes is 1/2, 1, or 2 based on the rule of the BCM. Figure 11 shows the cross-sectional distribution of y+ near the surface of a sphere. Black solid lines show the boundaries of cubes. In this case, a threshold of y+ = 5 is considered, i.e. a cube which has the maximum y+ value larger than 5 is divided in to eight cubes. As shown in Figure 11, the value of y+ deceases as the cubes are divided. Figure 12 shows the flow fields and cube distributions on a cross-sectional slice during AMR, where black solid lines show cube boundaries. Figure 12(a) is the velocity

Figure . Reynolds stresses in comparison with the experiment (Grandemange, Gohlke, and Cadot ): (a) u2 x along a line passes through y = , z = . (b) u2 z along a line passes through y = , z = .

Downloaded by [Tohoku University] at 22:40 12 November 2017

10

T. MISAKA ET AL.

Figure . Inﬂuence of mesh resolution near wall: (a) time average streamwise velocity with mesh reﬁnement (y+ < ), (b) that with uniform cube arrangement around the sphere and (c) the comparison of the velocity proﬁles.

distribution with the initial cube for Re = 104 . Restating from this flow field, the adaptive cube refinement is conducted so that the y+ becomes smaller than five near the wall surface of the sphere. The acceleration region with thin boundary layer and the wake are refined as in Figure 12(b). The Reynolds number is increased in Figure 12(c), where the cubes are further refined in the acceleration and wake regions. The threshold for y+ is kept a constant value of five. Figure 13 shows the sphere surfaces coloured by pressure coefficient during AMR. Minimum mesh spacing varies as (a) x = 9.8E−3, (b) x = 2.4E−3 and (c) x = 1.2E−4 with 1220, 2382 and 8913 cubes, respectively. Note that the IBM is employed in this study; therefore, the small bumps on the surface do not directly affect the near-wall velocity distribution. As shown in Figure 13(b), cubes near the strong acceleration region are preferentially refined (see also Figure 11(b)). Figure 14 shows the comparison of pressure coefficient between the experiment by Achenbach (1972) and the time-averaged results from the present simulation for Re = 104 and 2 × 104 . The angle θ starts from negative x direction. They agree well between θ = 0 and 140 degrees, while highly unsteady wake results in the discrepancy after θ = 140. The lowest peak of Cp for Re = 104 is slightly larger than the experiment and the case of Re = 2 × 104 . The difference is due to one level larger mesh spacing in Re = 104 compared to that of Re = 2 × 104 . Figure 15 shows Reynolds stresses in comparison with the experiment (Grandemange, Gohlke, and Cadot 2014) to check turbulent statistics in the wake. Figure 15(a)

Figure . The wall-clock time for the ﬂow calculation and one-toone communication as well as the number of cubes during adaptive mesh reﬁnement with load balancing.

shows u2 x calculated from the streamwise velocity fluctuation along a line passes through y = 0, z = 0.5, while Figure 15(b) is u2 z form the vertical velocity fluctuation along a line passes through y = 0, z = 0. In Figure 15, we also plotted the case with mesh refinement in the wake region bounded by x = 0 and 3d. The magnitudes of u2 x and u2 z are comparable to the experiment; however, the 2 peaks of u2 x and uz are slightly moved backward compared to the experiment. This tendency is independent of the mesh resolution in the wake. Figure 16 shows the influence of wall cube size, where the time average streamwise velocity with mesh refinement (y+ < 5) is shown in Figure 16(a), and that with uniform cube arrangement around the sphere in

Downloaded by [Tohoku University] at 22:40 12 November 2017

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS

11

Figure . Switching of time-stepping schemes: (a) the distribution of explicit cubes in red and implicit cubes in blue and (b) a corresponding ﬂow ﬁeld.

Figure 16(b). In addition, the comparison of the streamwise velocity profiles on the top of the sphere is shown in Figure 16(c). Although the average velocity of fully separated region appears slightly different, the velocity profiles just after the separation in Figure 16(c) are almost identical. Figure 17 shows the wall-clock time for the flow calculation and the one-to-one communication as well as the number of cubes before/after mesh refinement. It is shown that the number of cubes assigned to each MPI rank deviates from 300 up to 500 cubes after AMR and load balancing. On the other hand, the wall-clock time is almost levelled with the help of load balancing. The wall-clock time and the number of cubes are both uniformly distributed in MPI ranks before load balancing. The number of cubes increased 6.8 times during AMR, while the wall-clock time became 5.7 times larger than the case before AMR. Figure 18 shows the distribution of explicit (red) and implicit (blue) cubes, and the corresponding velocity distribution on a slice. By checking the local CFL number in each cube, explicit and implicit time integration schemes are switched during time integration. The explicit/implicit cubes are switched frequently due to the unsteadiness of the wake; however, the time integration proceeds stably. The mixed implementation of different numerical schemes with various fidelities and the load balancing of such solver become easy due to the simplicity of the BCM, which would be a potential benefit of the current approach.

5. Conclusions We developed a framework for a distributedmemory parallel computer that enables dynamic data management for AMR and load balancing. We employed data structure of the BCM where a computational domain is divided into multi-level cubic domains and each cube has the same number of grid points inside, realising a multi-level block-structured Cartesian mesh. Solution AMR, which worked efficiently with the help of the dynamic load balancing, was implemented by dividing cubes based on mesh refinement criteria. The framework was investigated with the Laplace equation in terms of AMR, load balancing and the parallel efficiency. The error reduction of AMR was confirmed using an analytic series solution of the Laplace equation. The results showed the effect of the load balancing to reduce the overall computational time as well as the solution AMR to reduce the error of the numerical solution. It was also shown that the number of cubes assigned to each MPI rank should be large to conduct load balancing efficiently. It was then applied to the incompressible Navier– Stokes equations to simulate a flow around a sphere. We considered wall-adaptive cube refinement where a nondimensional wall distance y+ near an object was used for the criterion of mesh refinement. For a fixed value of y+ as a threshold, the AMR was conducted by increasing the Reynolds number. This strategy fits well with the immerse boundary method in a Cartesian mesh because the mesh

12

T. MISAKA ET AL.

resolution criteria for the IBM are automatically met. The load imbalance due to y+ AMR was also corrected by the present approach. To utilise the BCM framework more effectively, we also tested a cube-wise algorithm switching where an explicit and implicit time integration schemes were switched depending on the local CFL number.

Acknowledgments We thank computational resources provided by the Institute of Fluid Science, Tohoku University (SGI UV2000), Cyberscience Center, Tohoku University (NEC LX402Re-2) and by the Institute of Statistical Mathematics (SGI ICE X).

Downloaded by [Tohoku University] at 22:40 12 November 2017

Disclosure statement No potential conflict of interest was reported by the authors.

Funding This work was partially supported by Japan Society for the Promotion of Science (JSPS) KAKENHI Grant-in-Aid for Scientific Research on Innovative Areas [grant number 16H015290].

References Achenbach, E. 1972. “Experiments on the Flow Past Spheres at Very High Reynolds Numbers.” Journal of Fluid Mechanics 54: 565–575. Adams, M., P. Colella, D. T. Graves, J. N. Johnson, N. D. Keen, T. J. Ligocki, D. F. Martin, et al. 2015. “Chombo Software Package for AMR Applications: Design Document.” Lawrence Berkeley National Laboratory Technical Report LBNL-6616E. Aftosmis, M. J., M. J. Berger, and S. M. Murman. 2004. “Applications of Space-Filling Curves to Cartesian Methods in CFD.” 42nd AIAA Aerospace Sciences Meeting and Exhibit, Reno, NV, January 5–8, 2004. Berger, M. J., and P. Colella. 1989. “Local Adaptive Mesh Refinement for Shock Hydrodynamics.” Journal of Computational Physics 82: 64–84. Fukushima, Y., T. Misaka, S. Obayashi, D. Sasaki, and K. Nakahashi. 2016. “Wavenumber Optimized Immersed Boundary Method for AeroAcoustic Analysis Based on Cartesian Mesh Method.” AIAA Journal 54 (10): 2988–3001.

Grandemange, M., M. Gohlke, and O. Cadot. 2014. “Statistical Axisymmetry of the Turbulence Sphere Wake.” Experiments in Fluids 55: 1838-1-11. Gunney, B. T. N., and R. W. Anderson. 2016. “Advances in Patch-Based Adaptive Mesh Refinement Scalability.” Journal of Parallel and Distributed Computing 89: 65–84. Hirt, C. W., and J. L. Cook. 1972. “Calculating ThreeDimensional Flows around Structures and Over Rough Terrain.” Journal of Computational Physics 10: 324–340. Ishida, T., S. Takahashi, and K. Nakahashi. 2008. “Efficient and Robust Cartesian Mesh Generation for Building-Cube Method.” Journal of Computational Science and Technology 2 (4): 435–446. Kawamura, T., H. Takami, and K. Kuwahara. 1986. “Computation of High Reynolds Number Flow Around a Circular Cylinder with Surface Roughness.” Fluid Dynamic Resrach 1:145–162. Kobayashi, H. 2005. “The Subgrid-Scale Models based on Coherent Structures for Rotating Homogeneous Turbulence and Turbulent Channel Flow.” Physics of Fluids 17: 045104-1-12. Manhart, M. 2004. “A Zonal Grid Algorithm for DNS of Turbulent Boundary Layer.” Computers & Fluids 33: 435–461. Mittal, R., and G. Iaccarino. 2005. “Immersed Boundary Methods.” Annual Review of Fluid Mechanics 37: 239–261. Morinishi, Y., T. S. Lund, O. V. Vasilyev, and P. Moin. 1998. “Fully Conservative Higher Order Finite Difference Schemes for Incompressible Flow.” Journal of Computational Physics 143: 90–124. Nakahashi, K. 2005. “High-Density Mesh Flow Computations with Pre-/Post-Data Compressions.” 17th AIAA Computational Fluid Dynamics Conference, Toronto, ON, June 6–9, 2005. Onishi, K., K. Tsubokura, S. Obayashi, and K. Nakahashi. 2014. “Vehicle Aerodynamics Simulation for the Next Generation on the K Computer: Part 2 Use of Dirty CAD Data with Modified Cartesian Grid Approach.” SAE International Journal of Passenger Cars- Mechanical Systems 7: 528–537. Sakai, R., D. Sasaki, S. Obayashi, and K. Nakahashi. 2013. “Wavelet-Based Data Compression for Flow Simulation on Block-Structured Cartesian Mesh.” International Journal for Numerical Methods in Fluids 73(5): 462–476. Spalart, P. R., R. D. Moser, and M. M. Rogers. 1991. “Spectral Methods for the Navier-Stokes Equations with one Infinite and Two Periodic Directions.” Journal of Computational Physics 96: 297–324. Su, X., S. Yamamoto, K. Nakahashi. 2012. “Analysis of a Meshless Solver for High Reynolds Number Flow.” International Journal for Numerical Methods in Fluids 72: 505–527. Williamson, J. H. 1980. “Low-Storage Runge-Kutta Schemes.” Journal of Computational Physics 35: 48–56.

ISSN: 1061-8562 (Print) 1029-0257 (Online) Journal homepage: http://www.tandfonline.com/loi/gcfd20

Adaptive mesh refinement and load balancing based on multi-level block-structured Cartesian mesh Takashi Misaka, Daisuke Sasaki & Shigeru Obayashi To cite this article: Takashi Misaka, Daisuke Sasaki & Shigeru Obayashi (2017): Adaptive mesh refinement and load balancing based on multi-level block-structured Cartesian mesh, International Journal of Computational Fluid Dynamics, DOI: 10.1080/10618562.2017.1390085 To link to this article: http://dx.doi.org/10.1080/10618562.2017.1390085

Published online: 12 Nov 2017.

Submit your article to this journal

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=gcfd20 Download by: [Tohoku University]

Date: 12 November 2017, At: 22:40

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS, https://doi.org/./..

Adaptive mesh refinement and load balancing based on multi-level block-structured Cartesian mesh Takashi Misakaa , Daisuke Sasakib and Shigeru Obayashic

Downloaded by [Tohoku University] at 22:40 12 November 2017

a Frontier Research Institute for Interdisciplinary Sciences, Tohoku University, Sendai, Japan; b Kanazawa Institute of Technology, Hakusan, Ishikawa, Japan; c Institute of Fluid Science, Tohoku University, Sendai, Japan

ABSTRACT

ARTICLE HISTORY

We developed a framework for a distributed-memory parallel computer that enables dynamic data management for adaptive mesh refinement and load balancing. We employed simple data structure of the building cube method (BCM) where a computational domain is divided into multi-level cubic domains and each cube has the same number of grid points inside, realising a multi-level blockstructured Cartesian mesh. Solution adaptive mesh refinement, which works efficiently with the help of the dynamic load balancing, was implemented by dividing cubes based on mesh refinement criteria. The framework was investigated with the Laplace equation in terms of adaptive mesh refinement, load balancing and the parallel efficiency. It was then applied to the incompressible Navier– Stokes equations to simulate a turbulent flow around a sphere. We considered wall-adaptive cube refinement where a non-dimensional wall distance y+ near the sphere is used for a criterion of mesh refinement. The result showed the load imbalance due to y+ adaptive mesh refinement was corrected by the present approach. To utilise the BCM framework more effectively, we also tested a cube-wise algorithm switching where an explicit and implicit time integration schemes are switched depending on the local Courant-Friedrichs-Lewy (CFL) condition in each cube.

Received May Accepted September

1. Introduction Resolving flow features with appropriate mesh resolution is critically important especially for unsteady computational fluid dynamics (CFD) simulation such as large-eddy simulation (LES). Adaptive mesh refinement (AMR) is one of most effective techniques to adapt mesh resolution to local flow features such as shock, vortex, and shear layer (Berger and Colella 1989). The key elements of Cartesian mesh-based AMR are (1) data structure, (2) parallelisation, and (3) refinement criteria. The points (1) and (2) are deeply related to the code design; therefore, the development of an AMR code involves tough coding work and time-consuming test runs. Due to the paradigm of computing clusters with distributedmemory multi-core processors, one have to consider the parallelisation between computing nodes and within a node, which further complicates the development of an AMR code. To alleviate this, open AMR libraries such as Chombo (Adams et al. 2015), SAMURAI (Gunney et al. 2016) and so on have been developed. However, it is not always easy to fully use those libraries especially when a code needs to be modified for our own purpose including the change of numerical schemes, inter-grid data transfer, and parallelisation strategy. CONTACT Takashi Misaka

[email protected]

© Informa UK Limited, trading as Taylor & Francis Group

KEYWORDS

Adaptive mesh reﬁnement; load balancing; block-structured Cartesian mesh; large-eddy simulation

Cartesian mesh-based CFD simulation is recently gaining attentions due to its capabilities of fast mesh generation around complex geometries and efficient solution procedures. The building cube method (BCM) is one of Cartesian mesh-based approaches that employ a simple block-structured mesh called ‘cube’ (Nakahashi 2005). A flow field is divided by an assemblage of multi-level cuboids, where those cubes join with the size ratio of 1/2, 1 or 2. Each cube has the same number of mesh points so that the local resolution is determined by the cube size. The cubes are generated by the geometry-adaptive refinement (Ishida, Takahashi, and Nakahashi 2008), which reduces the total number of mesh points compared to a uniform Cartesian mesh. Intensive developments have been conducted by Nakahashi et al., which results in several flow solvers based on compressible/incompressible Navier–Stokes equations, Euler equations, and linearised Euler equations for aeroacoustic analysis (Fukushima et al. 2016; Nakahashi 2005; Sakai et al. 2013; Su et al. 2012). The BCM is also effective for large-scale computing as it is demonstrated by Onishi et al. (2014). Tools have also been developed such as a fast mesh generator (Ishida, Takahashi, and Nakahashi 2008) and data compression techniques for post-processing (Sakai et al. 2013). There are many approaches to handle curved wall

Downloaded by [Tohoku University] at 22:40 12 November 2017

2

T. MISAKA ET AL.

surface in a Cartesian mesh. Focusing on techniques used in the BCM, efforts are put into a staircase representation, low Reynolds number immersed boundary method (IBM), the IBM combined with a logarithmic wall model, and a coupling of a grid-less solver (Su, Yamamoto, and Nakahashi 2012) or an unstructured mesh solver. The simplicity of the BCM data structure is expected to be effective for the dynamic data management required for the AMR because an AMR code tends to become complex when it is applied to three-dimensional complex geometries. In addition, the AMR should be used along with the dynamic load balancing on a distributedmemory machine such that the number of mesh points assigned to each MPI rank can be varied during the AMR simulation, which can be implemented easily using cubes of the BCM. Our motivation is to develop an AMR code for unsteady flow simulation such as LES or direct numerical simulation (DNS). For this purpose, we developed a framework for a distributed-memory parallel computer that enables dynamic data management for AMR and dynamic load balancing. We employ an approach of the BCM as a basic data structure. We added a capability of solution AMR that may work efficiently with the help of the dynamic load balancing. The code is applied to the Laplace equation to investigate the effect of AMR and load balancing. We then apply it to an incompressible flow field, where the wall-adaptive cube refinement is conducted using a non-dimensional wall distance y+ near an object as a criterion of the mesh refinement. The simple data structure of the BCM enables the mixed implementation of different numerical schemes with various fidelities, where the load balancing would work efficiently in such situation. This paper is organised as follows, Section 2 describes the approach for the dynamic cube management based on the BCM. It is demonstrated using the Laplace equation in Section 3. An example of the incompressible Navier–Stokes equations is shown in Section 4. Section 5 concludes the paper.

2. Dynamic cube management 2.1. Data structure in building cube method The data management strategies of the BCM are documented in the original paper (Nakahashi 2005). It is explained briefly to make this paper self-consistent. The BCM basically employs an equally-spaced Cartesian mesh for the simplicities in the mesh generation, in high-order schemes and in the post-processing. To adapt such equally-spaced mesh to complex geometries and local flow scales, the BCM employs multi-level sub-mesh called ‘cube’ so that the local mesh size is determined by

Figure . An example of ‘cubes’and mesh cells of the building cube method.

the size of the cube. Figure 1 shows an example of a BCM mesh, where each cube is discretised by 163 mesh points. With the BCM approach, it is capable to (1) treat complex geometry, (2) use locally dense mesh depending on the local flow scales, (3) implement a high-order scheme easily, (4) parallelise the flow solver, and (5) handle huge data in the post-processing. The present study is mainly focusing on the fourth point, where the load-balance in parallel computations and the solution AMR are realised with the help of the simple data structure of the BCM. Because the computation in each cube is independent, cube-level parallelisation and mesh refinement become simple as mentioned above. Another feature to be mentioned is that the data compression in pre-/post-processing because the total number of mesh points tends to become very large if a complex geometry is handled by a Cartesian mesh. The BCM mesh only keeps the following information: (1) number of cubes, (2) size and coordinates of the front-lower-left corner for each cube, (3) neighbouring-cube information for each cube, (4) number of division of a cube, and (5) cell in–out information. The data (5) may become large because it is flag information for each mesh point, while other data are for each cube, i.e. the size is small. This cell in–out information is compressed by the runlength technique (Nakahashi 2005). The data compression in post-processing has also been proposed to reduce the file output during the simulation (Sakai et al. 2013). 2.2. Cube management for dynamic load balancing To realise dynamic cube management, we employ a space-filling curve, especially, the Peano–Hilbert curve, which is often used for ordering scattered data. The use of a space-filling curve in the unstructured Cartesian CFD solver is well documented in Aftosmis, Berger, and Murman (2004). Here, we simply apply this technique for cubes of the BCM. Figure 2 shows an

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS

3

Figure . Schematic of the load balancing based on wall-clock time.

Downloaded by [Tohoku University] at 22:40 12 November 2017

Figure . The Peano–Hilbert space-ﬁlling curve for cube reﬁnement and load balancing.

example of the Peano–Hilbert space-filling curve, where two-dimensional multi-level cubes are mapped into one-dimensional ordered information. A good point of this space-filling curve is that the domain decomposition for parallel computation can be done by simply locate dividers along the one-dimensional ordered information. The decomposed domain in the original two-dimensional space has nice property, i.e. each sub-domain is sufficiently agglomerated realising relatively small cut surface (Aftosmis, Berger, and Murman 2004). Figure 2 also shows how cube refinement is conducted in the case of three MPI ranks. Once a certain cube is divided into four small cubes (eight small cubes in three dimensions) for mesh refinement, a number of cubes assigned to a certain MPI rank increases. This change can be naturally handled by using a space filling curve. After the cube refinement, the load balancing is done to level the number of cubes assigned to each MPI rank. In this study, we consider the dynamic load balancing using the space-filling curve as mentioned above. That is, the dividers for separating each MPI rank are set simply based on the wall-clock time elapsed in each MPI rank. The AMR and the IBM for solid wall boundaries cause the load imbalance between MPI ranks. We count the wall-clock time for calculation and one-to-one MPI data transfer on each MPI rank. Suppose that the wall-clock time for each MPI rank denotes ti (i = 1, N ), where N is the total number of MPI ranks. We define the ratio of the wall-clock time as follows:

Figure . A computation domain and a series solution of the Laplace equation on a cross-sectional slice.

where ri is the inverse of ti normalised by the sum of them. Using ri , the position of the divider eci along the spacefilling curve is defined by i eci = ncube,all k=1 N

rk

j=1 r j

,

(2)

where ncube,all is the total number of cubes. The procedure of load balancing is schematically shown in Figure 3 in the case of N = 3. Before the load balancing, the measured wall-clock time in MPI rank 2 is smaller than that of MPI rank 1 and 3, while the number of cubes assigned in each MPI rank is the same. The position of dividers eci has the relationship; ec1 = ec2 − ec1 + 1 = ec3 − ec2 + 1 before the load balancing. For the load balancing, the divider positions are modified based on Equation (2) resulting in the increased number of cubes in MPI rank 2 as in Figure 3. By repeating this process, the wall-clock time on each MPI rank is sufficiently levelled. In incompressible flow simulations, the load balancing is carried out every a few hundred time steps.

2.3. Adaptive cube refinement 1/ti ri = N , j=1 1/t j

(1)

The AMR is conducted simply by dividing each cube into eight cubes in three dimensions (one cube into four cubes

4

T. MISAKA ET AL.

Downloaded by [Tohoku University] at 22:40 12 November 2017

Figure . The wall-clock time and the number of cubes with/without load balancing: (a) cubes, MPI ranks and (b) , cubes, MPI ranks.

in two dimensions). The solution AMR requires a criterion for selecting meshes to be refined, such as vorticity magnitude, entropy gradient and so on. We consider in this paper the gradient of a variable in the case with the Laplace equation and the non-dimensional wall distance y+ in the incompressible Navier–Stokes equations case, which will be described in the following sections. The adaptive cube refinement is conducted independently in each computing node; therefore, the number of cubes assigned to each node varies during the computation. The number of mesh points tends to increase especially near the solid wall in the wall-bounded flow simulation with a Cartesian mesh. The load imbalance due to the AMR and the IBM can be alleviated by the approach described in the previous section.

should be up to infinity as an analytical solution; however, we consider 50 terms here after the convergence study concerning the number of terms. Figure 4 shows the domain and the series solution on a cross-sectional slice. The Laplace equation is discretised by a secondorder central difference scheme and the resulting finite difference equations are solved by the successive-overrelaxation (SOR) iteration method to obtain a numerical solution. The numerical solution is then compared with the series solution, especially, by the integration of the difference between numerical and series solutions within each cube.

3. The Laplace equation in three dimensions

Figure 5 shows the wall-clock time with/without load balancing for (a) 512 cubes with 32 MPI ranks, and (b) 4,096 cubes with 64 MPI ranks. Here, the wallclock time for the SOR iteration and one-to-one communication using MPI_Get for halo data exchange are only counted. The load imbalance is artificially generated by varying the number of SOR iteration in each MPI rank using a random number generator: nmax = int(50.0∗ rand()+1.0), where nmax denotes the maximum number of iterations. MPI/OpenMP hybrid parallelisation was employed with a fixed number of 12 threads per MPI rank on the SGI ICE X system. As shown in Figure 5, the wall-clock time is deviated before load balancing, while the number of cubes is the same for all MPI ranks. After the load balancing, the deviation of the wall-clock time is reduced and the number of cube for each MPI rank is varied. The effect of load balancing is pronounced for the case where the number of cubes per MPI rank is large, i.e. 16 cubes per MPI rank in

3.1. Governing equations and the numerical methods We consider a field obtained by a solution of the threedimensional Laplace equation, i.e. ∇ 2 φs = 0. As for boundary conditions, a constant value φs (x, y, 1) = 1 is set on the top boundary, and φs (x, y, z) = 0 is set to zero on other boundaries. In this system, a series solution of the Laplace equation in a cubic domain can be obtained analytically as follows (which is used as a reference solution to validate a numerical solution): 50 16 sinh γmn z φs x, y, z = 2 sin mπ x sin nπ y, (3) π m,n=1 mn sinhγmn

√ where γmn = (mπ )2 +(nπ )2 . The cubic domain has sides of unit length for simplicity. The summation in Equation (3)

3.2. Load balancing

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS

5

arbitrary number of nodes were chosen for MPI parallelisation. 3.3. Adaptive cube refinement based on the gradient

Downloaded by [Tohoku University] at 22:40 12 November 2017

Figure . The wall-clock time with/without load balancing for different number of total MPI ranks.

Figure 5(a) and 64 cubes per MPI rank in Figure 5(b) before load balancing. In the present simulation, the minimum number of cubes in each MPI rank was set to 3; therefore, the freedom of moving the cubes between MPI ranks is limited if the number of cubes per MPI rank is small. Figure 6 shows the wall-clock time with and without load balancing versus the number of MPI ranks. The wall-clock time here includes the time for SOR iteration, MPI_Get, and MPI_Allreduce for collecting the residual. We used SGI ICE X and NEC LX406Re-2, where a MPI/OpenMP hybrid approach is employed with a fixed 12 threads per MPI rank. The result again appears such that the load balancing is more effective when the number of cube assigned to each MPI rank is large. The irregular scaling is observed because the load imbalance is artificially created using a random number as mentioned above. Table 1 shows the specification of computers used in this study. SGI ICE X and NEC LX406Re-2 are used for the solution of the Laplace equation as shown before, while SGI UV2000 is employed for the incompressible flow simulation described in Section 4. Because SGI UV2000 is a cluster with a large shared memory, an

Table . Speciﬁcation of computing resources. node (CPU, memory) NEC LXRe- Intel Xeon E-v ( core) × , GB SGI UV Intel Xeon E-v ( core) × , TB Intel Xeon E-v ( core) × , TB Intel Xeon E-v ( core) × , TB SGI ICE X Intel Xeon E-v ( core) × , GB Intel Xeon E-v ( core) × , GB Intel Xeon E-v ( core) × , GB

No. of nodes nodes unit units units nodes nodes nodes

Figure 7 shows the numerical solution, the difference between the numerical and series solutions, the derivative of numerical solution integrated in each cube, and MPI rank assign to each cube on a cross-sectional slice. Solid black lines show cube boundaries. In this configuration, the difference between numerical and series solutions becomes large near top corners as shown in Figure 7, and cubes near the corner should be refined by the AMR. Figure 8 shows the errors of the numerical solutions during mesh refinement with two different criteria, i.e. the maximum value of |∇φn | in each cube in Figure 8(b) and the cube-integrated |∇φn | in Figure 8(c), where |∇φn | denotes the magnitude of the gradient of φn . For the reference, the error of uniformly refined case is also plotted. The magnitude of ∇φn has large value near the top corners; therefore, cubes in the corners are continuously refined. It is shown from the plot that this approach is not efficient to reduce the error. On the other hand, the refinement index defined by the integral of the gradient within a cube efficiently reduces the error compared to the uniformly refined case and the case of the other refinement criterion. That is, the AMR realised the smaller error with the same number of mesh points (the smaller number of mesh points to achieve a certain value of the error). In Figure 8, the distributions of refined cubes by the two criteria are also shown where approximately 5000 cubes are used for calculation in both cases.

4. Incompressible Navier–Stokes simulation 4.1. Governing equations and the numerical methods For flow simulation, we employ the incompressible Navier–Stokes equations: ∂u j ∂u j uk 1 ∂ p ∂ + =− + (ν + νt ) 2S jk , ∂t ∂xk ρ0 ∂x j ∂xk ∂u j = 0, ∂x j

(4) (5)

where u j and p represent the velocity components in three spatial directions ( j, k = 1, 2, or 3), the pressure deviation from the reference state p = p0 + p , respectively. S jk = (∂u j /∂xk + ∂uk /∂x j ) /2 denotes the strain rate tensor. The summation convention is used for the

Downloaded by [Tohoku University] at 22:40 12 November 2017

6

T. MISAKA ET AL.

Figure . Distributions on a cross-sectional slice: (a) numerical solution φn , (b) the diﬀerence between numerical and series solutions, (c) the derivative of the numerical solution integrated within each cube and (d) MPI rank of each cube.

Figure . Error during adaptive mesh reﬁnement using diﬀerent mesh adaptation indices: (a) the maximum value of |∇φn | in each cube and (b) cube-integrated |∇φn |, where |∇φn | denotes the magnitude of the gradient of the numerical solution φ.

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS

Downloaded by [Tohoku University] at 22:40 12 November 2017

Figure . Wall treatment based on immersed boundary method.

velocity components u j . In this study, a typical value of air density is employed: ρ0 = 1.2 kg/m3 . The kinematic viscosity in Equation (4) is defined by the sum of molecular viscosity and eddy viscosity obtained by a subgrid-scale model (Kobayashi 2005). The above equations are discretised by the fullyconservative fourth-order central difference scheme (Morinishi 1998), where the staggered arrangement of velocity components is employed. Time integration is performed by the third-order low-storage Runge–Kutta scheme (Williamson 1980). It would be effective to switch numerical schemes in each cube based on the local Courant-Friedrichs-Lewy (CFL) condition, because the minimum mesh spacing becomes very small in AMR and the CFL condition limits the size of time step. The switching of numerical schemes and an IBM on a Cartesian grid cause the load imbalance between processors; therefore, the load balancing introduced in this study would be effective. In an explicit cube (CFL = 0.5). The velocity-pressure iterative solver of Hirt et al. (1972) is used for obtaining a pressure field. In the pressure solver, the magnitude of velocity divergence is used as a threshold to stop the iteration. Therefore, the velocity divergence is kept smaller than a pre-defined threshold during time integration. Information exchange between neighbouring cubes with different size requires some interpolations techniques, i.e. a tri-linear interpolation for small to large cubes and the second-order Lagrange interpolation for large to small cubes are employed in this study. The conservation of mass and momentum fluxes by the interpolation was investigated carefully by Manhart (2004). Because the interpolation coefficients can be calculated in advance, the computational cost of the inter-cube interpolation during time stepping is minimised.

4.2. Immersed boundary method There are several approaches to handle curved wall surface in a Cartesian mesh. In this study, we employ an IBM that use image points (Mittal 2005). Figure 9 schematically shows mesh cells and image points (IPs) located near the wall. Because we employ the staggered arrangement of velocity components, wall boundary conditions are applied to cell faces as shown in the figure. The

Figure . The inﬂuence of wall y+ on the velocity distribution near-wall boundary.

8

T. MISAKA ET AL.

Downloaded by [Tohoku University] at 22:40 12 November 2017

Figure . Cube distributions during adaptive mesh reﬁnement based on a non-dimensional wall distance y+ .

position of IPs is defined based on local mesh size, i.e. the IP height from the wall is 1.8 times larger value than local √ mesh spacing, such that the distance is larger than 3x. The size of cubes varies locally on the wall surface during the AMR; therefore, the height of IPs also changes depending on the local mesh spacing. In the case of non-slip wall boundary, we assume a linear velocity distribution from IP down to the wall. And the velocity value at the height of cell face centre is set to the velocity boundary condition of a staggered velocity component. Note that there exist rooms for improving the modelling of the velocity distribution between the IP and the wall, for example, a quadratic distribution and a distribution defined by a log-low model. Because the high-order spatial schemes require several mesh points to calculate the derivative of flow quantities, the cell face

boundary conditions are also applied to the cell inside an object (ghost cell faces). Again we use a linear distribution from the IPs down to the wall, i.e. the negative velocity value is applied on the cell faces inside the object. The boundary condition for pressure is similarly applied as that for velocity components, but the quantity is assigned on the cell centre, not cell faces. Zeroth-order extrapolation is considered for pressure boundary condition; therefore, pressure value on the IP is directly applied to boundary cells and ghost cells inside the object. Figure 10 shows the distribution of a crossflow velocity component for various minimum mesh widths and Reynolds numbers. The current IBM in combination with spatiotemporal schemes results in oscillations around the wall surface when the normalised wall distance y+ is large. The systematic study showed that the wall y+

Figure . Flow ﬁelds and cube distributions during adaptive mesh reﬁnement for (a) Re = .E+, (b) Re = .E+ with y+ < adaptive mesh reﬁnement and (c) Re = .E+ with y+ < adaptive mesh reﬁnement.

Figure . Surfaces colored by pressure coeﬃcient during adaptive mesh reﬁnement for (a) x = .E−, cubes, (b) x = .E−, cubes and (c) x = .E−, cubes.

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS

9

x = −5d, while the outlet boundary is x = 10d, where d denotes a diameter of the sphere. In the Cartesian mesh with immersed boundary treatment, the mesh resolution near the wall surface is critically important; therefore, we conduct mesh refinement for improving mesh resolution near the wall. We employ a refinement criterion of a non-dimensional wall distance y+ defined by Equation (6). The near-wall cubes are refined so that the maximum value of y+ becomes smaller than a pre-defined value in the order of one, O(1).

Downloaded by [Tohoku University] at 22:40 12 November 2017

y+ =

Figure . Time-averaged pressure coeﬃcient Cp along with the experimental values (Achenbach ).

should be in the order of one to avoid such oscillations. This is one of the reasons to employ the AMR based on y+ . 4.3. Adaptive cube refinement based on y+ We consider an incompressible flow around a sphere of unit diameter to demonstrate the present BCM framework coupled with a Navier–Stokes flow solver. The uniform velocity is specified in the inlet boundary and the convection boundary condition is set to the outlet boundary. The Neumann boundary condition is applied to the other outer boundaries. The inlet boundary is located at

ρu+ x , u+ = τw /ρ, μ

(6)

where x denotes the mesh spacing near the wall and shear stress τw is evaluated using the velocity value at an image point of the IBM. With this approach, the boundary layer profile is resolved by a constant number of mesh points regardless of Reynolds numbers. In addition, we apply the condition that the cube size change for neighbouring cubes is 1/2, 1, or 2 based on the rule of the BCM. Figure 11 shows the cross-sectional distribution of y+ near the surface of a sphere. Black solid lines show the boundaries of cubes. In this case, a threshold of y+ = 5 is considered, i.e. a cube which has the maximum y+ value larger than 5 is divided in to eight cubes. As shown in Figure 11, the value of y+ deceases as the cubes are divided. Figure 12 shows the flow fields and cube distributions on a cross-sectional slice during AMR, where black solid lines show cube boundaries. Figure 12(a) is the velocity

Figure . Reynolds stresses in comparison with the experiment (Grandemange, Gohlke, and Cadot ): (a) u2 x along a line passes through y = , z = . (b) u2 z along a line passes through y = , z = .

Downloaded by [Tohoku University] at 22:40 12 November 2017

10

T. MISAKA ET AL.

Figure . Inﬂuence of mesh resolution near wall: (a) time average streamwise velocity with mesh reﬁnement (y+ < ), (b) that with uniform cube arrangement around the sphere and (c) the comparison of the velocity proﬁles.

distribution with the initial cube for Re = 104 . Restating from this flow field, the adaptive cube refinement is conducted so that the y+ becomes smaller than five near the wall surface of the sphere. The acceleration region with thin boundary layer and the wake are refined as in Figure 12(b). The Reynolds number is increased in Figure 12(c), where the cubes are further refined in the acceleration and wake regions. The threshold for y+ is kept a constant value of five. Figure 13 shows the sphere surfaces coloured by pressure coefficient during AMR. Minimum mesh spacing varies as (a) x = 9.8E−3, (b) x = 2.4E−3 and (c) x = 1.2E−4 with 1220, 2382 and 8913 cubes, respectively. Note that the IBM is employed in this study; therefore, the small bumps on the surface do not directly affect the near-wall velocity distribution. As shown in Figure 13(b), cubes near the strong acceleration region are preferentially refined (see also Figure 11(b)). Figure 14 shows the comparison of pressure coefficient between the experiment by Achenbach (1972) and the time-averaged results from the present simulation for Re = 104 and 2 × 104 . The angle θ starts from negative x direction. They agree well between θ = 0 and 140 degrees, while highly unsteady wake results in the discrepancy after θ = 140. The lowest peak of Cp for Re = 104 is slightly larger than the experiment and the case of Re = 2 × 104 . The difference is due to one level larger mesh spacing in Re = 104 compared to that of Re = 2 × 104 . Figure 15 shows Reynolds stresses in comparison with the experiment (Grandemange, Gohlke, and Cadot 2014) to check turbulent statistics in the wake. Figure 15(a)

Figure . The wall-clock time for the ﬂow calculation and one-toone communication as well as the number of cubes during adaptive mesh reﬁnement with load balancing.

shows u2 x calculated from the streamwise velocity fluctuation along a line passes through y = 0, z = 0.5, while Figure 15(b) is u2 z form the vertical velocity fluctuation along a line passes through y = 0, z = 0. In Figure 15, we also plotted the case with mesh refinement in the wake region bounded by x = 0 and 3d. The magnitudes of u2 x and u2 z are comparable to the experiment; however, the 2 peaks of u2 x and uz are slightly moved backward compared to the experiment. This tendency is independent of the mesh resolution in the wake. Figure 16 shows the influence of wall cube size, where the time average streamwise velocity with mesh refinement (y+ < 5) is shown in Figure 16(a), and that with uniform cube arrangement around the sphere in

Downloaded by [Tohoku University] at 22:40 12 November 2017

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS

11

Figure . Switching of time-stepping schemes: (a) the distribution of explicit cubes in red and implicit cubes in blue and (b) a corresponding ﬂow ﬁeld.

Figure 16(b). In addition, the comparison of the streamwise velocity profiles on the top of the sphere is shown in Figure 16(c). Although the average velocity of fully separated region appears slightly different, the velocity profiles just after the separation in Figure 16(c) are almost identical. Figure 17 shows the wall-clock time for the flow calculation and the one-to-one communication as well as the number of cubes before/after mesh refinement. It is shown that the number of cubes assigned to each MPI rank deviates from 300 up to 500 cubes after AMR and load balancing. On the other hand, the wall-clock time is almost levelled with the help of load balancing. The wall-clock time and the number of cubes are both uniformly distributed in MPI ranks before load balancing. The number of cubes increased 6.8 times during AMR, while the wall-clock time became 5.7 times larger than the case before AMR. Figure 18 shows the distribution of explicit (red) and implicit (blue) cubes, and the corresponding velocity distribution on a slice. By checking the local CFL number in each cube, explicit and implicit time integration schemes are switched during time integration. The explicit/implicit cubes are switched frequently due to the unsteadiness of the wake; however, the time integration proceeds stably. The mixed implementation of different numerical schemes with various fidelities and the load balancing of such solver become easy due to the simplicity of the BCM, which would be a potential benefit of the current approach.

5. Conclusions We developed a framework for a distributedmemory parallel computer that enables dynamic data management for AMR and load balancing. We employed data structure of the BCM where a computational domain is divided into multi-level cubic domains and each cube has the same number of grid points inside, realising a multi-level block-structured Cartesian mesh. Solution AMR, which worked efficiently with the help of the dynamic load balancing, was implemented by dividing cubes based on mesh refinement criteria. The framework was investigated with the Laplace equation in terms of AMR, load balancing and the parallel efficiency. The error reduction of AMR was confirmed using an analytic series solution of the Laplace equation. The results showed the effect of the load balancing to reduce the overall computational time as well as the solution AMR to reduce the error of the numerical solution. It was also shown that the number of cubes assigned to each MPI rank should be large to conduct load balancing efficiently. It was then applied to the incompressible Navier– Stokes equations to simulate a flow around a sphere. We considered wall-adaptive cube refinement where a nondimensional wall distance y+ near an object was used for the criterion of mesh refinement. For a fixed value of y+ as a threshold, the AMR was conducted by increasing the Reynolds number. This strategy fits well with the immerse boundary method in a Cartesian mesh because the mesh

12

T. MISAKA ET AL.

resolution criteria for the IBM are automatically met. The load imbalance due to y+ AMR was also corrected by the present approach. To utilise the BCM framework more effectively, we also tested a cube-wise algorithm switching where an explicit and implicit time integration schemes were switched depending on the local CFL number.

Acknowledgments We thank computational resources provided by the Institute of Fluid Science, Tohoku University (SGI UV2000), Cyberscience Center, Tohoku University (NEC LX402Re-2) and by the Institute of Statistical Mathematics (SGI ICE X).

Downloaded by [Tohoku University] at 22:40 12 November 2017

Disclosure statement No potential conflict of interest was reported by the authors.

Funding This work was partially supported by Japan Society for the Promotion of Science (JSPS) KAKENHI Grant-in-Aid for Scientific Research on Innovative Areas [grant number 16H015290].

References Achenbach, E. 1972. “Experiments on the Flow Past Spheres at Very High Reynolds Numbers.” Journal of Fluid Mechanics 54: 565–575. Adams, M., P. Colella, D. T. Graves, J. N. Johnson, N. D. Keen, T. J. Ligocki, D. F. Martin, et al. 2015. “Chombo Software Package for AMR Applications: Design Document.” Lawrence Berkeley National Laboratory Technical Report LBNL-6616E. Aftosmis, M. J., M. J. Berger, and S. M. Murman. 2004. “Applications of Space-Filling Curves to Cartesian Methods in CFD.” 42nd AIAA Aerospace Sciences Meeting and Exhibit, Reno, NV, January 5–8, 2004. Berger, M. J., and P. Colella. 1989. “Local Adaptive Mesh Refinement for Shock Hydrodynamics.” Journal of Computational Physics 82: 64–84. Fukushima, Y., T. Misaka, S. Obayashi, D. Sasaki, and K. Nakahashi. 2016. “Wavenumber Optimized Immersed Boundary Method for AeroAcoustic Analysis Based on Cartesian Mesh Method.” AIAA Journal 54 (10): 2988–3001.

Grandemange, M., M. Gohlke, and O. Cadot. 2014. “Statistical Axisymmetry of the Turbulence Sphere Wake.” Experiments in Fluids 55: 1838-1-11. Gunney, B. T. N., and R. W. Anderson. 2016. “Advances in Patch-Based Adaptive Mesh Refinement Scalability.” Journal of Parallel and Distributed Computing 89: 65–84. Hirt, C. W., and J. L. Cook. 1972. “Calculating ThreeDimensional Flows around Structures and Over Rough Terrain.” Journal of Computational Physics 10: 324–340. Ishida, T., S. Takahashi, and K. Nakahashi. 2008. “Efficient and Robust Cartesian Mesh Generation for Building-Cube Method.” Journal of Computational Science and Technology 2 (4): 435–446. Kawamura, T., H. Takami, and K. Kuwahara. 1986. “Computation of High Reynolds Number Flow Around a Circular Cylinder with Surface Roughness.” Fluid Dynamic Resrach 1:145–162. Kobayashi, H. 2005. “The Subgrid-Scale Models based on Coherent Structures for Rotating Homogeneous Turbulence and Turbulent Channel Flow.” Physics of Fluids 17: 045104-1-12. Manhart, M. 2004. “A Zonal Grid Algorithm for DNS of Turbulent Boundary Layer.” Computers & Fluids 33: 435–461. Mittal, R., and G. Iaccarino. 2005. “Immersed Boundary Methods.” Annual Review of Fluid Mechanics 37: 239–261. Morinishi, Y., T. S. Lund, O. V. Vasilyev, and P. Moin. 1998. “Fully Conservative Higher Order Finite Difference Schemes for Incompressible Flow.” Journal of Computational Physics 143: 90–124. Nakahashi, K. 2005. “High-Density Mesh Flow Computations with Pre-/Post-Data Compressions.” 17th AIAA Computational Fluid Dynamics Conference, Toronto, ON, June 6–9, 2005. Onishi, K., K. Tsubokura, S. Obayashi, and K. Nakahashi. 2014. “Vehicle Aerodynamics Simulation for the Next Generation on the K Computer: Part 2 Use of Dirty CAD Data with Modified Cartesian Grid Approach.” SAE International Journal of Passenger Cars- Mechanical Systems 7: 528–537. Sakai, R., D. Sasaki, S. Obayashi, and K. Nakahashi. 2013. “Wavelet-Based Data Compression for Flow Simulation on Block-Structured Cartesian Mesh.” International Journal for Numerical Methods in Fluids 73(5): 462–476. Spalart, P. R., R. D. Moser, and M. M. Rogers. 1991. “Spectral Methods for the Navier-Stokes Equations with one Infinite and Two Periodic Directions.” Journal of Computational Physics 96: 297–324. Su, X., S. Yamamoto, K. Nakahashi. 2012. “Analysis of a Meshless Solver for High Reynolds Number Flow.” International Journal for Numerical Methods in Fluids 72: 505–527. Williamson, J. H. 1980. “Low-Storage Runge-Kutta Schemes.” Journal of Computational Physics 35: 48–56.