Acceleration of Compressible Flow Simulations with ...

3 downloads 130 Views 10MB Size Report
Jun 11, 2012 - of performance tools Intel Vtune and CrayPAT, improving the runtime. ..... matrix Λ∗ contains the eigenvalues adjusted with the Harten ..... are the hotspot of the code, the cache memory utilization and the peak performance.
Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

EVELYN OTERO

Licentiate Thesis Stockholm, Sweden 2012

TRITA AVE 2012:25 ISSN 1651-7660

KTH School of Engineering Sciences SE-100 44 Stockholm, Sweden

Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av teknologie licentiatexamen i flygteknik måndagen den 11 Jun 2012, klockan 10.15 i E2, Kungliga Tekniska högskolan, Lindstedsvägen 3 Entreplan, Stockholm. © Evelyn Otero, June 2012 Tryck: Universitetsservice US AB

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

iii

Preface The present Licentiate work was carried out at the Department of Aeronautical and Vehicle Engineering at the Royal Institute of Technology (KTH) in Stockholm, Sweden, in the period from August 2010 till June 2012. In Sweden a Licentiate of Technology is an intermediate academic degree which can be obtained half-way between an MSc and a PhD degree. While less formal than a Doctoral Dissertation, examination for the degree includes writing a thesis and a public thesis defense. The funding was provided by the FoT Flygteknik 10-12 Nationell Aerodesignmetodik project and the financial support is gratefully acknowledged. Then I express my gratitude to my three supervisors Arthur Rizzi, Jesper Oppelstrup, and Peter Eliasson, who gave me the opportunity to work in this interesting project and in a friendly atmosphere. I appreciated their continuous disponibility and support but also their human side. A gratitude is also addressed to my colleagues for their good advice during the progress of my work. Moreover I would like to thank deeply my parents Carlos Otero and MariCarmen Sola who supported and encouraged me along my entire life abroad, giving me forces until this current Licentiate thesis. Evelyn Otero Stockholm, June 2012

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

v

Abstract Computational fluid dynamics (CFD) has become a significant tool routinely used in design and optimization in aerospace industry. Typical flows may be characterized by high-speed and compressible flow features and, in many cases, by massive flow separation and unsteadiness. Accurate and efficient numerical solution of timedependent problems is hence required, and the efficiency of standard dual-time stepping methods used for unsteady flows in many CFD codes has been found inadequate for large-scale industrial problems. This has motivated the present work, in which major effort is made to replace the explicit relaxation methods with implicit time integration schemes. The CFD flow solver considered in this work is Edge, a node-based solver for unstructured grids based on a dual, edge-based formulation. Edge is the Swedish national CFD tool for computing compressible flow, used at the Swedish aircraft manufacturer SAAB, and developed at FOI, lately in collaboration with external national and international partners. The work is initially devoted to the implementation of an implicit Lower-Upper Symmetric Gauss-Seidel (LU-SGS) type of relaxation in Edge with the purpose to speed up the convergence to steady state. The convergence of LU-SGS has been firstly accelerated by basing the implicit operator on a flux splitting method of matrix dissipation type. An increase of the diagonal dominance of the system matrix was the principal motivation. Then the code has been optimized by means of performance tools Intel Vtune and CrayPAT, improving the runtime. It was found that the ordering of the unknowns significantly influences the convergence. Thus, different ordering techniques have been investigated. Finding the optimal ordering method is a very hard problem and the results obtained are mostly illustrative. Finally, to improve convergence speed on the stretched computational grids used for boundary layers LU-SGS has been combined with the line-implicit method.

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

vii

Dissertation This Licentiate thesis has been divided into two main sections. The first part gives an overview of the research area with a summary of the work related to the appended papers. The second part collects the papers as follows: Paper A E. Otero, P. Eliasson, J. Oppelstrup. “Convergence Acceleration of the CFD code Edge by LU-SGS”. Presented at the CEAS 2011 conference 24-28 Oct 2011, Venice, Italy and at the Aerodays 2011 conference 30 Mar - 01 Apr 2011, Madrid, Spain. Paper B E. Otero, P. Eliasson, J. Oppelstrup. “Performance analysis of the LU-SGS algorithm in the CFD code Edge”. Preprint paper. Paper C E. Otero, P. Eliasson, J. Oppelstrup. “Implementation of Implicit LU-SGS method with Line-implicit scheme on Stretched Unstructured Grids”. Preprint paper.

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

ix

Division of Work Between Authors Paper A E. Otero wrote the code in collaboration with P. Eliasson, performed the tests and wrote the paper. Paper B E. Otero implemented the algorithms, performed the tests and wrote the paper under the supervision of P. Eliasson and J. Oppelstrup, respectively. Paper C E. Otero developed the theory, implemented the code in collaboration with P. Eliasson, and wrote the paper.

x

E. Otero

Contents I Overview and Summary

1

Introduction

3

Background Governing equations . . . . . . . . . . . Inviscid flux discretization . . . . . . . . Multigrid and Runge-Kutta steady state Line-implicit explicit Runge-kutta . . . Implicit time stepping . . . . . . . . . .

. . . . .

5 5 8 11 13 16

LU-SGS linear solver LU-SGS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . Convergence condition . . . . . . . . . . . . . . . . . . . . . . . . . .

18 18 18

Results and Discussion Paper A – Convergence Acceleration of the CFD code Edge by LU-SGS Paper B – Performance analysis of the LU-SGS algorithm in the CFD code Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paper C – Implementation of Implicit LU-SGS method with Lineimplicit scheme on Stretched Unstructured Grids . . . . . . . .

20 20

Conclusions

34

References

35

II Appended Papers

37

. . . . . . time . . . . . .

. . . . . . . . . . . . . . integration . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

24 32

Part I

Overview and Summary

1

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

3

Introduction Computational fluid dynamics (CFD) has become a significant tool routinely used in the aerodynamic design and optimization of aircraft. CFD tools are used from the conception phase to the production as preliminary tests of a specific aircraft design. CFD simulations have largely replaced experiments to provide aerodynamic data. However, computing time can become very costly, thus convergence acceleration techniques are needed to speed up CFD solvers. The flow is often characterized by high-speed and compressible properties and, in many cases, by massive flow separation and unsteadiness. Accurate and efficient numerical time-dependent computations are called for as Fig. 1 illustrates. For unsteady problems, a dual-time stepping method is often used. The efficiency of standard dual-time stepping methods is inadequate for large-scale industrial problems. A more rapidly convergent method would speed up most of the existing CFD solvers used in industries and research agencies. This has motivated the present Licentiate work, in which an effort is made to replace the explicit time stepping methods with implicit time integration schemes. The flow solver considered in this project is the Swedish flow solver Edge [1] illustrated in Fig. 2, developed at FOI, lately in collaboration with external national and international partners. Edge uses an edge-based formulation [1] on unstructured node-centered finite-volume grids based on a dual grid supplied by a preprocessor. Acceleration techniques [1] have been implemented in Edge, such as Multigrid (MG) based on the Full Approximation Scheme (FAS) [2]. It treats the nonlinear equation system using an explicit Runge-Kutta (RK) time stepping [1] as smoother. For viscous problems, a line-implicit method [3] is available to improve convergence on the very anisotropic grids necessary for the boundary layers. Implicit methods remove the stability restriction on the time-step. Therefore, implicit methods are widely used in attempts to accelerate convergence of steady-state problems as well as to improve the efficiency of unsteady solvers by advancing the solution with substantially larger time (or pseudo-time) steps. The present work focuses on the Implicit LU-SGS type of relaxation. LowerUpper Symmetric Gauss-Seidel (LU-SGS)[4] [5] [6] is a linear iterative solver with low memory requirement based on approximations to the Jacobian matrix. The first step is the definition of a proper block matrix implicit operator. It controls, by parameters in the artificial dissipation model, contributions to the diagonal

4

E. Otero

blocks of the system matrix [6] to make them "heavier". Then optimization of the code was performed using information from performance tools [7] [8] and finally, the effect of different node orderings [9] [10] was investigated. The convergence is still inadequate for stretched grids, so a first version of LU-SGS combined with line-implicit [3] was implemented and tested for stretched meshes.

Figure 1: Time dependent simulation problem (From MUSAF colloquium http://www.cerfacs.fr/musaf/

Figure 2: Complete process of the CFD solver Edge

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

5

Background Governing equations The Navier-Stokes equations are the basic equations modeling fluid flow in CFD. These equations can be simplified for inviscid flow by neglecting the viscosity. This leads us to the Euler equations which will be at the core of this project. The equations must be treated in conservation form for correct capturing of discontinuities such as shock waves. The conservation-law form of the NavierStokes equations concern the mass, momentum and energy and can be formulated as in Eq.1. ∂U +∇·F =0 ∂t

(1) 



ρ  ρu     where U is the vector of the conserved variables   ρv  with E the total  ρw  ρE energy, ρ the density, and u, v, w the velocities. The pressure p is needed and computed from the conserved variables as needed. Moreover F = FI − FV , where FI is the matrix of the inviscid fluxes and FV is the matrix of the viscous flux, 

  FI =    

  FV =   

ρu ρu2 + p ρuv ρuw u(ρE + p) 0

τxx τxy τxz (τ W )x − q x

ρv ρw ρuv ρuw ρv 2 + p ρvw ρvw ρw2 + p v(ρE + p) w(ρE + p) 0 τxy τyy τyy (τ W )y − q y

     

0 τxz τyz τzz (τ W )z − q z

     

6

E. Otero



 u   ∂Wj 2 k i with W =  v , τ = µ ∂W + − 23 µ ∂W ∂xj ∂xi ∂xk δij = 2µS − 3 µdiv(W )I the w stress tensor, and q the conductive heat flux according to the Fourier law. The integral form of the Navier-Stokes equations (1) over a control volume Ω contained in the surface S reads as Eq.2 Z Z ∂U dΩ + F ndS = 0 (2) Ω ∂t S where n is the normal vector to the surface of the control volume.

Edge solves the Reynolds-Averaged Navier-Stokes (RANS) equations [1] for compressible flow which resolve only the mean flow motions and rely on a turbulence model to represent sub-grid scale motion by the Reynolds stresses. Figure 3 gives a visualization of how typical problems are initially set for computation. We identify the domain of computation where the data will be computed and the aerodynamic object in it, in this 2D case an airfoil. Then boundary conditions have to be defined along the airfoil and the outer boundary limiting the domain, depending on the flow. Then aerodynamic data has to be provided such as the free stream velocities and angle of attack.

Figure 3: Configuration for 2D Simulations (RANS or Euler)

The unsteady RANS flow model in Edge is a Method-of-Lines discretization of Eq. 2. This means that space is discretized first, producing a large set of ordinary

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

7

differential equations, V.

dU + R(U ) = 0 dt

(3)

which may be solved for steady state or for the unsteady flow. V is the volume and R contains parts pertaining to the convective processes, the viscous forces, the artificial dissipation, and the turbulence model. If the time-stepping process converges as time goes on and the residual R tends to zero, the solution tends to a stable steady flow. The accuracy of the solution will be determined by a tolerance limit tol in Eq.4. ||R(U n )|| ≤ tol

(4)

∂U + ∇ · FI + ∇ · FV = Q ∂t

(5)

The time dependent problem of Eq.3 corresponds to solve the RANS equations,

with Q, an additional vector of source terms and the viscous flux matrix FV including the negative signs. Equation 5 is then integrated in a control volume Ω contained in the surface S (Fig.4), Z Z Z ∂U dΩ + (FI + FV )ndS = QdΩ (6) Ω ∂t S Ω

A control volume is defined by the so called dual grid indicated in Fig.4 with the dashed lines. This control volume surrounds each node of the generated grid, connected by edges. A surface nS is given for each edge with n, the normalized normal vector and S the area. For each edge the surface vector orientation goes from the node with the first index to the node with the second index. Summing all the surface vectors over a closed volume gives the null vector (Eq.7). X ni Si = 0 (7) i

This definition is essential for the finite-volume method where the fluxes are computed for each surface of the control volumes. This method is conservative in the sense that the flux entering a given volume is equal to the one leaving the adjacent volume. To the integral form (Eq. 6) is applied a finite-volume discretization, Eq. 8, to the control volume V0 surrounding the node v0 , as in Fig.4. m

m

k=1

k=1

0 0 X X d (U 0 V0 ) + F I0k n0k S0k + F V0k n0k S0k = Q0 V0 dt

(8)

8

E. Otero

Figure 4: Dual grid (dashed lines) forms control volume

Neglecting the viscous flux F V0k in Eq. 8 leads to the Euler equations to be solved, m

0 X d (U 0 V0 ) + F I0k n0k S0k = 0 dt

(9)

k=1

Inviscid flux discretization In a finite-volume method one needs also to specify the numerical methods for evaluation of fluxes. Edge [1] uses two schemes for the inviscid flux F I0k , based on central discretizations Eqs. 10, 14 but with different types of dissipation d . These correspond to the Jameson flux approximation with artificial dissipation, Eq. 11 and to the upwind flux difference splitting (DFS) with a built-in ˝dissipation Eq. 15. The Jameson scheme for the inviscid flux F I0k reads as, F I01 = F I



U0 + U1 2



− d01

(10)

The dissipation added d01 (Eq.11) is governed by a pressure switch which activates a second order difference dissipation ε(2) near the shocks while turning off a fourth order difference dissipation ε(4) .   (2) (4) d01 = ε01 (U 1 − U 0 ) − ε01 ∇2 U 1 − ∇2 U 0 ϕ01 λ01

(11)

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

9

with the factor ϕ01 introduced to account for the stretching in the grid and the local spectral radius, λ01 = (|u01 · n01 | + c01 ) S01

(12)

where u01 , c01 , S01 , and n01 denote the flow velocity, speed of sound, area, and normal of the cell face, respectively. Moreover ∇2 is the undivided Laplacian operator which reads as, ∇2 U 0 =

m0 X

k=1

(U k − U 0 ) = −m0 U 0 +

m0 X

Uk

(13)

k=1

The flexibility of this method comes from the controllable amount of artificial dissipation applicable through the ε-factors. Other advantages from this scheme are its simplicity and robust convergence. However since the spectral radius corresponds to the maximum eigenvalue of the flux Jacobian, the Jameson model is perhaps unnecessarily dissipative. The Roe DFS (Eq.14) is an upwind scheme [11] where natural dissipation d01 , Eq.15, appears with the discretization using the Roe matrix A = RΛ∗ R−1 . F I01 =

1 (F I (U 0 ) + F I (U 1 )) − d01 2

(14)

1 R |Λ∗ | R−1 (U 1 − U 0 ) 2

(15)

d01 =

The matrix R contains the right eigenvectors of the flux Jacobian. The diagonal matrix Λ∗ contains the eigenvalues adjusted with the Harten entropy fix [12], as shown in Eq. 16, to prevent expansion shocks where an eigenvalue vanishes. ( 2 2 λi +δ ∗ 2δ , |λi | ≤ δ |λi | = (16) |λi | , |λi | > δ The variable δ is usually around 5% of the spectral radius. The flexibility of this method comes then from the modification of the eigenvalues λi in Λ∗ , through the parameter δ. By proper choice of δ, the first order scheme can be turned into a second order one with better accuracy. The Matrix dissipation From these both discretization schemes, a generalized method has been implemented to keep the advantages from both methods. The matrix dissipation method [13] [11] is a generalization of the Roe Upwind scheme with a good accuracy and Jameson with good convergence. This particular construction, Eq. 17, consists in applying the Roe matrix |A| from the Upwind scheme

10

E. Otero (2)

only on the shock ε01 for higher accuracy while keeping the strong dissipation of (4) the spectral radius λ01 outside the shock ε01 . This matrix dissipation method has been of strong interest as a new tool to increase the diagonal dominance of the system matrix as explained in the section on the LU-SGS linear solver. (2)

(4)

d01 = |A|ε01 (U 1 − U 0 ) − λ01 ε01 ∇2 U 1 − ∇2 U 0



(17)

Figs. 5 and 6 show the flow around an RAE airfoil for inviscid and viscous transonic flow simulations respectively at Mach 0.8 and 0◦ angle of attack using a matrix dissipation method. As we can observe in both cases the shock waves are sharp. Moreover in the viscous case (Fig. 6) we visualize the interaction with the boundary layer. Concerning the residuals convergences, these follow the ones from the Jameson scheme dissipation. Hence the method has been validated and used for further implementations.

Figure 5: Matrix dissipation. Euler computations on RAE airfoil.

Figure 6: Matrix dissipation. RANS computations on RAE airfoil.

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

11

After describing the spatial discretization methods, the techniques of resolution are presented with respect to the temporal discretization schemes and solvers.

Multigrid and Runge-Kutta steady state time integration The time integration method used in Edge is an explicit q-stage Runge-Kutta scheme [1], U1 = U2 = ... Uq = U n+1 =

U n + α1 ∆tV −1 R(U n ), U n + α2 ∆tV −1 R(U 1 ), U n + αq ∆tV −1 R(U q−1 ), U q.

(18)

Multigrid acceleration with Runge-Kutta as smoother has been implemented in Edge. It employs grids of different mesh sizes from fine to coarser grids (Fig.7). Figure 7 shows four grid levels created by merging the cells progressively from the finest to the coarser level. This process is called agglomeration. It consists in combining control volumes from the finest grid in such a way that the flux conservation property of the finite-volume discretization is preserved. As we can notice, the agglomerations of control volumes are achieved at a geometric level where new data needs to be reconstructed from edges and nodes for each level.

Figure 7: Multigrid agglomeration

Multigrid [2] [14] uses restriction and prolongation operations Fig. 8. The first one transfers the solution or residual from one level to the next coarser grid

12

E. Otero

level. The prolongation operation interpolates the solution from one level to the next finer level. Both restriction and prolongation operators have to be designed to respect the conservation property of the finite-volume scheme. We know that

Figure 8: Multigrid steps

for elliptic problems the idea of this technique consists in transferring the lowfrequency components of the residual into the coarser grids where they become highfrequency. This implies an acceleration since iterative methods are known for their high convergence speed for the high-frequency components of the error. However, the exact effect of Multigrid in hyperbolic problems is still under investigation. There are two mechanisms active. One advects the residual out through boundaries, and that may proceed faster on coarser grids by ways of their less restrictive CF L condition. But any time-stepping-like scheme needs dissipation for stability, so such built-in, carefully crafted artificial dissipation engages the elliptic Multigrid convergence mechanism. Which is the more important remains under debate. The Multigrid approach implemented for the non-linear systems considered in CFD is based on the FAS scheme [2] [14] developed by Jameson for its robustness and good computing time. The Navier–Stokes problem comes on the finest level of the Multigrid process, N L (U L ) = 0

(19)

We define then the forcing function F l , such that the problem to be solved becomes on the coarser levels l, N l (U l ) = F l

(20)

The definition of the forcing function Fl corresponds to, l F l = Iˆl+1 [F l+1 − N l+1 (U l+1 )] + N l (I ll+1 U l+1 )

(21)

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

13

l with I ll+1 and Iˆl+1 the restriction operators from grid level l to the coarser grid level l + 1, of the unknowns and the residuals respectively. By the restriction operator I ll+1 a grid function ul+1 on grid level l+1 is mapped to ul on level l as,

I ll+1 : ulj =

X V l+1 ul+1 i i l V j l+1

: Vjl =

i∈Nj

P

i∈Njl+1

Vil+1

which guarantees conservation of the variables, V , corresponding to the cell volume. For residuals, we have simply l Iˆl+1 : Rlj =

X

Rl+1 i

i∈Njl+1

The corresponding prolongation interpolates a grid function ul to level l + 1, I l+1 : ul+1 = ulj , i ∈ Njl+1 i l is the identity, and I l+1 I ll+1 is an averaging massWe have to note that I ll+1 I l+1 l l conserving injection. For a solution candidate U L (Eq. 20) on the finest level, one solves on the coarser level L − 1 for a correction cL−1 which, when prolonged and L L added to U L to become the next iterate U L new , gives a smaller residual R (U new ) L L L L−1 with U new = U + I L−1 c . The problem to solve on level L − 1 becomes, L−1 RL−1 (I L−1 U L + cL−1 ) = −IˆL RL (U L ) L

for cL−1 . When the problem on level L−1 is again solved by computing a correction on a coarser grid L − 2, etc., recursively, the MG algorithm results. This works only if the residual restricted can be represented accurately on the coarser grid. Therefore, a smoothing operation is applied prior to sending the solution to the coarse grid. In Edge, a Runge-Kutta “time–stepping”is the standard smoother as seen in Eq. 18. However it has been shown in paper B that the Multigrid technique has almost no effect for stretched meshes. The convergence speed to steady state can be accelerated on stretched meshes by combining this explicit scheme with a lineimplicit method.

Line-implicit explicit Runge-Kutta In order to resolve the strong velocity gradients in boundary and shear layers of viscous flows, stretched meshes are necessary close to solid boundaries and in wakes. This area corresponds to the boundary layer indicated in Fig. 9.

14

E. Otero

Figure 9: Boundary layer

The line-implicit method operates in regions of stretched grid keeping the explicit Runge-Kutta everywhere else. This means that the flow equations are integrated implicitly in time only along structured lines in regions of stretched grids (Fig.10), for example in the boundary layer. As we can see in Fig.10, a refined structured mesh envelops the airfoil where the structured lines are emphasized.

Figure 10: Stretched mesh. Implicit lines

Fig.11 shows the aspect ratio of the cells. The mesh width ∆x along the airfoil is much larger than ∆y in the normal direction to the wall. The interesting property of using an implicit scheme for lines normal to the wall, comes from the inefficiency of explicit schemes whose time step is limited by the smallest cell size, here ∆y and the Courant–Friedrichs–Lewy (CF L) number (Eq.22). Then an implicit scheme can provide a convergence acceleration through big time steps depending on the biggest edge size, ∆x and a flexible CF L number (Eq.23). ∆texp ≈ CF L

∆y , ∆y kAij k , ∀i ∈ {1, ...n} , ii

(32)

j=1;j6=i

with Aii a block matrix contained in A, 

A11

  A21 A=   ... An1

A12 .. . ···

···

A1n



 ...    .. . ...  · · · Ann

(33)

From this investigation, Dwight [6] analyses the effect of this property in the LU-SGS convergence. He concludes that making the system matrix more diagonally dominant makes it more likely that the linear solver is to converge.

20

E. Otero

Results and Discussion This chapter summarizes the results presented in the appended papers A, B and C. The main motivation behind these three papers is the convergence acceleration. Paper A treats the definition of a proper implicit operator by increasing the diagonal dominance of the system matrix. Paper B searches for code optimization techniques through performance tools and investigates different ordering techniques. Paper C combines the implicit LU-SGS with the line-implicit method for computations on stretched grids.

Paper A – Convergence Acceleration of the CFD code Edge by LU-SGS The aim of this paper is to investigate the best construction of the implicit operator for diagonal dominance.

Flux Jacobian discretization The discretization methods investigated for the inviscid flux jacobians have been based on a first order flux-difference splitting (FDS) scheme (Eq. 34) F in =

1 (F (U n ) + F (U i )) − din 2

(34)

The definition of the dissipation d01 differentiates two methods implemented in this paper. Moreover a splitting technique [17] of the flux Jacobian matrices A = ∂F I ∂U from the linear system of Eq. 27 has been applied to both methods to increase the diagonal dominance of the system matrix. This splitting technique found its origin in Jameson, Turkel, [18] Yoon and Kwak [19] and has been successfully implemented in the CFD solver code COBALT [17]. Table 1 gives a comparison between a scalar dissipation and a matrix dissipation with their corresponding flux Jacobian splitting. As one can observe, the scalar type dissipation is defined by the spectral radius λmax of the flux Jacobian matrix A, corresponding to a strong dissipation. However the parameter β applied to the spectral radius and affecting the diagonal dominance, offers a degree

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

Scalar dissipation Flux

F i,n =

1 2

Matrix dissipation

(F (U n ) + F (U i )

F i,n =

−βλmax (U n − U i )) discretization

1 2

(F (U n ) + F (U i )

−|A | (U n − U i )) 00

Spectral radius:

Roe matrix:

λmax = (|u · n| + c)

|A00 | = R|Λ00 |R−1

Flux

1 (A ± Iβλ A± = 2 max )

Jacobian

A− = ∂U i,n n

∂F i,n

1 = 2

splitting

A+ = i,n

∂F i,n ∂U i

= 1 2

Linear system

1 ∆t

21

  ∂F − Iβλmax ∂U n   ∂F + Iβλmax ∂U i ! X +

Vi +

Ai,n

n∈N (i)

1 A± = 2

A ± |A00 |

A− = ∂U i,n n

∂F i,n

= 1 2

A+ = i,n

∂F i,n ∂U i

= 1 2

∆Ui = −R (U ) −

X



  − |A00 |i,n n   00 ∂F

∂F ∂U ∂U



i

+ |A |i,n

Ai,n ∆Un

n∈N (i)

Table 1: Implicit operator definition

of freedom for the best trade-off between linear convergence and non-linear convergence. Then for the matrix type dissipation, we find the same construction but with the difference that the spectral radius λmax has been replaced by the Roe ma00 00 trix |A | = R|Λ |R−1 . We find this technique also by Nejat and Ollivier Gooch [20] in the construction of LU-SGS preconditioner for GMRES. In our implementation the flexibility for diagonal dominance improvement comes this time from the eigenvalues matrix. This is controlled by the same entropy fix technique defined in Eq. 16. Moreover the main advantage of this method would be a non-linear convergence improvement by applying a Roe’s FDS discretization also to the RHS. In fact this would mean a close correlation between the Jacobian computed and the RHS and thus that the system would get closer to a Newton scheme. However a second order accuracy for the RHS is generally necessary which would degrade then the quadratic convergence. Hence the convergence speed of the implicit LU-SGS has been studied with discretization techniques and parameters.

Results The scalar and matrix dissipation models for the flux Jacobians computations have been compared based in a flux splitting method. For that a study was conducted for three parameters: CF L number, number of linear iterations (N swp), and β coefficient. Moreover computations for first and second order upwind schemes have been compared, all for inviscid transonic Mach 0.8 flow over a NACA 0012 airfoil at 1.25◦ angle of attack. The impact of these parameters for both methods were the same so we analyse the results coming from only one case, the scalar dissipation model. From the variation of the CF L number as in Fig. 14a, the number of iterations is reduced by a factor of 2.5 compared to the explicit Runge-Kutta scheme for tolerances in

22

E. Otero

{10−4 . . . 10−10 }. Fortunately, the rate is not very sensitive to the choice of CF L in [100:10000]. Fig. 14b shows convergence for different linear iteration numbers (N swp). There are two distinct phases, the initial for “large”residuals > 10−5 , and the asymptotic phase with very small residuals. The initial convergence is clearly faster for larger N swp, but the ultimate convergence rate seems only slightly influenced by N swp.

(a)

(b)

Figure 14: CF L influence (14a). Inner iterations (N swp) influence (14b) Moreover regarding the matrix dissipation type an extra parameter is flexible corresponding to the entropy fix or percentage of the spectral radius (Eq. 16). The main impact of this variable is its strong correlation with the maximum CF L number (Fig.15). This means that an increase of the CF L number will probably require extra dissipation on the LHS. However as observed for the cases of CF L = 10000, the convergence is decelerated by a too high dissipation of 90% the spectral radius.

In order to do the choice between both schemes, we compare their convergence rate. Figure 16 shows clearly that the convergence with a 1st order Upwind dissipation (matrix type) gives a convergence twice faster than the one with a scalar dissipation and almost 4 times faster than the current method (explicit scheme).

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

23

Figure 15: Entropy fix limiting CF L

Figure 16: RHS second order Upwind scheme

Then computations on stretched mesh have been tested as a first experience for later RANS simulations. The grids have a normal distance from the airfoil to the first interior node of 10−5 and 10−6 chord length. The corresponding result in paper A gives an acceleration twice faster than the line-implicit scheme but for a first order accuracy. Moreover analysis in CPU time have shown the following results: • The computations run in Euler meshes give similar speed as the explicit Runge-Kutta, of the order of 1.

24

E. Otero

• For RANS grid (or stretched meshes) the order of deceleration compared to a line-implicit method can vary from 2 (for a 1st order accuracy) to 5 or much more (for second order schemes).

Discussion An implicit LU-SGS implementation has given convergence acceleration in iterations of the order of 4 with respect to the current method (Explicit) for Euler grids. The fastest convergence in number of iterations has been obtained with a 1st order Upwind (matrix) dissipation on the left hand side (LHS) of the linear system with a CF L=1000. Computations on RANS grids give a convergence acceleration in iterations of the order of 2 for first order accuracy but a deceleration with a second order accuracy which can be strongly increased with the refinement of the grid to 10 times slower. Since we are mainly interested in second order accurate RANS simulations in stretched meshes, more work has to be done regarding efficiency optimization. Following, other techniques are investigated for convergence improvement, not affecting this time the linear system construction itself, but the code implementation and the nodes ordering provided.

Paper B – Performance analysis of the LU-SGS algorithm in the CFD code Edge In the next paper, the code performance is analyzed and ordering techniques are investigated.

Performance tools: CrayPAT and Intel Vtune Performance tools are essential to analyse the behaviour of a code and mainly its interaction with the computer resources. Some relevant information for speed up are the hotspot of the code, the cache memory utilization and the peak performance. The idea is to detect first the most time-consuming parts - the hotspots - of the code by the Intel Vtune software [7] and optimize these parts of the code. Once the hotspots have been optimized, we want to look at memory access patterns to avoid bottlenecks such as illustrated in Fig.17. The memory access pattern is the single most important factor for performance. It is analyzed by the CrayPAT [8] tool which provides information about the cache memory utilization. A cache miss means that the data needed by the CPU is not in cache and thus requires a main memory access which will use part of the available memory bandwidth. Too many memory accesses or cache misses will cause a memory bottleneck since the CPU will have to wait for the memory accesses, Fig.17. Several techniques exist for efficient management of data storage and access [21].

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

25

Figure 17: Memory access

Results Intel Vtune has directly defined as the first hotspot the FORTRAN90 intrinsic function MATMUL for the 5 x 5 matrix multiplications which takes almost 50 % of the total CPU time required. The direct method in Fig. 18 refers to a one by one explicit assignment of the 25 entries of the resulting matrix. The loop technique reduces the last method to a unique assignment inside a double loop going through the rows and columns. One can observe that the direct method is most efficient with only 12 % of the total CPU time, reducing by 80% the computing time of the original subroutine MATMUL. The overall CPU time is reduced by a factor of 1.5 compared to the original implementation.

Figure 18: CPU time vs. Matrix multiplication methods

This optimization leads to an increase of fraction of the peak performance from 7.4 %, to 11.2 % and overall performance twice the non optimized version (Fig. 19) .

26

E. Otero

0 Optimized Non Optimized

−2

Log10(Res)

−4 −6

−8 −10

−12 0

100

200

300 CPU time

400

500

600

Figure 19: Convergence optimization in time

Parallelization analysis Before considering lparallelization for distributed memory machines using Message Passing Interface (MPI) we have to understand the data dependencies of the solution algorithms. LU-SGS comes from the classical Gauss-Seidel method whose data dependencies are linked to the ordering of the unknowns and equations. For Edge, the ordering is associated with the graph of the grid so the ordering of unknowns and equations follows from the node numbering. Edge is parallelized by domain decomposition, so the Gauss-Seidel iteration only runs into trouble at inter-domain boundaries. Figure 20 gives a clear illustration of the parallelization issue for a simple 1D case, with nodes numbered sequentially along the line, and 2 processors. Inside each subdomain, the algorithm runs without interruption. However whenever the loop reaches a subdomain boundary, synchronization barriers are set, blocking all the other processors. As we see in Fig. 20, during the forward sweep, processor 2 can not start to compute in node i until processor 1 has finished updating node i-1. The corresponding observation holds for the backward sweep where processor 1 would have to wait for processor 2. Analogous problems appear for any domain decomposition. We understand then, that the original LU-SGS algorithm does not parallelize in a straightforward way by domain decomposition. As an alternative to this parallelization issues, the Jacobi iteration (Eq. 35) is parallelizable with no data dependency. Dxn+1 = b − (L + U )xn

(35)

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

27

Figure 20: Data dependency. 1D case

As Fig.21 shows, its process consists in updating the unknowns at each node and not using the data until the next iteration. By this way the processors can compute independently of each other and with synchronization and inter-domain data transport only once per iteration just like explicit time-stepping. It is a folklore theorem that Gauss-Seidel is twice as fast as Jacobi ˝, and Fig. 22 compares runtime for Jacobi iteration and LU-SGS and shows consistency with such expectations. Clearly we wish to retain the LU-SGS convergence speed also after domain decomposition. Parallelization of LU-SGS would proceed as follows. The idea is

Figure 21: No Data dependency. Jacobi iteration

to keep the original LU-SGS inside each domain and break the whole algorithm at the interprocessor boundaries. In fact the processors would not wait for updated values but use the old ones from the previous iteration. This idea has already been implemented by Löhner & al in [5] who refer to a hybrid approach to exchange interprocessor data. This idea leads to a pure Jacobi method in the extreme case of one node per domain. The degree of convergence deceleration suffered by the proposed scheme compared to a sequential LU-SGS will be looked into in the planned parallelization of LU-SGS. Figure 22 shows residual vs. runtime for Jacobi iteration

28

E. Otero

and LU-SGS for different number of sweeps. The lower N swps within each solver give the best convergence, namely 1 N swp for LU-SGS and 5 for Jacobi iteration, and LU-SGS is almost three times faster than Jacobi. LU-SGS has been then se0 LU−SGS, Nswp = 1 LU−SGS, Nswp = 5 Jacobi it., Nswp = 9 Jacobi it., Nswp = 15

−2

Log10(Res)

−4

−6

−8

−10

−12 0

100

200

300

400 500 CPU time

600

700

800

900

Figure 22: Convergences in time for Jacobi and LU-SGS

lected as smoother for Multigrid. Moreover its data dependency necessitates the study of how ordering influences the convergence.

Ordering techniques From the data dependency follows that the ordering of the nodes through which the forward and backward sweeps are executed will influence the LU-SGS convergence. It is natural to look for the optimal ordering, but it is very hard to find for unstructured grids. The idea has been then to limit the search to numberings which make neighbor nodes have small number differences. We start with variants of graph theory methods developed for low fill-in in Gaussian elimination on sparse matrices, for instance by minimizing the profile or skyline of the matrix. Figure 23 illustrates the definition of both denominations. The profile contains all the lower-triangular elements in a row from the first non-zero to the main diagonal. The skyline holds all upper-triangular elements in a column from the first non-zero to the main diagonal. Quite a few results from graph theory have found application to sparse matrix computations. In our setting, which never creates the "big" matrix for elimination, the concept would provide spatial and temporal data locality which means to use data closely stored in the memory (spatial locality) or being reused as much as possible (temporal locality). The Cuthill-McKee (CM) algorithm [9] is a graph theory method which consists in grouping by levels the nodes numbers connected to each other in the mesh. The nodes are ordered by increased order of degree corresponding to the number of neighboring nodes. Reversing the final CM

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

29

Figure 23: Profile structure (envelop) of a sparse matrix

ordering produces the Reverse CM which may be the most used profile reduction ordering method [9]. Originally Elisabeth Cuthill and Sean McKee published the alogorithm in 1969 to reduce the bandwidth of a sparse symmetric matrix. The choice of starting node is a degree of freedom, and we add a physical approach to CM which consists in first numbering the nodes along the strong boundaries, namely the airfoil wall and the outer boundary (Fig. 3). By this way the levels created will follow the desired starting boundary as shown in Fig.24 to make numbering in layers.

Figure 24: Level structure. CM at strong boundaries

Results: Structured meshes Figure 25 defines the pattern used for the CM algorithm for which different versions have been implemented by reversing the ordering nodes following Fig .26. Apart from the best cases convergences shown in Fig. 27, the ordering methods have resulted easily into no convergence. In fact small modifications in the starting vector of the CM algorithm were strongly degrading or directly stopping the convergence.

30

E. Otero

Figure 25: pattern

Figure 26: Zoom airfoil-wake. Numbering order path 1-5and starting vectors definitions: 1-257,1-208

Figure 27: Ordering techniques convergences. Structured meshes.

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

31

Results: Unstructured meshes Unstructured grids are usually generated by advancing front (AFM) or Delaunay (DM) methods, and we compare such orderings with the CM variants. Figure 28 illustrates the CM ordering path starting at the strong boundaries,namely wall boundary (cf. zoom) and at the outer boundary. As we can observe, CM follows exactly the pattern creating level structures (Fig. 24) We can notice in Fig .29,

Figure 28: CM ordering starting at the strong boundaries

that the shapes from CM and advancing front look almost identical. The difference is that the nodes which are created when the advancing fronts collide and seem "randomly" numbered in AF have been brought into the pattern by CM. However,

Figure 29: Advancing-front ordering shape(left). CM outer-wall (right)

32

E. Otero

whereas we expected convergence acceleration from this cleaner technique, AFM gave faster convergence as shown in Fig. 30. From Fig. 30 we observe that the fastest CM convergence obtains from starting at the wall and reversing the ordering. In fact this aspect seems to have an effect in the convergence speed, faster than the original CM-wall case. Then concerning the optimized ordering shape found in Fig. 29, namely the CM outer wall reverse, this gives neither a faster convergence but surprisingly even more slowly. As one would hope, a random numbering gives the slowest convergence.

Figure 30: Ordering techniques convergences for unstructured grids

Paper C – Implementation of Implicit LU-SGS method with Line-implicit scheme on Stretched Unstructured Grids LU-SGS with line-implicit method By combining the LU-SGS solver with a line-implicit method we expect to improve the convergence in stretched grids needed for resolution of boundary layers and wakes in RANS computations. The idea is similar to the Runge-Kutta line-implicit implementation but replaces the Runge-Kutta Multigrid smoother by LU-SGS. There are many possible variants. The current implementation is a consistent splitting of the Jacobian into a sum of contributions from neighbor nodes on the lines which produces block-tridiagonal matrices, and the rest. The system obtained by implicit Euler time discretization is then solved by approximate factorization using Gaussian elimination for the blocktriangular matrices in step one and LU-SGS for step two. The combined algorithm looks as follows in the regions of stretched grid. Equations 36, 37 give a first

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

33

theoretical approach investigated for solving this new problem. (I +

4t ∂R1 4t ).4U 1 = − R V ∂U V

(36)

4t ∂R2 ).4U = 4U 1 V ∂U

(37)

(I +

with R1 , R2 , the residuals of the line-implicit and LU-SGS respectively. As we can observe, two linear systems have to be solved. The first one, from the line-implicit method, applies only to the stretched domain. This linear system is solved by a Backward Euler method. The solution found becomes then the RHS in the second linear system solved by LU-SGS. Fig. 31 shows that for moderately stretched grid the convergence of LU-SGS is comparable to Runge-Kutta line-implicit. However, LU-SGS deteriorates strongly for more stretched meshes (10−6 ) where the line-implicit LU-SGS is much more efficient.

Figure 31: Second order computations on strechted grid. 10−5 (left) and 10−6 chord length (right)

34

E. Otero

Conclusions An LU-SGS solver has successfully been implemented in Edge. It gives significant convergence acceleration compared to explicit schemes for isotropic grids. Performance optimization of the code produced convergence rates five times faster in iterations and twice faster in time than explicit schemes. Moreover, LU-SGS has shown a strong sensitivity with respect to ordering techniques, and finding the optimal ordering of an unstructured grid is a hard problem. The results obtained are mostly illustrative rather than conclusive in the search for a good numbering scheme. A first implementation of LU-SGS with the line-implicit method shows that the combination improves convergence results for highly stretched grids. Further study has to be done on the best combination of line implicit smoothing with LU-SGS for the extension to RANS simulations.

References [1]

Edge, theoretical formulation. Technical Report 03-2870, FOI, 2010.

[2]

Alfio Borz‘. Introduction to multigrid methods.

[3]

P. Weinerfelt P. Eliasson and J. Nordström. Application of a line-implicit scheme on stretched unstructured grids. 47th AIAA Aerospace Sciences Meeting, January 2009. Paper 2009-163.

[4]

J.S. Kim and O.J. Kwon. An efficient implementation of implicit operator for block lu-sgs method. Computational Fluid Dynamics, pages 154–159, July 2005.

[5]

J. D. Baum H. Luo, D. Sharov and R. Löhner. Implementation of unstructured grid gmres+lu-sgs method on shared-memory,cache-based parallel computers. AIAA, 2000. Paper 2000-0927.

[6]

R. P. Dwight. Efficiency improvements of rans-based analysis and optimization using implicit and adjoint methods on unstructured grids. Faculty of Engineering and Physical Sciences, University of Manchester, 2006.

[7]

Intel(r) vtune(tm) amplifier xe 2011 getting started tutorials for linux* os. Technical Report 324207-003US, Intel Software Development Tools. http://developer.intel.com.

[8]

Cray, using cray®performance analysis tools. Technical Report S–2376–41, Cray Inc., 2006, 2007.

[9]

A. George and J. W-H Liu. Computer Solution of Large Sparse Positive Definite Systems. Prentice-Hall, Englewood Cliffs, New Jersey 07632, 1981.

[10] S. Le Borne. Ordering techniques for two- and three-dimensional convectiondominated elliptic boundary value problems. [11] J. E. Anker and J. F. Mayer. A preconditioned solution scheme for the simulation of turbomachinery flow at arbitrary mach numbers. AIAA, 2004. 35

36

E. Otero

[12] A. Madrane and E. Tadmor. Entropy stability of roe-type upwind finite volume methods on unstructured grids. Proceedings of Symposia in Applied Mathematics, 2009. Volume 67.2. [13] K. Frew and D.W Zingg. On artificial dissipation models for viscous airfoil computations. AIAA, 1996. Paper 96-1970. [14] Brussels Free University and FFA Sweden. Study note on multigrid methods, 1990. [15] A. Jameson and S. Yoon. Lower-upper implicit schemes with multiple grids for the euler equations. AIAA, pages 929–935, 1987. Vol. 25, No. 7. [16] D.G. Feingold and R.S. Varga. Block diagonally dominant matrices and generalizations of the gerschgorin circle theorem. Pacific J. Math., pages 1241–1250, 1962. [17] R.F. Tomaro M.J. Grismer, W.Z. Strang and F.C Witzeman. Cobalt: A parallel, implicit, unstructured euler/navier-stokes solver. Advances in Engineering Software, pages 365–373, 1998. Vol. 29, No.3-6. [18] A. Jameson and E. Turkel. Implicit schemes and lu decompositions. Mathematics of Computation, pages 385–397, 1981. Vol 37, no 156. [19] S. Yoon and D. Kwak. Implicit navier-stokes solver for three dimensional compressible flows. AIAA, pages 2653–2659, 1992. Vol 30, no 11. [20] A. Nejat and C. Ollivier-Gooch. A high-order accurate unstructured gmres algorithm for inviscid compressible flows. AIAA. [21] S. Goedecker and A. Hoisie. Performance Optimization of Numerically Intensive Codes. siam, 2001.

Part II

Appended Papers

37

Convergence Acceleration of the CFD code Edge by LU-SGS E. Otero Dept. Aeronautics, KTH, SWEDEN P. Eliasson FOI, SWEDEN Keywords: compressible CFD, convergence acceleration, implicit time-stepping, matrix dissipation

Abstract Edge is a flow solver for unstructured grids based on a dual grid and edge-based formulation. The standard dual-time stepping methods for compressible unsteady flows are inadequate for large-scale industrial problems. This has motivated the present work, in which an implicit Lower-Upper Symmetric GaussSeidel (LU-SGS) type of relaxation has been implemented in the code Edge with multigrid acceleration. Two different types of dissipation, a scalar and a matrix model, have been constructed which increase the diagonal dominance of the system matrix but not the numerical viscosity of the computed solution. A parametric study demonstrates convergence accelerations by a factor of three for inviscid transonic flows compared to explicit RungeKutta smoothing for multigrid acceleration. 1

Introduction

Convergence acceleration techniques are needed to speed up Computational fluid dynamics (CFD) solvers, which have a major role in industries and research agencies for aerodynamic design and optimization of

aircraft. The flow is often characterized by highspeed and compressible flow features and, in many cases, by massive flow separation and unsteadiness. Accurate and efficient numerical solution by means of time-dependent computations is hence required. The efficiency of standard dual-time stepping methods used for unsteady flows in many CFD codes is inadequate for large-scale industrial problems. Edge [1] is the Swedish national CFD tool for compressible flow, used at the Swedish aircraft manufacturer SAAB, and developed at FOI, lately in collaboration with external national and international partners. Edge uses an edge-based formulation on unstructured nodecentered finite-volume grids based on a dual grid supplied by a preprocessor. Multigrid acceleration has been implemented in Edge as a strategy to speed up the convergence rate [1] based on the FAS (Full Approximation Scheme) [2] as a nonlinear approach. It treats the nonlinear equation system using an explicit Runge-Kutta time stepping [1] as smoother. The Line-Implicit method of acceleration in Edge combines the explicit multistage Runge–Kutta scheme with an implicit integration along lines in regions where the computational grid is highly stretched. Implicit methods can efficiently handle problems with disparate scales, since they have no stability restriction on

CEAS 2011 The International Conference of the European Aerospace Societies

E. Otero, P. Eliasson

the time-step. See e.g. [3], for a detailed analysis. Therefore, implicit methods are widely used to accelerate convergence of steady-state problems as well as to improve the efficiency of unsteady solvers by advancing the solution with substantially larger time (or pseudo-time) steps. This has motivated the present work to speed up the computations by implicit time-stepping using iterative solvers such as LU-SGS [4][5][6][7]. The work is devoted to the definition of a proper block matrix implicit operator. Increasing the diagonal blocks of the system matrix is beneficial for convergence of the linear iteration [7]. Hence, linearizations which produce the coefficients in the linear system Left Hand Side (LHS), have been implemented, based on two different types of implicit operators, a scalar and a matrix model. Suitable combinations of LHS and Right Hand Side (RHS) dissipation models show significant convergence acceleration compared to explicit schemes. Since several parameters are important for controlling the convergence, an extensive parametric study has been conducted.

A finite-volume discretization, Eq. (3), is obtained by applying the integral formulation to the control volume surrounding the node v 0 , as illustrated in Fig. 1.

2

with m0 the number of neighbors and S 0 k the surfaces enclosing the control volume for node  0 . FI0k and FV0k are the inviscid and

Governing Equations

Edge solves the compressible ReynoldsAveraged Navier-Stokes (RANS) equations expressed as in Eq. (1), U  .FI  .FV  Q t

(1)

where U  ( ρ ρu ρv ρw ρE ) is the vector of conserved variables, FI and FV the inviscid and viscous flux matrices respectively, and Q the source vector. Equation (2) shows the corresponding integral form for a control volume Ω contained in the surface S.

Ω

U dΩ t

  FI  FV   ndS   QdΩ S

Ω

(2)

Fig. 1 Dual grid forms control volume

m0 d U 0V0    FI0k n0k S 0k dt k 1 m0

(3)

  FV0k n0k S 0k  Q0V0 k 1

viscous flux vectors respectively, computed on the edge connecting nodes v 0 and v k . This paper focuses on the unsteady (2D) Euler equations, Eq. (4), where the viscous flux is neglected. m0 d U 0V0    FI0k n0k S 0k  0 dt k 1

3

(4)

Dissipation models

Second order schemes are non-dissipative, so dissipation must be added to stabilize the timestepping. Edge considers two schemes [1] for the inviscid flux FI0k , both based on a central discretization, with dissipation terms of either Jameson artificial dissipation type or upwind flux difference splitting type.

Convergence Acceleration of the CFD code Edge by LU-SGS

3.1

Jameson flux approximation

Equation (5) shows the inviscid flux computation across the cell face between nodes v 0 and v1 using variable averaging,

 U  U1  FI01  FI  0   d 01 2  

(5)

The construction of the artificial dissipation term d01, Eq. (6), is based on the Jameson pressure switch in the -factors, which activate the second order difference dissipation and turn off fourth order difference dissipation in the vicinity of shocks.



(2) U 1  U 0  d 01  ε01





(4)  ε01  2U 1   2U 0 01 λ01

FI01 

d 01 

(7)

where u01, c01, S01, and n01,denote the flow velocity, speed of sound, area, and normal of the cell face, respectively. Since the spectral radius corresponds to the maximum eigenvalue, the Jameson model is known to be very dissipative near the shocks. On the other hand it is characterized by its simplicity, a controllable amount of artificial dissipation, and robust convergence. 3.2

Upwind scheme

The upwind scheme is of Roe flux difference splitting type. The inviscid flux formulation can be expressed as for the basic central scheme but with a flux averaging (Eq. (8)).

1 R Λ* R 1 U 1  U 0  2

(9)

The matrix R contains the right eigenvectors of the flux Jacobian. The diagonal matrix Λ* contains the eigenvalues adjusted with an entropy fix, as shown in Eq. (10), to prevent expansion shocks where an eigenvalue vanishes.

λi

λ01   u01 .n01  c01  S01

(8)

The upwind dissipation d01 uses the “Roe” ~ ~ matrix AR  A(U) , AR  R Λ R -1 where U is the Roe average of the states U0 and U1, as shows Eq. (9).

(6)

The factor  01 is introduced to account for the stretching in the grid. This scheme is called scalar because of the scalar coefficient of the variable differences, the local spectral radius 01 (see Eq. (7)).

1 FI U0   FI U1   d01 2

*

 λi2  δ 2  ,   2δ  λi , 

λi  δ,

(10)

λi  δ,

 is taken as a small fraction of the spectral radius, usually around 5%. This expression is applied on the linear and non-linear eigenvalues independently, by two different  variables. Moreover, a second order Upwind scheme is available in Edge by applying limiters to the absolute eigenvalue matrix Λ* from Eq. (9). Despite its slower convergence compared to the Jameson scheme, this method is preferable for its better accuracy. 3.3

Matrix dissipation

Matrix Dissipation [8] [9] is a generalization of the two previous dissipation models. This model offers a combination of high accuracy around the shocks and good convergence, coming from the Upwind and Jameson scheme respectively. The matrix dissipation replaces the spectral radius 01 by the absolute Roe matrix R Λ* R 1

CEAS 2011 The International Conference of the European Aerospace Societies

E. Otero, P. Eliasson

from Eq. (9). Equation (11) shows the final expression of this combination.



 

(4)  ε01  2U 1   2U 0 λ01  01

 Fi Edges

(11)

This formulation offers dissipation control through the entropy fix parameters of the Roe matrix. One can notice that the strong dissipation of the spectral radius is kept in the smooth regions. 4

ΔU n  U n  1  U n Ri 

(2) U 1  U 0 R Λ* R 1 d 01  ε01



with,

The LHS and RHS term definitions are in the core of the paper, controlling convergence and accuracy respectively. A special effort is made to separate the parameters affecting the LHS from the ones related to the RHS. Hence the work can be focused in the increase of the diagonal dominance without affecting the accuracy of the solution.

Implicit time-stepping

A non-linear steady state problem for inviscid flow is solved by time marching the unsteady Euler equations [1] from Eq. (4). The resulting equations in Eq. (12) are discretized in time by an implicit Backward Euler (“BE”) method. Equation (13) corresponds to the final discretized system.

5

The linear system has a sparse coefficient matrix, but direct elimination methods require excessive computing resources for multi-million grid-point geometries. 5.1

dU  R( U )  0 dt

(12)

U n 1  U n  R( U n1 )  0 Δt

(13)

V.

V.

After linearization of Eq. (13), R( U n1 )  R( U n ) 

R( U n ) ΔU n U

(14)

the linear system Eq. (15) appears and is to be solved approximately.

 1 R( U )   V  Un   R U n  Δt  Δ     U    x b M RHS LHS n

LU-SGS linear solver

Algorithm Description

The Lower-Upper Symmetric Gauss-Seidel (LU-SGS) linearization technique [4] solves the system very approximately but has small memory requirement. The challenge is in reducing the computational time by increasing the convergence rate through the definition of a proper approximation to the exact matrix M. The algorithm [7] is based on a number of Forward and Backward sweeps, Eq. (16), where L is the block lower triangular part, D the block diagonal, and U the block upper triangle of M. In the current implementation, a block is the set of unknowns per node.

Mx  L  D  U x  b

D  Lx*  b  Ux n D  U x n1  b  Lx*

  (15)

(16)

A fixed number of sweeps is performed in a time-step. The time marching process continues till the residual of the n o n - linear system vanishes as in Eq. (17).

Convergence Acceleration of the CFD code Edge by LU-SGS

R( U n )  tol

(17)

Note that M controls the convergence rate but does not influence the converged solution. M should be a sufficiently accurate approximation to the exact derivative to make the linearization accurate, and for convergence of the sweeps, it should be block-diagonally dominant. Since the diagonal elements come from the artificial dissipation model, the two requirements conflict. The rest of the paper focuses on the choice and construction of matrix M for convergence acceleration. 5.2

 F U i ,U n  1   F       Iβmax  U n 2   U  n    F U i ,U n  1  F       Iβmax  U i 2   U i 

5.2.2 Matrix dissipation This extension [11] is based on the previous splitting method with the only difference that the flux derivative corresponds to the derivative of the exact RHS of Eq (8). This means that the Jacobian matrix is now expressed as in Eq. (21). A 

LHS dissipation models

The linearizations, i.e. the coefficients in the linear system Left Hand Side, are based on two different types of implicit operators, a scalar and a matrix model. 5.2.1 Scalar dissipation This first implementation of the implicit operator (LHS) is based on the Yoon and Kwak [10] flux Jacobian splitting as in Eq. (18), 1 A   A  Iβmax  2 

FI the flux Jacobian matrix, U  max   u.n  c  the spectral radius of A and  a constant parameter to adapt. The flux function considered for the LHS can be expressed by Eq. (19) for node vi and its neighbor v n , with the corresponding splitting when deriving it, in Eq. (20). Fi,n 

1 F ( U n )  F ( U i ) 2  βλmax U n  U i 

with

(19)



1 A  A'' 2

A''  R Λ'' R 1



(21)

the absolute Roe

matrix, and that the derivatives are expressed by Eq. (22).

 F U i ,U n  1   F  ''      A i ,n  U n 2   U  n    F U i ,U n  1  F  ''      A i ,n  U i 2   U i 

(18)

with A 

(20)

(22)

ignoring the derivatives of the A'' matrix. The new notation for the eigenvalue matrix Λ'' instead of Λ* as seen in Eq. (9), underlines the possibility of acting on the dissipation matrices of the LHS and RHS independently. 6

Numerical results

Several simulations have been run to assess the influence of the multitude of parameters in the matrix dissipation LU-SGS scheme. The accelerated methods are compared to the present standard of FAS Multigrid with explicit RungeKutta smoothing. All the cases have been computed for transonic inviscid flow over a NACA 0012 airfoil, on an inviscid mesh and a

CEAS 2011 The International Conference of the European Aerospace Societies

E. Otero, P. Eliasson

mesh for RANS simulation. The accuracy required for the computations has needed residuals of 10-10. 6.1

Inviscid mesh

iterative solution of a linear system. The initial convergence seems is clearly faster for larger Nswp, but the ultimate convergence rate seems only slightly influenced by Nswp. In all of these three cases, the number of iterations is reduced by a factor of 1.5 compared to Runge-Kutta.

The first cases have been computed on an inviscid mesh for inviscid transonic Mach 0.8 flow at 1.25º angle of attack. The results show the influence of different parameters on the convergence rate compared to explicit schemes, and gives a better comprehension of the models used. The parametric study is based on a first order upwind scheme for the RHS. 6.1.1 LHS with scalar dissipation Figure 2 shows convergence with full 5 multigrid for CFL in (10,10 ) and the number of sweeps in a time-step Nswp set to 5. The number of iterations is reduced by a factor of 2.5 compared to the explicit Runge-Kutta -4 -10 scheme for tolerances in 10 … 10 . The rate is not very sensitive to the choice of CFL in (100,10000).

Fig. 2 Convergence dependence on CFL

Figure 3 shows convergence for different sweep numbers (Nswp) and CFL=1000. There are two distinct phases, the initial for “large” residuals > 10-5, and the asymptotic phase with very small residuals. One may surmise that the non-linearities are negotiated in the first phase and in the latter phase, the residual is very close to a linear function so the scheme is actually an

Fig. 3 Convergence dependence on number of sweeps Nswp

The coefficient β from Eq. (18) has been varied to find that the value of 1 giving full upwinding in the system matrix is optimal. β > 1 causes a convergence degradation. For design calculations, a second order accurate scheme is necessary. In Fig. 4, LUSGS shows more rapid convergence for a 1st order upwind residual scheme compared to a 2nd order scheme with its smaller dissipation, as expected. Moreover one observes increased convergence rate after reaching a residual of the order of 10-5 (see also Fig. 2) contrary to the explicit Runge-Kutta which keeps a linear convergence. Additional computations show that this LU-SGS behavior disappears when the multigrid acceleration is switched off.

Convergence Acceleration of the CFD code Edge by LU-SGS

the cases a convergence rate twice slower. The computations have been run for a CFL of 1000 and Nswp=2.

Fig. 4 Influence of space discretization scheme

6.1.2 LHS with matrix dissipation Similar observations have been found using matrix dissipation in the LHS. The Entropy fix control described in Eq. (10) seems not to have much effect on the convergence rate, except for mitigating the CFL number limitation. As observed in Fig. 5, an entropy fix of 90% is needed to reach a CFL of 10000. Bigger values decrease the convergence as noticed for the LHS scalar dissipation.

Fig. 6 RHS Jameson dissipation

Fig. 7 RHS Matrix dissipation

Fig. 5 Entropy limiting CFL

6.1.3 Synthesis Figures 6 to 8 compare the convergence accelerations using the “optimal” LHS dissipation operator, for each RHS scheme, namely Jameson, Matrix, and 2nd order Upwind scheme. The Explicit Runge-Kutta shows in all

Fig. 8 RHS 2nd order Upwind scheme

CEAS 2011 The International Conference of the European Aerospace Societies

E. Otero, P. Eliasson

6.2

Stretched mesh for RANS flow

As a transition to viscous flow simulations, inviscid transonic Mach 0.754 flow at 2.57º angle of attack, on a mesh for RANS simulation has been computed. Figure 9 gives an example of computation for a matrix dissipation LHS and 1st order Upwind scheme RHS. The comparison is done with the line-implicit method. The grid has a normal distance from the airfoil to the first interior node of 10-5 chord length, about 310 nodes on the airfoil, and a varying number of quadrilateral cells in the wall normal direction with about 35 layers in average. The triangular grid outside of these layers has about 51 K nodes. The maximum aspect ratio of the stretching is 1.5×103. Four levels of full multigrid W-cycles are used in all calculations. Directional coarsening with coarsening ratios 1:4 or 1:2 along the lines are used. Previous computations have shown a negligible gain by increasing the CFL number beyond 104. For an optimized configuration with CFL=10000, implicit LU-SGS converges twice faster than the line-implicit method.

Fig. 9 Convergence on a stretched grid

7

Conclusions

An implicit LU-SGS method has been successfully implemented reducing the iteration count by a factor of 3 compared to the explicit schemes. Two LHS dissipation models of scalar and matrix types have been constructed to improve the diagonal dominance of the system matrix. First computations on inviscid mesh have revealed optimal configurations

corresponding to a CFL of 10000 and matrix LHS dissipation. More study is necessary on stretched grids for an extension into RANS and turbulence models. There remains also efficient implementation on computational clusters. References [1] “Edge - Theoretical Formulation”, Defence and Security, Systems and Technology, FOI, 2010. [2] “Study Note on Multigrid Methods”, Free University, Brussels and FFA Sweden, 1990. [3] Eliasson P., Weinerfelt P., and Nordström J., “Application of a line-implicit scheme on stretched unstructured grids,” 47th AIAA Aerospace Sciences Meeting Including The New Horizons Forum and Aerospace Exposition, Orlando, 2009. [4] Kim J.S. and Kwon O.J., “An efficient implementation of implicit operator for block LU-SGS method”, Computational Fluid Dynamics, 2005, pp.154–159. [5] Yoon S. and Kwak D., “Implicit methods for the Navier-Stokes equations”, Computing Systems in Engineering, Vol. 1, No. 2-4, 1990, pp. 535-547. [6] Luo H., Sharov D., Baum J. D. and Löhner R. , “Parallel unstructured grid GMRES+LU-SGS method for turbulent flows ”, American Institute of Aeronautics and Astronautics, 2003. [7] Dwight R.P, “Efficiency improvements of RANS-based analysis and optimization using implicit and adjoint methods on unstructured grids”, Ph.D. Dissertation, University of Manchester, 2006. [8] Anker J.E. and Mayer J.F., “A preconditioned solution scheme for the simulation of turbomachinery flow at arbitray mach numbers”, 42nd AIAA Aerospace Sciences Meeting and Exhibit, Reno, 2004. [9] Otero E., “Increasing the Efficiency of the CFD Tool Edge for Aircraft Design”, Aerodays 2011, Madrid. [10] Grismer M.J., Strang W.Z., Tomaro R.F. and Witzeman F.C, “Cobalt: A parallel, implicit, unstructured Euler/Navier–Stokes solver”, Advances in Engineering Software, Vol. 29, No. 3-6, 1998, pp. 365-373. [11] Nejat A. and Ollivier-Gooch C. “A High-Order Accurate Unstructured GMRES Algorithm for Inviscid Compressible Flows”, American Institute of Aeronautics and Astronautics.

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

B-1

Performance analysis of the LU-SGS algorithm in the CFD code Edge Evelyn Otero Department of Aeronautical and Vehicle Engineering Royal Institute of Technology SE - 100 44 Stockholm, Sweden

and Peter Eliasson

Department of Aeronautics and Autonomous Systems Swedish Defense Research Agency (FOI) SE-164 90 Stockholm, Sweden

B.1

Introduction

Solving linear systems can require an important amount of data storage and computing time in diverse computational science and engineering applications. Two general classes of linear system solvers can be considered regarding the problem treated, called direct and iterative solvers. Direct linear system solvers are well known for their accuracy however they require an important amount of memory. This last constraint leads us to the second technique, the iterative method, whose performance comes from the robustness and speed for large problems. LU-SGS (lower-upper symmetric Gauss-Seidel) [B.1–3] linear iterative solver is specially known for its low memory requirement based on Jacobians approximations. Edge [B.4] is the Swedish flow solver developed at FOI and works for unstructured grids based on a dual grid and edge-based finite-volume formulation. It computes through an explicit Runge-kutta time integration with Multigrid acceleration. For viscous problems, a line-implicit method [B.5] is available giving a strong convergence acceleration. From the efficiency of this last technique an Implicit LU-SGS solver has been implemented to speed up the convergence to steady state in dual time. In order to gain a maximum of speed up from the new solver, it was important to analyze the performance of an LU-SGS algorithm. The convergence efficiency of a solver can be done either by optimizing the code directly through high performance computing techniques or by working with external parameters. In this paper two performance tools CrayPAT [B.6] and IntelVtune [B.7] are used to analyse performance optimization of the code providing useful information about memory access and time consuming. A preliminary study has been additionally done regarding the parallelization capabilities for later parallelization

B-2

E. Otero

versions. Then ordering techniques have been investigated as possible parameters for convergence acceleration.

B.2

Performance analysis tools

Basic timings have shown insufficient information for code optimization. Tools for performance analysis have been then selected to analyze the code Edge, namely CrayPAT [B.6] and Intel Vtune [B.7]. These tools work on Cray XE machines and Intel processors, respectively.

B.2.1

CrayPAT

Edge has been installed in a Cray XE machines which contains Cray Performance Tools. CrayPAT [B.6] has been used to obtain information about the cache utilization and the peak performance by modules or subroutines called during the execution of Edge. The setting of the variable PAT_RT_HWPC defines the data provided. Firstly one starts by setting it to 1 which gives the following information: • User time • FLOPS • Computational intensity (ops/cycle) • TLB utilization • D1 (level 1 cache) hit/miss rations • D1 cache utilization (misses) If a low peak performance is observed then PAT_RT_HWPC can be set to 2 to have more information about the Level 1 and 2 data caches regarding their hit/misses, refill and bandwidth.

B.2.2

Intel Vtune Amplifier XE

The Intel(R)VTune(TM) Amplifier XE [B.7] is an Intel(R) Parallel Studio XE tool, offering a graphical user interface (GUI) with the following information: • The most time-consuming (hot) functions in the application and/or on the whole system • Sections of code that do not effectively utilize available processor time • The best sections of code to optimize for sequential performance and for threaded performance

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

B-3

• Synchronization objects that affect the application performance • Whether, where, and why the application spends time on input/output operations • The performance impact of different synchronization methods, different numbers of threads, or different algorithms • Thread activity and transitions • Hardware-related bottlenecks in the code The documentation [B.7] gives a complete specification. For Edge analysis the main features of interest have been the hotspots localization related to the corresponding lines into the code. The software was installed in a quad-core Intel Xeon supercomputer. These tools implies modifications of the code implementation for higher performance as to avoid bottlenecks. Then one can think about external factors which could eventually play a major role in the convergence rate. Depending on the type of solver, ordering techniques can have eventually more or less influence in the speed of convergence.

B.3

LU-SGS vs. Jacobi iteration

The LU-SGS solver has been compared to the Jacobi iteration method. If one has to define the differences between both algorithm, the main points concern the data dependency. LU-SGS algorithm [B.3] is based on a number of Forward and Backward sweeps, Eq. 1, where L is the block lower triangular part, D the block diagonal, and U the block upper triangle of M . A block is the set of unknowns per node. M x = (L + D + U )x = b (D + L)x∗ = b − U xn (D + U )xn+1 = b − Lx∗

(1)

Dxn+1 = b − (L + U )xn

(2)

The LU-SGS sweep immediately re-uses the data computed at one node at the next node. Unless the nodes are numbered in some special way, this introduces data dependency between loop iterations (i.e. between nodes) as shown in Fig. B.1 This implies that parallelization by domain decomposition is not straight forward. On the other hand the corresponding loop construct of block Jacobi iteration, Eq. 2 is data independent since components of xn+1 are not re-used in iteration n. This means that the result is independent of the node numbering and that domain decomposition requires no extra synchronization.

B-4

E. Otero

Figure B.1: Data dependency in Gauss-Seidel iteration

Note that there are node graphs which can be numbered to obtain loop independence, like the red-black ordering for discrete laplacians on structured 2D grids, which requires only a single extra synchronization point. That Gauss-Seidel converges twice as fast as Jacobi for the latter case is well known. No such theoretical result is available for the Edge equations so we must resort to running test cases. The test case has been computed for inviscid transonic Mach 0.8 flow at 1.25°angle of attack over a NACA 0012 airfoil, on the isotropic mesh (Fig. B.2). The accuracy required for the computations has needed residuals of 10−10 .

Figure B.2: Inviscid mesh for NACA 0012 airfoil at 1.25°angle of attack

Figures B.3a and B.3b show the convergence in iterations and in time for both solvers. Note that one inner iteration (sweep) of LU-SGS requires the same amount of computation as two in the Jacobi algorithm. N swp is the number of sweeps per time-step. First consider convergence per iteration, Fig. 3. If this was a linear system of equations we would expect the same residual for any combination of N swp and niter with the same product. This is clearly not what happens. Indeed the ultimate convergence rate for LU-SGS as well as Jacobi seems independent of

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

B-5

N swp. One explanation for this can be that already one sweep is a good enough smoother to make the Multigrid iteration work, once the residual has been reduced below a threshold around 10−5 . It is thus tempting to conclude that the knee˝in the convergence curves is associated with a change in residual reduction mechanism. Two alternate theories have been put forward. One considers that the pre-knee mechanism is associated with advection and the post-knee process is dominated by diffusion. The other ascribes the pre-knee behavior to non-linearity and that post-knee iterations operate on an effectively linear set of equations. Note also that the Multigrid iteration is well understood to be efficient for parabolic˝residual propagation but less so for hyperbolic˝because all schemes built on time-stepping have some built-in dissipation: pure advection is never at hand. This lends some support to the advection–diffusion mechanism. The two schemes converge to log10 residual of -10 after about the same number of iterations for N swp = 5, and N swp = 15 respectively. With respect to time (Fig. B.3b), the lower N swps are fastest, namely 1 N swp for LU-SGS and 5 for Jacobi iteration. In overall time convergence we can observe that LU-SGS (1) is almost 3 times faster than Jacobi (5), consistent with the factor in computational complexity.

(a) Convergence in iterations

(b) Convergence in time

Figure B.3: Convergence comparisons between Jacobi and LU-SGS LU-SGS is thus preferred over Jacobi as smoother for Multigrid. However, its data dependency necessitates the study of its dependence on node ordering which is the next thing to investigate.

B.4 B.4.1

Ordering methods Motivation

That the numbering influences convergence is shown in the example in Fig. B.4, which compares the Cuthill-McKee ordering, well known as ordering scheme for

B-6

E. Otero

elliptic finite element equations, with the ordering from the generation of the unstructured triangular grid by an advancing front technique. Inadequate orderings can even cause non convergence. Therefore special attention will be given to the implementation of different ordering techniques for an LU-SGS solver. For a grid with N nodes, there are on the order of N ! different numberings, so finding the one which gives the best convergence rate is a daunting task. The idea is to analyze a number of very deterministic orderings associated with boundary conditions and the flow physics in the hope to find characteristics for suitable numberings for this specific LU-SGS solver. 0 Cuthill−McKee Advancing Front

Log10(Res)

−2 −4 −6 −8 −10 −12 0

200

400 600 800 Iterations number

1000

1200

Figure B.4: Ordering impact in LU-SGS

B.4.2

Reducing fill-in

The original Cuthill-McKee (CM) algorithm with its reverse version RCM are well known for their effects in the minimization of the matrix profile, Fig. B.5. The origin can be traced to the standard breadth-first search algorithm used in graph algorithms, regrouping the nodes by levels, Fig. B.6. All nodes belonging to the same level have the same distance to the starting node, distance defined as the number edges traversed from the starting node. The only difference with the classic Breadth-First Search (BFS) is that the nodes on a level are numbered by increasing degree. The degree of a node is the number of nodes adjacent to it. Reversing the resulting ordering produces the RCM. The determination of a starting node has to be treated separately. It can chosen among those with the lowest degree or as a so-called pseudo-peripheral node. Such a node is found through an algorithm constructing sequentially level structures (Fig.B.6) from the node with the smallest degree of the last level. The one belonging to the biggest structure level Lm ax will be kept as the final starting node. More details about the Cuthill-McKee and other numbering schemes for direct solution of linear systems can be found in [B.8].

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

B-7

Figure B.5: Profile matrix [B.8]

Figure B.6: Level structure

B.4.3

Graph theory physics oriented

Based on the graph-theory methods, variants have been implemented which take into account the strong boundaries of the domain of computation. The Cuthill McKee algorithm is started after numbering a selection of nodes instead of a single node. The starting nodes can be chosen to include either the solid boundary (airfoil surface), the outer boundary (free stream) or both of them. The CM numbering will follow the starting vector pattern (for example an airfoil profile) and number the grid by approximate layers growing from the starting set. By this way one can have a control of the direction of the numbering with respect to the flow direction. In a 2D simulation the sense of the direction can be eventually controlled proceeding by ordering reversions at each level. Figure B.7 gives an illustration about ordering directions along or against the flow for a structured C-grid. In this domain, the boundaries are defined as the wall

B-8

E. Otero

boundary along the airfoil and the outer boundary limiting the area of computation. In the first case, Fig. B.7a, the lexicographic ordering happens to follow the stream lines of the flow. Reversing the lexicographic ordering (from Latin to Chinese˝), Fig. B.7b, makes the ordering cross the streamlines. Each level of connecting nodes is represented by the dotted lines.

(a) Lexicographic ordering, also following the advection

(b) Un-convective ordering

Figure B.7: C-type mesh orderings Different cases have been run to visualize the corresponding effects in the convergence rates regarding the ordering direction and the mesh topology (cf. B.5 Ordering techniques)

B.5

Numerical results and discussion

In this section the results of code modifications indicated by the performance tools and convergence rate dependence on the ordering are presented. Ordering is considered here as an external tool for convergence acceleration of the LU-SGS solver.

B.5.1

Performance tools

Edge with the original LU-SGS solver has been run with CrayPAT [B.6] and Intel Vtune [B.7] to identify the hotspot and cache missings, peak performance respectively. Intel Vtune has defined as the first hotspot the FORTRAN90 intrinsic function MATMUL for the 5-by-5 matrix multiplications of LU-SGS taking almost 50% of the total CPU time. Figure B.8 shows specialized matrix multiplication code compared to MATMUL. The direct˝method refers to the one by one explicit assignment of the 5 by 5 entries of the resulting matrix. The loop˝technique reduces the last method to a unique assignment inside a double loop over the rows and columns. The chart points out clearly the direct method as the most efficient algorithm with only 12% of the total CPU time. CPU time is reduced by a factor of 1.5 compared to the original implementation.

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

B-9

Figure B.8: CPU time vs. matrix multiplication methods

CrayPAT has shown cache-misses of the order of 0.2% and clear peak performance improvement has been obtained as Fig. B.9 illustrates. The LU-SGS solver initially at 7.4%, has reached a 11.2% peak performance just by replacing the MATMUL subroutine by the direct matrix multiplication method.

Figure B.9: Peak performance comparison

As Fig. B.10 shows, the optimized code runs twice as fast as the original. After having optimized the code, we investigate acceleration of the LU-SGS solver convergence rate by external ordering techniques.

B-10

E. Otero

0 Optimized Non Optimized

−2

Log10(Res)

−4 −6

−8 −10

−12 0

100

200

300 CPU time

400

500

600

Figure B.10: Convergence optimization in time

B.5.2

Ordering techniques

The different ordering techniques have been tested for two cases, in a structured and unstructured mesh. The computations have been run for a second order Upwind scheme on the Right Hand Side of the linear system using an upwind type/matrix dissipation on the Left Hand Side (paper A). Structured meshes The ordering from the generation of the structured mesh is a lexicographic ordering. Figures B.11 shows the ordering resulting shape starting down along the wake and boundary airfoil wall up to the outer layer of the C-type mesh.

Figure B.11: C-type mesh

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

B-11

The exact path taken by this ordering through the mesh is shown by vectors. Figure B.12 gives vector form illustration with a zoom in the trailing edge and wake area interaction (Fig. B.13). We notice a unique and defined closed path along the wake and airfoil wall. None discontinuity or change in direction, sense are observable until the last layer included. We define Figs. B.12 and B.13 as the lexicographic ordering pattern. Figure B.13 illustrates precisely the path taken by the numbering process from 1 to 6 and the starting vectors definitions, namely from node 1 to 257 and from node 1 to 208. These have been used for the implementation of the following ordering techniques.

Figure B.12: Pattern

Figure B.13: Zoom airfoil-wake

In addition to this original ordering, algorithms based on CM [B.8] methods have been tested and compared. Figure B.14 shows the convergence results for the best techniques found. As it can be observed the three methods give similar convergence speeds, however the lexicographic ordering is the fastest. These best convergences

B-12

E. Otero

come from a Cuthill Mckee algorithm starting along the wake and airfoil wall in two lightly different ways, from node 1 to 208 and from node 1 to 257, reversing the first 49 nodes at each level (Fig. B.13). Convergence has shown to be very sensitive to ordering. In fact, even small modifications in the starting vector of the CM algorithm were strongly degrading or directly stopping the convergence.

Figure B.14: Convergences of LU-SGS for different ordering techniques

Unstructured meshes The original ordering method coming from the unstructured mesh generation corresponds this time to an advancing front technique Fig. B.15 with the airfoil in the center. This technique is well known to start at the defined boundaries, here the wall airfoil and outer boundary. Then it advances by numbering the mesh until collision as can be observed by the free nodes remaining.

Figure B.15: Advancing-front ordering shape

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

B-13

The CM algorithm has been run for different starting vectors. Before going into the convergence analysis, the different techniques employed are presented. Two ways of ordering visualizations are used. The first one is in two dimensions (x,y) and provides information about the ordering path with its sense and the level structures. The axes (x, y) defining the domain of computation. Figure B.16 illustrates the ordering path starting at the strong boundaries, namely at the wall boundary (cf. zoom) and at the outer boundary. As we can observe, the CM algorithm follows exactly the pattern creating level structures (Fig. B.6).

Figure B.16: CM ordering starting at the strong boundaries

Moreover a special interest has been taken in the ordering path around the airfoil wall where a reversing order technique has been applied as in the advancing front method, Fig. B.17.

Figure B.17: Ordering at the airfoil (Advancing front and CM outer-wall reverse)

B-14

E. Otero

The second visualization techniques is 3D (x,y,z), with z as the node number. This method provides a unique shape for each ordering technique, useful for comparison. Figure B.18, gives the Cuthill McKee ordering starting at the wall boundary with its reverse version, namely the RCM.

Figure B.18: CM starting ordering at wall (left) and the RCM version (right)

These shapes can be then compared to the advancing front ordering (Fig. B.15) However if we reproduce the shapes of the previous orderings starting at both boundaries, outer and wall (B.16), one can reproduce the advancing front ordering shape from Fig. B.15, with the optimization of removing the free nodes as shows Fig. B.19.

Figure B.19: CM ordering shape starting at the outer and wall boundary

A deeper comparison is done by zooming in the internal tornados shape as illustrates Fig. B.20. The denomination of outer-wall reverse˝, defines the Cuthill Mckee starting vector at the outer and wall boundaries by reversing the ordering sense as in Fig. B.17. As we can notice, the shapes look almost identical, with the only difference that the free nodes have been arranged properly in the general shape. However, whereas we expected convergence acceleration from this cleaner

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

B-15

Figure B.20: Shape comparisons. Zoom for Advancing Front and CM outer-wall reverse

technique, the advancing front methods provide faster convergence. Figure B.21 shows the convergence of the cases presented. The denomination of the CM orderings corresponds to the definition of the starting vectors. From this figure we observe that the fastest convergence of the new orderings is given by CM starting at the wall but reversing the ordering. In fact the reversing technique seems to have an effect in the convergence speed, faster than the original CM-wall case. Then concerning the optimized ordering shape found in Fig. B.19, namely the CM outer-wall reverse, this gives neither a faster convergence but surprisingly even slower. Moreover an additional random method has been tested, given as expected the slowest convergence. Many other techniques have been tried related to CM or to the general idea of advancing through the connecting nodes. Also other types of techniques have been tested as trying to follow the flux advection or reordering by the increasing coordinates. However any of them has given better convergence, and the ones presented in this paper are the best cases, namely CM starting at the strong boundaries. Hence the ordering methods investigation has shown a strong impact in the LUSGS convergence, where a small alteration of the order could stop the convergence. Then the LU-SGS convergence efficiency is strongly dependent to the ordering used, and more or less if the mesh is structured or unstructured. In fact the orderings for structured grids were much more sensitive to break the convergence. The final conclusion about the best ordering methods were those coming originally from the mesh generations techniques as the advancing front. Additional tests on a Delaunay mesh have confirmed the previous conclusion where the original ordering gives always the best convergence.

B-16

E. Otero

Figure B.21: Ordering techniques convergences for unstructured grids

Conclusion The LU-SGS solver has been tuned for high performance computing and with external parameters as ordering techniques. Modified algorithms for small matrix multiplies increased the fraction of the peak performance from 7% to 11% and overall twice faster runtime. The ordering is important, but none of the algorithms investigated has given clear advantages on the test cases of structured and unstructured grids. Additionally, a preliminary parallelization study has been realized, showing its complexity for an LU-SGS algorithm compared to the Jacobi iteration which on the other side converges more slowly. This last analysis will be considered for the parallelized version.

References [B.1] J.S. Kim and O.J. Kwon. An efficient implementation of implicit operator for block lu-sgs method. Computational Fluid Dynamics, page 154â159, July 2005. [B.2] J. D. Baum H. Luo, D. Sharov and R. Löhner. Implementation of unstructured grid gmres+lu-sgs method on shared-memory,cache-based parallel computers. AIAA, 2000. Paper 2000-0927. [B.3] R. P. Dwight. Efficiency improvements of rans-based analysis and optimization using implicit and adjoint methods on unstructured grids. Faculty of Engineering and Physical Sciences, University of Manchester, 2006. [B.4] Edge, theoretical formulation. Technical Report 03-2870, FOI, 2010.

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

B-17

[B.5] P. Weinerfelt P. Eliasson and J. Nordström. Application of a line-implicit scheme on stretched unstructured grids. 47th AIAA Aerospace Sciences Meeting, January 2009. Paper 2009-163. [B.6] Cray, using cray®performance analysis tools. Technical Report S–2376–41, Cray Inc., 2006, 2007. [B.7] Intel(r) vtune(tm) amplifier xe 2011 getting started tutorials for linux* os. Technical Report 324207-003US, Intel Software Development Tools. http://developer.intel.com. [B.8] A. George and J. W-H Liu. Computer Solution of Large Sparse Positive Definite Systems. Prentice-Hall, Englewood Cliffs, New Jersey 07632, 1981.

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

C-1

Implementation of Implicit LU-SGS method with Line-implicit scheme on Stretched Unstructured Grids Evelyn Otero Department of Aeronautical and Vehicle Engineering Royal Institute of Technology SE - 100 44 Stockholm, Sweden

and Peter Eliasson

Department of Aeronautics and Autonomous Systems Swedish Defense Research Agency (FOI) SE-164 90 Stockholm, Sweden

Abstract A first implementation of line-implicit Lower-Upper Symmetric GaussSeidel (LU-SGS) in Edge has been considered in this paper for convergence acceleration on stretched meshes. The motivation comes from the lack of efficiency of the implicit LU-SGS when running in RANS meshes. Edge is a flow solver for unstructured grids based on a dual grid and edge-based formulation. The line-implicit method has been implemented in the code Edge has an acceleration method when computing explicit schemes in stretched meshes. This methods works only in regions of stretched grid where the flow equations are integrated implicitly in time along the structured lines. The combination with an implicit LU-SGS should remove the restriction of the time step for explicit schemes and accelerate the convergence to steady state. The results have shown that for highly stretched meshes, line-implicit LU-SGS could perform much faster than LU-SGS.

C.1

Introduction

Implicit schemes are becoming more and more spread in the Computational fluid dynamics (CFD) field, for reducing simulations time. These can be then a preferable choice with respect to explicit schemes though they are easier to implement. The use of an explicit or implicit scheme is considered when dealing with time dependent problems or non-linear problems solved by time-stepping. Depending on the problem, the time-step will have an influence either in the accuracy of the

C-2

E. Otero

solution for time-dependent problems or in the convergence to steady-state for nonlinear problems. In fact we speak about physical time step and pseudo-time step, respectively. In both cases the interest of an implicit method is to have larger time steps without the stability restriction in time for an explicit scheme. Since in this paper we treat non-linear problems, not corresponding to time accurate problems, we are not going to consider the accuracy of the solution per iteration but just focus in the speed of convergence to steady-state. Implicit Lower-Upper Symmetric Gauss-Seidel (LU-SGS) [C.4–6] solver has been implemented in the code Edge to accelerate the convergence into steady state. The LU-SGS approximate factorization method is attractive because of its good stability properties and limited memory requirement. This method considers a linearization which is inexactly solved by a single step of a symmetric Gauss-Seidel method through a forward and backward sweep. The new challenge in this paper is the combination of the implicit LU-SGS with the line-implicit method to speed up the convergence on stretched grids. Edge [C.1] is a flow solver for unstructured grids based on a dual grid and edge-based formulation on node-centered finite-volume discretization. Edge uses an explicit multistage Runge-Kutta time integration solver [C.1] . Multigrid (MG) is implemented as a strategy to speed up the convergence rate based on the FAS (Full Approximation Scheme/Storage) [C.2] for nonlinear problems. Another acceleration technique corresponds to the line-implicit method [C.3] for unstructured stretched meshes used in RANS computations. This method is originally combined with an explicit Multistage Runge-Kutta scheme, and integrates implicitly in time along lines in regions where the computational grid is highly stretched. The rest of the domain of computation is treated through an explicit scheme. In the following sections, the mathematical problem to be solved is first formulated followed by the method of resolution, the LU-SGS solver. Then the lineimplicit method is presented with its combination with LU-SGS. The results give a first overview of this hybrid scheme for unstructured stretched meshes.

C.2 C.2.1

Problem formulation Flow equations

The Reynolds–Averaged Navier –Stokes (RANS) equations are solved in the code Edge. They correspond to Eq.1. ∂U + ∇.FI + ∇.FV = Q ∂t

(1)

with U = (ρ ρu ρv ρw ρE) , the vector of conserved variables, F I and F V the inviscid and viscous flux matrices respectively, and Q the source vector An integral formulation of these equations over a control volume Ω contained in a surface S leads to Eq.2.

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

Z



∂U dΩ + ∂t

Z

(FI + FV ).ndS =

S

Z



QdΩ

C-3

(2)

This integral formulation is then applied to a control volume surrounding node v0 (Fig.C.1) corresponding to the finite-volume discretization (Eq.3). A control volume is defined by the dual grid in dash lines in whose surfaces the flux vectors are computed.

Figure C.1: Dual grid forms control volume

m

m

k=1

k=1

0 0 X X d (U 0 V 0 ) + F I0k n0k S0k + F V0k n0k S0k = Q0 V0 dt

(3)

Since this paper treats inviscid problem, the viscous flux F V0k of Eq.3 are neglected giving the Euler equations (Eq.4). m

0 X d (U 0 V 0 ) + F I0k n0k S0k = 0 dt

(4)

k=1

The inviscid flux computations are based on central schemes of the form Eq.5 with a numerical dissipation d of artificial type or upwind flux difference splitting type [C.1]. F I01 = F I



U0 + U1 2



− d01

(5)

C-4

E. Otero

C.2.2

Implicit time stepping

The resolution of the Euler equations, Eq. 6, is carried out to solve our non-linear steady-state problem, using a time-marching approach. The Euler equations are solved in time by an implicit Backward Euler method, Eq. 7.

V.

V.

dU + R(U ) = 0 dt

U n+1 − U n 4t

(6)

+R(U n+1 ) = 0

(7)

After linearization of Eq.7 by Eq.8

R(U n+1 ) ≈ R(U n ) +

∂R(U n ) ∂U

4U n

(8)

we end up into a linear system of the form M x = b as in Eq. 9 to be solved approximately. The system matrix M is a block matrix type due to the flux jacobians. In the current implementation, the block size is the set of unknowns per node.   ∂R(U n ) 1 V + 4U n = −R(U n ) (9) | {z } | {z } 4t ∂U | {z } x b M

In the rest of the paper, M will be referred as the left hand side (LHS) and b as the right hand side (RHS).

C.3

LU-SGS solver

The Lower-Upper Symmetric Gauss-Seidel (LU-SGS) [4] is implemented as smoother for Mutigrid, solving the system very approximately but with small memory requirement. The algorithm consists in a number of Forward and Backward sweeps shown in Eq. 10, M x = (L + D + U )x = b (D + L)x∗ = b − U xn (D + U )xn+1 = b − Lx∗

(10)

where L is the block lower triangular part, D the block diagonal, and U the block upper triangle of M . An important requirement for the linear convergence is to increase the block diagonally dominance of the matrix M . On the other hand M has to be sufficiently accurate to ensure the non-linear convergence.

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

C-5

A Flux Jacobian splitting with matrix dissipation type (paper A) has been used for the construction of the implicit operator M . Eq. 11 defines the left hand side flux Jacobians of M . ∂F (U i ,U n ) ∂U n ∂F (U i ,U n ) ∂U i 00

=

1 2

=

1 2



00







00 ∂F ∂U n − |A |i,n  00 ∂F ∂U i − |A |i,n

(11)

with |A | = R|Λ |R−1 the absolute Roe matrix , R the matrix of the right 00 eigenvectors of the flux Jacobian and the diagonal matrix Λ containing the eigen00 values of the flux Jacobian matrix A adjusted with an entropy fix.

C.4

LUSGS with Line-implicit

The line-implicit method [C.3] consists in integrating the flow equations implicitly in time along structured lines in stretched regions of the grid, usually in the boundary layer. Figure C.2 shows the implicit lines (emphasized) in the structured stretched area surrounded by an unstructured grid.

Figure C.2: Implicit lines (emphasized) on stretched grid

C.4.1

Theoretical formulation

The mathematical formulation for the combined line-implicit LU-SGS solver has been derived as following. We start by defining three residuals as, R1k

=

X

F kj

(12)

j∈L(k)

Rk

=

X

F kj

(13)

j∈Nb (k)

R2k

=

Rk − R1k

(14)

C-6

E. Otero

with k ∈ [1..N ] the node number where the residual R is computed, and N is the total number of nodes. F kj is the flux between nodes k and j. Then Nb (k) is the set of nodes connected to node k by edges, and L(k) is the subset of Nb (k) belonging to an implicit line. Then we can express the overall residuals as juxtapositions of the block residuals,

R1 R

= {R1k }k=1...N = {Rk }k=1...N

(15) (16)

By approximate factorization, we solve then the following linear system,

(I +

4t ∂R2 4t 4t ∂R1 )(I + )4U = − R(U n ) V ∂U V ∂U V

(17)

with R1 and R2 , the residuals for the line-implicit and LU-SGS solver respectively.

C.4.2

Implementation

The implementation splits the problem Eq. 17 into two steps, solving the first linear system Eq. 18 by the line-implicit method and the second one Eq. 19 by the LU-SGS solver. (I +

4t ∂R1 4t ).4U 1 = − R V ∂U V

(18)

4t ∂R2 ).4U = 4U 1 V ∂U

(19)

(I +

The nodes are sequentially ordered along the implicit lines. The system matrix of Eq. 18 becomes a block tri-diagonal matrix so the first system Eq. 18 can be solved by Gaussian elimination. The second linear Eq. 19 is solved by LU-SGS according to Eq. 10. As we can observe its RHS comes from the solution of the first system Eq. 18. This scheme is easily implemented by replacing the standard multi-stage Runge-Kutta scheme by a one-stage scheme so it becomes the backward Euler scheme. In the LU-SGS sweeps the fluxes along the edges along the implicit lines as in Fig. C.3 are then neglected in the formulation of the LHS matrices so the result is a consistent approximate factorization applied to the implicit time step. Both methods, combined and separated, are applied on grids made for RANS computations with high Reynolds number. The thickness of the cells on the airfoil surface are on the order of a millionth of the chord length.

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

C-7

Figure C.3: LU-SGS region of computation. Elimination of the edges on implicit lines

C.5

Numerical results

A first version of the implicit LU-SGS line implicit has been tested for inviscid transonic Mach 0.754 flow at 2.57°angle of attack, on a mesh for RANS simulation. The computations have been run for two refined grids where the normal distance from the airfoil to the first interior node is of 10−5 and 10−6 chord lengths respectively. There are 310 nodes on the airfoil, and a varying number of quadrilateral cells in the wall normal direction with about 35 and 45 layers in average. The triangular grid outside of these layers has about 51 K and 54 K nodes, viz.. The maximum aspect ratio of the boundary layer cells is 15 ×103 . All the calculations are performed in four levels of full multigrid W-cycles. Directional coarsening with coarsening ratios 1:4 or 1:2 along the lines are used. Figure C.4 shows the mesh for a 10−5 chord length. We can clearly identify the stretched structured grid around the airfoil surrounded by an unstructured grid.

Figure C.4: Computational grid for NACA0012 (10−5 chord length)

C-8

E. Otero

We first analyze the computations on the less stretched mesh in Fig. C.5 with a 10−5 chord length. The combined line-implicit LU-SGS solver doesn’t give any improvement compared to an implicit LU-SGS solver and that both of them are relatively close to the line-implicit Runge-Kutta convergence. However, the lineimplicit Runge-Kutta keeps being the most efficient solver with a convergence 1.3 times faster. 0

Log10(Res)

Line−implicit Runge−Kutta Implicit LU−SGS Line−implicit LU−SGS −5

−10

−15 0

2000

4000 6000 8000 Iterations number

10000 12000

Figure C.5: Convergences on stretched grid 10−5 . (Second order accuracy)

The story changes on more stretched grids. Fig. C.6 shows the line-implicit LU-SGS solver 2.6 times faster than the implicit LU-SGS. We have to note that the line-implicit Runge-Kutta gives still the fastest convergence. −2 Line−implicit Runge−Kutta Implicit LU−SGS Line−implicit LU−SGS

Log10(Res)

−4

−6

−8

−10

−12 0

1

2 3 Iterations number

4

5 4

x 10

Figure C.6: Convergences on stretched grid 10−6 . (Second order accuracy)

Acceleration of Compressible Flow Simulations with Edge using Implicit Time Stepping

C.6

C-9

Discussion

From the previous results we observe that LU-SGS has a good convergence for not highly stretched meshes which can be explained by the strong efficiency of the LUSGS in isotropic grids (paper A). Then this first implementation of the line-implicit LU-SGS can hardly provide a faster convergence. However the use of a line-implicit LU-SGS solver seems to be of higher interest when computing in more stretched grids. In fact the implicit LU-SGS is strongly affected by the refinement of the grid, becoming for example 4 times slower from a 10− 5 mesh to a 10− 6 mesh. This can be compared to the line-implicit LU-SGS convergence ratio of the order of 1.5 between both meshes. Thus this last result is very promising for high stretched meshes.

C.7

Conclusion

This paper compares the implicit LU-SGS solver with a variant of its combination with a line-implicit method. The motivation comes from the lack of efficiency of LU-SGS for stretched unstructured grids. The strong efficiency provided by a lineimplicit LU-SGS method appears when computing in highly stretched grids. For this kind of meshes the implicit LU-SGS suffers from an important deterioration in convergence speed. Thus this first implementation of the combined line-implicit LUSGS is very promising as a solution for the implicit LU-SGS in stretched grids. Since the original line-implicit Runge-Kutta gives still the fastest convergence speed, more work has to be done concerning the theoretical formulation and the implementation related.

References [C.1] Edge, theoretical formulation. Technical Report dnr 03-2870, FOI, 2010. [C.2] Alfio Borz‘. Introduction to multigrid methods. [C.3] P. Weinerfelt P. Eliasson and J. Nordström. Application of a line-implicit scheme on stretched unstructured grids. 47th AIAA Aerospace Sciences Meeting, January 2009. Paper 2009-163. [C.4] J.S. Kim and O.J. Kwon. An efficient implementation of implicit operator for block lu-sgs method. Computational Fluid Dynamics, page 154â159, July 2005. [C.5] J. D. Baum H. Luo, D. Sharov and R. Löhner. Implementation of unstructured grid gmres+lu-sgs method on shared-memory,cache-based parallel computers. AIAA, 2000. Paper 2000-0927.

C-10

E. Otero

[C.6] R. P. Dwight. Efficiency improvements of rans-based analysis and optimization using implicit and adjoint methods on unstructured grids. Faculty of Engineering and Physical Sciences, University of Manchester, 2006.