Unstructured Mesh Computational Mechanics on DM ... - CiteSeerX

McManus, K.; Walshaw, C.; Cross, M.; Johnson, S.

Unstructured Mesh Computational Mechanics on DM Parallel Platforms Parallel machines oer the computational power necessary to solve increasingly demanding industrial computational mechanics problems. Many structured grid codes have been successfully parallelised but there is a move towards unstructured mesh codes which can provide greater modelling exibility. Domain decomposition methods may be used to map unstructured mesh codes onto Distributed Memory (DM) Multi Instruction Multi Data (MIMD) hardware using a Single Program Multi Data (SPMD) paradigm. The success of this method depends upon achieving a good load balance. Obtaining a well balanced partition of an unstructured mesh is non-trivial, but may be quickly computed using a number of methods. We present results which demonstrate the eectiveness of this strategy for homogeneous models. However inhomogeneous models such as casting, aeroelasticity and contact analysis present dicult load balancing problems which require more sophisticated balancing techniques.

1. Introduction Unstructured mesh codes provide the exibility required for modelling complex geometries. The University's research into the modelling of metals casting processes has led to the development of a control volume code in which, a cell centred co-located Rhie and Chow SIMPLE algorithm [1], for the solution of the Navier Stokes equations modelling

ow and heat with solidi cation, has been integrated with a node based control volume solver for the elastic stress strain equations [2,3]. Iterative solvers are used with false time stepping for transient problems. This entire code has been parallelised using domain decomposition with explicit message passing in Fortran77 for use on Distributed Memory (DM) Multi Instruction Multi Data (MIMD) platforms.

2. Parallelisation by Domain Decomposition In parallelising an existing code there are a number of desirable objectives: i) Minimise the changes to the original algorithms. The parallel code should ideally produce identical results to the serial code. This is a necessary requirement for user acceptance of the parallel code. ii) Minimise the visibility of the parallel code. The parallel code should be hidden from both the serial code developers and the parallel code users. This allows ease of maintenance of the parallel code alongside the serial code and avoids deterring users from the parallel code. iii) Portability to most DM MIMD platforms. The code needs to be able to make good use of the majority of available parallel hardware. iv) Maximise parallel eciency. The parallel code must show signi cant speed-up over the serial code. The primary motivation for parallelisation is to reduce run-time. v) Scalable to massively parallel DM platforms. This is the direction in which the high op per Dollar supercomputers are being developed. Cost eectiveness is rapidly becoming an overriding criteria. These objectives have been achieved through parallelisation by domain decomposition using a Single Program Multi Data (SPMD) programming paradigm. The unstructured mesh partitioning code JOSTLE [4] has been integrated into the software to provide an element based partition of the mesh into P sub-domains mapped onto P processors. A node partition is subsequently derived from the element partition. The elements and nodes within each partition are referred to as core elements and nodes. The mesh of each sub-domain is extended with a layer of overlap elements and points along the boundaries between the partitions. The extent of the overlap layer is determined by the data dependency of the code being parallelised. Each processor calculates only the values of variables for the core elements and nodes. When needed, variable values are swapped into the overlaps from the processor on which the variable has been calculated. An exception

to this scheme arises is when it is quicker to operate locally on the overlaps than to communicate the values from neighbouring processors.

3. Parallel Iterative Solvers Variation between the serial and parallel versions of an algorithm is sometimes inevitable. For an iterative solver the principal change to the serial algorithm is the order of coecient evaluation. A Jacobi solver is order independent so the parallel solution variables remain identical to those of the serial solution at each step of the procedure. It is however impractical to identically parallelise a Gauss-Seidel iterative solver for an irregular problem. This algorithm is dependent upon the order of evaluation of the coecients and must be modi ed to achieve an eective parallel scheme. The resulting parallel algorithm becomes a hybrid of Gauss-Seidel and Jacobi in that a small number of old variable values persist in the overlaps until the end of each iteration. Results on irregular problems so far indicate that variations between the serial and parallel solution variables and dierences in the number of iterations required to converge are both insigni cant. The diagonally preconditioned conjugate gradient solver is an explicit scheme that like the Jacobi method uses only old variable values within each iteration. It may therefore be expected to give identical results from both serial and parallel versions. However rounding errors occur in the summation involved in the formation of the direction vector and these errors are dierent for serial and parallel implementations. As the solutions are highly sensitive to the direction vector these rounding errors give rise to small dierences between the serial and parallel solution. In this case both serial and parallel solutions are equally valid solutions of the original algorithm.

4. Code Changes Extension of the mesh to provide overlaps is accommodated by simple extension of existing data structures. New variables are however required by the parallel code to keep account of these extensions. More signi cantly data structures are needed to record the order and direction in which data is exchanged between processors. Mapping of the original mesh to the partitioned mesh requires a global sized data structure that has to be distributed among the processors to remain scalable. unchanged parallelised parallel comms harness library library routines routines

Figure 1: Shell structure of the parallel code The SPMD paradigm allows a single source code parallel program to be developed and maintained as a serial code. A shell structure illustrated in Figure 1 has been used to build layers of visibility within the code. Around the outside of the shell are the majority of the original routines which remain unchanged. At the next level in are the routines from the original code that have been modi ed to function in parallel. Most of these routines are changed only slightly in that additional subroutine calls have been included and array dimensions and loop lengths are changed. The parallel utilities library provides locationless and directionless routines which form a barrier to the visibility of the parallel implementation. At this level there is no concept of master or slave processor or indeed processor number, position or communication channel. It is felt that the serial code developers should have little problem with this view of parallelism. The mesh partitioning, overlap generation and sub-domain renumbering routines at this level require extensive data structures and globally dimensioned variables. Embedding of these routines in the parallel code may not always possible, due to memory restrictions, in which case they may be used to pre-process

the problem les. The communication library provides portability interface and a barrier to the visibility of the parallel machine. This library consists a very simple set of communication routines used by the utility library to present a uniform functionality on all machines. The innermost level is the native communication harness provided by the machine (PVM, MPI, ctoolset, etc.). Only the most primitive send and receive functions are necessary at this level so guaranteeing portability to the majority of hardware platforms. More sophisticated harness calls may however be used to implement parallel library functions to allow ecient use of the hardware where possible.

5. Mesh Partitioning Distributing an unstructured mesh across a parallel machine at run time, so that the amount of computational load is evenly balanced and the amount of interprocessor communication is minimised is non trivial. It is well known that this problem is NP complete, so in recent years much attention has been focused on developing ecient partitioning algorithms. Many methods have been developed that partition a graph corresponding to the communication requirements of the mesh; such graph based methods have the advantage of geometric independence. The mesh partitioning code JOSTLE [4] aims to address the problems of massive parallelism, machine topology, parallel partitioning and dynamic load balancing. Massive parallelism implies a massive problem and so it is important that the partitioning becomes a parallel process. The topology of the parallel machine becomes increasingly relevant as the number of processors increases, since there will inevitably be a distance based communication cost on a massively parallel machine. A parallel partitioning scheme is therefore required that can rapidly distribute the problem to the processors and then work in parallel as much as possible. JOSTLE seeks to use a three stage process to achieve these requirements, the rst of which is unavoidably serial, but the subsequent processes may be carried out in parallel. An O(N) (where N is the number of nodes in the graph) initial partition based on either Greedy[6] or geometric sorting provides an inexpensive and reasonably good initial partition. Node clustering using a recursive greedy algorithm coarsens the graph in order to reduce the order of the problem. A number of optimisation heuristics can then be applied to re-partition the reduced graph. Nodes are migrated across the partition boundaries using high-level heuristics which use the distance from a partition centre (in a graphical sense) to minimise the surface energy of each partition. Load balance is then restored and low-level node based heuristics are used to smooth out the partition interfaces on the original graph. The reduction and optimisation codes may be used to re-partition a mesh in a parallel dynamic load balancing scheme.

6. Performance 16

14

3,000 10,000

12

30,000 60,000

speedup

10

8

6

4

2 0

5

10

15

20

25

30

no. processors

Figure 2: Speedup for a range of mesh sizes on an i860 based Transtech Paramid. The parallel eciency or speed-up is strongly dependent on the calculation to communication ratio of the parallel machine. Using T800 transputers the eciency of the parallel code remains high even for small problems with a ratio of core to overlap elements around unity [5]. The Transtech i860 processing nodes available at the University of Greenwich use a T800 as a communication co-processor. This system incurs a signi cant communication start-up latency which results in a rather poor calculation to communication ratio. Consequently the achievable parallel performance using these i860 processors is restricted, but nevertheless shows a signi cant speed-up given a suitable

problem size. The test case used is meshed as 3,000, 10,000, 30,000 and 60,000 triangular elements running a solidi cation problem involving ow, heat and stress.

7. Partitioning for Multi-Physical Modelling

Figure 3: Foil mesh partitioned for four processors Figure 3 shows a simple foil mesh partitioned into four sub-domains each containing the same number of elements. This partition has achieved a balance of elements for each processor, but it is necessary to balance the load across all solvers. In an aeroelastic problem only ow is solved in the space around the foil and only stress is solved within the foil. To achieve a load balance the nature of these physical domains must be incorporated into the partitioning algorithm. A more balanced partition may appear more like that shown in Figure 4.

Figure 4: Foil mesh partitioned with solver balancing For a casting problem the massive migration of elements from the liquid to solid domain leads to a limit to the achievable load balance. These problems along with adaptive and moving meshes need to be addressed if a dynamic load balancing scheme is to be successful.

Acknowledgements The work presented in this paper was funded by the EPSRC

8. References 1 Patankar, S.V.: Numerical Heat Transfer and Fluid Flow, Hemisphere 1980. 2 Cross, M., Bailey, C., Chow, P., Pericleous, K. and Fryer, Y.D.: Towards an integrated control volume unstructured mesh code for the simulation of all of the macroscopic processes involved in shape casting; Numerical Methods in Industrial Forming Processes (NUMIFORM 92), Balkema (1992), 787-792. 3 Fryer, Y.D. Bailey, C., Cross, M. and Lai, C-H: A control volume procedure for solving the elastic stress-strain equations on an unstructured mesh; Applied Mathematical Modelling, 15 (1991), 639-645. 4 Walshaw, C., Cross, M. and Everett, M.G. : A parallelisable algorithm for optimising unstructured mesh partitions; Tech. Rep. 95/IM/03, CMS Press, University of Greenwich, London 1995, submitted for publication. 5 Jones, B.W., McManus, K., Cross, M., Everett, M.G. and Johnson, S.: Parallel unstructured mesh CFD codes: a role for recursive clustering techniques in mesh decomposition; Parallel Computational Fluid Dynamics: New Trends and Advances, Elsevier Science B.V. (1995) 207-214 6 Farhat, C.: A simple and ecient automatic FEM domain decomposer; Comp. & Struct. 28 (1988) 579-602.

Addresses:

K. McManus, Dr C. Walshaw, Prof M. Cross, Dr S. Johnson Centre for Numerical Modelling and Process Analysis, University of Greenwich, London SE18 6PF, UK. email [k.mcmanus, ..]@gre.ac.uk : URL http://www.gre.ac.uk/~[k.mcmanus, ..]

Mr.