Parallelizing Molecular Dynamics Codes using Parti ... - CiteSeerX

6 downloads 17275 Views 132KB Size Report
The programmer should be relieved of such machine dependent and architecture ... Support for authors has been provided by pending DARPA contract, by theĀ ...
Parallelizing Molecular Dynamics Codes using Parti Software Primitives  R. Das y

J. Saltz z

Abstract

This paper is concerned with the implementation of the molecular dynamics code, CHARMM, on massively parallel distributed-memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution eciency is compared with that obtained when implementation is done using data-replication.

1 Introduction

The rapid growth in microprocessor technology over the last decade has lead to the development of massively parallel hardware capable of solving large computational problems. Given the relatively slow increases in large mainframe supercomputer capabilities, it is now generally acknowledged that the most feasible and economical means of solving extremely large computational problems in the future will be with highly parallel distributed memory architectures. The large molecular dynamics (MD) codes like CHARMM [3], GROMOS [10], AMBER [11], etc. have parts which are extremely computationally intensive. Implementationg these codes on massively parallel machines not only reduces the total solution time but allows one to solve very large problems. There are two software issues that needs to be addressed when one considers to e ectively parallelize MD codes. The rst is the development of ecient and inherently parallelizable algorithms to do the inter-particle force calculations, which happens to consume bulk of the computation time in these codes. A number of parallel algorithms has been designed and implemented for both SIMD and MIMD architectures [7, 1, 8, 9]. The second is an implementation issue. Most often, the programmer is required to explicitly distribute large arrays over multiple local processor memories, and keep track of which portions of each array reside on which processors. In order to access a given element of a distributed array, communication between appropriate processors must be invoked. The programmer should be relieved of such machine dependent and architecture speci c tasks. Such low-level implementational issues have severely limited the growth of parallel computational applications, much in the same way as vectorized applications were inhibited early on, prior to the development of ecient vectorizing compilers.  Support for authors has been provided by pending DARPA contract, by the National Aeronautics and Space Administration under NASA contract NAS1-18605 while these authors were in residence at ICASE, NASA Langley Research Center y Institute for Advanced Computer Studies, Computer Science Department, University of Maryland, College Park, MD z Institute for Advanced Computer Studies, Computer Science Department, University of Maryland, College Park, MD

1

2

Das and Saltz

We have parallelized the MD code CHARMM using the Parti software primitives. This is a data-parallel implementation for a distributed memory architecture. Several di erent optimizations have been performed to improve the single node performance and also to reduce the communication time. Most of the optimizations have been incorporated in the software primitives. This paper is less concerned with the individual development of the solver or the parallelizing primitives, but more so with the interaction between these two e orts resulting from a speci c application: the implementation of the MD code CHARMM on the Intel iPSC/860.

2 Molecular Dynamics

Molecular dynamics is a technique for simulating the thermodynamic and dynamic properties of liquid and solid systems. For each timestep of the simulation two separate calculations are performed. First, there is the bonded and non-bonded force calculation for each atom. The second part is the integration of the Newtons equation for each atom. In most MD codes the bulk of the time (a little more than 90%) is spent in the longrange force i.e. the non-bonded force calculation. Hence this is the part that needs to be parallelized eciently. This is an O(N 2) calculation, where N is the number of atoms. Every single atom interacts with each other but usually a cuto distance Rc is speci ed and interactions outside this are neglected. The non-bonded force calculation constitutes of two distinct parts. For each atom, rst the pailist needs to be generated and next the Vander Waals and electrostatic force calculation has to be performed. The pairlist generation is not performed every iteration but after every n iterations where n is a variable that can be xed by the user. The MD code that we used was CHARMM (Chemistry at HARvard Macromolecular Mechanics) [3] which was developed at Harvard University for biomolecular simulations. It is a relatively ecient program which uses empirical energy functions to model molecular systems. The code is written in Fortran and is about 110,000 lines. The program is capable of performing a wide range of analyses. Some of the important simulation routines are the dynamic analysis, the trajectory manipulations, energy calculations and minimization, and vibrational analysis. It also performs statistical analysis, time series and correlation function analysis and spectral analysis.

3 Parti Primitives

The PARTI primitives (Parallel Automated Runtime Toolkit at ICASE) [2] are designed to ease the implementation of computational problems on parallel architecture machines by relieving the user of the low-level machine speci c issues. The primitives enable the distribution and retrieval of globally indexed but irregularly distributed data-sets over the numerous local processor memories. In distributed memory machines, large data arrays need to be partitioned between local memories of processors. These partitioned data arrays are called distributed arrays. Long term storage of distributed array data is assigned to speci c memory locations in the distributed machine. A processor that needs to read an array element must fetch a copy of that element from the memory of the processor in which that array element is stored. Alternately, a processor may need to store a value in an o processor distributed array element. Thus, each element in a distributed array is assigned to a particular processor, and in order to be able to access a given element of the array we must know the processor in which it resides, and its local address in this processor's memory. We thus build a translation table which, for each array element, lists the host

Parallelizing Molecular Dynamics Codes using Parti Software Primitives

processor address. One of the primitives handles initialization of distributed translation tables, and another primitive is used to access the distributed translation tables. In distributed memory MIMD architectures, there is typically a non-trivial communications latency or startup cost. For eciency reasons, information to be transmitted should be collected into relatively large messages. The cost of fetching array elements can be reduced by precomputing what data each processor needs to send and to receive. In irregular problems, such as calculating inter-particle forces and sparse matrix algorithms, it is not possible to predict at compile time what data must be prefetched. This lack of information is dealt with by transforming the original parallel loop into two constructs called an inspector and executor [6]. The PARTI primitives can be used directly by programmers to generate inspector/executor pairs. Each inspector produces a communications schedule, which is essentially a pattern of communication for gathering or scattering data. The executor has embedded PARTI primitives to gather or scatter data. The primitives are designed to minimize the e ect on the source code, such that the nal parallel code remains as close as possible to the original sequential code. The primitives issue instructions to gather, scatter or accumulate data according to a speci ed schedule. Latency or start-up cost is reduced by packing various small messages with the same destinations into one large message.

4 Implementation and Results

We initially carry out a preprocessing phase to generate appropriately partitioned datastructures that are read in by each processor. The parallelized code looks very similar to the sequential code except that it has a number of Parti primitive calls embeded in it. Initially, all parallel irregular loops were transformed into inspector/executor pairs. Then we performed optimizations associated with reusing copies of o -processor data and with vectorizing communications [5]. The entire energy calculation portion of CHARMM has been parallelized. This involves both the internal (bond, angle etc.) energy calculation and the external (nonbonded) energy calculations. We present some of the results that we have obtained on the Intel iPSC/860. In these tables depicting the results, we present the total time required to carry out 100 CHARMM timesteps (including 4 list regenerations), the communication time associated with 100 timesteps, and the cost associated with a single list regeneration. We used the con guration TIP3P (TIP4P -4-site water model of Jorgenson) which contains 216 water molecules (648 atoms) in a box. Table 1 shows the timings that were obtained for 4 to 32 processors. In this case the atoms were partitioned by simple blocking. We compare these to the data presented in Table 2, for the same case on the same numbers of processors. The atom partitioning in this case was obtained by rst partitioning the iterations in the non-bonded loop by blocks and then carrying out the atom distribution based on the loop distribution. We can see that partitioning the iterations in the non-bonded loop leads to a reduction in the total time due to better load balancing in the non-bonded force calculation. As expected, we see that the pairlist regeneration time remains virtually unchanged. The second test case we used was a simulation of carboxy-myoglobin hydrated with 3830 water molecules (the total number of atoms is 14026). The list generation cuto was 14 angstroms. The timings obtained for 100 steps are shown in Table 3. The internal energy calculation is not very expensive and accounts for 1 % percent of the total energy calculation time. The two most computationally intensive parts of the code are the nonbonded energy calculations and the atom pairlist generation. In our implementation,

3

4

Das and Saltz

Number Total Time Comm time List Regen. time of per 100 per 100 per processors iterations iteration generation 4

111

3.8

5.0

8

62

5.3

2.8

16

40

9.1

1.6

32

27

13.7

1.1

Table 1

Timings of the Dynamic Analysis on Intel iPSC/860 for TIP3P BLOCK partitioning

every processor generates the pairlist for the atoms that have been assigned to the processor. A communication phase preceeds the pairlist generation. During this communication phase, the current atomic coordinates are exchanged. For the purpose of simplifying the implementation and to reduce the volume of communication, we have replicated some of the data structures used during pairlist generation. For the smaller TIP3P problem, pairlist generation is relatively inexpensive and has only a modest e ect on the total cost (note that pairlist generation is only occasionally performed, in our base sequential code, pairlist generation was carried out every 25 steps). In the larger carboxy-myoglobin simulation, pairlist generation proved to be extremely expensive. The reasons for the high cost do not appear to be fundamental, these costs in part are prompting us to redesign the address translation mechanisms in our software. We utilize incremental scheduling [5] to reduce the volume of data to be communicated by reusing live data that has been fetched for previous loops. Since we use incremental scheduling we can fetch all the required data i.e. the o processor atom coordinates that will be referenced. On the other hand, in the current implementation, we scatter o processor data during the calculation of energy and forces. This allows the ordering of operations in the parallel code to more closely approximate the ordering of operations in the sequential code (both parallelization and vectorizations of loops with accumulations typically require a reordering of operations within the loop). We could potentially achieve a very signi cant savings in our communications overhead by delaying the o -processor accumulation until the end of the energy calculation loops. This savings would require a further compromise in the sequential operation ordering, but at least one group has found that this modi cation does not have an adverse e ect on the computational results [4]. These results can be compared with the ones presented by Brooks in [4]. Their implementation replicates all the data structure on all the processors. For 64 processors on the Intel iPSC/860, Brooks reports a total time of 1521.1 seconds for 1000 timesteps for the carboxy-myoglobin benchmark, this compares to the 368 seconds we required to carry out 100 iterations of the same problem on 64 processors of an iPSC/860. While the drawback of the replicated approach is that it limits the size of the problem that can be t into a machine because of the limited memory on each processor node, at present that group is able to obtain better results.

Parallelizing Molecular Dynamics Codes using Parti Software Primitives

Number Total Time Comm time List Regen. time of per 100 per 100 per processors iterations iteration generation 4

105

3.1

5.0

8

56

5.9

2.8

16

39

10.1

1.8

32

25

13.0

1.1

Table 2

Timings of the Dynamic Analysis on Intel iPSC/860 for TIP3P partitioning based on iteration partitioning

Number Total Time Comm time List Regen. time of per 100 per 100 per processors iterations iteration generation 16

724

102

75.4

32

509

173

43.4

64

368

191

29.9

Table 3

Timings of the Dynamic Analysis on Intel iPSC/860 for carboxy-myoglobin hydrated with water

5

6

Das and Saltz

References

[1] R. C. G. A. I. Mel'cuk and H. Gould, Molecular dynamics simulation of liquids on the connection machine, Computers in Physics, (May/Jun 1991), pp. 311{318. [2] H. Berryman, J. Saltz, and J. Scroggs, Execution time support for adaptive scienti c algorithms on distributed memory architectures, Concurrency: Practice and Experience, 3 (1991), pp. 159{ 178. [3] B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus, Charmm: A program for macromolecular energy, minimization, and dynamics calculations, Journal of Computational Chemistry, 4 (1983), p. 187. [4] B. R. Brooks and M. Hodoscek, Parallelization of charmm for mimd machines, Chemical Design Automation News, 7 (1992), p. 16. [5] R. Das, R. Ponnusamy, J. Saltz, and D. Mavriplis, Distributed memory compiler methods for irregular problems - data copy reuse and runtime partitioning, in Compilers and Runtime Software for Scalable Multiprocessors, J. Saltz and P. Mehrotra Editors, Amsterdam, The Netherlands, 1992, Elsevier. [6] R. Mirchandaney, J. H. Saltz, R. M. Smith, D. M. Nicol, and K. Crowley, Principles of runtime support for parallel processors, in Proceedings of the 1988 ACM International Conference on Supercomputing, July 1988, pp. 140{152. [7] J. P. M. P. Tamayo and B. M. Boghosian, Parallel approaches to short range molecular dynamics simulations, in Supercomputing '91, Albuquerque, NM, Nov. 1991. [8] S. Plimpton and G. He el nger, Scalable parallel molecular dynamics on mimd supercomputers, in SHPCC92, Williamsburg, April 1992. [9] J. A. M. Terry W. Clark and L. R. Scott, Parallel molecular dynamics, in Proceedings of the Fifth SIAM Conference on Parallel Processing for Scienti c Computing, Houston, TX., March 1991. [10] W. F. van Gunsteren and H. J. C. Berendsen, Gromos: Groningen molecular simulation software., technical report, Laboratory of Physical Chemistry, University of Groningen, Nijenborgh, The Netherlands, 1988. [11] P. K. Weiner and P. A. Kollman, Amber:assisted model building with energy re nement. a general program for modeling molecules and their interactions, Journal of Computational Chemistry, 2 (1981), p. 287.