4. Quantum Chemical Reaction Dynamics

1 downloads 0 Views 288KB Size Report
coordinate technique for the calculation of accurate cross sections in reactive ... The key goal of the CASA quantum chemical reaction dynamics application was to ..... is the number of propagations we select to achieve load balance between ...
4. Quantum Chemical Reaction Dynamics Mark Wu, Aron Kuppermann, and Paul Messina California Institute of Technology

4.1 Introduction This section describes the research efforts and progress made for the implementation and performance of the quantum mechanical reactive scattering application (3D-REACT) across CASA heterogeneous computer environments with nodes that can reside at geographically different locations. This is a scientifically and technically important application. The formalism uses a symmetrized hyperspherical coordinate technique for the calculation of accurate cross sections in reactive collisions of an atom with a diatomic molecule [1-3]. The goal was to quantify the requirements for efficient network implementation of the parallel 3D-REACT application that perform explicit communication. In addition to quantifying these requirements for different problem sizes and different machines, it was desired to develop an analytical model to explain the behavior of the 3D-REACT application on the CASA network environment. In this section, the issues involved in developing this distributed application in a high-speed network environment are first identified. The decomposition of the calculation was based on the diversity of tasks performed by its major components. It was organized in such a way as to achieve a high efficiency by overlapping communication with computation. The experience acquired in implementing this application and the results of performance measurements are described. The network bandwidth requirements were analyzed and it was shown that the distribution of this application across disparate supercomputing platforms was feasible and that superlinear speedup was possible. Various network and computational science issues related to using distributed supercomputer environments for chemical reaction dynamics applications are also discussed.

4.2 Research Goals The key goal of the CASA quantum chemical reaction dynamics application was to characterize the role that wide-area networks with very high bandwidth can play in providing new capabilities for state-of-theart ab initio calculations of the cross sections of some scientifically and technologically important chemical reactions. The network and computational science issues related to parallel computations using the largescale computers on the CASA gigabit testbed were studied. The question arose as to whether the enhanced computing power implied by this heterogeneous configuration coud be used effectively to reduce the wallclock time demanded by 3D-REACT. In this section, we address these issues by discussing procedures for distribution and efficient computation of 3D-REACT in the CASA gigabit testbed computer environment consisting of multiple computers, which had different architectures and were geographically separated from each other but were connected by a high-speed network. The overall emphasis was to

CASA Final Report

9

Tech. Report CACR-123

• assess the suitability of each computer for the different computational tasks required by the application, • explore the mechanisms for exchanging data among processors in order to hide network latency and mask communication with computation, • investigate the possibility of superlinear speedup by using the most suitable computer architecture for individual task components, • increase available computing resources by allowing access of geographically separated computers simultaneously, • serve as a vehicle for exploring issues in the use of high-speed networks and computers for distributed computing, and • generate new results that have scientific and technological significance.

4.3 Scientific Background and Network Approach Quantum mechanical reaction dynamics calculations predict the cross sections and rates of chemical reactions from first principles. This application, 3D-REACT, provides an understanding of the details of chemical reactions at the molecular level. While experimental studies yield important data, a significant fraction of the interesting elementary reactions in nature involve short-lived molecules that are difficult to isolate and study in the laboratory. The prediction of the rates and cross sections of such chemical reactions is a primary goal of computational chemistry. These data, in addition to its scientific interest, are important for the modeling of complex chemical systems such as combustion, plasmas, and the atmosphere. This computational method involves the use of symmetrized hyperspherical coordinates [1-3] for calculating atom-diatom cross sections. These coordinates consisted of a distance ρ, called the hyperradius, and five hyperangles denoted collectively as ω. The former was an overall mass-scaled size parameter. Large values of ρ correspond to separated reagents or products, whereas small ones, of the order of equilibrium interatomic distance, correspond to reagents or products in close interaction with each other. The five hyperangles describe the orientation of the triatomic system in space, the angle between the diatomic internuclear vector and the vector from the center of mass of the diatom to the atom, and the ratio of the lengths of these two vectors. The code for solving the six-dimensional Schrödinger equation for the triatomic system consisted of three major parts, namely, the local hyperspherical surface function (LHSF) calculation, the logarithmic derivative (LOG-D) propagation, and the asymptotic analysis (ASY). The number of channels (rotational and vibrational states of the diatomic molecule) determined the size of matrices to be manipulated. In the first part, a set of five-dimensional LHSF was generated for which the time-consuming step was the diagonalization of full symmetric matrices with dimensions on the order of twice the number of channels. Thus, this phase required the most memory. The LHSFs were local and needed to be calculated at about 50 values of the hyperradius. Each set of eigenpairs was used to obtain a set of dense symmetric matrices, called propagation matrices. The second part, LOG-D, read the LHSFs and propagation matrices and performed the logarithmic derivative propagation of the coupled channel equations resulting from expansion of the Schrödinger equation in the LHSF.

CASA Final Report

10

Tech. Report CACR-123

This propagation involved large numbers of matrix inversions and matrix multiplications (of the order of 1,000 for each energy, partial wave, and symmetry). The resulting matrices were stored on disk. Finally, in the third phase (ASY), these matrices were read from disk and an asymptotic analysis was performed for each of them. This phase was not computationally intensive and was done sequentially in general. All aspects of the physics and chemistry could then be extracted from the results of the asymptotic analysis. The number of partial waves for which the calculation must be done was typically about 30 to 50, the number of symmetries was two to six, and the number of energies was about 100. Versions of the programs had been implemented for shared-memory vector supercomputers, such as the CRAY 2, CRAY Y-MP, and CRAY C-90, as well as for parallel computers with distributed memory, such as the Intel Touchstone Delta and the Intel Paragon. Several characteristics of 3D-REACT made it both a difficult and ideal application for a heterogeneous computer environment. The calculation was computation intensive and required a large amount of memory and also generated large outputs. While a very small amount of data (on the order of a few Kbytes) was read in for initialization, a large amount of data (on the order of several hundred Mbytes to several Gbytes) was output from the LHSF calculation. As an example, consider the H + D2 reaction at a total energy of 1.5 eV. This computation was a 800-channel problem which required 64 Mwords of core memory and generated about 400 Mbytes of data. On a CRAY C-90 it took about 3.5 hours to compute the LHSF and 1.5 hours to perform the LOG-D propagation for 8 energies for a typical partial wave. Today’s scientific and engineering computing environments contain elements that are diverse not only in terms of CPU and memory capacity, but also in terms of architecture and specialized functions. In addition to increasing the amount of processor and memory resources available to this application, heterogeneous distributed supercomputing provides several other advantages over stand-alone supercomputers. Because of the characteristic of the numerical algorithms in the LHSF and LOG-D codes, the LHSF can be executed most efficiently on vector processing systems like CRAYs while the LOG-D execute much faster on massively parallel machines like the Intel Delta and Paragon. Heterogeneous distributed supercomputing allows each part of 3D-REACT to be run on the architecture for which it was best suited. The high-speed network that links such diverse architectures allows the combined system to be viewed as a meta-computer.

4.4 Implementation and Performance of the Codes 4.4.1

Surface Function (LHSF) and Logarithmic Derivative (LOG-D) Codes

Table 4.1 gives a summary of the performance measures for both the LHSF (including calculation of the propagation matrices) and the LOG-D codes on 64 processors of the Intel Delta and Paragon, 128 processors of the Paragon, a 1-CPU CRAY Y-MP and an 8-CPU C-90 for a 1,000-channel problem. The speeds on the CRAYs were measured using the hardware-performance monitor of the PERFMON and PERFPRT subroutines. The speeds on the Delta and Paragon were estimated on the basis of the absolute measured speed on the CRAYs and the measured relative speeds of the Delta and Paragon with respect to the CRAYs for the same codes. These measurements corresponded to the H + D2 reaction. The LHSFs were calculated at 51 values of ρ from 2.0 bohr to 12.0 bohr in steps of 0.2 bohr. The corresponding overlap matrices were

CASA Final Report

11

Tech. Report CACR-123

obtained between consecutive values of ρ and the propagation matrices were obtained every 0.05 bohr. The number of primitive functions used permitted us to generate 1,000 LHSFs to achieve accurate cross sections up to a total energy of 1.5 eV. As can be seen from Table 4.1, the speeds for the LHSF and LOG-D codes on the CRAYs were about the same, and the CRAY C-90 was about 1.8 times faster than the CRAY Y-MP. Large speed differences for the LHSF and LOG-D were obtained on the Delta and Paragon. The LOG-D code was about 11 times faster than the LHSF code on these two machines. This was due to the less efficient eigensolver routine of the LHSF code with respect to the very efficient Blas 3 matrix inverter developed for the LOG-D routine.

4.4.2

Network Implementation

The 3D-REACT application can be distributed in several different ways over the computers linked by the network to achieve different levels of concurrency based on functional considerations specific to the LHSF and LOG-D codes. In the simplest case that has been tested extensively, the LHSF and LOG-D run on different computers and continually exchange information through the network. In a typical situation the CRAY C-90 was used to calculate the LHSFs and the propagation matrices and the Intel Delta was used to integrate the corresponding coupled differential equations with the LOG-D code. The rationale for this choice was governed by the performances given in Table 4.1, which suggest that such choice may lead to superlinear speedup. The efficiency of an application distributed across a network was affected by network latency and communication delays. When designing algorithms intended for network applications, it was important to overlap data transmission with computations so that the communication time can be masked. Figure 4.1 shows the utilization chronogram of this application for the case of two machines A and B and in which machine A generated the LHSFs and machine B performed the LOG-D propagation and ASY analysis. The two horizontal axes display the executing times of the LHSF and the LOG-D codes, respectively. Vertical arrows between the two axes indicate the times at which exchange of information between the two codes takes place. To study the bandwidth requirements for efficient operation of the distributed 3D-REACT, we had to estimate the flow of data, the network latencies, and the time within which data transmission had to be carried out. Let us consider for example the 1,000-channel test calculation for the H + D2 system described Table 4.1 Performance of the Distributed Codes for a 1,000-Channel Problem Speed (Megaflops)

Code

Delta (64 nodes)

Paragon (64 nodes)

Paragon (128 nodes)

CRAY Y– MP (1 CPU)

CRAY C-90 (1 CPU)

CRAY C-90 (8 CPUs)

LHSF

115

147

255

230

410

2400

LOG-D

1280

1600

2800

250

430

2432

CASA Final Report

12

Tech. Report CACR-123

Time axis

flag

Test Convergence

Figure 4.1 above. For two computers, each having a sustained speed of a single CRAY Y-MP CPU (~ 0.2 Gflops), 2.4 minutes were required to calculate the LHSF for each hyperradius and for each symmetry. About 80 Mbytes of data were generated for each sector and had to be transmitted over the network. This data was used as input for the LOG-D code. Machine A calculated a set of LHSF at a given hyperradius and sent the overlap matrix and 4 propagation matrices, one at a time, to machine B where LOG-D propagations were performed. It took 2.4 minutes for machine B to propagate these matrices for two energies, for one partial wave, and one symmetry. Machine B started performing the LOG-D propagation as soon as it received the first set of matrices. After that, a pipeline of data was transmitted from machine A to machine B every 2.4 minutes. If we assume a network efficiency of 60%, a dedicated network of 7 Mbytes/sec would be sufficient to hide the network latency and communication time for this 1,000-channel example. Asymptotic analysis was performed right after machine B finished the propagation for all the hyperradii in order to check the convergence of the calculation. It then requested a new set of LHSF from machine A. Starting from the smallest hyperradius, this process was repeated several times until a good set of such functions was obtained as determined from the convergence of the results of the ASY analysis. At that point, both machines had a complete set of accurate LHSF and propagation matrices and the LOG-D propagation could be carried out independently on each machine without further communication, for all energies, partial waves and symmetries needed to complete the study of the reaction. The two machines were assigned different sets of energies for this final stage of the calculation. It was necessary to know the network transmission rate and latency before planning experiments in a dedicated mode. Table 4.2 shows the performance of the network between the Caltech Delta or Paragon

CASA Final Report

13

Tech. Report CACR-123

Table 4.2 Network Performance Measures Latency (ms)

Transmission Rate (Mbytes/sec)

JPL CRAY Y-MP – Caltech Delta

2.8

5.0

SDSC CRAY C-90 – Caltech Delta

3.2

4.8

SDSC CRAY C-90 – Caltech Paragon

2.0

26.2

LANL CRAY Y-MP – Caltech Paragon

2.4

25.4

Network Link

and the JPL, LANL CRAY Y-MPs or SDSC CRAY C-90. It was measured by running the complete distributed 3D-REACT code in such a way that all the computations were bypassed. A precalculated set of LHSFs that were stored in disk were read by the code and transmitted through the network by calling EXPRESS subroutines. The transmission rates were measured by timing the round-trip of known message sizes. The latency of the network was obtained by comparing transmission times using small messages with those using large messages. A transmission rate of about 5 Mbytes/sec and a 3 ms latency were obtained for the links between the Delta and the CRAYs. For the link between the Paragon and the SDSC C-90, a transmission rate of 26.2 Mbytes/sec and a 2 ms latency was measured. A transmission rate of 25.4 Mbytes/ sec and 2.4 ms latency was measured between the link of the Caltech Paragon and the LANL CRAY Y-MP. The limiting factor for the network transmission rates for the Delta or Paragon to CRAY link was the internal mesh routing speed to the HIPPI node rather than the speed of the network itself.

4.5 Analytical Model for the Distributed Calculation Speedup Heterogeneous network computing focuses on the placement of a single complex application on a coordinated network of diverse hardware, software and administrative resources. In general, heterogeneous applications have computation and communication behaviors distinct from the behaviors of applications designed for individual parallel or sequential computers. To develop performance-efficient heterogeneous applications, many factors must be taken into account. Ideally, portions of the application that exhibit distinct computational structures should be implemented on architectures which support them in the most performance-efficient manner. An analytical model that can better explain the affinities present between application tasks and possible platforms was very helpful in optimizing machine-dependent parameters that affect the whole computation. In this section, we describe a model for the 3D-REACT application on heterogeneous computing environments based on functional decomposition schemes. Consider a distributed application C involving two separate codes C1 and C2, which we wish to run on M machines (A1, A2, A3 ....An for C1 and B1, B2, B3 ....Bm for C2 where n + m = M) connected by a HIPPI network to achieve high efficiency. In an application like 3D-REACT, C1 and C2 will be the LHSF evaluation and the LOG-D propagation codes respectively. The first set of LHSF must be made available before starting the LOG-D propagation. Let us define ti(a) as the time for machine i (i =A1, A2, A3 ...., B1, B2, B3 ....) to perform the entire calculation alone. In the concurrent use of the M machines, let ti(c) be the

CASA Final Report

14

Tech. Report CACR-123

total elapsed time for machine i when C1 was run on A1, A2, A3 .... and C2 on B1, B2, B3 ..... Here, the speedup in using the M machines concurrently is defined as

 (a) (a) (a)  ( a) (a) ( a) min  t A 1 , t A2 , t A 3 , .....,t B 1 , t B 2 , t B 3 , .....    G = ------------------------------------------------------------------------------------------------- (c) (c) (c)  (c) (c) (c) max  t A1 , t A 2 , t A3 , .....,t B1 , t B 2 , t B 3 , .....   

(1)

It is easy to see that the condition for superlinear speedup is that G > M. By assigning each part of a distributed application to the computer for which it is best suited, it is possible in principle to achieve superlinear speedup. We used a data decomposition scheme with the LHSF and LOG-D propagation codes and functional decomposition between the M machines. Figure 4.1 illustrates the utilization chronogram for the simplest case that involves only two machines. At t = 0, machine A is first turned on and initiates the calculation of the first set of LHSF and propagation matrices for hyperspherical sector 1. Machine A ends the calculation of the LHSF and propagation matrices for sector 1 at t1, repetitively sends a signal to machine B until it is acknowledged, and initiates transmission of data to that machine. Transmission of data for sector 1 ends at t2 and machine B signals receipt to machine A. Machine A starts LHSF and propagation matrix evaluations for sector 2. At t3, machine A ends the calculation for sector 2, signals machine B, waits for an acknowledgment and initiates transmission of data. Machine B receives the data for sector 2 and signals receipt to machine A at t4. The operation occurring from t2 through t4 constitutes a cycle, which is repeated until all sectors are finished. The duration of such a cycle is τΒ and represents the LOG-D propagation for one sector on machine B. Typically, there are a total of n sectors with n ≅ 50. After machine A transmits the data for sector n, it stops. After machine B ends the propagation for the nth sector, it stops the LOG-D propagation and the ASY analysis is initiated; it takes a negligible amount of time. For this diagram, we conclude that the elapsed time for machine A is

t A( c ) = T + ( n – 1 )τ B

(2)

where

T = t A + ∆t

(3)

∆t = t 2 – t 1

(4)

The quantity tA = t1 is the LHSF and propagation matrices evaluation time for one sector on machine A and ∆t is the corresponding network transmission time. The elapsed time for machine B that starts at the time t0 at which machine A was turned on is

t B( c ) = T + nτ B = t A( c ) + τ B

CASA Final Report

15

(5)

Tech. Report CACR-123

As a result of this expression, we conclude that (c)

(c)

max ( t A , t B

(c)

) = tB

= t A + ∆t + nτ B

(6)

The total network utilization time is n∆t. Only ∆t (i.e, the time required to transmit data for one sector) is not hidden by the concurrent calculations. The time for each machine to perform the entire calculation can be expressed as (a)

tA

= nt A + nτ A

(7)

where τA is the LOG-D time for one sector on machine A alone. Similarly, (a)

tB

= nt B + nτ B

(8)

where tB is the LHSF and propagation matrices evaluation time for one sector on machine B alone. Replacing (6) and (8) into (1) results in

min ( nt A + nτ A, nt B + nτ B ) G = ---------------------------------------------------------------t A + ∆t + nτ B

(9)

It is simple to extend the equation (9) to the general case which involves M machines. For the general situation, machine Ai, (i = 1,2,3,...) will perform LHSF calculations for ni-sectors and transmit the LHSF and propagation matrices consecutively to machine Bi, (i = 1, 2, 3,....) where LOG-D propagation is performed. The time for each machine to perform the entire calculation can then be expressed as (a)

ti

= nt i + nτ i

(10)

where ti is the LHSF and propagation matrices evaluation time for one sector and τi is the LOG-D time for one sector on machine i alone. The denominator in the equation (1), which gives the maximum elapsed time for the concurrent use of M machines, can be written as

 (c) (c) (c)  (c) (c) (c) max max  t A1 , t A 2 , t A 3 , …, t B1 , t B2 , t B 3 , …  = t 1 + ∆t + nτ i (11)   where nτmax is the propagation time for one sector for the slowest machine i. The quantity ti is the time for i the first LHSF and propagation matrices evaluation and ∆t is the corresponding network transmission time. By replacing equation (10) and (11) into (9), the general speedup gain can be rewritten as (a)

min ( t i ) G = -----------------------------------max t 1 + ∆t + nτ i

(12)

Let f ( N ) LHSF and f ( N ) LOG – D be, respectively, the number of floating point operations executed in the corresponding code for one sector where N is the number of channels for the problem being considered. CASA Final Report

16

Tech. Report CACR-123

These two quantities reflect the nature of the two codes. The corresponding computation time for each machine can be expressed as LHSF

f( N ) t i = --------------------α iLHSF

(13)

LOG – D

N iLOG – D f ( N ) τ i = ------------------------------------------------α iLOG – D

(14)

where αiLHSF and αiLOG – D are efficiency parameters for the respective codes on each machines. The quantity N iLOG – D is the number of propagations we select to achieve load balance between the machines when run concurrently. The network transmission time per sector ∆t is given by 2

32N ∆t = -----------user S net

(15)

user is the network speed available to one user (in Mbytes/sec), taking into account that the network where S net must be shared concurrently by many users. The factor 32 comes from the fact that we must transmit 4 matrices of dimension N × N, corresponding to 8 N2 Bytes/matrix. We do not need to include the effect of the network latency because the message sizes are large enough for this latency to be negligible. Replacement of (13) through (15) into (12) gives the expression for G:

LOG – D  f ( N ) LHSF  f(N) LOG D – --------------------------min  n --------------------N +  i α iLOG – D N 3  αiLHSF  G = -----------------------------------------------------------------------------------------------------------------------------LOG – D 2  f ( N ) LHSF f(N) 32N  LOG D – --------------------------- + -----------max  --------------------- + nN i user  LHSF α iLOG – D S net  αi 

(16)

user This equation permits us to calculate G as a function of the problem size N and the network speed Snet available to one user for fixed values of the remaining parameters. Because the diagonalization and inversion of full symmetric matrices dominate the LHSF and LOG-D calculations, as the problem size N increases, the computation time increases with N3 whereas the communication time increases only with N2. It is more convenient to write (16) as

LOG – D  g ( N ) LHSF  (N) LOG – D g ---------------------------min  n ----------------------N +  i LHSF N 3 α iLOG – D N 3   αi G = -----------------------------------------------------------------------------------------------------------------------------------LOG – D  g ( N ) LHSF (N) 32  LOG – D g ---------------------------max  ----------------------nN + --------------+ i LOG – D N 3 user  LHSF N 3 α NS α i net i  

CASA Final Report

17

(17)

Tech. Report CACR-123

where

g(N) f ( N ) = ----------3 N

(18)

Expression (17) permits us to predict the gain for any configuration of machines and any network speed and allows us to select the tuning parameters optionally.

4.6 Design of the Experiments In order to test the validity of this analytical model, several experiments were performed over the links from Caltech to JPL, Caltech to SDSC, and Caltech to JPL and LANL. In order to achieve better performance, the LHSF were evaluated on the CRAY-type machines, while massively parallel machines such as the Delta or Paragon performed the LOG-D propagation. Running the application requires several windows, one for each machine. Since one of the CRAYs acted as a server, its code needed to be started first. All the measurements presented in this section were made with all machines placed in dedicated mode.

4.6.1

Two-Machine Configuration

In the two-machine mode, we ran several problem sizes ranging from five to 800 channels. The system we chose was the H + D2 reaction. The tests consisted of performing calculations for four partial waves and two energies. This gave the appropriate load balance between the links of the Delta to the CRAY Y-MP, the Delta to the C-90, and the Paragon to the C-90. The machine-independent parameters for this reaction were n = 50 and ( N ) LOG – D = 8. The quantities f ( N ) LHSF and f ( N )LOG – D were measured on the CRAY C-90 with different problem sizes. For large N, f ( N )LHSF = 188.2 Mflops/channel3, and f ( N )LOG – D = 104.0 Mflops/ channel3. The machine-dependent parameters αBLHSF and αBLOG – D on the 64-processor Delta were 115 and 1280 Mflops/sec, respectively. For the 64-processor Paragon, αBLHSF became 147 Mflops/sec and α BLOG – D changed to 1600 Mflops/sec. These parameters changed to 410 and 430 Mflops/sec for the CRAY C-90 and 230 and 250 Mflops/sec for the CRAY Y-MP (see Table 4.1). Table 4.3 shows the measurements of 13 complete runs of both the distributed and stand-alone calculations over the Caltech Delta to SDSC C-90 link. Table 4.4 shows the measurements of 10 complete runs of both the distributed and stand-alone calculations over the Caltech Paragon to SDSC C-90 link. The network transmission rate was measured by doubling the one-way transfer time between the two machines. It required synchronization of the clocks on the two machines. Before sending the data, machine A sent a null message through the network to machine B and waited for an acknowledgment. Upon receiving this null message, machine B returned an acknowledgment and started a timer. Machine A only sent the data after it had received this acknowledgment from machine B. This synchronization procedure did not add appreciably to the total execution time for the application. Figures 4.2 and 4.3 show the predicted and observed speedup gains as a function of N with respect to the number of channels times network speed available for a single user. Two analytical speedup gains were calculated for the same problem sizes as those for which speedup gains were measured. The solid lines (without the solid circles) are the analytical curve which assume the pure N3 relation for the LHSF and

CASA Final Report

18

Tech. Report CACR-123

Table 4.3 Timings and Gains for the Caltech Delta – SDSC CRAY C-90 Linka Run #

N

user S net

t A( a )

t B( a )

t1

∆t

ητ B

b GM

G Cc

G Cd

1

20

4.51

9.4(-04)

4.2(-04)

1.0(-06)

9.4(-07)

4.6(-04)

0.91

1.1

3.43

2

32

5.03

1.1(-03)

1.0(-03)

4.2(-06)

2.4(-06)

5.5(-04)

1.79

2.00

3.45

3

36

5.01

1.6(-03)

1.5(-03)

6.1(-06)

2.6(-06)

7.8(-04)

1.90

2.16

3.45

4

64

4.98

8.9(-03)

8.3(-03)

3.5(-05)

7.4(-06)

3.3(-03)

2.51

2.54

3.46

5

140

5.00

9.5(-02)

8.8(-02)

4.0(-04)

3.6(-05)

3.2(-02)

2.72

2.88

3.46

6

196

5.02

2.6(-01)

2.4(-01)

1.1(-03)

7.1(-05)

8.2(-02)

2.91

2.99

3.47

7

256

4.97

5.6(-01)

5.3(-01)

2.3(-03)

1.3(-04)

1.8(-01)

2.92

3.08

3.47

8

400

5.10

2.14

2.03

8.4(-03)

2.9(-04)

6.4(-01)

3.14

3.20

3.47

9

520

5.00

4.74

4.46

1.8(-02)

4.9(-04)

1.39

3.16

3.26

3.47

10

676

4.99

10.4

9.80

4.6(-02)

8.3(-04)

3.01

3.20

3.32

3.47

11

800

5.00

17.0

16.26

7.8(-02)

1.4(-03)

4.89

3.27

3.32

3.47

12

870

5.00

21.9

20.91

1.0(-01)

1.7(-03)

6.23

3.30

3.32

3.47

13

960

5.00

29.4

28.12

1.3(-01)

2.0(-03)

8.33

3.32

3.33

3.47

Table 4.4 Timings and Gains for the Caltech Paragon – SDSC CRAY C-90 Linka Run #

N

user S net

t A( a )

t B( a )

t1

∆t

ητ B

b GM

G Cc

G Cd

1

4

19.2

7.5(-06)

2.7(-06)

8.24(-09)

8.9(-09)

1.8(-06)

1.51

1.75

3.36

2

12

20.4

2.0(-04)

7.3(-05)

2.3(-07)

8.5(-08)

3.5(-05)

2.05

2.10

3.39

3

32

24.4

1.1(-03)

8.0(-04)

4.2(-06)

5.4(-07)

3.2(-04)

2.45

2.50

3.40

4

60

24.6

7.3(-03)

5.4(-03)

2.8(-05)

1.8(-06)

1.9(-03)

2.75

2.75

3.40

5

100

25.0

3.5(-02)

2.6(-02)

1.3(-04)

5.0(-06)

8.5(-03)

3.00

3.00

3.40

6

160

25.2

1.4(-01)

1.0(-01)

5.3(-04)

1.3(-05)

3.0(-02)

3.15

3.15

3.40

7

204

25.2

2.8(-01)

2.1(-01)

1.1(-03)

2.1(-05)

6.0(-02)

3.25

3.26

3.40

8

240

24.8

4.6(-01)

3.5(-01)

1.8(-03)

2.9(-05)

1.0(-01)

3.28

3.30

3.40

9

290

25.6

8.2(-01)

6.2(-01)

3.2(-03)

4.2(-05)

1.8(-01)

3.30

3.33

3.40

10

360

26.0

1.56

1.18

6.2(-03)

6.5(-05)

3.5(-01)

3.32

3.35

3.40

a

Machine A was the SDSC CRAY C-90 (1 CPU); machine B was the Caltech Paragon (64 nodes). The former was used to perform the LHSF calculation and the latter the LOG-D propagations. The measured times are in hours and the measured network speeds are in Mbytes/sec. The numbers in parentheses are powers of 10. See text for meanings of column headings.

b

The measured speedup gains.

c

The predicated speedup gains that use the measured f(N).

d

The predicated speedup gains that assumed the pure N3 relation for the LHSF and LOG-D calculations.

CASA Final Report

19

Tech. Report CACR-123

LOG-D calculations while the dashed lines are the analytical curve which use the measured g(N). The solid circles connected by straight lines represent the experimentally observed speedup gains. It can be seen from Tables 4.3 and 4.4 and Figures 4.2 and 4.3 that for all the problem sizes considered, other than the smallest ones, appreciable superlinear speedup has been achieved, as well as a good agreement between the model using the g(N) and the measurements.

4.6.2

Three-Machine Configuration

In order to demonstrate the feasibility of running 3D-REACT on three machines, we also conducted nine experiments for problem sizes ranging from five to 500 channels. All the measurements for the threemachine configuration were done on the same H + D2 reaction at a relative translational energy Etr = 1.29 eV. The SDSC C-90 (8 CPUs) and the LANL CRAY Y-MP (1 CPU) were used to perform LHSF calculations while the Caltech Paragon (128 nodes) was used to perform LOG-D propagations. For the 500 channel problem there were about 510 Mbytes of data which had to be transferred through the network with 90% of the message sizes around 2 Mbytes. The frequency of data transfer was about every 14 seconds. The machine-independent parameters for this reaction were n = 50 and ( N )LOG – D = 8. The quantities LHSF f( N) and f ( N ) LOG – D were measured on the CRAY C-90 for different problem sizes. For large N, LHSF f( N) = 188 Mflops/channel3, and f ( N ) LOG – D = 104 Mflops/channel3. For the Intel Paragon (128 LHSF and α LOG – D were 255 and 2,800 Mflops/sec, processors), the machine-dependent parameters αC90 C90 respectively. These parameters changed to 2,400 and 2,432 Mflops/sec for the CRAY C90 (8 CPU) and 230 and 250 Mflops/sec for the CRAY Y-MP (see Table 4.1).

Speedup Gain

Caltech Delta (64 nodes) — SDSC Cray C-90 (1 CPU)

N x Suser net / (Mbytes/sec) Figure 4.2. The predicted and measured speedup gains for the two-machine configuration, involving the Caltech Delta to CRAY C-90 link, as a function of the number of channels N times network speed user available for a single user. The symbols have the same meanings as in Figure 4.2. S net

CASA Final Report

20

Tech. Report CACR-123

Table 4.5 shows the measurements of nine complete runs of both the distributed and stand-alone calculations over the Caltech Paragon, the LANL CRAY Y-MP and the SDSC C-90 link. Figure 4.4 shows the predicted and observed speedup gain as a function of the the number of channels times network speed available for a single user. Two analytical speedup gains are calculated for the same problem sizes as those for which speedup gains were measured. The symbols have the same meanings as in Figures 4.2 and 4.3. As can be seen from Table 4.5 and Figure 4.4, the measured speedup gains were about 20% lower than the predictions of the model. The analytical model was based on an assumption that the computational loads were balanced between machines. Because of large differences in performance between machines, it was difficult to achieve load balance between them. Nevertheless, it can be seen from the Table 4.5 and Figure 4.4 that superlinear gains are still feasible.

4.7 Discussion and Lessons Learned The agreement between the analytical model, which has no adjustable parameters, and the observed speedup gains in all cases showed that this application was well-behaved and that all the important issues in this distributed application are accounted for. The assumptions that the LHSF computation time was dominated by the eigenvalue/vector routine and the LOG-D computation time was dominated by matrix inverter are, in general, valid for large problem sizes. However, for small problem sizes the operation count for other Table 4.5 Timings and Gains for the Caltech Delta – SDSC CRAY C-90 – LANL CRAY Y-MP Linka Run #

N

user S net

tA( a ) 1

t A( a )

t B( a )

t1

∆t

ητ B

b GM

G Cc

G Cd

1

12

20.8

4.2(-05)

3.5(-04)

5.4(-05)

4.1(-08)

9.0(-08)

9.3(-05)

0.45

1.20

3.43

2

45

22.2

6.0(-04)

5.4(-03)

1.8(-03)

2.3(-06)

1.3(-06)

3.5(-04)

1.70

2.45

3.45

3

70

24.3

2.2(-03)

2.1(-02)

7.2(-03)

8.3(-06)

3.1(-06)

1.0(-03)

2.15

2.82

3.46

4

125

24.0

1.3(-02)

1.2(-01)

3.8(-02)

4.8(-05)

9.9(-06)

5.1(-03)

2.50

3.10

3.46

5

164

24.4

2.7(-02)

2.7(-01)

8.2(-02)

1.1(-04)

1.8(-05)

1.0(-02)

2.60

3.20

3.46

6

228

24.6

7.1(-02)

7.0(-01)

2.2(-01)

2.9(-04)

3.3(-05)

2.6(-02)

2.70

3.23

3.46

7

380

25.0

3.4(-01)

3.36

1.03

1.3(-03)

9.1(-05)

1.2(-01)

2.8

3.38

3.46

8

500

25.0

7.6(-01)

7.68

2.31

3.0(-03)

1.6(-04)

2.7(-01)

2.81

3.45

3.46

9

512

25.1

7.9(-01)

8.10

2.48

3.2(-03)

1.6(-04)

2.7(-01)

2.85

3.45

3.46

2

a

Machines A1 and A2 were the SDSC CRAY C-90 (8 CPUs) and the LANL CRAY Y-MP (1 CPU), respectively. Machine B was the Caltech Paragon (128 nodes). The first two were used to perform the LHSF calculation and the third the LOG-D propagations. The measured times are in hours and the measured network speeds are in Mbytes/sec. The numbers in parentheses are powers of 10. See text for meanings of column headings.

b

The measured speedup gains.

c

The predicated speedup gains that use the measured f(N).

d

The predicated speedup gains that assumed the pure N3 relation for the LHSF and LOG-D calculations.

CASA Final Report

21

Tech. Report CACR-123

Speedup Gain

Caltech Paragon (64 nodes) — SDSC Cray C-90 (1 CPU)

N x Suser net / (Mbytes/sec) Figure 4.3. The predicted and measured speedup gains for the two-machine configuration, involving the Caltech Paragon to CRAY C-90 link, as a function of the number of channels N times user available for a single user. The symbols have the same meanings as in Figure network speed Snet 4.2. The details of the measurements are given in Table 4.4.

Speedup Gain

Caltech Paragon (128 nodes) — SDSC C-90 (8 CPUs) — LANL CRAY Y-MP (1 CPU)

N x Suser net / (Mbytes/sec) Figure 4.4. The predicted and measured speedup gains for the three-machine configuration, involving the Caltech Paragon to SDSC CRAY C-90 and LANL CRAY Y-MP link, as a function of user available for a single user. The symbols have the number of channels N times network speed Snet the same meaning as in Figure 4.2. The details of the measurements are given in Table 4.5.

CASA Final Report

22

Tech. Report CACR-123

numerical procedures was no longer negligible. The discrepancy between the solid model curves in Figures 4.2–4.4 and the measured speedup for the small problem sizes arose from that fact. Given the pedestrian nature of the parallel operations counts, it is remarkable that the speedup gain estimates were close to the observed one. An improvement in the analytical model, which used the experimental measured operational counts performed by the codes, gave much closer values for speedup gain when compared with the measured ones. This model can now be used to design distributions of the application among a larger number of computers while still achieving superlinear speedup. What is needed to bring this heterogeneous distributed application into routine computational chemistry production? For this sort of job to be run with any frequency in the future, the computing environment would need to be modified to give additional consideration to heterogeneous distributed supercomputing. In particular, to get maximum utilization of the machines, all the processes would have to be time-shared. Then, by judicious choice of the time slice, it may still be possible to hide much of the latency in the network transfer as well as the cost of the remote execution. In a time-shared environment another job could be swapped in while either the network or the remote machines are busy. Therefore, although the distributed application itself would not complete faster, the throughput of the heterogeneous distributed supercomputing system would be greater. It is in this sense that heterogeneous distributed supercomputing has the greatest potential. An alternative approach for distributing the 3D-REACT application over the network is through a data decomposition scheme. The data decomposition means that we map the data domain into each computer domain. All the computers within the network perform simultaneously the same task with a different subset of data. Each processor must exchange boundary information within each subset of data through the highspeed network. With this approach, a network of heterogeneous supercomputers can be viewed as a metacomputer. To scale up the problem size or to increase the number of machines on the network should be relatively easy to incorporate. However, high communication latency and data congestion are expected to be severe because of the nature of the communication scheme. For some numerical algorithms, node to node communications between different machines will be required; this is difficult to achieve with limited numbers of physical HIPPI connections. In addition, the data transmission rate between each machine and the network will be different. It will require sophisticated software tools to achieve appropriate load balance of the network. Both software and hardware have to be improved to make data decomposition feasible.

4.8 New Science One of the main objectives for this application is to generate new results that have scientific and technological significance. During this study, converged state-to-state differential cross sections were obtained for the H + D2 → HD + D reaction [4, 5]. This is one of the simplest chemical reactions in nature, involving three chemically simple atoms, and is an important prototype for testing our understanding of the general process of breaking a chemical bond and making a new one. The importance of the geometric phase for state-to-state differential cross sections has been clearly demonstrated in this reaction [4]. The results of the calculations of the state-to-state differential cross sections including the geometric phase effect for this reaction at a total energy E of 1.48 eV differ significantly from those excluding this

CASA Final Report

23

Tech. Report CACR-123

effect, and important differences persist even after summation over all final state vibrational and rotational quantum numbers [4, 5]. Measurements of this state-to-all differential cross section, using a new reaction product imaging technique [6] are in much better agreement with the calculations including the effect of the geometric phase, than with those not including it. In addition, the calculations including the geometric phase effect display strong resonances with lifetimes as high as 164 fs in the general vicinity of E = 1.48 eV, whereas the results excluding this phase effect do not. This is the first prediction of long-lived resonance in a chemical reaction involving repulsive reagents. These results provide insight on why colliding reagent molecules stick together even when there are no strong attractive forces between them. The predictions based on these calculations are being used to guide experiments aimed at verifying the existence of this phenomenon, which is of major importance for the foundation of chemical dynamics.

4.9 Scientific Results Section 4.8 describes the geometric phase effect discovered during the CASA project. In this section, we include scientific results obtained in the first three years of CASA. For example, during the first year of CASA, we focused on developing (1) numerical methods appropriate for the distributed computing environment, (2) parallel codes that permit utilization of the eight CPUs of a CRAY Y-MP as a single supercomputer, and (3) parallel codes for the Intel Delta and for linking it to the CRAY. Section 4.9.1 reports on the progress made on these tasks during Year 1. In the second year of this project, we examined the communications bandwidth requirements for different problem sizes that are realistic to pursue in the CASA network environment. We also implemented versions of programs for the CRAY X-MP, CRAY YMP and CRAY 2 vector machines, as well as for parallel computers with distributed memory, such as the Caltech-JPL Mark IIIfp hypercube and Intel Touchstone Delta. This work is summarized in section 4.9.2. Finally, the results of calculations performed in Year 3 are given in section 4.9.3, along with information on the status of the code at this point in the project.

4.9.1 4.9.1.1.

Year 1 Results Numerical Methods

Early efforts in this CASA application included the development of numerical methods appropriate for distributed computing environments. In preparing modifications to the algorithms for the chemical reaction dynamics, we soon discovered that the larger, more complex problems that they planned to tackle with CASA facilities required more care in computing certain quantities. A major phase of the program needed to be modified to converge to the right answer. In year 1, we solved these numerical problems before making progress in developing the new distributed and parallel version of the program. We subsequently implemented and validated the method by running medium-sized problems. The formalism used for calculating atom-diatom cross sections, described previously [3, 4], consists of solving the Schrödinger equation for the system generated by expanding the scattering wave function in the local hyperspherical surface functions (LHSF) and of solving the resulting coupled ordinary second-order linear differential equations (the scattering equations). The LHSF are the eigenfunctions of a particular hamiltonian. Previously, this method had been implemented on several CRAY machines and on the

CASA Final Report

24

Tech. Report CACR-123

Caltech-JPL Mark IIIfp hypercube, and was used to compute small values of the total angular momentum quantum number J [7]. However, when applying this method to J > 3, the efficient but supercomplete basis set used to calculate the LHSF developed a severe linear dependence problem that required using a new approach. A new trigonometric basis set was selected, which although still efficient, was no longer super complete [1], and therefore no longer displayed linear dependence problems. In addition, the LHSF was replaced by pseudo-surface functions. This new approach allowed calculations to be expanded to all total angular momentum (about 50) needed to obtain converged integral and differential cross sections for the H + H2 system over a relatively wide range of total energies. An example of these calculations, performed on CRAY 2 and CRAY Y-MP machines, is given in Figure 4.5. The bumps on the surface are the manifestation of dynamic resonances, which play an important role in chemical reactions. This H + H2 system displays a conical intersection between its ground and first excited electronic states. Associated with this conical intersection is a geometric phase effect whose implications for reaction

Figure 4.5. Differential cross sections, summed and averaged over projection quantum numbers, as a function of scattering angle and energy for the v = 0, j = 0 v’ = 1, j’ = 1 cross sections, where the symbols v and j labels ab asymptotic state of the H + H2 system in which v, j(v’, j’) are the quantum numbers of the initial (final) H2 state.

CASA Final Report

25

Tech. Report CACR-123

cross sections had not been previously calculated [1, 2]. The new methodology permitted us to make such calculations and to conclude that this was an important effect, which significantly alters the cross sections. Figure 4.6 illustrates the influence of the geometric phase effect on a differential cross section at a total energy of 0.7 eV. This influence will become even greater at higher energies. Thus, in the process of improving the computation methodology, important new science was generated. 4.9.1.2.

Parallel Code for the CRAY Y-MP

The solution of the scattering equations was accomplished using a logarithmic derivative method. The parallel version of this code, developed originally for the Mark IIIfp hypercube, was successfully ported to the eight CPUs of the SDSC CRAY Y-MP under the Express operating system in Year 1. Table 4.6 provides the results of test runs for systems of 120 and 250 differential equations (i.e., channels) using one, two, four, and eight CPUs. It can be seen that for the 120-channel case, the efficiency for the eight-CPU calculation is 53%, compared to 67% for the 252-channel case. As expected, the efficiency increased with the number of channels.

Cross Section / (Bohr Squared x Steradian)

H + p-H2(0,0) → p-H2(0,2) + H Differential Cross Sections at 0.7 eV

Theta (degree) Figure 4.6. Differential cross sections, summed and averaged over projection quantum numbers, for the v = 0, j = 0 v’ = 1, j’ = 1 cross sections at a total energy of 0.7 eV, as a function of scattering angle. The squares (circles) correspond to the case in which the geometric phase is included (not included).

CASA Final Report

26

Tech. Report CACR-123

Table 4.6 Performance of the Logarithmic Derivative Code Implemented in Fortran with Express Calls for Parallelization Speed (Megaflops)

Efficiency (%)

Number of Processors

252-Channel Problem

120-Channel Problems

252-Channel Problem

120-Channel Problems

1

172

165

100

100

2

310

272

90

82

4

571

440

83

67

8

920

704

67

53

Note: This code was run on the eight-processor CRAY Y-MP at the San Diego Supercomputer Center.

The first case confirms that it is relatively simple to use the CRAY Y-MP for this problem as a parallel computer and that porting codes written for parallel MIMD machines with distributed memory to the CRAY Y-MP is straightforward.

4.9.2

Year 2 Results

In the second year of the CASA project, we implemented versions of programs for the CRAY X-MP, CRAY Y-MP and CRAY 2 vector machines, as well as for parallel computers with distributed memory, such as the Caltech-JPL Mark IIIfp hypercube and Intel Touchstone Delta. Section 5.9.1 (above) provides test results from Year 1 of porting the Mark IIIfp hypercube logarithmic derivative code to the eight CPUs of a CRAY Y-MP/864 under the Express parallel programming environment as if they were nodes of a message-passing parallel computer for H + H2 reaction with 120 and 252 channels. In Year 2, we conducted several experiments with the logarithmic derivative code using a version of the Express programming environment to perform calculations on two geographically separated CRAY Y-MPs, one at the SDSC in La Jolla and the other at the Jet Propulsion Laboratory in Pasadena, using the T1 network link that was in place at the time. Table 4.7 gives the results of runs for a system of 250 differential equations—a 250-channel problem— using one CPU at each site. The matrix elements needed for the logarithmic derivative propagation were calculated and stored in both machines. For unidirectional communication, the machine at SDSC reads in the matrix elements from the disk and sends it to JPL with different message sizes through the network while the JPL machine receives the messages and performs the integrations. For bidirectional communication, both machines read in the matrices and sent them to each other and both performed the integrations. The main objective of these experiments was to test the programs and to evaluate the programming environment for network-connected supercomputers. Therefore, we performed the tests under the normal multi-user conditions of those computers rather than as dedicated machines. We found that it was relatively simple to port the programs that are written for parallel MIMD machines with distributed memory under network Express programming environment. The results are summarized in Table 4.7. During daytime operation,

CASA Final Report

27

Tech. Report CACR-123

the unidirectional communication speed from SDSC to JPL was generally in the range of 18 to 23 Kbytes/ sec, whereas at night it was about 23 to 24 Kbytes/sec. The total data transmitted in each direction in this test was 75 Mbytes (150 matrices each of dimension 250 × 250). The total communication time in each direction ranged from 50 to 62 minutes, and the computation time at each end was about 1/2 minute. Obviously, for T1 speeds, this mode of operation is inappropriate. During Year 2, the typical computer speed was around 10 Gflops in the CASA computational environment, if we take an entire machine (Delta, CM-2) as one processor. The computation time would decrease by a factor of 40 for this particular mode of operation. If large messages (1 Mword) are used to send through the network, then the ratio of the time to send one 64-bit word from one processor’s memory to the other processor over the time each processor takes to perform one floating point operation will approach 100 which gives an overhead approximately 10-5. Since the logarithmic derivative calculation is energy independent, we can work on several energies at each end using the same matrices to hide the communications latency. Also in Year 2, progress was made in developing new parallel programs to calculate LHSF for the Intel DELTA using a slightly different algorithm than those employed in the previous Mark IIIfp codes. This was important for overcoming the linear dependence difficulties the previous method displayed for large angular momenta. Two major components, namely the matrix inversion and symmetric eigenvalue solver, were developed for the Intel Touchstone Delta in Year 2; Figures 4.7 and 4.8 summarize the results of test runs for several configurations. As expected, the speed and efficiency increased with the dimensions of the matrices. The matrix inverter operated at satisfactory speeds. Thus, we initiated efforts to use BLAS routines that would allow the parallel eigensolver to operate faster.

Table 4.7 Testing of Logarithmic Derivative Code on Two Geographically Separated CRAY Y-MPs Message Length (Kbytes)

Communication Speed (Kbytes/sec)

Bidirectional Communication

Daytime

Nighttime

500

40

1000

41

2000

45

500

48

1000

48

2000

50

Unidirectional Communication

Nighttime

CASA Final Report

500

24

1000

24

2000

25

28

Tech. Report CACR-123

4.9.2.1.

Issues on Size of the Problem and Network Considerations

The efficiency of an application distributed across a network is affected by network latency. When designing algorithms intended for network applications, it is important to overlap data transmission with computation so that communication time can be masked. The optimal decomposition for the quantum chemical reaction dynamic calculation is dependent on the CPU power of the supercomputers linked by the high-speed network. As an example, in Table 4.8, we assume a wide range of computer speeds appropriate for the next five years, and the size of the problems appropriate for those computers. We also assume associated minimum desired network communication speed needed for a computational strategy, such that one machine calculates the surface functions and propagation matrices, while the other machine integrates the correspondence coupled differential equations by using the logarithmic derivative code. Table 4.8 Computer Speeds and Network Speeds for Different Problem Sizes Sustained Computer Speed (Gflops)

Problem Size (Number of Channels)

Appropriate Network Speed (Gbps)

0.1

800

0.04

0.2

1000

0.07

1.0

1700

0.2

10

3700

0.9

100

6800

2.5

1000

19000

10

For two computers, each one of them having a sustained speed of a single CPU CRAY Y-MP (~0.2 Gflops), a 1000-channel problem like a simple H + H2 reaction requires 15 minutes to calculate LHSFs for each irreducible representation and eight minutes to propagate one energy. 200 matrices (1000 × 1000 each) would be generated, which would need to be transmitted through the network for propagation. Computer 1 calculates LHSFs and sends one matrix each time to computer two where logarithmic derivative propagation is performed. There will be 4.5 seconds of idle time for computer 2 waiting for the first matrix to come. After that, a pipeline of data transmission goes from computer 1 to computer 2. If we assume 70% of network efficiency, a network speed of 0.07 Gbps would be sufficient. However, for two computers in the 10 Gflops class (such as the Delta and the CM-2), 3,700 channel problems such as F + HD → O + H2 reactions could be run and a 0.9 Gbps network would be needed. For a machine capable of 100 Gflops, the problem size could be increased to 6,800 channels, but the network speed required would be 2.5 Gbps.

4.9.3

Year 3 Results

Since the discovery of quantum mechanics, the ability to predict the rates and cross sections of elementary chemical reactions from first principles has been a central goal of theoretical chemistry. However, the equations involved are too complicated to be solvable analytically and only in the last six years or so have theorists, aided by efficient methodologies and access to supercomputers, been able to predict numerically

CASA Final Report

29

Tech. Report CACR-123

Figure 4.7. Performance of the Matrix Inverter on the Intel Touchstone Delta

Figure 4.8. Performance of the Symmetric Eigenvalue Solver on the Intel Touchstone Delta

CASA Final Report

30

Tech. Report CACR-123

the cross sections of some simple chemical reactions in sufficient detail for comparison with experiments. The chemical reaction focus in year 3 was the reaction between a deuterium (D) atom (an isotope of hydrogen) and a hydrogen molecule (H2): D + H2 → DH + D. This is one of the simplest chemical reactions in nature, involving three chemically simple atoms, and is an important prototype for testing our understanding of the process of breaking a chemical bond and making a new one. 4.9.3.1.

Geometric Phase Effect

Recent experiments have measured the rate of reaction of vibrationally excited H2 into vibrationally excited DH in different rotationally excited states. Converged calculations for this system, performed under this project, which strained the ability of a large vector supercomputer, have led to the discovery of an observable effect—the geometric phase effect (see Figure 4.9)—not previously detected in chemical reactions. Some other interesting chemical reactions are those of a fluorine atom with a hydrogen molecule (F + H2 → FH + H) and of an oxygen atom with a hydrogen molecule (O + H2 → OH + H). The former reaction is relevant to a powerful chemical laser and the latter is one of the most important processes that occurs in the burning of the hydrogen-oxygen mixtures used in the propulsion system of the space shuttle. Calculations of the cross sections of these reactions over a wide range of energies are of major scientific interest but will require more than a single supercomputer; such as a combination of the Intel Touchstone Delta and the eight CPUs of a CRAY Y-MP or of the Delta and a CM-5. 4.9.3.2.

Status of the Code

In Year 3, we implemented versions of programs for the CRAY Y-MP and CRAY 2 vector machines as well as for parallel computers with distributed memory, such as the Intel Touchstone Delta and the Paragon. We were able to perform calculations over all the values of the total angular momentum J needed to achieve convergence of integral and differential cross sections in the energy range below 2.6 eV for the H + H2 and D + H2 reactions both with and without including the geometric phase effect. During Year 3, we also made a complete demonstration running the program over the low-speed internet. This demonstration included porting to the SDSC and JPL CRAY Y-MP the parallel version of the logarithmic derivative integrator developed originally for the Mark IIIfp hypercube and of the parallel version of the LHSF code running on the Delta under the Network Express operating system. The Intel Delta was used to calculate the surface function and propagation matrices and the CRAY Y-MP to integrate the corresponding coupled differential equation with the logarithmic derivative code. Both machines communicated over the T1 link. Two major components of the LHSF code, namely matrix multiplication and inversion are operating at satisfactory speeds (~30 Mflops/node). The symmetric eigenvalue QR solver runs about 3 Mflops/node and the spectral transform Lanczos solver runs about 5 Mflops/node. Since only about 20% of the eigenvalues are required, the overall performance of the spectral transform Lanczos eigensolver is satisfactory. We found that the efficiency of this application distributed across the network is affected by network latency. When designing algorithms intended for network applications, it is important to overlap data transmission with computation so that the communication time can be masked. This strategy was adopted in our codes. The optimal decomposition is dependent on the CPU power of the supercomputers linked by the high-speed network.

CASA Final Report

31

Tech. Report CACR-123

Rate Coefficient (10-11 cm3 s-1 molec-1)

Relative Rate

j’ Figure 4.9. Comparison of theoretical and experimental rates as a function of rotational state quantum number j’ for D + H2 (v = 1, j = 1) → DH (v’ = 1, j’) + H for an average rotational energy E = 1.0 eV (total energy E = 1.8 eV). This average includes contributions from the slow and fast D atoms arising from the photodissociation of DBr. The solid circles represent the experimental results, while open squares represent the corresponding theoretical results which do not include the geometric phase effect (NGP). The open circles represent the theoretical results which include that effect (GP). The sum of the relative experimental or theoretical rates (left ordinate scale) over all product rotational states was normalized to unity. The ordinate scales at the right give the absolute NGP and GP rate coefficients.

At the end of Year 3, we began work to extend the calculations of the differential cross section for the F + HD system (which is computationally more demanding and scientifically more interesting than the F + H2 one) across the JPL-Caltech HIPPI link at energies spanning the Feshbach resonance. The plan was to use the Intel Delta to calculate the surface functions and propagation matrices and use the two CPUs of JPL CRAY Y-MP to integrate the corresponding coupled differential equations with the logarithmic derivative code. The purpose of these runs would be to characterize the timing aspects of this calculation, particularly in relation to the pipelining of data transmissions. The final results of quantum chemistry experiments performed using CASA gigabit network facilities, completed after Year 3 of this project, are described in detail in section 4.6 (above).

CASA Final Report

32

Tech. Report CACR-123

4.10 Summary We described in this section the implementation and performance of a quantum chemical reaction dynamics application distributed over the CASA gigabit heterogeneous computing environment. The machines on the network are linked by a wide-area HIPPI connection. We first identify the issues involved in developing this application and report the results of several performance measurements. Both analytical and simulation results are presented. We show that the distribution of this application across disparate supercomputing platforms is feasible and that superlinear performance is achieved. Finally, we include some scientific results obtained during the first three years of this project.

4.11 References [1] Y. M. Wu, A. Kuppermann and B. Lepetit, “Theoretical calculation of experimentally observable consequences of the geometric phase in chemical reaction cross sections,” Chem. Phys. Lett., 186, 1991, 319. [2] Y. M. Wu and A. Kuppermann, “Prediction of the effect of the geometric phase on product rotational state distributions and integral cross sections,” Chem. Phys. Lett., 201, 1993, 178. [3] A. Kuppermann and Y. M. Wu,”The geometric phase effect shows up in chemical reactions,” Chem. Phys. Lett., 205, 1993, 577. [4] Y. M. Wu and A. Kuppermann, “The importance of the geometric phase effect for the H + D2 → HD + D reaction,” Chem. Phys. Lett., 235, 1995, 105. [5] A. Kuppermann and Y. M. Wu, “The quantitative prediction and lifetime of a pronounced reactive scattering resonance,” Chem. Phys. Lett,. 241, 1995, 229. [6] T. N. Kitsopoulos, M. A. Buntine, D. P. Baldwin, R. N. Zare and D. W. Chandler, “Reaction-product imaging — the H + D2 reaction,” Science, 260, 1993, 1605. [7] A. Kuppermann and Y. M. Wu, “Quantum chemical reaction dynamics on a highly parallel supercomputer,” Science, 260, 1993, 1605.

CASA Final Report

33

Tech. Report CACR-123