Terascale materials modelling on high ... - Semantic Scholar

2 downloads 0 Views 549KB Size Report
of Edinburgh's. Institute for. Astronomy. Following two post-doctoral positions and a spell in indus- try, he joined the Edinburgh. Parallel Computing Centre as.
FEATURE ARTICLE

www.rsc.org/materials | Journal of Materials Chemistry

Terascale materials modelling on high performance system HPCx M. Plummer,*a J. Hein,b M. F. Guest,a K. J. D’Mellow,b I. J. Bush,a K. Refson,c G. J. Pringle,b L. Smithb and A. Trewb Received 6th January 2006, Accepted 7th March 2006 First published as an Advance Article on the web 12th April 2006 DOI: 10.1039/b600182n We describe the HPCx UoE Ltd national computing resource HPCx Phase 2 as used in 2004 and 2005. We describe the work of the HPCx ‘terascaling team’ and how this work in collaboration with scientists and code developers allows for efficient exploitation of large-scale computational resources to produce new science as described in the rest of this volume. We emphasize the need for scientists and code developers to have an understanding of the peculiarities of the national and international facilities they use to generate their data. We give some examples of successful application code optimization in materials chemistry on HPCx. We introduce HPCx Phase 2A which entered service in November 2005.

1. Introduction a

CCLRC Daresbury Laboratory, Daresbury, Cheshire, UK WA4 4AD. E-mail: [email protected]; Fax: +44 1925603634; Tel: +44 1925603244 b EPCC, King’s Buildings, University of Edinburgh, Edinburgh, UK EH9 3JZ c CCLRC Rutherford Appleton Laboratory, Chilton, Didcot, Oxfordshire, UK OX11 0QX

Martin Plummer

Martyn Guest

Dr Martin Plummer is a computational scientist and theoretical atomic, molecular and optical physicist with publications concerning electron and positron scattering by light atoms and molecules and multiphoton ionization in a laser field. His current terascaling team interests include computational development of the CASTEP package and studies of the solvation properties nanoparticles of using DL_POLY. Martyn Guest is associate director of CCLRC Computational Science and Engineering and leads the HPCx terascaling team. He is presently an honorary professor in computer science at the University of Cardiff and an honorary reader in chemistry at the University of Manchester. Dr Guest is lead author of the GAMESS-UK electronic structure program, and has written or contributed to more than 200 articles in the areas of theoretical and computational chemistry, and high performance computing.

This journal is ß The Royal Society of Chemistry 2006

The effective exploitation of current high performance computing (HPC) platforms, in particular the HPCx system, relies on the ability of the present generation of application codes to make effective utilization of these platforms and their components. HPCx is the UK’s largest national HPC resource,

Dr Joachim Hein received his PhD from Hamburg University (Germany) for numerical work in lattice Quantum Field Theory. He has since held post-doctoral positions at the University of Glasgow, Cornell University and the University of Edinburgh. He is presently working at EPCC and has a strong interest in parallel computing.

Joachim Hein Kenton D’Mellow hails from a Cosmology background, obtaining his PhD at the University of Edinburgh’s Institute for Astronomy. Following two post-doctoral positions and a spell in industry, he joined the Edinburgh Parallel Computing Centre as an applications consultant, and works as part of the HPCx terascaling team.

Kenton D’Mellow

J. Mater. Chem., 2006, 16, 1885–1893 | 1885

operated by a consortium consisting of the University of Edinburgh, Edinburgh Parallel Computing Centre (EPCC), the Council for the Central Laboratory for the Research Councils (CCLRC) Daresbury Laboratory and IBM. Evolving national and international HPC resources allow scientists to plan more ambitious calculations and simulations to achieve more realistic modelling of nature. In this article we complement the individual reports of these achievements by the Materials Chemistry Consortium with a description of the work of HPCx staff, in particular the ‘terascaling team’, in managing and improving the large parallel application codes used by the consortium. Ian Bush studied for both his first and second degrees at Oxford University, his D Phil being with professor Paul Madden. He then moved to Daresbury Laboratory in 1992, and has since then supported and developed a number of parallel codes, mostly of applicability to the materials consortium. These codes include DL_POLY, CRYSTAL, GAMESS-UK and CASTEP. Ian Bush

In order to transform the desires of materials scientists to perform more effective simulations of increasingly large systems into practical and efficient computations, the codes should ideally exploit the full power of HPCx through effective utilization of all available resources: CPUs, memory, communication strategies that are sympathetic to the system and in many cases I/O. The terascaling team are in a position to take a broad view of the application codes as a collection and use experience in optimizing particular codes to suggest and implement improvements for other codes with similar requirements, in conjunction with and with appropriate permission from code authors. Similarly, the terascaling team should make efforts to ensure that scientific code developers are fully aware of the kind of practical code they need to write in order to perform useful new science efficiently on large platforms. We hope this article contains useful pointers for scientists coding large parallel applications. The HPCx website1 contains information on various matters concerning systems hardware and software plus an ongoing series of Technical Reports describing HPCx staff investigations of the system and optimization of applications.

2. A brief technical overview A key characteristic of today’s parallel machines is the presence of a deep memory hierarchy. HPCx Phase 2, in operation during 2004–2005, utilized IBM p690+ frames equipped with 1.7 GHz Power 4+ processors. These processors

Dr Keith Refson is a computational physicist specialising in atomistic simulation methods, particularly plane-wave DFT and molecular dynamics methods. He has published in the fields of mineral physics, condensed-matter physics and computational methods. He is a co-author of the CASTEP code and the author of the ‘‘moldy’’ molecular dynamics simulation package. Keith Refson

Gavin Pringle received a PhD in CFD from Napier University, with Edinburgh University and the University of California at Berkeley as cooperating institutions. After a 4 year post-doc at Napier, researching the Lagrangian simulation of turbulent fluid flow over a sphere, he moved to EPCC and is currently a principal consultant.

Gavin Pringle Dr Arthur Trew is the director of EPCC, a European centre of expertise in developing high performance, novel computing solutions, managing advanced systems and providing HPC training. Arthur is also deputy director of the National e-Science centre.

Dr Lorna Smith is Edinburgh Parallel Computing Centre’s (EPCC) high performance computing (HPC) research manager, and is responsible for a wide range of projects in the HPC area. She has a particular interest in terascaling applications and on investigating new languages and models for high performance computing.

Lorna Smith 1886 | J. Mater. Chem., 2006, 16, 1885–1893

Arthur Trew This journal is ß The Royal Society of Chemistry 2006

have an individual L1 data cache of 32 kBytes and are arranged as dual-processor units sharing a 1.5 MBytes L2 cache. Four ‘duals’ made up an ‘MCM’ (multi-chip module) unit and four MCM units combined to make up the HPCx Phase 2 Logical Partition (LPAR), with a shared L3 cache of 512 MBytes and a user-available memory of y27 GBytes. Communications within an LPAR made full use of the shared memory facility whereas communications between LPARs utilized the IBM High Performance Switch (HPS) and associated management software (through the lifetime of the HPCx service we receive ongoing software upgrades or Service Packs, labelled SPn). HPCx is a ‘clustered SMP system’ (SMP stands for shared memory processing or programming). The Phase 2 system had 49 LPARs available for production, and up to 1280 processors available for individual jobs. Further technical details may be found in ref. 2. Parallel programs on HPCx may use MPI (Message Passing Interface) for general communications and OpenMP (shared memory communication) within an LPAR. HPCx Technical Report HPCxTR05013 provides a useful introduction and more detailed references for these communication methods. MPI programs do not assume any shared data between processes, although the IBM implementation does take advantage of the SMP structure within an LPAR to maximize communication speed as noted above. This effectively means that mixed-mode programming (MPI between LPARs, OpenMP within an LPAR) is not generally advantageous compared to pure MPI programming unless the parallelization has been specifically designed and optimized to make use of particular features of the OpenMP method.3 Some programs, for example GAMESS-UK discussed in section 4 below, may use direct calls to the IBM proprietary one-sided communication library LAPI.4 Following the SP7 and SP9 software upgrades the HPS MPI bandwidth or message data-passing speed peaked at an aggregate value of around 8 GBytes per second for 32 simultaneous messages between two LPARs, with the latency (point-to-point start-up time for messages) at about 6 microseconds.4 Environment variables ensuring efficient allocation of shared memory and tasks to processors within an LPAR ensure that ‘internal’ communication remains much faster than LPAR to LPAR communication (these environment variables are recommended for all applications and are detailed on the HPCx website1). A recent development with upgrades SP12/13 and above has been the introduction of the RDMA (remote memory direct access) facility4 which can double the bandwidth of individual point-to-point messages in the limit of large messages. Finally in this technical overview we note that towards the end of 2005 the Power 4+ HPCx system was upgraded to a state of the art Power 5 system2. HPCx Phase 2A is based on Power 5 p5–575 dual-core compute nodes with a frequency of 1.5 GHz; each node (LPAR) has 16 CPUs and 32 GB of memory. The initial system installed in November consisted of 96 nodes connected through the High Performance Switch making a total of 1536 Power 5 CPUs, which should ensure that the new system at least matches the performance of the Power 4+ based Phase 2 system. Moreover, each processor has double the memory of the Phase 2 processors and the total This journal is ß The Royal Society of Chemistry 2006

memory is more than 3TB, almost double the Phase 2 total. This enhanced memory addresses one of the most important issues raised by users and should allow significant new capability science. A particular improvement with Power 5 is that the L3 cache is now directly associated with a particular dual-core chip, as is access to the main memory, thus greatly improving memory bandwidth.2 The L2 cache is increased to 1.875 MBytes per dual-core and the L3 cache is now 36 MBytes per dual-core. The new system is also more energy efficient, a very important consideration if large parallel systems are to continue to increase in size. Selected early application code benchmark tests of this system are presented in section 6. A further upgrade, Phase 3, will occur during 2006. This will involve adding additional Power 5 nodes onto the switch infrastructure to roughly double the performance and the total memory.

3. HPCx parallel programming: general points In order for application codes to run efficiently on HPCx, they should be designed to take account of the system’s individual peculiarities as well as obeying general ground rules for parallel codes. The two basic general ground rules are that communications between processors, though essential, should be minimal to avoid communications overheads and that the large quantities of data required for the application should be sensibly distributed over the available processor cache and memory in order to avoid overloading memory on individual processors. On HPCx the data should be arranged so that ideally particular ‘chunks’ of data to be operated on fit into the cache structure described above. Similarly communications systems need to take into account the ‘fast internal’ communication within an LPAR and the slower communication between LPARs, not to mention the balance between latency and bandwidth and the optimum message sizes to achieve the peak bandwidth. It could be argued that in an ideal world this latter point should be taken care of by IBM systems software and, for example, the supplied implementations of the MPI library. The following examples indicate that this has taken place partially and indeed the various IBM software upgrades described above, both environment improvements within LPARs and service pack upgrades managing the HPS, have been extremely important in improving performance. However, these improvements have been supplemented by terascaling team optimizations that have also made a great difference to applications performance. When discussing communications it is useful to recall Amdahl’s law5 which in its simplest form may be written as T(N) = T(1) ((1 2 b) + b/N) for execution time T, N processors and ‘parallel fraction’ of the code b (i.e. the fraction of the code which is parallelizable). However, this expression ignores overheads introduced by the parallel communications which reduce parallel performance and can make the execution time increase with N beyond a large enough value. It also ignores memory accessibility issues. If the parallelization strategy includes data distribution (as it ideally should) then more processors implies smaller blocks of data per processor which for cache-based systems means much faster execution if the blocks of data being worked on fit into the cache. J. Mater. Chem., 2006, 16, 1885–1893 | 1887

The equation also assumes perfect ‘load-balancing’, i.e. all processors work equally and in synchronization so that no processors are idle while other processors work. Most parallel codes introduce various book-keeping and communications routines and overheads that are not present in the serial versions of the code and thus in Amdahl’s law the reference number should ideally be at least T(2) rather than T(1). Note that as a further complication to Amdahl’s law as quoted above, for clustered SMP units the reference number of processors used for scaling comparisons should take account of how the basic memory is shared if we wish to tune or parameterize the equation for the specific system. Thus on HPCx the reference number should definitely be at least T(2) as two CPUs share a L2 cache. Arguably for massively parallel scaling comparisons the reference level could be, for example, T(32) for HPCx Phase 2.

4. Materials chemistry application codes Most of the main application codes used by the materials chemistry consortium tend to be either molecular dynamics codes with classical interaction potentials between atoms or self-consistent-field density functional theory (DFT) codes in which the electronic structure is determined quantummechanically. The latter codes may expand electronic wave functions in a set of (Gaussian) orbitals or as plane-waves. Our example cases of particularly useful code optimizations for HPCx include the orbital-based DFT codes SIESTA6 and CRYSTAL7 and the plane-wave DFT codes CASTEP8 and VASP.9 We also consider the DFT molecular structure code GAMESS-UK10 and the classical molecular dynamics codes DL_POLY11 and, briefly, NAMD.12 More detailed descriptions of some of these codes may be found elsewhere (and in this volume).7,8,11 A review of molecular dynamics code performance on HPCx and on another UK national HPC computer, the SGI Altix 3700 system newton,13 has recently appeared.14 Codes such as SIESTA and CRYSTAL require the construction and diagonalization of very large Hamiltonians made up from the orbital bases. In parallel the Hamiltonian matrices and the diagonalization should be distributed for efficient calculation. SIESTA is designed to handle systems with up to several thousand atoms. Version 1.3 offers a choice of two algorithms for the diagonalization of the Hamiltonian matrix, an iterative solver or an algorithm diagon which employs a direct solver from the ScaLAPACK library.15 The diagonalization takes up most of the computing time. The iterative solver scales linearly with the number of atoms, whereas diagon scales with the third power of the number of atoms. Other SIESTA algorithms are designed to scale linearly with the number of atoms. It is expected that the iterative solver should become more efficient for extremely large systems. However, on the available benchmarking systems using, for example, 5328 orbitals it is out-performed by diagon. diagon performance was investigated by J. Hein.16 By changing the grid of processors over which ScaLAPACK distributes the large Hamiltonian from a block-cyclic 1-dimensional array to a 2-dimensional grid, a substantial increase in performance was obtained. 1888 | J. Mater. Chem., 2006, 16, 1885–1893

Fig. 1 1d and 2d grid distributions of a matrix.

The different distributions are illustrated in Fig. 1 for a simple case of four processors p0, p1, p2 and p3. The left hand side shows an n-by-n matrix divided into sub-matrices (blocks) of size n * b. b is called the block size and can be controlled from the SIESTA input file. The sub-matrices are cyclically distributed onto the processors. As shown in the figure, depending on the values of n and b, not all processors get the same number of sub-matrices and the last sub-matrix might be smaller. If the number of sub-matrices and the number of processors are of similar size there can be load balance problems as seen on the figure: processor p0 has twice as many elements to work on as processor p3. The right hand side of the figure illustrates a 2-dimensional distribution in which the matrix gets divided into small squares of size c * c. Again c is called the block size, however we emphasize that the optimal block sizes for the two cases may be different. Having divided the matrix, the processors are arranged on a 2-dimensional grid: a 2-by-2 processor grid in this example. The blocks of the matrix are cyclically distributed onto the grid in both directions. Typically for the 2-dimensional decomposition, the number of sub-matrices will be larger than for the 1-dimensional case. This provides more scope to even out load balance problems. Also, when operations need to be carried out on the rows the 1-dimensional case involves all processors for each row. With large numbers of processors row operations can involve small ‘chunks’ of data and many messages, leading to inefficiency. In the 2-dimensional case there are independent sub-groups for rows and columns with no need for communication between sub-groups along rows or columns. Fig. 2 shows comparison times for a test benchmark system with 5328 orbitals for the overall calculation and for the diagonalizer: please note the logarithmic scale. The grid modification has been incorporated into a forthcoming version of SIESTA. CRYSTAL is the result of a collaboration between the University of Torino, Italy and the Computational Materials Science group at CCLRC. The terascaling version of the code has fully distributed data and a Gauss–Jacobi parallel diagonalization package17 developed by I. J. Bush for the largest systems, currently the proteins crambin and rusticyanin.7 Fig. 3 shows scaling of SCF cycles on crambin using up to 1024 processors for various Gaussian-type-orbital (GTO) basis sets compared to ideal linear scaling (details and further examples are given in ref. 7). This journal is ß The Royal Society of Chemistry 2006

Fig. 2 SIESTA benchmark timings for 1d and 2d grids.

Fig. 3 CRYSTAL scaling performance.

The plane-wave codes CASTEP and VASP do not require extremely large diagonalizations but are characterized by many transformations of the plane-wave wave function bases between real and reciprocal space, i.e. 3-dimensional fast Fourier transforms (3d-FFTs). Please note that due to requests by code authors, CASTEP and VASP terascaling activities are kept separate from and private to each other (CASTEP is investigated at Daresbury and VASP at Edinburgh). CASTEP has been carefully optimized by M. Plummer and K. Refson.18 The 3d-FFT is treated using a column distribution of wave functions on the 3d-FFT grid resulting in two calls to the collective communication routine MPI_AllToAllV per 3d-FFT. In principle this choice allows for the best loadbalancing for the whole calculation but it also places strong demands on latency-bound collective communication which has a high overhead. In addition to various memory-use and algorithmic improvements Plummer and Refson have developed an LPAR-customized version of the 3d-FFT which replaces the basic MPI_AllToAllV with an optimum combination of MPI_GatherV, MPI_ScatterV and restricted MPI_AllToAllV calls which is also responsive to message size and makes the best use of IBM buffered send and receive This journal is ß The Royal Society of Chemistry 2006

calls.19 Curiously, while the basic IBM MPI_AllToAllV is not ‘LPAR-aware’, tests show that the IBM MPI_AllReduce routine (a secondary source of CASTEP communication overhead) is written to make fairly optimum use of sharedmemory communication within LPARs. CASTEP also has a higher level parallelism over Brillouin zone sampling k-points which is low in communications overhead. Careful choice of the number of processors for a given calculation as an appropriate multiple of the number of k-points allows for scaling to high processor count (the 3d-FFT typically scales to 64 and often 128 processors depending on the problem size. Recent calculations of gold chains on copper surfaces have efficiently utilized more than 500 processors.20 For certain applications CASTEP has an ‘intelligent-task-farming’ ensemble level parallelism which is currently under further development (see section 5). We give examples of CASTEP optimization in section 6. VASP does not employ k-point parallelism but has parallelism over electron bands as well as a treatment of the 3d-FFT which depends on a mixture of collective communication routines MPI_AllToAll and MPI_AllToAllV. HPCx investigations have focused on modifying the optimum ratio of band-parallelization to 3d-FFT parallelization21 by optimizing the collectives used in these operations. The code has been migrated to build and run as a 64-bit application, including several bug-fixes that have previously prevented this on HPCx. This allows the use of an IBM-supplied 64-bit version of MPI_AllToAll which is designed to be LPAR-aware and a general purpose LPAR-aware MPI_AllToAllV created by A. Jackson and S. Booth22 and adapted for Fortran applications and conformance to the MPI standard by K. J. D’Mellow. This routine is distinct from the CASTEP-specific customized MPI_AllToAllV described above and has some different characteristics.19,22 Both of these optimized collectives exploit the regime of small message sizes (below a few Kb), increasing the scalability notably of the 3d-FFT, but also of other parts of the code. We give examples of VASP performance in section 6. VASP also has an ensemble level ‘intelligent-task-farming’ parallelism for certain applications, which allows significantly better scaling to higher processor numbers. In contrast to the codes discussed above, parallelization of the electronic structure code GAMESS-UK10 has been based around the Global Array (GA)23 and PeIGS24 tools developed at Pacific North West National Laboratory. These allow the matrix algebra (similarity transforms Q{HQ and diagonalization) to be performed in parallel and provide a way to implement a memory-mapped I/O system based on a global shared memory model. As an example of this parallel strategy, we show the performance of the Analytic DFT 2nd derivatives module of GAMESS-UK. The transformed integrals are held in a distributed global array: in a conventional implementation a disk file or direct re-computation strategy would be used. Fig. 4 shows the performance, in arbitrary units, for HPCx Phase 1 and Phase 2 and also SGI Altix 37004,13 (with both 1.3 and 1.5 GHz Itanium 2 processors) systems for comparison, relative to the HPCx Phase 1 IBM p690 running on 32 processors. We note that HPCx Phase 1, which operated from January 2003 to mid 2004, utilized 8-way LPARs, 1.3 GHz Power 4 processors and the older IBM ‘Colony’ interconnect.2 J. Mater. Chem., 2006, 16, 1885–1893 | 1889

Fig. 4 Performance of the parallel DFT 2nd derivatives module of GAMESS-UK on a number of platforms.

The figure shows that speed-ups of around 1.7 are obtained when doubling the number of processors, leading to a progressive loss of parallel efficiency on higher node counts. This behaviour is typical of replicated data parallel implementations, providing useful parallelism up to perhaps y200 processors. Although this approach will generally yield increased efficiencies on large problem cases because of the larger proportion of parallel work, the memory requirements on a given node will increase as the problem size increases. This limits the size of a system that can be treated by the replicated data parallel version of GAMESS-UK to around 4000 basis functions on a computer with 1 GByte of memory per processor, regardless of the number of processors used in the calculation. Such calculations are found to be scalable to up to around 500 processors. Efficient use of the HPCx system required a redesign of the SCF algorithm to make more effective use of the distributed system memory. The terascaling team (M. F. Guest and I. J. Bush) have adopted an approach in which most of the matrices used in the SCF scheme are distributed25 while a replicated data model is retained for the Fock matrix construction step. In Fig. 5 we show the performance of this implementation which makes use of MPI-based tools such as ScaLAPACK. These tools were at first found to be significantly more efficient than Global Arrays (using LAPI communications) on HPCx. However IBM’s SP9 upgrade largely redressed the balance between MPI and LAPI performance. The use of replicated Fock and density matrices has an obvious drawback in that some large replicated data structures are still required (a closed shell Hartree–Fock or DFT calculation requires two replicated matrices). Calculations have been performed of up to 12 000 basis functions on the HPCx Phase 2 system.26

Fig. 6 DL_POLY NaCl benchmark 3d-FFT speed-up for two grid sizes.

As noted above, the recent review by Hein et al14 describes the performance of various classical molecular dynamics codes on HPCx. Another article in this volume11 describes the new DL_POLY code ‘DL_POLY_3’ together with applications in radiation damage research. DL_POLY_3 has fully distributed data and a locally-developed version of the 3d-FFT that maps directly onto the same domain-decomposition data distribution,27 thus avoiding large and expensive data redistributions required for standard 3d-FFTs. These changes from the earlier DL_POLY code ‘DL_POLY_2’, which has data replicated on all processors, mean that simulations of, for example, 28 million ions on 1024 processors are now possible and efficient.11 Fig. 6 contrasts the performance of the new 3d-FFT on HPCx with an IBM library routine for two 3d-FFT grid sizes in a 216 000-ion benchmark simulation of NaCl. As the processor count increases the new routine with minimal communication overhead becomes faster than the library routine (the ‘speed-up’ ratio becomes greater than 1). We note that the larger 3d-FFT grid is normally reserved for systems with 1.5 million or more atoms: we expect a greater speed-up for these larger systems. All the codes described have benefited in performance from the ongoing IBM improvements to systems software as described above. Recently, tests of the IBM Fortran compiler xlf-v9.1 and the C/C++ compiler xlc7 introduced fully in autumn 2005, showed a general improvement in performance (to varying extent) for all the application codes considered. For example, SIESTA and the bio-molecular simulation code NAMD both show a general performance improvement of around 10% following the upgrade. Finally in this section we would like to draw attention to another article in this volume28 describing the package ChemShell, which has been used to couple DFT and empirical potential methods in a collaboration including HPCx staff.

5. Task farming

Fig. 5 Speed-up for a GAMESS-UK SCF calculation on a zeolite fragment (3975 basis functions) on the HPCx system.

1890 | J. Mater. Chem., 2006, 16, 1885–1893

Bearing in mind the earlier discussion of communication overheads affecting Amdahl’s law, the ideal ‘terascaling’ or ‘capability’ code approaching the use of the full resources of the HPCx system is one for which the communication is essential but minimal as minimal communication implies This journal is ß The Royal Society of Chemistry 2006

minimal communication overhead. On the other hand zero communication, though allowing a parallel fraction of 1 if each processor performs a different calculation, is not a good use of the machine unless all the final data are needed simultaneously. The work could in principle be performed much less expensively on serial machines or local clusters if a job submission is actually a set of, for example, 32-processor jobs that do not communicate with each other but are submitted in the form of a single ‘capability’ job. Some individual calculations may not scale usefully beyond 32–64 processors. If, however, a calculation requires a group of ‘similar but different’ sub-calculations and operates relatively infrequently on the output data of these to set off new sub-calculations or produce global output, we have a useful capability calculation. This is the difference between ‘intelligent’ task farming which includes feedback and makes use of sub-calculations and non-intelligent task farming which disguises several few-processor calculations as one calculation. Provided the sub-tasks involve similar amounts of work in order to maintain load-balancing, intelligent task farming calculations can scale extremely well. The CASTEP and VASP codes contain intelligent taskfarming parallelisms for certain applications.18,21 In the CASTEP case this has been applied to preliminary studies of the tautomerization of benzoic acid.29 The CASTEP code is currently under development to widen these applications. The terascaling team has also provided harnesses30 which allow MPI programs to be transformed into intelligent task farms with a master processor controlling the submission of the application code. The scripts also allow non-intelligent task farming: the user should ideally be able to justify the need for this type of use. These scripts are currently being used successfully, for example, in calculations using DL_POLY_2 and with applications such as Gaussian.31 The harness is also currently being tested on SIESTA.

6. Selected initial benchmarks on HPCx Phase 2A Results for Phase 2A presented in this section are preliminary and represent base levels from which further optimization and performance improvements will follow through 2006 and indeed the life of the HPCx Phase 2A/Phase 3 systems. We first consider CASTEP in some detail as an excellent example of combined coding and system optimization. Fig. 7 and 8 show how job time has evolved for two representative benchmark electronic self-consistent-field (SCF) calculations (DFT electronic SCF calculations form the bedrock of all CASTEP applications). Both are of slabs of Al2O3 but with different periodic boundary condition ‘super-cells’. The Fig. 7 benchmark has 120 atoms and 5 k-points whereas the Fig. 8 benchmark has 270 atoms and 2 k-points. The results shown all include the LPAR-customized MPI_AllToAllV, without which the timings requiring MPI_AllToAllV across LPARs would be much slower and the calculations impractical, particularly those of Fig. 8. The effects of this customization are described in detail elsewhere.18,19 We first note that k-point parallelism is extremely efficient: the timings for the 120-atom benchmark, set to run for 24 SCF iterations, are much faster than those for the 270-atom This journal is ß The Royal Society of Chemistry 2006

Fig. 7

CASTEP results for the 120-atom benchmark.

Fig. 8

CASTEP results for the 270-atom benchmark.

benchmark which is set to run for 10 SCF iterations. (MPI_AllToAllV is performed across independent sub-groups for each k-point: note that the sub-group size is chosen to match the HPCx LPAR structure.) The large increase in performance from Phase 1 to Phase 2 (well above the increase in processor clock-speed) in both cases is due to the introduction of 32-way LPARs and the High Performance Switch. The change between the initial and final Phase 2 values is due to a mixture of code optimization and the introduction of the xlf-v9.1 compiler. We note that the changeover to 16-way LPARs with Phase 2A does not reduce the performance: any disadvantage is easily offset by the improved L3-cache organization and memory bandwidth. This is particularly important for large CASTEP calculations as the code is not currently designed to parallelize over electron bands or sets of atoms. The final column of each figure shows timings for the newly supplied (at the time of writing) version 3.2.1 of CASTEP following performance profiling on HPCx and subsequent optimization. The ‘initial’ Phase 2A timings are for the fully optimized CASTEP version 3.1. CASTEP 3.2.1 is an intermediate release between CASTEP 3.1 and the forthcoming CASTEP version 4. The improvement in performance in the final column is purely due to code modification and optimization achieved by the Castep Developers’ Group8 and the HPCx terascaling team working together. We expect J. Mater. Chem., 2006, 16, 1885–1893 | 1891

Fig. 11 Performance of the parallel Hartree Fock SCF 2nd derivatives module of GAMESS-UK.

Fig. 9 VASP results for the Fe benchmark.

continued increases in CASTEP performance and scaling with version 4 and over the lifetimes of HPCx Phase 2A and Phase 3. VASP has also shown considerable improvement over recent code revisions and the migration to Phase 2A. Fig. 9 shows how the job time has evolved for a test benchmark 288-atom cell SCF calculation of bulk Fe, comparing a 32-bit compilation of VASP 4.6 on Phase 2 of HPCx, through to an optimized 64-bit version of VASP 4.6.13.g using the xlf-v9.1 compiler on Phase 2A. VASP 4.6 is ‘standard’ VASP whereas VASP 4.6.13.g is an upgraded version originally optimized for SGI Altix machines13,21 which as can be seen also performs much better than VASP 4.6 on HPCx. In all Phase 2 cases the major parallelization parameter NPAR, which controls the combined degrees of parallelization of the 3d-FFT and eigensolver routines (i.e. over plane-waves and bands respectively), has been set to its optimum value as described by G. J. Pringle.21 For Phase 2 the 3d-FFT should be parallelized over groups of 16 processors, thus having two sets of 3d-FFTs per LPAR. The changeover to 16-way LPARs in Phase 2A has altered the optimal parallelization as shown in Fig. 10. The sensitivity to NPAR is far less than on Phase 2 once outside of the regime where the 3d-FFT is parallelized across more than one LPAR. As can be seen, the optimal NPAR is as a general rule one eighth of the total number of processors, parallelizing the 3d-FFT over 8 processors and again yielding two sets of 3d-FFTs per LPAR.

The combined result of the changeover to Phase 2A, xlf-v9.1 and essential debugging of the code to enable it to run as a 64-bit executable has yielded on average a 23% speed-up in VASP 4.6.13.g over the original 32-bit code on Phase 2. Current and further work on VASP’s collective communications, described in section 4, shows promising initial results but may require further tuning of VASP to take full advantage of the optimized collectives. We anticipate that this will further increase the scalability of VASP 4.6.13.g without degrading the performance at low processor counts. For the case of GAMESS-UK, the improved memory bandwidth on the Phase 2A p5–575 nodes is seen to have a dramatic effect, for example, on the performance of the Hartree Fock SCF 2nd derivatives module of the code. Fig. 11 compares performance, in arbitrary units, for the SGI Altix 3700 (with both 1.3 and 1.5 GHz Itanium 2 processors) and HPCx Phase 2 and Phase 2A. The SGI systems outperform HPCx Phase 2 in this case, however the Phase 2A HPCx system dominates at all processor counts and outperforms the Phase 2 system by a factor of 1.76 when using 128 processors. Other codes of interest to materials chemistry do not yet all show significant improvements with the transition to Phase 2A but generally perform at least as well as on Phase 2, for example the SIESTA benchmark (we expect SIESTA performance on Phase 2A to improve for larger cases due to the improved memory organization). The molecular dynamics codes as a group (including DL_POLY and NAMD considered in section 4) show the same or sometimes slightly worse performance (a few percent) on Phase 2A for initial benchmarks. We conjecture that this may be due to slightly slower memory or cache latency for collecting small amounts of data on Power 5 such as occur when looking up positions, speeds and accelerations of individual atoms: the memory latency becomes faster on Power 5 compared to Power 4+ as data size increases. Work is ongoing to investigate this problem, understand it and circumvent it. We would like to emphasize that molecular dynamics (and other) calculations that were impossible on HPCx Phase 2 due to physical memory limits are now possible with Phase 2A. For example, a test DL_POLY_3 simulation with y300 million ions was successfully run on 1024 processors during the Phase 2A acceptance tests.

7. Conclusions Fig. 10 VASP Fe benchmark Phase 2A job-time variation with parameter NPAR.

1892 | J. Mater. Chem., 2006, 16, 1885–1893

National and international HPC resources allow the possibility of ground-breaking new science. To exploit these resources to This journal is ß The Royal Society of Chemistry 2006

maximum benefit and efficiency, scientists should have some understanding of how these systems differ from local PCs and small clusters. On HPCx, the terascaling team exists to further this understanding and work in practical collaboration to achieve these aims. Codes in materials chemistry and other disciplines have benefited substantially from parallel optimization performed by team members in collaboration with code authors. The team also provides expertise in new developments in HPC which will allow code improvements to be carried through and extended to make the best use of future resources.

Acknowledgements HPCx UoE Ltd is funded by the UK Research Councils (the Engineering and Physical Sciences Research Council provides the largest contribution). We would like to thank I. T. Todorov, A. G. Sunderland, S. Booth, H. J. J. van Dam and P. Sherwood for useful discussions.

References 1 http://www.hpcx.ac.uk. 2 http://www.hpcx.ac.uk/services/hardware/index.html, http:// www.top500.org/ORSC/2005/power4.html, http://www.top500. org/ORSC/2005/power5.html. 3 S. Booth and A. Jackson, HPCx Technical Report, 2005, http:// www.hpcx.ac.uk/research/hpc/technical_reports/HPCxTR0501.pdf. 4 A. Jackson, HPCx Technical Report, 2004, http://www.hpcx.ac.uk/ research/hpc/technical_reports/HPCxTR0412.pdf; M. Ashworth, I. J. Bush, M. F. Guest, M. Plummer, A. G. Sunderland and J. Hein, HPCx Technical Report, 2004, http://www.hpcx.ac.uk/ research/hpc/technical_reports/HPCxTR0417.pdf; A. Gray, J. Hein and S. Booth, HPCx Technical Report, 2005, http://www.hpcx. ac.uk/research/hpc/technical_reports/HPCxTR0505.pdf. 5 G. Amdahl, AFIPS Conference Proceedings, 1967, AFIPS Press, Reston VA, USA, vol. 30, pp. 483–485. Amdahl’s law was originally devised to describe vector processors and has been adapted for parallel processors. 6 http://www.uam.es/departamentos/ciencias/fismateriac/siesta; J. M. Soler, E. Artacho, J. D. Gale, A. Garcı´a, J. Junquera, P. Ordejo´n and D. Sa´nchez-Portal, J. Phys.: Condens. Matter, 2002, 14, 2745. 7 http://www.chimifm.unito.it/teorica/crystal; I. J. Bush and B. Montanari, 2006, article in preparation. 8 M. D. Segall, P. J. D. Lindan, M. J. Probert, C. J. Pickard, P. J. Hasnip, S. J. Clark and M. C. Payne, J. Phys.: Condens. Matter, 2002, 14, 2717. The authors together with K. Refson constitute the Castep Developers’ Group; K. Refson, M. Plummer and A. Wander, 2006, article in preparation. 9 http://cms.mpi.univie.ac.at/vasp. 10 M. F. Guest, J. M. H. Thomas, P. Sherwood, I. J. Bush, H. J. J. van Dam, J. van Lenthe, R. W. A. Havenith and J. Kendrick, Mol. Phys., 2005, 103, 719; http://www.cfs.dl.ac.uk/. 11 W. Smith and T. R. Forester, J. Mol. Graphics, 1996, 14, 136; W. Smith, C. Yong and M. Rodger, Mol. Simul., 2002, 28, 385; I. T. Todorov, W. Smith, K. Trachenko and M. T. Dove, J. Mater. Chem., 2006, DOI: 10.1039/b517931a, article in this volume. 12 L. Kale´, R. Skeel, M. Bhandarkar, R. Brunner, A. Gursoy, N. Krawetz, J. Phillips, A. Shinozaki, K. Varadarajan and K. Schulten, J. Comput. Phys., 1999, 151, 283. 13 http://www.csar.cfs.ac.uk/. 14 J. Hein, F. J. L. Reid, L. A. Smith, I. J. Bush, M. F. Guest and P. Sherwood, Philos. Trans. R. Soc. London, Ser. A, 2005, 363, 1987.

This journal is ß The Royal Society of Chemistry 2006

15 http://www.netlib.org/scalapack/. 16 J. Hein, HPCx Technical Report, 2004, http://www.hpcx.ac.uk/ research/hpc/technical_reports/HPCxTR0410.pdf. 17 I. J. Bush, Daresbury Laboratory Technical Report, 2004, http:// www.cse.clrc.ac.uk/arc/bfg.shtml. 18 M. Plummer and K. Refson, HPCx Technical Report, 2005, http:// www.hpcx.ac.uk/research/hpc/technical_reports/HPCxTR0507.pdf. 19 M. Plummer and K. Refson, HPCx Technical Report, 2004, http:// www.hpcx.ac.uk/research/hpc/technical_reports/HPCxTR0401.pdf; M. Plummer and K. Refson, Comput. Phys. Commun., 2006, to be submitted. 20 G. Kyriakou, F. J. Williams, M. S. Tikhov, A. Wander and R. M. Lambert, Phys. Rev. B: Condens. Matter, 2005, 72, 121408(R). 21 G. J. Pringle, HPCx Technical Report, 2004, http://www.hpcx. ac.uk/research/hpc/technical_reports/HPCxTR0414.pdf. 22 A. Jackson and S. Booth, HPCx Technical Report, 2004, http:// www.hpcx.ac.uk/research/hpc/technical_reports/HPCxTR0409.pdf. 23 J. Nieplocha, R. J. Harrison and R. J. Littlefield, J. Supercomput., 1996, 10, 197; J. Nieplocha, R. J. Harrison and R. J. Littlefield, Global Arrays; a portable shared memory programming model for distributed memory computers, Proceedings of Supercomputing ’94, 1994, IEEE Computer Society Press, Los Alamos CA, USA, p. 340. 24 G. I. Fann and R. J. Littlefield, Parallel inverse iteration with reorthogonalisation, Proceedings of the 6th SIAM conference on parallel processing for scientific computing, 1993, SIAM Publishing, USA, p. 409. 25 Y. Alexeev, R. A. Kendall and M. S. Gordon, Comput. Phys. Commun., 2002, 143, 69; Y. Alexeev, M. Schmidt, T. Windus, M. S. Gordon and R. A. Kendall, Proceedings of Cluster 2002, IEEE international conference on cluster computing, IEEE Computer Society, 2002, p. 135. 26 H. J. J. van Dam, M. F. Guest, P. Sherwood, J. M. H. Thomas, J. H. van Lenthe, J. N. J. van Lingen, C. L. Bailey and I. J. Bush, Large scale electronic structure calculations in the study of the condensed phase, Proceedings of WATOC05 (2005), 2006, in press. 27 I. J. Bush and W. Smith, Daresbury Laboratory Technical Report, 2005, http://www.cse.clrc.ac.uk/arc/fft.shtml; I. J. Bush, W. Smith and I. T. Todorov, Comput. Phys. Commun., 2006, to be submitted. 28 J. To, P. Sherwood, A. A. Sokol, I. J. Bush, C. R. A. Catlow, H. J. J. van Dam, S. A. French and M. F. Guest, J. Mater. Chem., 2006, DOI: 10.1039/b601089j, article in this volume. 29 B. Dobrzelecki, Portability and Performance of CASTEP Code on eServer Blue Gene, MSc Dissertation, The University of Edinburgh, UK, 2005. 30 HPCx task-farming scripts and ‘control’ programs produced by J. Hein, S. Booth, A. G. Sunderland and D. S. Henty, scripts and advice are available on request from [email protected]. 31 M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman, J. A. Montgomery, Jr., T. Vreven, K. N. Kudin, J. C. Burant, J. M. Millam, S. S. Iyengar, J. Tomasi, V. Barone, B. Mennucci, M. Cossi, G. Scalmani, N. Rega, G. A. Petersson, H. Nakatsuji, M. Hada, M. Ehara, K. Toyota, R. Fukuda, J. Hasegawa, M. Ishida, T. Nakajima, Y. Honda, O. Kitao, H. Nakai, M. Klene, X. Li, J. E. Knox, H. P. Hratchian, J. B. Cross, V. Bakken, C. Adamo, J. Jaramillo, R. Gomperts, R. E. Stratmann, O. Yazyev, A. J. Austin, R. Cammi, C. Pomelli, J. W. Ochterski, P. Y. Ayala, K. Morokuma, G. A. Voth, P. Salvador, J. J. Dannenberg, V. G. Zakrzewski, S. Dapprich, A. D. Daniels, M. C. Strain, O. Farkas, D. K. Malick, A. D. Rabuck, K. Raghavachari, J. B. Foresman, J. V. Ortiz, Q. Cui, A. G. Baboul, S. Clifford, J. Cioslowski, B. B. Stefanov, G. Liu, A. Liashenko, P. Piskorz, I. Komaromi, R. L. Martin, D. J. Fox, T. Keith, M. A. Al-Laham, C. Y. Peng, A. Nanayakkara, M. Challacombe, P. M. W. Gill, B. Johnson, W. Chen, M. W. Wong, C. Gonzalez and J. A. Pople, Gaussian 03, Revision C.02, 2004, Gaussian, Inc., Wallingford CT.

J. Mater. Chem., 2006, 16, 1885–1893 | 1893