A HighPerformance, LowPower Chip Multiprocessor for Large Scale Molecular Orbital Calculation Kenta NAKAMURA∗ , Hiroaki HONDA∗‡ , Koji INOUE∗ , Hisao SATO† , Masamitsu UEHARA† , Harunobu KOMATSU‡ , Hiroaki UMEDA§ , Yuichi INADOMI§ , Kengo ARAKI¶ , Toru SASAKI¶ , Shigeru OBARAk , Umpei NAGASHIMA‡ and Kazuaki MURAKAMI∗ ∗ Department

of Informatics, Kyushu University † Seiko Epson Corp. ‡ Mizuho Information & Research Institute, Inc. § National Institute of Advanced Industrial Science and Technology ¶ A Priori Microsystems Inc. k Hokkaido University of Education Abstract— Ab initio molecular orbital (MO) calculation is very important for solving many challenging problems concerning the development of new drugs, chemicals, polymers, materials, and so on. However, large-scale MO calculations have at least two difficulties. The first is a considerable number of Electron Repulsion Integral (ERI) computations. The second is the computational complexity of single ERI. We are constructing special purpose processors, and a high performance and compact parallel computing system embedding them for a large-scale MO calculation, named EHPC (Embedded High Performance Computing) system. The numerous ERI calculations can be performed efficiently by the hierarchical parallel computing system architecture, and the special processors accelerate the computation of single ERI. For the parallel computing system, high-performance, low power consumption, and high densely integratable special purpose processor is strongly required. So we developed the ERI calculation specific chip-multiprocessor (CMP) architecture, named ERIC. This is the first implementation of an application specific processor architecture for MO calculation. The ERIC processor architecture has following unique features: heterogeneous CMP architecture optimized with ERI calculation algorithm, microengine-based chip-multiprocessor (µCMP) architecture, and special floating-point operating units, especially inverse-square-root unit, exponential function unit and error function unit for calculating Taylor expansion. We have developed the actual ERIC processor chip with TSMC 0.13µm CMOS technology. The die size is 10mm × 5mm; the operating frequency is 200MHz; the estimated power consumption is 2.1W.

I. I NTRODUCTION Ab initio molecular orbital (MO) method is very important for calculating electronic structures and properties of molecules. Hence, it is indispensable today to interpret experimentally observed phenomena and to predict various physical and chemical properties of known or unknown molecules to design functional materials by the MO method. It can also present information on inter-atomic interaction in molecular assemblies, and is a basis of all molecular simulation approaches. However, large-scale MO calculation has at least two difficulties: numerous numbers of electron repulsion integral

(ERI) calculations and computational complexity of single ERI. As mentioned in next section, O(N 4 ) numbers of electron repulsion integral (ERI) calculations are required, where N correspounds to the computatinal size of the target molecular system. Moreover, the computation of single ERI is laborious task, almost all arithmetics are consist of double precision floating-point operations, it needs to compute division, inverse square root, exponential function, error function, and a large number of double precision floating-point multiply-and-add operations. In recent years, many high performance computer systems have been developed as high speed vector computer systems and huge (PC) cluster computer systems, and large scale MO calculations have been actually possible. However, it’s difficult for almost all chemists to use such large computer systems. They are expensive, hard to maintain, require high power consumption, and inappropriate for trial calculations of some ideas. Furthermore, it is difficult to keep the target chemical compounds secret from the other researchers. Therefore, development of the personal computer system, which is high performance, low cost, and easy to use is strongly desired. We have been therefore developing EHPC (Embedded High Performance Computing) system with following special features in answer to the difficulties and the requirement: • The total system with hierarchical parallel computing architecture which enables numerous ERI computations efficiently, • ERI specific LSI which enables fast computation of the single integral, • Low power consumption and easy to use as the personal computer system. For realization of those features, high-performance, low power consumption, and high densely integratable special purpose processor is indispensable. So we investigate the ERI computation algorithm, and develop the ERI calculation specific processor architecture, named ERIC. The rest of the paper is organized as follows: section II

describes the outline of MO calculation, section III shows the overview of the EHPC system and the ERIC processor architecture, and section IV describes the ERIC processor architecture in detail. Then we show the specifications of the ERIC processor chip, and evaluate the attainable performance of the ERIC processor in section V. Section VI concludes the paper. II. C HARACTERISTICS OF THE MO C ALCULATION A. General MO Calculation Scheme and Most Time Consuming Step Ab initio MO calculation is performed by solving following Hartree-Fock equation [1] (Eq. 1):

B. Fock Matrix Generation Algorithm The integral driven calculation algorithm [2], shown in Figure 1, is widely used for the step of F (2) matrix generation in almost all SCF programs currently available. As the algorithm indicates, the number of ERI calculations is proportinal to N 4 and increases rapidly as N becomes larger.

for i = 1, N for j = 1, i for k = 1, i if (k == i) for l = 1, j, else for l = 1, k calculation of ERI (ij, kl) (2)

N ∑

(1)

(2)

{Fij + Fij }Cia =

i=1

(2)

Fij

Pij

= ≡

(1)

Sij Cia εa

(1)

i=1

N ∑ N ∑ k=1 l=1 occ ∑

2

∫ (ij, kl) ≡

N ∑

1 Pkl {(ij, kl) − (ik, jl)} 2

Cia Cja

(2) (3)

a

χi (r1 )χj (r2 )

1 χk (r2 )χl (r2 )dr1 dr2(4) , |r1 − r2 |

Fij + = Pkl ∗ (ij, kl) (2) Fkl + = Pil ∗ (ij, kl) (2) Fik − = 12 Pjl ∗ (ij, kl) (2) Fil − = 12 Pjk ∗ (ij, kl) (2) Fjk − = 12 Pil ∗ (ij, kl) (2) Fjl − = 12 Pik ∗ (ij, kl) end loop end loop end loop end loop Fig. 1.

Integral Driven Algorithm for the Fock Matrix Generation

(2)

here, Fij , (ij, kl), Fij and Pij denote a one-electron Fock matrix, a electron repulsion integral (ERI), a two-electron Fock matrix element, and a density matrix element, respectively. A set {χ} is called basis function set. N is the number of basis functions and is regarded as the computational size of the target molecular system. The total electronic wave function and energy value, and various electronic properties of target molecule are obtained from the resultant∑ set {C}, which is N regarded as the coefficients of MO (φa = i=1 Cia χi ). Since F (2) matrix depends on the answer C through P matrix, above Hartree-Fock equation (Eq. 1) is solved by the iterative procedure called as the SCF (Self-Consistent Field) method. Whole SCF computational procedure can be outlined as follows: 1. General Setup. 2. Input initial parameters, and generate initial data such as coordinates, basis function set, and so on. 3. Calculate one electron integrals (ONE-EIs). 4. Generate F (2) matrix with ERI computations. (3 ∼ 4 steps are iterative.) 5. Diagonalize the Fock matrix, Judge the convergence, and Update the density matrix. 7. Calculate various properties (when density matrix is converged). Table I shows the ab initio MO computation time for five sample molecules, and summarizes the time distribution for each step in the SCF procedure. In the SCF procedure mentioned above, the step of F (2) matrix generation consumes more than 98% of the total computation time. Furthermore, this percentage increases over 99% for larger scale calculation. If it is possible to shorten this computation time to a great extent, total computation time would be decreased drastically.

As shown from this algorithm, a calculation of a set i, j, k, and l doesn’t depend on the other set. Therefore, there are a lot of parallelism between each ERI calculation. C. Algorithm for the ERI Calculation by the Obara method The computation of single ERI is a laborious task. There are several algorithms for the ERI calculation. Among such algorithms, we selected the Obara method which is based on the general recurrence formula for contracted Gaussian function. This method have the good features of fast calculation, numerically high accuracy, and can be applied to the various integral calculations other than the ERI. Figure 2 shows this Obara algorithm for the ERI Calculation[3], [4]. The most characteristic point of the Obara algorithm is its process. An initial integrals are calculated with input data. Then this result is used to perform the recursive calculation for a certain ERI. Therefore, the Obara algorithm can be roughly divided into two segments: initial integral calculation and recursive calculation as shown in Figure 2. 1) Initial Integral Calculation: The initial integral calculation has a four-fold loop structure. In the initial integral calculation, there is a small number of operations per iteration, each with a small amount of instruction-level parallelism (ILP). To make matters worse, the initial integral calculation contains complex double precision floating-point operations such as division, inverse square root, exponential function for

TABLE I E XAMPLE OF ab initio MO C OMPUTATION T IMEa [ SEC ] Types of Peptide Molecules No. of Atoms No. of Atomic Orbitals

G 10 55

Setup Initial Orbital Set ONE-EI Calculation TWO-ERI Calculation & Fock-Matrix Generation Fock-Matrix Diagonalization Property Calculation Total CPU Time No. of Iterations

0.1 0.1 0.1 22.9 (96.6%) 0.2 0.1 23.7 12

a 4-31G

GA 20 110

GAQ GAQM 37 58 207 316 Computation Time (sec.) 0.1 0.1 0.1 0.6 4.4 18.9 0.3 1.5 5.0 269.4 1871.0 8482.1 (98.7%) (98.6%) (98.8%) 1.7 11.0 60.9 0.3 2.2 9.2 272.9 1892.7 8584.9 14 15 16

0.3 57.3 10.1 23284.3 (98.6%) 211.7 27.5 23614.5 19

basis function set, using GAMESS, on [email protected] with 512MB Main Memory

Next, F0 (T ) ∼ Fm−1 (T ) are obtained by the following relation: 1 {exp(−T ) + Fm+1 (T )} (8) Fm (T ) = (2m + 1)

(Generation of Initial Integrals) for a = 1, Mi for b = 1, Mj for c = 1, Mk for d = 1, Ml error function: inverse-square root, division, exponential function, and Taylor expansion calculations end loop end loop end loop end loop

2) Recursive Calculation: Recursive calculation computes a objective ERI after the processing of initial calculation is completed. The recursive calculation part consists of many hierarchical structured sub-functions, which consist of a large number of double precision floating-point multiply-and-add operations and can be processed in parallel.

Recurrence Calculation

(Recursive Calculation from Initial Integrals) recursive calculation to the target (ij, kl): many multiply-and-add calculations

ERI_reccal_0001_0000 function call ERI_reccal_0100_0000 function call •••

ERI_reccal_0200_0100 function call •••

Fig. 2.

GAQMY 75 427

Obara Algorithm for the Electron Repulsion Integral Calculation

ERI_reccal_0202_0201 function call

High [ 0] = ERI_DCx * Med1 [ + ERI_CDx * Med2 [ + ERI_BAx * Med3 [ + ERI_ACx * Med4 [ High [ 1] = ERI_DCy * Med1 [ + ERI_CDy * Med2 [ + ERI_BAy * Med3 [ + ERI_ACy * Med4 [ High [ 2] = ERI_DCz * Med1 [ + ERI_CDz * Med2 [ + ERI_BAz * Med3 [ + ERI_ACz * Med4 [ •••

ERI_horcal_1120_b function call ERI_horcal_1110_b function call ERI_horcal_1111_d function call •••

Targ [ 0] = High [ 0] + ERI_ABx * Low [ 0] ; Targ [ 1] = High [ 1] + ERI_ABx * Low [ 1] ; Targ [ 2] = High [ 2] + ERI_ABx * Low [ 2] ; •••

the following error function calculation: ∫ 1 Fn (T ) = t2n exp(−T t2 )dt (n = 0 ∼ m).

0] 0] 0] 0] ; 0] 0] 0] 0] ; 0] 0] 0] 0] ;

Targ [53] = High [17] + ERI_ABz * Low [17] ;

(5)

0

This error function calculation is the most time comsuming step in the innermost iteration of the ERI calculation loop structure. In the program, F0 (T ) ∼ Fm (T ) are evaluated using two formulas depending on the value T [5]. For a certain threshold value Tf , if T is greater than Tf , we can employ the asymptotic formula: √ (2n − 1)!! π . (6) Fm (T ) ' 2(2T )n T Whereas, if T is from 0 to Tf , we first employ the Taylor expansion for the highest value of m: ∑ Fm (T ) = Fm+k (T0 )(T0 − T )k /k!. (7) k

Fig. 3.

Hierarchical Structure of a Recursive Calculation

III. EHPC

PLATFORM AND ERIC A RCHITECTURE

P ROCESSOR

The goal of our project is to build a high-performance parallel computing system which implements many embedded ERIC processors, named Embedded High Performance Computing (EHPC) system. We introduce the overview of the EHPC platform and the ERIC processor architecture in this section. A. EHPC Platform Architecture 1) EHPC Platform Overview: As shown in Figure 4, the total EHPC platform architecture has three-level hierarchical

network structure: conventional PC, house keeping SH-4 processor and worker ERIC processor in order of descending levels, optimized with the Fock matrix generation algorithm by the integral driven method (Figure 1). One chassis system consists of a general purpose CompactPCI CPU board as a top layer conventional PC board and up to seven custom computation processor boards based on CompactPCI standard, and those boards are connected by CompactPCI bus each other. The ERIC processors are mounted on the board with a Hitachi SH-4 processor. The SH-4 processor on each board has a role in housekeeping and distributes segmentalized Fock matrix computation to the worker ERIC processors. The EHPC system can be used as PC cluster system by connecting chassis systems through 100Mbps Ethernet. 100Mbps Ethernet

Top Layer

PC (CompactPCI Card)

PC

PC

CompactPCI Bus

・・・

House Keeping Layer

SH4

SH4

SH4

BUS Bottom Layer (Worker LSI)

・・・ ERIC

ERIC

ERIC

ERIC

Board Chassis

Chassis

Fig. 4.

Chassis

EHPC Platform Architecture

We developed a prototype EHPC system which utilizes SH4 processors for worker LSIs instead of ERIC processors. Three worker LSIs are mounted on each prototype SH4 board. 2) Efficiency of the EHPC Platform: Three benchmark calculations by the EHPC 1, 2, and 3 chassis systems (21, 42, and 63 worker SH4 processors) were performed for (Glycine)5 amino acid [6] The result speed-up ratios compared to the 1 PC worker LSIs system are 2.00 for 2 chassis and 2.86 for 3 chassis system, respectively. This shows that speed-ups ratios are almost proportional to the number of worker SH4 LSIs. This result comes from not only the original nature of Fock generation algorithm (Figure 1) which can be easily parallelized, but also the reason why the concentration of the data communication does not occur, because the sub-totals of the partial Fock matrix elements on the memories of the bottom layer can be obtained on each layer in hierarchical EHPC architecture. Therefore, it is expected that this good parallelism will inherit for larger molecular systems and absolute calculation times will decrease by exchanging worker LSIs from the prototype SH4 processors to the newly developed ERIC ones. B. Overview of the ERIC Processor Architecture As described in the previous section (Sec. II), we especially focus on ab initio MO method and adopt following two approaches to speedup large-scale MO calculations: exploitation of inter-ERI calculation parallelism with EHPC system, and fast calculation of single integral using special purpose LSI. For that purpose we are currently developing ERI calculation specific LSI named ERIC for fast calculation of each ERI.

Since low power consumption is one of the most important requirements to the ERIC processor design for implementing the EHPC system, we decide to adopt S1C33 family which is an embedded use RISC-type CPU as a base processor. With this limitation, we investigate the ERI calculation algorithm shown in section II. The Obara algorithm consists of two segments whose characteristics are quite different from each other: an initial integral calculation (IIC) and a recursive calculation (RC), as shown in Figure 2. So we decided the design philosophy for coping with the two calculation parts, the IIC part and the RC part as follows: • The IIC part including some complex floating-point operations accelerates processing by installing special functional units to a processor. • The RC part having rich ILP accelerates by installing many functional units to a processor and by parallel processing. However, implementing the both functions with one processor is not efficient for two reasons: the IIC part does not need many functional units for the sake of its low parallelism and the RC part does not need special functional units for lack of complex floating-point operation. Therefore we would not optimize the ERIC processor with the whole part of the Obara algorithm, and decided to install a co-processor which accelerates a series of floating-point multiply-and-add operations in the ERIC processor. Namely, we adopted a heterogeneous chip-multiprocessor (CMP) architecture as the ERIC processor design, and divided the ERIC processor into two engines which work concurrently: an IIC engine as a main engine and an RC engine as the co-processor for calculating floating-point multiply-and-add operations. Figure 5 shows the overview of ERIC processor. ERIC processor is mainly composed of the following five modules: • Host Interface for communication with Host SH4 processor • SDRAM Interface for accessing External Memory • Internal Data/Program Memory for storing the calculation results and programs • IIC Engine for the initial integral calculation • RC Engine for the recursive calculation This processor only supports IEEE 754 double-precision floating-point numbers. Therefore, all the functional unit, memory and interfaces are 64-bit width, and single precision floating-point numbers are not supported. IV. ERIC: ERI- CALCULATION S PECIFIC P ROCESSOR A RCHITECTURE A. IIC Engine The initial integral calculation (IIC) consists of various operations such as arithmetic, floating-point addition, multiplication, division, inverse square root, exponential function, function call and conditional branch, and lacks in instructionlevel parallelism (ILP). Therefore, the design like a general purpose RISC processor is suitable for the architecture of the initial integral calculation engine. In the ERIC processor, the

HOST CPU

Customized processor based on S1C33 RISC CPU

(SH4)

HOST Interface

Initial Integral Calculation Engine Local BUS

Recursive Calculation Engine

(SH4 I/F)

Internal Data Memory External Memory Interface (SDRAM I/F)

μCMP architecture with four micro-engines

Fig. 5.

External Memory

Overview of Eric Processor Architecture

IIC engine is a customized processor of S1C33 family (Seiko Epson’s original 32-bit RISC-type CPU). For the IIC part of the Obara algorithm, we have added five special floating-point operating units, multiply-and-add, division, inverse square root, exponential function and error function to the base processor. 1) S1C33: 32-bit RISC Microprocessor Family: S1C33 family is Seiko Epson’s1 original 32-bit RISC-type CPU. This CPU was developed for high-performance embedded applications such as peripheral equipment for personal computers, portable equipment and other products which need high-speed data processing with low power consumption. The S1C33 instruction set contains 61 basic instructions (105 instructions in all), but no floating-point instruction. Since the ERIC processor must calculate double-precision floating-point numbers for processing MO computation, we added some floating-point operating units written in the following section to the IIC engine as additional components. 2) Floating-Point Operating Units: As mentioned above, S1C33 family does not have a floating-point operating unit, so we decided to add five double-precision floating-point operating units as follows: • Floating-point multiply-and-add unit • Floating-point division and inverse square-root unit • Floating-point exponential function unit • Floating-point error function unit which calculate Taylor expansion (Eq. 7) • 64-bit width load-store unit Execution clock cycles of the double precision floating-point arithmetic operation are listed on the Table II. These various floating-point arithmetic operations are important for the ERI calculation. For example, the execution clock cycles of the error function unit is 16, while it takes 32 instructions and 146 clock cycles by using the Taylor expansion method measured 1 Epson

Electronic Devices: http://www.epsondevice.com/

on the typical MIPS processor. The error function is performed many times at the deepest level of total Fock matrix generation with ERI calculation loop structure. Similarly, execution clock cycles by using the exponential function, the square root, and inverse-square root unit can also shorten the execution clock cycles compared with the software evaluation. Therefore it is expected that the development of ERIC processor will be very effective for ERI calculations. TABLE II E XECUTION C LOCK C YCLES FOR EACH O PERATION addition multiplication division square root exponential error function

6 6 23 31 20 16

pipelined pipelined unpipelined unpipelined unpipelined unpipelined

B. RC Engine Recurrence calculation (RC) engine is a co-processor which is specialized to the RC part of the Obara algorithm. As described in the section II, the RC part consists of a large number of function-calls, and each sub-function contains only a series of substantial floating-point multiply-and-add operations without branches. Since there is a lot of parallelism between those multiply-and-add operations, we can execute many operations simultaneously. So we decide to install plural multiply-and-add operating units to the RC engine architecture, and investigate the optimal organization for efficient parallel execution of the numerous multiply-and-add operations. 1) µCMP Architecture: For effective utilization of the multiply-and-add operating units, we analyze the characteristics of the RC part and focus attention on an inter-subfunction parallelism. As described in section II-C.2, since there is parallelism between subfunctions and each subfunction contains only floating-point multiply-and-add operation, load / store operation and integer addition / subtraction for address calculations, we proposed a microengine-based chip-multiprocessor (µCMP) architecture [7], [8]. The µCMP architecture has plural quite small specialized processors only for computing floating-point multiply-and-add operation, named MicroEngines, and each subfunction is computed in parallel by allotted to each Micro-Engines. 2) Design Considerations: We kicked the organization of the Micro-Engines around and experimented with the several types of the RC engine design to find out the optimal architecture. In this experimentation, we assumed that the maximum number of multiply-and-add units and load-store units we can implement is four, and compared the execution cycles of the RC part on each RC engine design changing the number of the Micro-Engines and functional units in each Micro-Engine. Figure 6 shows the result of the experimentation. It shows the speed-up ratio of the alternative RC engine designs over the base model RC1, 4, 4. Each model notation of the graph means the number of Micro-Engines and functional units as shown in Table III. From the result we can draw the conclusion

Performance Improvement over RC1,4,4

TABLE IV 250%

M AJOR S PECIFICATIONS OF THE ERIC P ROCESSOR

200%

Die Size 5mm×10mm Logic 2.1MGates Memory 704Kbytes Frequency 200MHz Power∗ (Max.) 2.1W (4.2W) ∗ Estimated power consumption

150%

100%

50%

0%

(ps ,ss ) (ps ,ps ) (pp ,ss ) (pp ,ps ) (pp .pp ) (ds ,ss ) (ds ,ps ) (pp ,ds ) (ds ,ds ) (dp ,ss ) (dp ,ps ) (dp ,pp ) (dp ,ds ) (dp ,dp ) (dd .ss ) (dd ,ps ) (dd ,pp ) (dd ,ds ) (dd ,dp ) (dd ,dd )

-50%

Type of Integral RC2,1,1

RC2,2,1

RC2,2,2

RC3,1,1

RC4,1,1

Fig. 6. Performance Comparison among Different Organization of RC Engine

that we can get better performance when increasing the number of Micro-Engines than when increasing the functional units in each Micro-Engine. RC Engine[0]

TABLE III E VALUATION M ODELS Model

# of MEa

RC1, 4, 4 RC2, 1, 1 RC2, 2, 1 RC2, 2, 2 RC3, 1, 1 RC4, 1, 1

1 2 2 2 3 4

a ME:

# of Units M&Ab LSUc 4 4 1 1 2 1 2 2 1 1 1 1

LSU 4 2 2 4 3 4

IIC Engine

Memory SubSystem Host (SH4) Interface SDRAM Controler

SRAM 16Kbytes x 24 = 384Kbytes

Total M&A 4 2 4 4 3 4

S R A M

RC Engine[1]

RC Engine[2]

SRAM 16Kbytes x 20 = 320Kbytes

RC Engine[3]

PLL

Fig. 7.

Chip Implementation of the ERIC Processor

Micro-Engines b M&A: Multiply-and-add units c LSU: Load-store units

B. Performance Evaluation From the result of the experiment, we finally decided to put four Micro-Engines on the RC engine and the components of each Micro-Engine as follows: a register file, a load-store unit, an arithmetic unit and a floating-point multiply-andadd unit. And each Micro-Engine has a basic instructionset containing only 17 types of instruction such as integer addition/subtraction, load/store, floating-point multiply-andadd, some branch and system control operations. V. C HIP I MPLEMENTATION A. Major Specifications We designed the ERIC processor and fabricated its actual chip with 0.13-micron processes CMOS technology by TSMC. Figure 7 shows the photograph of the ERIC processor chip, and its major specifications are listed on the Table IV with one exception that only power consumption is estimated value because we have not measured it yet. The estimated power consumption of the ERIC processor is 2.1W, and maximum power consumption is 4.2W. The power consumption is low enough to implement a large number of the processors in the EHPC (Embedded High Performance Computing) platform as we expect.

We can not show the exact performance evaluation of the entire EHPC system at the moment, because the EHPC system has not been completed and we are currently working on setting up the system which still has a lot of problems to be solved. However, we have the actual ERIC processor chips and experimental boards for testing them, and can evaluate the performance of the single ERIC processor chip. In this section we show side-by-side performance comparison between the ERIC processor and a single PC. The evaluation condition for each is as follows: ERIC: • Actual execution time of the single ERIC processor with 704Kbytes on-chip SRAM and 256Mbytes offchip SDRAM (The operating frequency is 200MHz.) • Programs written in assembly and C language, compiled by S1C33 compiler which does not support floating point optimization PC: • Actual execution time of a 3.2GHz Pentium 4 processor with 1Mbytes L2 cache and 2Gbytes main memory • All programs written in C language and compiled by gcc -O2 option

TABLE V E XECUTION T IME AND P ERFORMANCE E VALUATION OF THE IIC E NGINE

ERIC Pentium 4 Ratio

(pp, pp) Exec. Time (sec.) Perf. / Power −3 4.52 × 10 5.27 × 101 −4 1.09 × 10 8.91 × 101 41.5 0.59

1) IIC Engine: Table V shows the performance evaluation of the IIC engine compared with the Pentium 4 processor for calculations of a (pp, pp) type initial integral and the error function. This error function calculations are the most time consuming step in innermost iteration of the IIC computation 4-fold loop structure as shown in Figure 2. For the (pp, pp) calculation, the IIC engine consumes 41 times execution time compared to the Pentium 4 processor. Now, we are developing the ERIC specific compiler, so this (pp, pp) code optimizations are far from complete at the present stage. For the error function part, the assembly codes have already been fully optimized for ERIC processor. This part of calculation shows calculations of the Pentium 4 processor is 16 times faster than of IIC engine. By the code optimization and utilizing advantages of double precision floating point calculations and special arithmetic units, it is expected that (pp, pp) execution times decrease drastically as this error function level. Datasheet shows the power consumption of Pentium 4 is 103.0W (TDP: Thermal Design Power) [9], while for the ERIC it is 4.2W (Max.). From the performance-per-power 1 ( exec.time∗power ) point of view, the result shows that the IIC engine has almost 1.5 times higher efficiency than the Pentium 4. However, it is not so efficient as we expected. 2) RC Engine: Table VI shows the execution time (sec.) of the RC part for five peptide molecules, G, GA, GAQ, GAQM and GAQMY. It indicates that the RC engine requires nearly 36 times more execution time than Pentium 4 on average, regardless of the inputs. C. Improving the Performance of RC Engine These evaluations in the previous section show that the ERIC processor has no advantage in both computation performance and power efficiency over Pentium 4. It is a disappointing result. So we are now examining why our implementation does not work well, for example, we are examining on memory architecture, functional units design, utilization rate for each functional unit. In this section we show several experimental result. As a cause of the low performance of the RC engine, we focused attention on the low IPC (Instructions Per Clock cycle), investigated on the memory architecture of the ERIC processor using a cycle-accurate simulation model based on the Verilog-HDL description of the RC engine. Although we adopt a simple request/acknowledge protocol as the on-chip bus protocol of the ERIC processor for both the program memory and the data memory, this bus bandwidth becomes

Error function Exec. Time (sec.) Perf. / Power 2.77 × 10−6 8.60 × 104 −7 1.73 × 10 5.62 × 104 16.0 1.53

extremely low because of shortening design periods. Specifically, the minimum memory access cycle to on-chip SRAM in all data types (char, short, int and double) is 6. Since these buses are not pipelined, its memory bandwidth is 1 data / 6 clock cycles (30ns) for each engine. In other words, each RC engine can fetch only one instruction for every 6 cycles. SDRAM controller connected with the main memory also has the bandwidth problem. The Figure 8 shows the SDRAM access cycles of each RC engine. The graph shows that some memory accesses, 19.65% on RC[2] and 14.18% on RC[3], take more than 100 cycles, and the worst case takes 32,218 cycles to get a 64bits-width data. 100%

0.00%

1.83% 3.10%

0.41% 3.14% 19.24%

11.04%

13.43%

70%

7.54% 3.37%

60%

19.71%

90%

2.92%

80% 44.97% 40.43%

50% 44.38%

40% 30% 52.40%

49.90%

2.63%

4.74%

RC[0]

RC[1]

40.59%

1000～ ～999 ～99 ～49 ～39 ～29 10～19

20% 24.36%

10% 9.14%

0%

Fig. 8.

RC[2]

RC[3]

Breakdown of SDRAM Access for the RC Engine (in cycles)

As the result, the total IPC of four RC engines is approximately 0.28, while the average IPC of single RC engine is 0.07. This indicates that the each RC engine can issue only one instruction for every 14 clock cycles. The Figure 9 compares execution time and power efficiency of Pentium 4 and the ERIC processor chip with those of ERIC simulation models. Here, ERIC-Sim is the simulation model of the ERIC processor with the ideal memory, and each memory (including SDRAM) access can be completed in 6 cycles, and its bus protocol is the same as that of the ERIC processor chip. ERIC-Sim(p) adopts pipelining technique as the bus protocol on the ERIC-Sim for improving the memory bandwidth, and its memory access cycle are also 6. As shown in Figure 9, we can recognize that the execution time can be shortened to two fifths of the current one if the SDRAM access can be reduced from the ERIC chip. Furthermore, improvement of the bus protocol can make the execution time shortened to one third of the above, which shows ERIC processor possibly has the competitive advantage over Perntium 4 in power efficiency. These results show the

TABLE VI E XECUTION T IME OF THE R ECURSIVE C ALCULATION ( SEC .) Molecules ERIC Pentium 4 Ratio

G 1.48 × 101 4.15 × 10−1 35.76

5.0

Exec. Time 7.25E+04 Efficiency

4.58 4.5 4.0

6.0E+04 3.5 5.0E+04 3.0 2.5

4.0E+04

2.98E+04 2.0 3.0E+04 1.5

1.65 2.0E+04

1.00 1.0

1.07E+04

Efficiency (normalized to Pentium 4)

Execution Time (sec.) [Lower is Better]

8.0E+04

7.0E+04

GA 3.07 × 102 8.46 × 100 36.31

1.0E+04 0.5

0.68 2.00E+03

0.0

0.0E+00 Pentium4 3.2GHz

Fig. 9.

Eric-Chip

Eric-Sim

Eric-Sim (p)

Execution Time and Efficiency of the Recursive Calculation

current specification of the memory architecture is the big performance deterioration factor. In other words, improvement of memory architecture design is necessary for ERIC processor to gain the competitive advantage over Pentium 4. VI. C ONCLUSION Ab initio molecular orbital (MO) method is very important for calculating electronic structures and properties of molecules, and the personal type of high performance computer is still strongly desired for MO calculation. In the MO computation, electron repulsion integrals (ERIs) consume the most and tremendous calculation time. In this study, we are developing EHPC (Embedded High Performance Computing) system which has following features: 1. hierarchical parallel computing platform which can compute numerous ERIs efficiently, 2. the special purpose processor which is implemented in the system and can compute single ERI fast. To implement many processors into the system, we have developed the high-performance and low power consumption ERI calculation specific processor architecture, called ERIC. The ERIC processor adopts heterogeneous chipmultiprocessor (CMP) architecture optimized with the Obara ERI calculation algorithm, and has two calculation engines: initial integral calculation (IIC) engine and recursive calculation (RC) engine. The IIC engine has special floating-point operating units for fast calculation of division, inverse-square-root, exponential function and error function. The RC engine adopts microengine-based chip-multiprocessor (µCMP) architecture and has micro-engines. We designed the actual ERIC processor chip. It is comprised of the IIC engine which has five types of special floating-point

GAQ 4.11 × 103 1.13 × 102 36.34

GAQM 1.75 × 104 4.82 × 102 36.41

GAQMY 7.25 × 104 2.00 × 103 36.23

operating units, the RC engine which consists of four MicroEngines, 704Kbytes on-chip SRAM memory, an SDRAM interface and a host interface. The die size is 50mm2 , the operating frequency is 200MHz and the estimated power consumption is 2.1W (max. 4.2W). The ERIC processor chips were fabricated with TSMC 0.13µm CMOS technology, and we are currently working on setting up the entire EHPC system. We have estimated the performance of each engine and compare with a Pentium 4 3.2GHz processor. The result of the estimation is that the performance of IIC engine is only 6.24% of that of Pentium 4, and the performance-per-power is 1.5 times higher than the Pentium 4. On the other hand, the performance of the RC engine is about 36 times lower than the Pentium 4 on average. In this work we can satisfy the power requirement to assemble many ERIC processors into the compact EHPC system. Unfortunately, we have not achieved enough performance in our research phase yet, however, by improving the memory architecture, the ERIC processor has a potential to obtain 4.58 times higher efficiency over the Pentium Processor. ACKNOWLEDGMENT The research is granted by Japanese Ministry of Education, Culture, Sports, Science and Technology. R EFERENCES [1] A. Szabo and N. S. Ostlund, Modern Quantum Chemistry. Reading, Massachusetts: Dover Publishers, 1996. [2] J. Alml¨of, K. Faegri, Jr., and K. Korsell, “Principles for a direct SCF approach to LCAO-MO Ab Initio calculations,” Journal of Chemical Physics, vol. 3, pp. 385–399, 1982. [3] H. Honda and S. Obara, “Molecular integrals evaluated over contracted Gaussian functions,” in Proceedings of the 11th International Congress of Quantum Chemistry, 2003. [4] H. Honda, T. Yamaki, and S. Obara, “Molecular integrals evaluated over contracted Gaussian functions by using auxiliary contracted hyperGaussian functions,” Journal of Chemical Physics, vol. 117, no. 4, pp. 1457–1469, 2002. [5] S. Obara and A. Saika, “Efficient recursive computation of molecular integrals over Cartesian Gaussian functions,” Journal of Chemical Physics, vol. 84, no. 7, pp. 3963–3974, 1986. [6] H. Umeda, Y. Inadomi, H. Honda, and U. Nagashima, “Parallel Fock matrix construction on layered multi-processor system,” in Proceedings of The 229th ACS National Meeting (To be published). American Chemical Society, March 2005. [7] K. Nakamura, H. Hatae, M. Harada, Y. Kuwayama, M. Uehara, H. Sato, S. Obara, H. Honda, U. Nagashima, Y. Inadomi, and K. Murakami, “Eric: A special-purpose processor for ERI calculations in quantum chemistry applications,” in Proceedings of HPC-Asia 2002, December 2002. [8] K. Nakamura, M. Harada, Y. Kuwayama, and K. Murakami, “A microengine-based chip-multiprocessor (µCMP) architecture for compute intensive applications,” in ODES: 1st Workshop on Optimizations for DSP and Embedded Systems, March 2003. [9] “Intel Pentium 4 Processor on 90 nm Process Detasheet,” Intel Corporation.

of Informatics, Kyushu University † Seiko Epson Corp. ‡ Mizuho Information & Research Institute, Inc. § National Institute of Advanced Industrial Science and Technology ¶ A Priori Microsystems Inc. k Hokkaido University of Education Abstract— Ab initio molecular orbital (MO) calculation is very important for solving many challenging problems concerning the development of new drugs, chemicals, polymers, materials, and so on. However, large-scale MO calculations have at least two difficulties. The first is a considerable number of Electron Repulsion Integral (ERI) computations. The second is the computational complexity of single ERI. We are constructing special purpose processors, and a high performance and compact parallel computing system embedding them for a large-scale MO calculation, named EHPC (Embedded High Performance Computing) system. The numerous ERI calculations can be performed efficiently by the hierarchical parallel computing system architecture, and the special processors accelerate the computation of single ERI. For the parallel computing system, high-performance, low power consumption, and high densely integratable special purpose processor is strongly required. So we developed the ERI calculation specific chip-multiprocessor (CMP) architecture, named ERIC. This is the first implementation of an application specific processor architecture for MO calculation. The ERIC processor architecture has following unique features: heterogeneous CMP architecture optimized with ERI calculation algorithm, microengine-based chip-multiprocessor (µCMP) architecture, and special floating-point operating units, especially inverse-square-root unit, exponential function unit and error function unit for calculating Taylor expansion. We have developed the actual ERIC processor chip with TSMC 0.13µm CMOS technology. The die size is 10mm × 5mm; the operating frequency is 200MHz; the estimated power consumption is 2.1W.

I. I NTRODUCTION Ab initio molecular orbital (MO) method is very important for calculating electronic structures and properties of molecules. Hence, it is indispensable today to interpret experimentally observed phenomena and to predict various physical and chemical properties of known or unknown molecules to design functional materials by the MO method. It can also present information on inter-atomic interaction in molecular assemblies, and is a basis of all molecular simulation approaches. However, large-scale MO calculation has at least two difficulties: numerous numbers of electron repulsion integral

(ERI) calculations and computational complexity of single ERI. As mentioned in next section, O(N 4 ) numbers of electron repulsion integral (ERI) calculations are required, where N correspounds to the computatinal size of the target molecular system. Moreover, the computation of single ERI is laborious task, almost all arithmetics are consist of double precision floating-point operations, it needs to compute division, inverse square root, exponential function, error function, and a large number of double precision floating-point multiply-and-add operations. In recent years, many high performance computer systems have been developed as high speed vector computer systems and huge (PC) cluster computer systems, and large scale MO calculations have been actually possible. However, it’s difficult for almost all chemists to use such large computer systems. They are expensive, hard to maintain, require high power consumption, and inappropriate for trial calculations of some ideas. Furthermore, it is difficult to keep the target chemical compounds secret from the other researchers. Therefore, development of the personal computer system, which is high performance, low cost, and easy to use is strongly desired. We have been therefore developing EHPC (Embedded High Performance Computing) system with following special features in answer to the difficulties and the requirement: • The total system with hierarchical parallel computing architecture which enables numerous ERI computations efficiently, • ERI specific LSI which enables fast computation of the single integral, • Low power consumption and easy to use as the personal computer system. For realization of those features, high-performance, low power consumption, and high densely integratable special purpose processor is indispensable. So we investigate the ERI computation algorithm, and develop the ERI calculation specific processor architecture, named ERIC. The rest of the paper is organized as follows: section II

describes the outline of MO calculation, section III shows the overview of the EHPC system and the ERIC processor architecture, and section IV describes the ERIC processor architecture in detail. Then we show the specifications of the ERIC processor chip, and evaluate the attainable performance of the ERIC processor in section V. Section VI concludes the paper. II. C HARACTERISTICS OF THE MO C ALCULATION A. General MO Calculation Scheme and Most Time Consuming Step Ab initio MO calculation is performed by solving following Hartree-Fock equation [1] (Eq. 1):

B. Fock Matrix Generation Algorithm The integral driven calculation algorithm [2], shown in Figure 1, is widely used for the step of F (2) matrix generation in almost all SCF programs currently available. As the algorithm indicates, the number of ERI calculations is proportinal to N 4 and increases rapidly as N becomes larger.

for i = 1, N for j = 1, i for k = 1, i if (k == i) for l = 1, j, else for l = 1, k calculation of ERI (ij, kl) (2)

N ∑

(1)

(2)

{Fij + Fij }Cia =

i=1

(2)

Fij

Pij

= ≡

(1)

Sij Cia εa

(1)

i=1

N ∑ N ∑ k=1 l=1 occ ∑

2

∫ (ij, kl) ≡

N ∑

1 Pkl {(ij, kl) − (ik, jl)} 2

Cia Cja

(2) (3)

a

χi (r1 )χj (r2 )

1 χk (r2 )χl (r2 )dr1 dr2(4) , |r1 − r2 |

Fij + = Pkl ∗ (ij, kl) (2) Fkl + = Pil ∗ (ij, kl) (2) Fik − = 12 Pjl ∗ (ij, kl) (2) Fil − = 12 Pjk ∗ (ij, kl) (2) Fjk − = 12 Pil ∗ (ij, kl) (2) Fjl − = 12 Pik ∗ (ij, kl) end loop end loop end loop end loop Fig. 1.

Integral Driven Algorithm for the Fock Matrix Generation

(2)

here, Fij , (ij, kl), Fij and Pij denote a one-electron Fock matrix, a electron repulsion integral (ERI), a two-electron Fock matrix element, and a density matrix element, respectively. A set {χ} is called basis function set. N is the number of basis functions and is regarded as the computational size of the target molecular system. The total electronic wave function and energy value, and various electronic properties of target molecule are obtained from the resultant∑ set {C}, which is N regarded as the coefficients of MO (φa = i=1 Cia χi ). Since F (2) matrix depends on the answer C through P matrix, above Hartree-Fock equation (Eq. 1) is solved by the iterative procedure called as the SCF (Self-Consistent Field) method. Whole SCF computational procedure can be outlined as follows: 1. General Setup. 2. Input initial parameters, and generate initial data such as coordinates, basis function set, and so on. 3. Calculate one electron integrals (ONE-EIs). 4. Generate F (2) matrix with ERI computations. (3 ∼ 4 steps are iterative.) 5. Diagonalize the Fock matrix, Judge the convergence, and Update the density matrix. 7. Calculate various properties (when density matrix is converged). Table I shows the ab initio MO computation time for five sample molecules, and summarizes the time distribution for each step in the SCF procedure. In the SCF procedure mentioned above, the step of F (2) matrix generation consumes more than 98% of the total computation time. Furthermore, this percentage increases over 99% for larger scale calculation. If it is possible to shorten this computation time to a great extent, total computation time would be decreased drastically.

As shown from this algorithm, a calculation of a set i, j, k, and l doesn’t depend on the other set. Therefore, there are a lot of parallelism between each ERI calculation. C. Algorithm for the ERI Calculation by the Obara method The computation of single ERI is a laborious task. There are several algorithms for the ERI calculation. Among such algorithms, we selected the Obara method which is based on the general recurrence formula for contracted Gaussian function. This method have the good features of fast calculation, numerically high accuracy, and can be applied to the various integral calculations other than the ERI. Figure 2 shows this Obara algorithm for the ERI Calculation[3], [4]. The most characteristic point of the Obara algorithm is its process. An initial integrals are calculated with input data. Then this result is used to perform the recursive calculation for a certain ERI. Therefore, the Obara algorithm can be roughly divided into two segments: initial integral calculation and recursive calculation as shown in Figure 2. 1) Initial Integral Calculation: The initial integral calculation has a four-fold loop structure. In the initial integral calculation, there is a small number of operations per iteration, each with a small amount of instruction-level parallelism (ILP). To make matters worse, the initial integral calculation contains complex double precision floating-point operations such as division, inverse square root, exponential function for

TABLE I E XAMPLE OF ab initio MO C OMPUTATION T IMEa [ SEC ] Types of Peptide Molecules No. of Atoms No. of Atomic Orbitals

G 10 55

Setup Initial Orbital Set ONE-EI Calculation TWO-ERI Calculation & Fock-Matrix Generation Fock-Matrix Diagonalization Property Calculation Total CPU Time No. of Iterations

0.1 0.1 0.1 22.9 (96.6%) 0.2 0.1 23.7 12

a 4-31G

GA 20 110

GAQ GAQM 37 58 207 316 Computation Time (sec.) 0.1 0.1 0.1 0.6 4.4 18.9 0.3 1.5 5.0 269.4 1871.0 8482.1 (98.7%) (98.6%) (98.8%) 1.7 11.0 60.9 0.3 2.2 9.2 272.9 1892.7 8584.9 14 15 16

0.3 57.3 10.1 23284.3 (98.6%) 211.7 27.5 23614.5 19

basis function set, using GAMESS, on [email protected] with 512MB Main Memory

Next, F0 (T ) ∼ Fm−1 (T ) are obtained by the following relation: 1 {exp(−T ) + Fm+1 (T )} (8) Fm (T ) = (2m + 1)

(Generation of Initial Integrals) for a = 1, Mi for b = 1, Mj for c = 1, Mk for d = 1, Ml error function: inverse-square root, division, exponential function, and Taylor expansion calculations end loop end loop end loop end loop

2) Recursive Calculation: Recursive calculation computes a objective ERI after the processing of initial calculation is completed. The recursive calculation part consists of many hierarchical structured sub-functions, which consist of a large number of double precision floating-point multiply-and-add operations and can be processed in parallel.

Recurrence Calculation

(Recursive Calculation from Initial Integrals) recursive calculation to the target (ij, kl): many multiply-and-add calculations

ERI_reccal_0001_0000 function call ERI_reccal_0100_0000 function call •••

ERI_reccal_0200_0100 function call •••

Fig. 2.

GAQMY 75 427

Obara Algorithm for the Electron Repulsion Integral Calculation

ERI_reccal_0202_0201 function call

High [ 0] = ERI_DCx * Med1 [ + ERI_CDx * Med2 [ + ERI_BAx * Med3 [ + ERI_ACx * Med4 [ High [ 1] = ERI_DCy * Med1 [ + ERI_CDy * Med2 [ + ERI_BAy * Med3 [ + ERI_ACy * Med4 [ High [ 2] = ERI_DCz * Med1 [ + ERI_CDz * Med2 [ + ERI_BAz * Med3 [ + ERI_ACz * Med4 [ •••

ERI_horcal_1120_b function call ERI_horcal_1110_b function call ERI_horcal_1111_d function call •••

Targ [ 0] = High [ 0] + ERI_ABx * Low [ 0] ; Targ [ 1] = High [ 1] + ERI_ABx * Low [ 1] ; Targ [ 2] = High [ 2] + ERI_ABx * Low [ 2] ; •••

the following error function calculation: ∫ 1 Fn (T ) = t2n exp(−T t2 )dt (n = 0 ∼ m).

0] 0] 0] 0] ; 0] 0] 0] 0] ; 0] 0] 0] 0] ;

Targ [53] = High [17] + ERI_ABz * Low [17] ;

(5)

0

This error function calculation is the most time comsuming step in the innermost iteration of the ERI calculation loop structure. In the program, F0 (T ) ∼ Fm (T ) are evaluated using two formulas depending on the value T [5]. For a certain threshold value Tf , if T is greater than Tf , we can employ the asymptotic formula: √ (2n − 1)!! π . (6) Fm (T ) ' 2(2T )n T Whereas, if T is from 0 to Tf , we first employ the Taylor expansion for the highest value of m: ∑ Fm (T ) = Fm+k (T0 )(T0 − T )k /k!. (7) k

Fig. 3.

Hierarchical Structure of a Recursive Calculation

III. EHPC

PLATFORM AND ERIC A RCHITECTURE

P ROCESSOR

The goal of our project is to build a high-performance parallel computing system which implements many embedded ERIC processors, named Embedded High Performance Computing (EHPC) system. We introduce the overview of the EHPC platform and the ERIC processor architecture in this section. A. EHPC Platform Architecture 1) EHPC Platform Overview: As shown in Figure 4, the total EHPC platform architecture has three-level hierarchical

network structure: conventional PC, house keeping SH-4 processor and worker ERIC processor in order of descending levels, optimized with the Fock matrix generation algorithm by the integral driven method (Figure 1). One chassis system consists of a general purpose CompactPCI CPU board as a top layer conventional PC board and up to seven custom computation processor boards based on CompactPCI standard, and those boards are connected by CompactPCI bus each other. The ERIC processors are mounted on the board with a Hitachi SH-4 processor. The SH-4 processor on each board has a role in housekeeping and distributes segmentalized Fock matrix computation to the worker ERIC processors. The EHPC system can be used as PC cluster system by connecting chassis systems through 100Mbps Ethernet. 100Mbps Ethernet

Top Layer

PC (CompactPCI Card)

PC

PC

CompactPCI Bus

・・・

House Keeping Layer

SH4

SH4

SH4

BUS Bottom Layer (Worker LSI)

・・・ ERIC

ERIC

ERIC

ERIC

Board Chassis

Chassis

Fig. 4.

Chassis

EHPC Platform Architecture

We developed a prototype EHPC system which utilizes SH4 processors for worker LSIs instead of ERIC processors. Three worker LSIs are mounted on each prototype SH4 board. 2) Efficiency of the EHPC Platform: Three benchmark calculations by the EHPC 1, 2, and 3 chassis systems (21, 42, and 63 worker SH4 processors) were performed for (Glycine)5 amino acid [6] The result speed-up ratios compared to the 1 PC worker LSIs system are 2.00 for 2 chassis and 2.86 for 3 chassis system, respectively. This shows that speed-ups ratios are almost proportional to the number of worker SH4 LSIs. This result comes from not only the original nature of Fock generation algorithm (Figure 1) which can be easily parallelized, but also the reason why the concentration of the data communication does not occur, because the sub-totals of the partial Fock matrix elements on the memories of the bottom layer can be obtained on each layer in hierarchical EHPC architecture. Therefore, it is expected that this good parallelism will inherit for larger molecular systems and absolute calculation times will decrease by exchanging worker LSIs from the prototype SH4 processors to the newly developed ERIC ones. B. Overview of the ERIC Processor Architecture As described in the previous section (Sec. II), we especially focus on ab initio MO method and adopt following two approaches to speedup large-scale MO calculations: exploitation of inter-ERI calculation parallelism with EHPC system, and fast calculation of single integral using special purpose LSI. For that purpose we are currently developing ERI calculation specific LSI named ERIC for fast calculation of each ERI.

Since low power consumption is one of the most important requirements to the ERIC processor design for implementing the EHPC system, we decide to adopt S1C33 family which is an embedded use RISC-type CPU as a base processor. With this limitation, we investigate the ERI calculation algorithm shown in section II. The Obara algorithm consists of two segments whose characteristics are quite different from each other: an initial integral calculation (IIC) and a recursive calculation (RC), as shown in Figure 2. So we decided the design philosophy for coping with the two calculation parts, the IIC part and the RC part as follows: • The IIC part including some complex floating-point operations accelerates processing by installing special functional units to a processor. • The RC part having rich ILP accelerates by installing many functional units to a processor and by parallel processing. However, implementing the both functions with one processor is not efficient for two reasons: the IIC part does not need many functional units for the sake of its low parallelism and the RC part does not need special functional units for lack of complex floating-point operation. Therefore we would not optimize the ERIC processor with the whole part of the Obara algorithm, and decided to install a co-processor which accelerates a series of floating-point multiply-and-add operations in the ERIC processor. Namely, we adopted a heterogeneous chip-multiprocessor (CMP) architecture as the ERIC processor design, and divided the ERIC processor into two engines which work concurrently: an IIC engine as a main engine and an RC engine as the co-processor for calculating floating-point multiply-and-add operations. Figure 5 shows the overview of ERIC processor. ERIC processor is mainly composed of the following five modules: • Host Interface for communication with Host SH4 processor • SDRAM Interface for accessing External Memory • Internal Data/Program Memory for storing the calculation results and programs • IIC Engine for the initial integral calculation • RC Engine for the recursive calculation This processor only supports IEEE 754 double-precision floating-point numbers. Therefore, all the functional unit, memory and interfaces are 64-bit width, and single precision floating-point numbers are not supported. IV. ERIC: ERI- CALCULATION S PECIFIC P ROCESSOR A RCHITECTURE A. IIC Engine The initial integral calculation (IIC) consists of various operations such as arithmetic, floating-point addition, multiplication, division, inverse square root, exponential function, function call and conditional branch, and lacks in instructionlevel parallelism (ILP). Therefore, the design like a general purpose RISC processor is suitable for the architecture of the initial integral calculation engine. In the ERIC processor, the

HOST CPU

Customized processor based on S1C33 RISC CPU

(SH4)

HOST Interface

Initial Integral Calculation Engine Local BUS

Recursive Calculation Engine

(SH4 I/F)

Internal Data Memory External Memory Interface (SDRAM I/F)

μCMP architecture with four micro-engines

Fig. 5.

External Memory

Overview of Eric Processor Architecture

IIC engine is a customized processor of S1C33 family (Seiko Epson’s original 32-bit RISC-type CPU). For the IIC part of the Obara algorithm, we have added five special floating-point operating units, multiply-and-add, division, inverse square root, exponential function and error function to the base processor. 1) S1C33: 32-bit RISC Microprocessor Family: S1C33 family is Seiko Epson’s1 original 32-bit RISC-type CPU. This CPU was developed for high-performance embedded applications such as peripheral equipment for personal computers, portable equipment and other products which need high-speed data processing with low power consumption. The S1C33 instruction set contains 61 basic instructions (105 instructions in all), but no floating-point instruction. Since the ERIC processor must calculate double-precision floating-point numbers for processing MO computation, we added some floating-point operating units written in the following section to the IIC engine as additional components. 2) Floating-Point Operating Units: As mentioned above, S1C33 family does not have a floating-point operating unit, so we decided to add five double-precision floating-point operating units as follows: • Floating-point multiply-and-add unit • Floating-point division and inverse square-root unit • Floating-point exponential function unit • Floating-point error function unit which calculate Taylor expansion (Eq. 7) • 64-bit width load-store unit Execution clock cycles of the double precision floating-point arithmetic operation are listed on the Table II. These various floating-point arithmetic operations are important for the ERI calculation. For example, the execution clock cycles of the error function unit is 16, while it takes 32 instructions and 146 clock cycles by using the Taylor expansion method measured 1 Epson

Electronic Devices: http://www.epsondevice.com/

on the typical MIPS processor. The error function is performed many times at the deepest level of total Fock matrix generation with ERI calculation loop structure. Similarly, execution clock cycles by using the exponential function, the square root, and inverse-square root unit can also shorten the execution clock cycles compared with the software evaluation. Therefore it is expected that the development of ERIC processor will be very effective for ERI calculations. TABLE II E XECUTION C LOCK C YCLES FOR EACH O PERATION addition multiplication division square root exponential error function

6 6 23 31 20 16

pipelined pipelined unpipelined unpipelined unpipelined unpipelined

B. RC Engine Recurrence calculation (RC) engine is a co-processor which is specialized to the RC part of the Obara algorithm. As described in the section II, the RC part consists of a large number of function-calls, and each sub-function contains only a series of substantial floating-point multiply-and-add operations without branches. Since there is a lot of parallelism between those multiply-and-add operations, we can execute many operations simultaneously. So we decide to install plural multiply-and-add operating units to the RC engine architecture, and investigate the optimal organization for efficient parallel execution of the numerous multiply-and-add operations. 1) µCMP Architecture: For effective utilization of the multiply-and-add operating units, we analyze the characteristics of the RC part and focus attention on an inter-subfunction parallelism. As described in section II-C.2, since there is parallelism between subfunctions and each subfunction contains only floating-point multiply-and-add operation, load / store operation and integer addition / subtraction for address calculations, we proposed a microengine-based chip-multiprocessor (µCMP) architecture [7], [8]. The µCMP architecture has plural quite small specialized processors only for computing floating-point multiply-and-add operation, named MicroEngines, and each subfunction is computed in parallel by allotted to each Micro-Engines. 2) Design Considerations: We kicked the organization of the Micro-Engines around and experimented with the several types of the RC engine design to find out the optimal architecture. In this experimentation, we assumed that the maximum number of multiply-and-add units and load-store units we can implement is four, and compared the execution cycles of the RC part on each RC engine design changing the number of the Micro-Engines and functional units in each Micro-Engine. Figure 6 shows the result of the experimentation. It shows the speed-up ratio of the alternative RC engine designs over the base model RC1, 4, 4. Each model notation of the graph means the number of Micro-Engines and functional units as shown in Table III. From the result we can draw the conclusion

Performance Improvement over RC1,4,4

TABLE IV 250%

M AJOR S PECIFICATIONS OF THE ERIC P ROCESSOR

200%

Die Size 5mm×10mm Logic 2.1MGates Memory 704Kbytes Frequency 200MHz Power∗ (Max.) 2.1W (4.2W) ∗ Estimated power consumption

150%

100%

50%

0%

(ps ,ss ) (ps ,ps ) (pp ,ss ) (pp ,ps ) (pp .pp ) (ds ,ss ) (ds ,ps ) (pp ,ds ) (ds ,ds ) (dp ,ss ) (dp ,ps ) (dp ,pp ) (dp ,ds ) (dp ,dp ) (dd .ss ) (dd ,ps ) (dd ,pp ) (dd ,ds ) (dd ,dp ) (dd ,dd )

-50%

Type of Integral RC2,1,1

RC2,2,1

RC2,2,2

RC3,1,1

RC4,1,1

Fig. 6. Performance Comparison among Different Organization of RC Engine

that we can get better performance when increasing the number of Micro-Engines than when increasing the functional units in each Micro-Engine. RC Engine[0]

TABLE III E VALUATION M ODELS Model

# of MEa

RC1, 4, 4 RC2, 1, 1 RC2, 2, 1 RC2, 2, 2 RC3, 1, 1 RC4, 1, 1

1 2 2 2 3 4

a ME:

# of Units M&Ab LSUc 4 4 1 1 2 1 2 2 1 1 1 1

LSU 4 2 2 4 3 4

IIC Engine

Memory SubSystem Host (SH4) Interface SDRAM Controler

SRAM 16Kbytes x 24 = 384Kbytes

Total M&A 4 2 4 4 3 4

S R A M

RC Engine[1]

RC Engine[2]

SRAM 16Kbytes x 20 = 320Kbytes

RC Engine[3]

PLL

Fig. 7.

Chip Implementation of the ERIC Processor

Micro-Engines b M&A: Multiply-and-add units c LSU: Load-store units

B. Performance Evaluation From the result of the experiment, we finally decided to put four Micro-Engines on the RC engine and the components of each Micro-Engine as follows: a register file, a load-store unit, an arithmetic unit and a floating-point multiply-andadd unit. And each Micro-Engine has a basic instructionset containing only 17 types of instruction such as integer addition/subtraction, load/store, floating-point multiply-andadd, some branch and system control operations. V. C HIP I MPLEMENTATION A. Major Specifications We designed the ERIC processor and fabricated its actual chip with 0.13-micron processes CMOS technology by TSMC. Figure 7 shows the photograph of the ERIC processor chip, and its major specifications are listed on the Table IV with one exception that only power consumption is estimated value because we have not measured it yet. The estimated power consumption of the ERIC processor is 2.1W, and maximum power consumption is 4.2W. The power consumption is low enough to implement a large number of the processors in the EHPC (Embedded High Performance Computing) platform as we expect.

We can not show the exact performance evaluation of the entire EHPC system at the moment, because the EHPC system has not been completed and we are currently working on setting up the system which still has a lot of problems to be solved. However, we have the actual ERIC processor chips and experimental boards for testing them, and can evaluate the performance of the single ERIC processor chip. In this section we show side-by-side performance comparison between the ERIC processor and a single PC. The evaluation condition for each is as follows: ERIC: • Actual execution time of the single ERIC processor with 704Kbytes on-chip SRAM and 256Mbytes offchip SDRAM (The operating frequency is 200MHz.) • Programs written in assembly and C language, compiled by S1C33 compiler which does not support floating point optimization PC: • Actual execution time of a 3.2GHz Pentium 4 processor with 1Mbytes L2 cache and 2Gbytes main memory • All programs written in C language and compiled by gcc -O2 option

TABLE V E XECUTION T IME AND P ERFORMANCE E VALUATION OF THE IIC E NGINE

ERIC Pentium 4 Ratio

(pp, pp) Exec. Time (sec.) Perf. / Power −3 4.52 × 10 5.27 × 101 −4 1.09 × 10 8.91 × 101 41.5 0.59

1) IIC Engine: Table V shows the performance evaluation of the IIC engine compared with the Pentium 4 processor for calculations of a (pp, pp) type initial integral and the error function. This error function calculations are the most time consuming step in innermost iteration of the IIC computation 4-fold loop structure as shown in Figure 2. For the (pp, pp) calculation, the IIC engine consumes 41 times execution time compared to the Pentium 4 processor. Now, we are developing the ERIC specific compiler, so this (pp, pp) code optimizations are far from complete at the present stage. For the error function part, the assembly codes have already been fully optimized for ERIC processor. This part of calculation shows calculations of the Pentium 4 processor is 16 times faster than of IIC engine. By the code optimization and utilizing advantages of double precision floating point calculations and special arithmetic units, it is expected that (pp, pp) execution times decrease drastically as this error function level. Datasheet shows the power consumption of Pentium 4 is 103.0W (TDP: Thermal Design Power) [9], while for the ERIC it is 4.2W (Max.). From the performance-per-power 1 ( exec.time∗power ) point of view, the result shows that the IIC engine has almost 1.5 times higher efficiency than the Pentium 4. However, it is not so efficient as we expected. 2) RC Engine: Table VI shows the execution time (sec.) of the RC part for five peptide molecules, G, GA, GAQ, GAQM and GAQMY. It indicates that the RC engine requires nearly 36 times more execution time than Pentium 4 on average, regardless of the inputs. C. Improving the Performance of RC Engine These evaluations in the previous section show that the ERIC processor has no advantage in both computation performance and power efficiency over Pentium 4. It is a disappointing result. So we are now examining why our implementation does not work well, for example, we are examining on memory architecture, functional units design, utilization rate for each functional unit. In this section we show several experimental result. As a cause of the low performance of the RC engine, we focused attention on the low IPC (Instructions Per Clock cycle), investigated on the memory architecture of the ERIC processor using a cycle-accurate simulation model based on the Verilog-HDL description of the RC engine. Although we adopt a simple request/acknowledge protocol as the on-chip bus protocol of the ERIC processor for both the program memory and the data memory, this bus bandwidth becomes

Error function Exec. Time (sec.) Perf. / Power 2.77 × 10−6 8.60 × 104 −7 1.73 × 10 5.62 × 104 16.0 1.53

extremely low because of shortening design periods. Specifically, the minimum memory access cycle to on-chip SRAM in all data types (char, short, int and double) is 6. Since these buses are not pipelined, its memory bandwidth is 1 data / 6 clock cycles (30ns) for each engine. In other words, each RC engine can fetch only one instruction for every 6 cycles. SDRAM controller connected with the main memory also has the bandwidth problem. The Figure 8 shows the SDRAM access cycles of each RC engine. The graph shows that some memory accesses, 19.65% on RC[2] and 14.18% on RC[3], take more than 100 cycles, and the worst case takes 32,218 cycles to get a 64bits-width data. 100%

0.00%

1.83% 3.10%

0.41% 3.14% 19.24%

11.04%

13.43%

70%

7.54% 3.37%

60%

19.71%

90%

2.92%

80% 44.97% 40.43%

50% 44.38%

40% 30% 52.40%

49.90%

2.63%

4.74%

RC[0]

RC[1]

40.59%

1000～ ～999 ～99 ～49 ～39 ～29 10～19

20% 24.36%

10% 9.14%

0%

Fig. 8.

RC[2]

RC[3]

Breakdown of SDRAM Access for the RC Engine (in cycles)

As the result, the total IPC of four RC engines is approximately 0.28, while the average IPC of single RC engine is 0.07. This indicates that the each RC engine can issue only one instruction for every 14 clock cycles. The Figure 9 compares execution time and power efficiency of Pentium 4 and the ERIC processor chip with those of ERIC simulation models. Here, ERIC-Sim is the simulation model of the ERIC processor with the ideal memory, and each memory (including SDRAM) access can be completed in 6 cycles, and its bus protocol is the same as that of the ERIC processor chip. ERIC-Sim(p) adopts pipelining technique as the bus protocol on the ERIC-Sim for improving the memory bandwidth, and its memory access cycle are also 6. As shown in Figure 9, we can recognize that the execution time can be shortened to two fifths of the current one if the SDRAM access can be reduced from the ERIC chip. Furthermore, improvement of the bus protocol can make the execution time shortened to one third of the above, which shows ERIC processor possibly has the competitive advantage over Perntium 4 in power efficiency. These results show the

TABLE VI E XECUTION T IME OF THE R ECURSIVE C ALCULATION ( SEC .) Molecules ERIC Pentium 4 Ratio

G 1.48 × 101 4.15 × 10−1 35.76

5.0

Exec. Time 7.25E+04 Efficiency

4.58 4.5 4.0

6.0E+04 3.5 5.0E+04 3.0 2.5

4.0E+04

2.98E+04 2.0 3.0E+04 1.5

1.65 2.0E+04

1.00 1.0

1.07E+04

Efficiency (normalized to Pentium 4)

Execution Time (sec.) [Lower is Better]

8.0E+04

7.0E+04

GA 3.07 × 102 8.46 × 100 36.31

1.0E+04 0.5

0.68 2.00E+03

0.0

0.0E+00 Pentium4 3.2GHz

Fig. 9.

Eric-Chip

Eric-Sim

Eric-Sim (p)

Execution Time and Efficiency of the Recursive Calculation

current specification of the memory architecture is the big performance deterioration factor. In other words, improvement of memory architecture design is necessary for ERIC processor to gain the competitive advantage over Pentium 4. VI. C ONCLUSION Ab initio molecular orbital (MO) method is very important for calculating electronic structures and properties of molecules, and the personal type of high performance computer is still strongly desired for MO calculation. In the MO computation, electron repulsion integrals (ERIs) consume the most and tremendous calculation time. In this study, we are developing EHPC (Embedded High Performance Computing) system which has following features: 1. hierarchical parallel computing platform which can compute numerous ERIs efficiently, 2. the special purpose processor which is implemented in the system and can compute single ERI fast. To implement many processors into the system, we have developed the high-performance and low power consumption ERI calculation specific processor architecture, called ERIC. The ERIC processor adopts heterogeneous chipmultiprocessor (CMP) architecture optimized with the Obara ERI calculation algorithm, and has two calculation engines: initial integral calculation (IIC) engine and recursive calculation (RC) engine. The IIC engine has special floating-point operating units for fast calculation of division, inverse-square-root, exponential function and error function. The RC engine adopts microengine-based chip-multiprocessor (µCMP) architecture and has micro-engines. We designed the actual ERIC processor chip. It is comprised of the IIC engine which has five types of special floating-point

GAQ 4.11 × 103 1.13 × 102 36.34

GAQM 1.75 × 104 4.82 × 102 36.41

GAQMY 7.25 × 104 2.00 × 103 36.23

operating units, the RC engine which consists of four MicroEngines, 704Kbytes on-chip SRAM memory, an SDRAM interface and a host interface. The die size is 50mm2 , the operating frequency is 200MHz and the estimated power consumption is 2.1W (max. 4.2W). The ERIC processor chips were fabricated with TSMC 0.13µm CMOS technology, and we are currently working on setting up the entire EHPC system. We have estimated the performance of each engine and compare with a Pentium 4 3.2GHz processor. The result of the estimation is that the performance of IIC engine is only 6.24% of that of Pentium 4, and the performance-per-power is 1.5 times higher than the Pentium 4. On the other hand, the performance of the RC engine is about 36 times lower than the Pentium 4 on average. In this work we can satisfy the power requirement to assemble many ERIC processors into the compact EHPC system. Unfortunately, we have not achieved enough performance in our research phase yet, however, by improving the memory architecture, the ERIC processor has a potential to obtain 4.58 times higher efficiency over the Pentium Processor. ACKNOWLEDGMENT The research is granted by Japanese Ministry of Education, Culture, Sports, Science and Technology. R EFERENCES [1] A. Szabo and N. S. Ostlund, Modern Quantum Chemistry. Reading, Massachusetts: Dover Publishers, 1996. [2] J. Alml¨of, K. Faegri, Jr., and K. Korsell, “Principles for a direct SCF approach to LCAO-MO Ab Initio calculations,” Journal of Chemical Physics, vol. 3, pp. 385–399, 1982. [3] H. Honda and S. Obara, “Molecular integrals evaluated over contracted Gaussian functions,” in Proceedings of the 11th International Congress of Quantum Chemistry, 2003. [4] H. Honda, T. Yamaki, and S. Obara, “Molecular integrals evaluated over contracted Gaussian functions by using auxiliary contracted hyperGaussian functions,” Journal of Chemical Physics, vol. 117, no. 4, pp. 1457–1469, 2002. [5] S. Obara and A. Saika, “Efficient recursive computation of molecular integrals over Cartesian Gaussian functions,” Journal of Chemical Physics, vol. 84, no. 7, pp. 3963–3974, 1986. [6] H. Umeda, Y. Inadomi, H. Honda, and U. Nagashima, “Parallel Fock matrix construction on layered multi-processor system,” in Proceedings of The 229th ACS National Meeting (To be published). American Chemical Society, March 2005. [7] K. Nakamura, H. Hatae, M. Harada, Y. Kuwayama, M. Uehara, H. Sato, S. Obara, H. Honda, U. Nagashima, Y. Inadomi, and K. Murakami, “Eric: A special-purpose processor for ERI calculations in quantum chemistry applications,” in Proceedings of HPC-Asia 2002, December 2002. [8] K. Nakamura, M. Harada, Y. Kuwayama, and K. Murakami, “A microengine-based chip-multiprocessor (µCMP) architecture for compute intensive applications,” in ODES: 1st Workshop on Optimizations for DSP and Embedded Systems, March 2003. [9] “Intel Pentium 4 Processor on 90 nm Process Detasheet,” Intel Corporation.