The way toward peta-flops. ISC-2011. Dr. Pierre Lagier. Chief Technology Officer
... Building a peta-flops class computer. Copyright 2011 FUJITSU Limited. 5 ...
The way toward peta-flops ISC-2011 Dr. Pierre Lagier Chief Technology Officer Fujitsu Systems Europe
Copyright 2011 FUJITSU Limited
Where things started from
DESIGN CONCEPTS
2
Copyright 2011 FUJITSU Limited
New challenges and requirements
!
Optimal sustained flops per watt
!
Low operating cost through efficient cooling, floor space and weight
!
Mission critical ready with high reliability and availability
!
Scalability towards 100000’s processors
3
Copyright 2011 FUJITSU Limited
Petascale supercomputer design concepts
HPC centric • Design first a true HPC processor rather than adapt the technologies around a commodity processor • Most efficient technologies fitting together toward HPC efficiency
Environmentally efficient • Optimal sustained flops per watt • Flexible mixed water/air cooling capabilities • High density integration to reduce floor space
4
Mission Critical • Longest mean time between interrupt (MTBI) and mean time between failure (MTBF) • Shortest mean time before restart production • High throughput capabilities
SPARC64TM VIIIfx architecture Multi-cores technology 256 Floating point registers per core 32KB(I)+32KB(D) 2 Way L1 cache per core 6MB Shared L2 cache Inter-core hardware synchronisation Application access to cache management
High performance per watt 128 GFlops
58 Watts peak
Water cooling Low current leakage of the CPU Low power consumption and low failure rate of CPUs
6
Copyright 2011 FUJITSU Limited
Memory hierarchy Main memory
L2
Memory High throughput Single CPU
16 / 32 / 64GB
6 MB
64 GB/s
L1 ICC
32KB(I)+32KB(D)
SPARCfx
7
Copyright 2011 FUJITSU Limited
On board interconnect controller (ICC) ICC Rich functions High BW & Low latency
Rich functions
User level RDMA Hardware barrier and reduction offload engine
Low latency and high-throughput Full crossbar router with redundant direct paths Dedicated data path with CPU ICC
SPARCfx
Direct command issuing to ICC enables low latency – user space interconnect DMA hardware registers memory mapped through Linux kernel driver extension 8
Copyright 2011 FUJITSU Limited
System Board Motherboard Compact Maintenance
Compact design 4 x compute nodes per system board Hybrid cooling Direct 6D Mesh/Torus (4 ICC’s) 32 x DIMMs for memory
dimensions 526mm x 481mm x 46mm
Low processor temperature
Easy replacement Remote power on/off of each system board
No need to stop a rack to replace a system board
Technology Very fast node to node communication, 5GB/s x 2 (bi-directional) Low latency, less than 2 µs point to point hardware latency Global hardware barrier, less than 10 µs to synchronise all compute nodes of petaflops class machine Integrated MPI support for collective operations
Topology 6D Torus / Mesh physical node addressing (x, y, z, a, b, c) Logical 3D Torus partitioning (x, y, z) with 3 additional communication paths (a, b, c)