Packet pacing avoids the critical network congestion. Insertion of non-sending period (inter-packet gap) between sending packets. â» Controls packet injection ...
A Congestion Avoidance Technique Using Aggressive Packet Pacing toward Exascale Interconnect Hidetomo Shibamura Institute of Systems, Information Technologies and Nanotechnologies Japan Science and Technology Agency, CREST
“PACKET PACING” MAKES YOUR APPLICATION FAST ! INTRODUCTION
NSIM: OUR INTERCONNECT SIMULATOR
Reducing communication latency is still important in HPC Some collective comm. may not be able to use in the near future ►
Packet pacing avoids the critical network congestion
On-demand comm. pattern (message) generation using MPI-like C program (MGEN program). Accurate simulation with very fine network specifications and various MPI overheads (Interconnect Configuration File). Wide variety of simulation outputs. Simulation with behavior of OS jitter/load imbalance.
WHAT IS PACKET PACING? Insertion of non-sending period (inter-packet gap) between sending packets
►
►
Very fast simulation 18-min. for a 1MiB random ring comm. (HPCC) on 128K-node 3D-Torus by using 128-core real system.
Optimum pacing point
Decreases stop & go latency.
1,000
Applications with heavy traffic comm. get faster. Network Packet
Keep space for smooth traffic
PWX.64KiB
900
PWX.128KiB
800
PWX.256KiB
700
RING.128KiB
500
RING.256KiB
400
SSPRD.32KiB
As an MPI program based on PDES(Parallel Discrete Event Simulation). By Kyushu Univ., Fujitsu, and ISIT since 2009.
MOD PACKET PACING: A PACING STRATEGY
SSPRD.64KiB
300
SSPRD.128KiB
200
SSPRD.256KiB
100
BRUCK.32KiB
0
BRUCK.64KiB
2.0
4.0 6.0 8.0 Inter-packet gap
10.0
12.0
Message Overlap Degree packet pacing ►
BRUCK.128KiB
►
Execution times for each inter-packet gap (3D Torus 512nodes)
An interval between sending packets. Gap = 0: No packet pacing. Gap = N: Transfer time for N packets.
Design and Implementation
RING.64KiB
600
Inter-packet gap
►
RING.32KiB
0.0
►
Tofu (K computer, FX10), ~6D Mesh/Torus, FatTree.
Maximizes throughput & minimizes network latency
Supported topologies
Gives an optimum inter-packet gap for each message.
Reduces network congestion
►
►
Controls packet injection rate by interleaving packets
Execution time (msec.)
►
Feature
Gap is calculated based on the degree of message overlap Gap = (Comm. Hop) – 1
Each link is shared by messages of the number of hop count at most in alltoall.
Each comm. has an optimum pacing point - Alltoall algorithm: PWX(Pairwise exchange), RING, SSPRD(Simple spread), BRUCK, BTFLY(Butterfly), and A2AT(by Ishihata et al. at Tokyo Univ. of Tech.). - Link bandwidth: 4GB/s. Routing algorithm: DOR(Dimension Ordered Routing).
EXPERIMENTAL RESULTS ON SIMULATION 16KiB
32KiB
64KiB
Link throughput (GB/sec.)
+X
128KiB
500 400 300
Link throughput (GB/sec.)
0 No MOD No MOD No MOD No MOD No MOD pacing pacing pacing pacing pacing pacing pacing pacing pacing pacing pwx
ring
ssprd
bruck
btfly
Algorithms and with/without packet pacing (3DT-16x8x8)
Execution times with/without MOD pacing (3D Torus 16x8x8nodes)
1.00
256 KiB 512 KiB
0.95
1 MiB 2 MiB
0.90 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 Inter-Packet GAP
4 MiB
Effect of Packet Pacing Fast Slow
128 KiB
AVE.
1.94 1.64
64 KiB 128 KiB 256 KiB 512 KiB
0.95
1 MiB 2 MiB
0.90 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 Inter-Packet GAP
1.5
12,000 10,000
1.0
8,000 6,000
0.5
4,000
0
No Mod No Mod No Mod No Mod No Mod No Mod Pacing Pacing Pacing Pacing Pacing Pacing Pacing Pacing Pacing Pacing Pacing Pacing 8x8x8
25.8
50 Elapsed time (msec.)
16x8x8
16x16x8
16x16x16
32x16x16
0.0
32x32x16
Number of Nodes and w/ or w/o packet pacing (3D-Torus, pairwise exchange)
100
Execution time and speedup ratio on various node sizes (3D Torus, PWX)
Speedup ratio increases according to node size
FX10 48nodes (768cores)
32 KiB
1.00
1.72
- Random ring communications (from HPCC) with packet pacing. - Measured on Fujitsu FX10 at Kyushu Univ., JAPAN.
16 KiB
1.05
1.65
2.0
1.98
1.45
14,000
FX10 96nodes (1,536cores)
1.15
1.10
Speedup Ratio
2,000
Pacing effect by MOD pacing (2)
1.15
64 KiB
-Y
Pacing effect by MOD pacing (1)
1.15
1.05
+Y
100
3.2x Faster, 107.5% of ideal
FX10 24nodes (384cores)
32 KiB
-X
Link throughputs of an alltoall communication (2DT-9x9nodes, A2AT)
FX10 12nodes (192cores) Effect of Packet Pacing Fast Slow
4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0
84.0
50 Elapsed time (msec.)
Link throughputs of an alltoall communication (3D Torus 4x4x4nodes, PWX)
EXPERIMENTAL RESULTS ON REAL MACHINE
16 KiB
Execution Time (ms)
16,000
0
Faster comm. by MOD pacing
1.10
AVE.
18,000
+X
1.62x Faster
-Y
20,000
200 100
+Y
24msec.: Ideal exec. time
0
4 MiB
Effect of Packet Pacing Fast Slow
Execution time (msec.)
600
4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0
-X
Speedup Ratio
►
Time-consuming comm. is caused by network congestion in heavy traffic.
1.15 16 KiB
1.10
32 KiB
1.05
64 KiB 128 KiB
1.00
256 KiB 512 KiB
0.95
1 MiB
Effect of Packet Pacing Fast Slow
Execution Time (ms)
►
Support rapid performance evaluation and communication analysis toward exascale interconnect
2 MiB
0.90 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 Inter-Packet GAP
4 MiB
16 KiB
1.10
32 KiB
1.05
64 KiB 128 KiB
1.00
256 KiB 512 KiB
0.95
1 MiB 2 MiB
0.90 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 Inter-Packet GAP
4 MiB
Optimum pacing point and effect of packet pacing grows as node size increases
SUMMARY Packet pacing : controls packet injection aggressively to accelerate HPC comm. 1. Each heavy traffic communication has its own optimum pacing point. 2. The MOD pacing strategy finds the best pacing point. 3. The Effectiveness increases along with # of nodes.
Cache for Processor, Packet Pacing for Interconnect !