A Congestion Avoidance Technique Using ...

0 downloads 0 Views 2MB Size Report
Packet pacing avoids the critical network congestion. Insertion of non-sending period (inter-packet gap) between sending packets. ▻ Controls packet injection ...
A Congestion Avoidance Technique Using Aggressive Packet Pacing toward Exascale Interconnect Hidetomo Shibamura Institute of Systems, Information Technologies and Nanotechnologies Japan Science and Technology Agency, CREST

“PACKET PACING” MAKES YOUR APPLICATION FAST ! INTRODUCTION

NSIM: OUR INTERCONNECT SIMULATOR

Reducing communication latency is still important in HPC Some collective comm. may not be able to use in the near future ►

Packet pacing avoids the critical network congestion

On-demand comm. pattern (message) generation using MPI-like C program (MGEN program).  Accurate simulation with very fine network specifications and various MPI overheads (Interconnect Configuration File).  Wide variety of simulation outputs.  Simulation with behavior of OS jitter/load imbalance. 

WHAT IS PACKET PACING? Insertion of non-sending period (inter-packet gap) between sending packets 





Very fast simulation 18-min. for a 1MiB random ring comm. (HPCC) on 128K-node 3D-Torus by using 128-core real system.

Optimum pacing point

Decreases stop & go latency.

1,000

Applications with heavy traffic comm. get faster. Network Packet

Keep space for smooth traffic

PWX.64KiB

900

PWX.128KiB

800

PWX.256KiB

700

RING.128KiB

500

RING.256KiB

400

SSPRD.32KiB

As an MPI program based on PDES(Parallel Discrete Event Simulation).  By Kyushu Univ., Fujitsu, and ISIT since 2009.

MOD PACKET PACING: A PACING STRATEGY

SSPRD.64KiB

300

SSPRD.128KiB

200

SSPRD.256KiB

100

BRUCK.32KiB

0

BRUCK.64KiB

2.0

4.0 6.0 8.0 Inter-packet gap

10.0

12.0

Message Overlap Degree packet pacing ►

BRUCK.128KiB



Execution times for each inter-packet gap (3D Torus 512nodes)

An interval between sending packets.  Gap = 0: No packet pacing. Gap = N: Transfer time for N packets. 

Design and Implementation 

RING.64KiB

600

Inter-packet gap



RING.32KiB

0.0



Tofu (K computer, FX10), ~6D Mesh/Torus, FatTree.



Maximizes throughput & minimizes network latency 

Supported topologies 

Gives an optimum inter-packet gap for each message.

Reduces network congestion 





Controls packet injection rate by interleaving packets

Execution time (msec.)



Feature

Gap is calculated based on the degree of message overlap Gap = (Comm. Hop) – 1 

Each link is shared by messages of the number of hop count at most in alltoall.

Each comm. has an optimum pacing point - Alltoall algorithm: PWX(Pairwise exchange), RING, SSPRD(Simple spread), BRUCK, BTFLY(Butterfly), and A2AT(by Ishihata et al. at Tokyo Univ. of Tech.). - Link bandwidth: 4GB/s. Routing algorithm: DOR(Dimension Ordered Routing).

EXPERIMENTAL RESULTS ON SIMULATION 16KiB

32KiB

64KiB

Link throughput (GB/sec.)

+X

128KiB

500 400 300

Link throughput (GB/sec.)

0 No MOD No MOD No MOD No MOD No MOD pacing pacing pacing pacing pacing pacing pacing pacing pacing pacing pwx

ring

ssprd

bruck

btfly

Algorithms and with/without packet pacing (3DT-16x8x8)

Execution times with/without MOD pacing (3D Torus 16x8x8nodes)

1.00

256 KiB 512 KiB

0.95

1 MiB 2 MiB

0.90 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 Inter-Packet GAP

4 MiB

Effect of Packet Pacing Fast Slow

128 KiB

AVE.

1.94 1.64

64 KiB 128 KiB 256 KiB 512 KiB

0.95

1 MiB 2 MiB

0.90 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 Inter-Packet GAP

1.5

12,000 10,000

1.0

8,000 6,000

0.5

4,000

0

No Mod No Mod No Mod No Mod No Mod No Mod Pacing Pacing Pacing Pacing Pacing Pacing Pacing Pacing Pacing Pacing Pacing Pacing 8x8x8

25.8

50 Elapsed time (msec.)

16x8x8

16x16x8

16x16x16

32x16x16

0.0

32x32x16

Number of Nodes and w/ or w/o packet pacing (3D-Torus, pairwise exchange)

100

Execution time and speedup ratio on various node sizes (3D Torus, PWX)

Speedup ratio increases according to node size

FX10 48nodes (768cores)

32 KiB

1.00

1.72

- Random ring communications (from HPCC) with packet pacing. - Measured on Fujitsu FX10 at Kyushu Univ., JAPAN.

16 KiB

1.05

1.65

2.0

1.98

1.45

14,000

FX10 96nodes (1,536cores)

1.15

1.10

Speedup Ratio

2,000

Pacing effect by MOD pacing (2)

1.15

64 KiB

-Y

Pacing effect by MOD pacing (1)

1.15

1.05

+Y

100

3.2x Faster, 107.5% of ideal

FX10 24nodes (384cores)

32 KiB

-X

Link throughputs of an alltoall communication (2DT-9x9nodes, A2AT)

FX10 12nodes (192cores) Effect of Packet Pacing Fast Slow

4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0

84.0

50 Elapsed time (msec.)

Link throughputs of an alltoall communication (3D Torus 4x4x4nodes, PWX)

EXPERIMENTAL RESULTS ON REAL MACHINE

16 KiB

Execution Time (ms)

16,000

0

Faster comm. by MOD pacing

1.10

AVE.

18,000

+X

1.62x Faster

-Y

20,000

200 100

+Y

24msec.: Ideal exec. time

0

4 MiB

Effect of Packet Pacing Fast Slow

Execution time (msec.)

600

4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0

-X

Speedup Ratio



Time-consuming comm. is caused by network congestion in heavy traffic.

1.15 16 KiB

1.10

32 KiB

1.05

64 KiB 128 KiB

1.00

256 KiB 512 KiB

0.95

1 MiB

Effect of Packet Pacing Fast Slow



Execution Time (ms)



Support rapid performance evaluation and communication analysis toward exascale interconnect

2 MiB

0.90 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 Inter-Packet GAP

4 MiB

16 KiB

1.10

32 KiB

1.05

64 KiB 128 KiB

1.00

256 KiB 512 KiB

0.95

1 MiB 2 MiB

0.90 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 Inter-Packet GAP

4 MiB

Optimum pacing point and effect of packet pacing grows as node size increases

SUMMARY Packet pacing : controls packet injection aggressively to accelerate HPC comm. 1. Each heavy traffic communication has its own optimum pacing point. 2. The MOD pacing strategy finds the best pacing point. 3. The Effectiveness increases along with # of nodes.

Cache for Processor, Packet Pacing for Interconnect !