A Localized Congestion Control Mechanism for ... - Semantic Scholar

4 downloads 280 Views 1MB Size Report
in its support for legacy PCI while addressing the inadequacies of. PCI. Indeed, PCI ..... At the extreme, a switch node may support all 4. XOFF states. For our ...
A Localized Congestion Control Mechanism for PCI Express Advanced Switching Fabrics Venkata Krishnan and David Mayhew Stargen Inc. Marlborough, MA 01752. [krishnan, mayhew]@ stargen.com http://www.stargen.com Abstract Even though there is a commonality in their physical and link layers, Advanced Switching (AS) is more than merely an extension of PCI Express. The role of PCI Express is primarily as a chip-interconnect. In contrast, AS functions as a true system fabric interconnect encompassing domains that are vastly different and indeed more sophisticated than that of PCI Express’. In such a setup, congestion management is crucial for optimal utilization of the fabric bandwidth. Congestion control in network fabrics has generally been based on end-to-end and link-by-link schemes for controlling packet injection. Though seemingly adequate, these schemes may not be effective in handling “transient” congestion arising from intermittent traffic − for transient congestion can indeed occur in fabrics operating well below their saturation limit. In such scenarios, a link-by-link scheme does not prevent congestion from spreading while an end-to-end scheme may result in under-utilization of the fabric bandwidth. This paper details a congestion control mechanism called Status Based Flow Control (SBFC) that has been incorporated into AS and is specifically targeted for alleviating the transition congestion problem. The SBFC mechanism exploits the source-based path routing used in AS. It enables upstream switch nodes to modify the transmission of packets based on the congestion status of links in a downstream switch. Simulation studies show that the SBFC mechanism indeed permits optimal usage of the fabric bandwidth during periods of transient congestion and effectively complements the traditional end-to-end and link-by-link schemes for congestion control.

1

Introduction

PCI Express, a low-latency, high-bandwidth switched serial interconnect, is poised to succeed PCI as the next generation intrasystem interconnect[1]. The primary strength behind PCI Express is in its support for legacy PCI while addressing the inadequacies of PCI. Indeed, PCI Express is fundamentally nothing more than a serialization and packetization of PCI and is completely compatible with PCI. The additional features that PCI Express offers over PCI are in addition to and not a replacement of the mechanisms supported by PCI. Hence, the reams of existing software that expect PCI to underpin them will continue to function untouched in a PCI Express world. However, PCI Express’ greatest advantage - its strict compatibility with PCI - is also one of its greatest liabilities. Since a PCI Express fabric spans a single global address space, there is no notion of a system boundary. In other words, a PCI Express fabric is, by definition, a single system - wherein multiple hosts cannot share a fabric. Since all communication is under the control of a single host, PCI Express is not well suited for an important application space that includes multiprocessing and peer-to-peer communication.

0-7803-8686-8/04/$20.00 ©2004 IEEE

Advanced Switching (AS)[2, 11] builds upon PCI Express by using the physical and link layer while at the same time providing capabilities that include support for multiprocessing and peer-to-peer computing. It is our belief that the evolutionary path afforded by PCI Express from PCI and the layer commonality between Advanced Switching (AS) and PCI Express would indeed enable AS to become a dominant inter-system interconnect technology in the near future. In a multi-stage interconnect such as AS, congestion avoidance and control[3] is essential for optimal usage of fabric bandwidth. Traditional congestion avoidance techniques include (a) link-by-link based and (b) end-to-end approaches. A common link-by-link scheme is the credit-based windowing approach[5]. Here, there is exchange of credit (or storage information) between nodes on both sides of the link. The sender must have the requisite credits before it can begin transmitting on its output link. Periodically, the receiver replenishes the credits. What this implies is that at least the specified buffer space (in terms of credits) is available on the input side of the link. Overall, this scheme is primarily used for avoiding packet drops typically seen in TCP/IP environments. This prevents unnecessary retransmission of packets thereby avoiding the infamous congestion collapse[6]. On the other hand, this scheme does not prevent congestion from spreading. If the incoming traffic exceeds the targeted output link bandwidth, the senders will soon lack the requisite transmission credits and would be blocked from sending additional packets. This blocking not only affects flows targeting the congested link but also penalizes flows targeting other links. In an end-to-end scheme, end nodes control the rate of traffic entering the network. A common approach is to send an explicit congestion notification (ECN)[7] to the source node whenever an intermediate switch encounters congestion. ECNs may either be forward ECN (FECN) or backward ECN (BECN)[8]. A BECN requires the switch to generate a source notification packet on encountering congestion. Alternatively, the switch may play a passive role and simply mark the congested packets en-route to their destination. This is the FECN mode of congestion notification. It is the responsibility of the upperlevel protocols on the destination node to take appropriate action on receipt of such marked packets. The action includes sending a notification to the source. Finally, the source node generally responds to a congestion notification by adjusting its packet injection rate by using the classic additive increase-multiplicative decrease (AIMD) algorithm or one of its variants[9]. Its complexity notwithstanding, the ECN approach enables the source node to adjust its injection rate in accordance with the offered load of the fabric. Nevertheless, ECN based mechanisms employing the AIMD approach result in under-utilization of the fabric bandwidth[20, 21] and may not be appropriate during periods of transient congestion. Indeed, transient congestion will be a common occurrence even in systems that operate below the saturation bandwidth. This is due to network traffic being extremely bursty and not necessarily following a Poisson model[10]. Since the sender uses the same link

transmission credits for all flows regardless of the output link taken in the downstream switch, it has the undesirable effect of penalizing flows that do not target a non-congested link. This condition may only be transient in nature and need not use the ECN mechanism to throttle the source. As we show later, this is precisely the scenario under which ECN mechanisms are ineffective. Advanced Switching (AS) takes a distinct approach towards handling transient congestion. In particular, it employs a localized congestion control mechanism called Status Based Flow Control (SBFC) for tackling this problem. The SBFC mechanism allows downstream AS switch nodes to generate an output-port specific status message to the upstream switch. The upstream switch node uses the status information to modify its scheduling such that packets targeting a congested link in a down stream switch are given lower priority. Note that this mechanism defines a new layer for congestion management and complements the traditional end-to-end (ECN) and link-by-link credit based schemes. Together, the three mechanisms define the AS congestion management strategy. Finally, though the SBFC mechanism is discussed in the context of AS, it is nevertheless applicable to any architecture using source-based path routing. The rest of the paper is organized as follows: Section 2 discusses some relevant features of AS pertinent to the description of the SBFC mechanism; Section 3 describes in detail the SBFC mechanism; Section 4 describes the evaluation environment while Section 5 describes the evaluation; Finally Section 6 concludes the paper.

2

Advanced Switching Features

AS shares the physical and link layer with PCI Express but uses a different transaction layer. Link layer communication uses Data Link Layer packets (DLLPs). DLLP packets allow link state information to be exchanged including flow credits. But DLLP packets themselves do not consume flow credits. Only transaction layer packets (TLPs) consume flow credits. The intent of this section is to briefly describe those features of AS relevant to the subsequent discussion of the localized congestion control mechanism. For more details on these features, please refer to [1, 11, 22].

2.1

Routing

The routing adopted for AS differs from that of a traditional TCP/IP approach. Unlike the destination-based routing used by TCP/IP or Infiniband[12], AS uses a source-based routing approach. The destination-based approach requires the source node to specify only the source and destination address whereas the actual forwarding decision is made at each hop as the packet traverses a network. This is an appropriate model for TCP/IP where endpoints number in the millions and enter or exit the network very frequently. The source-based routing approach, on the other hand, requires the exact path to the destination to be specified by the source within the packet header. This is indeed an appropriate model for AS since its design goal focuses on fabrics with fewer than a thousand endpoints. Also, the frequency with which AS endpoints are expected to enter or exit a fabric is relatively low. The source-based routing mechanism consists of a turnpool and a turnpointer. The turnpool contains the sequence of turns a packet must take as it traverses the fabric while the turnpointer allows each switch node to use a particular turn from the turnpool. The

turnpointer is updated when a packet moves from one switch node to the next. This source-based routing approach allows intermediate switch nodes to determine the output port a packet would take without the need for large destination tables. It also goes a step further. A switch node can easily determine the output port that a packet would take in the downstream switch also (this is essentially the next turn in the turnpool). 2.2 Virtual Channel and Flow Credit Control The traffic management support in AS is similar to that of PCI Express. As in PCI Express, AS supports multiple Traffic Classes (TCs) that enables flows to be prioritized. Rather than requiring every switch implementation to provide resources for supporting all TCs, Virtual Channels (VC) are employed that enable multiple TCs to map on to a single VC. Even though all VCs share the same physical link, each VC in a component has dedicated resources that include flow credits. Each VC is further guaranteed a certain amount of the link bandwidth. Thus, congestion may occur within a VC but not in other VCs using the same physical link. In this regard, VCs may be considered as Virtual Links (VLs) in Infiniband [12] rather than in a traditional sense[13] where in packets may move from one VC to another. Hence, wherever the term ‘link’ (also interchangeably used as ‘port’) appears in this paper, it is within the context of a particular VC.

3 3.1

Localized Congestion Management Motivation

Figure 1 illustrates a typical setup for illustrating congestion – both persistent as well as transient. The example shows three switches A, B and C that is part of a larger AS fabric. Packets belonging to flows g0, g1 and g2 exit a common link - link 2 of switch node C – before heading to their final destination. Furthermore, packets belonging to f0 and g0 also share a common output link – link 1 of switch B. Assume that in a steady state, the combined bandwidth of g0, g1 and g2 stays well below the output link 2 (switch C) bandwidth. If an intermittent flow g3 targets link 2 such that the link bandwidth is momentarily exceeded, it causes transient congestion and affects the bandwidth of g0, g1 and g2 sharing the link. The undesirable side effect is that it also affects flows that are not heading out to the congested link. In the above example, f0 and g0 share the same linklevel credits. Since packets belonging to g0 back up due to the transient congestion, it results in a temporary drop in the input link credit return from C. A fair scheduling policy in switch B will however schedule packets between f0 and g0 in an equitable manner (alternating between the two flows) since it has no information available regarding the congested flow g0. However, the slow progress of packets belonging to g0 in switch C implies that link credits consumed by packets belonging to g0 will take a longer time to be returned to link 1 of switch B. This eventually results in credit starvation for f0 packets and results in a drop in its bandwidth. Packet backups caused by the victimized flow f0 can create a secondary congestion point in link 1 of switch B and eventually results in a drop in bandwidth for f1. Depending on the duration of g3, a congestion tree that is rooted at switch C starts to spread to multiple levels. The bandwidth drop notwithstanding, a new packet entering one of the branches of the congested tree encounters a significant increase in its latency. Finally, after g3 is complete, the backed-up packets belonging to the remaining flows utilize the excess bandwidth available. Eventually, credits are returned rapidly and traffic is restored to the original non-congested state.

If g3 were a persistent congestion flow, an end-to-end approach is needed to throttle the sources. However, an end-to-end approach is not appropriate for handling transient congestion. If an end-to-end scheme were to be used (in which switch C generates a BECN to the sources of g0, g1 and g2), it is likely that flow g3 may have ceased even before the source could respond to the BECN.

2

f1

f0

0

A

g3

1

f0+f1 f0+g0

0 B

3

1

4 3

2

g0

5

0 1 C 2

g1

g0+g1+g2+g3

We now describe in detail the two components that are needed for enabling the SBFC mechanism. The first component is the generation of the output buffer status notification by the downstream switch node. The second component is the response taken by the upstream switch node on receiving such a notification.

g2

Figure 1. Packets arriving intermittently on input link 1 of switch node C (flow g3) results in a momentary over-subscription of the bandwidth of output link 2. This leads to primary and secondary transient congestion points and eventually affects even flows (f0 and f1) that do not target the congested links. We now describe how the use of source-based path routing information in the packet header can help in alleviating transient congestion.

3.2

may be argued that rather than reporting status on a next turn (port) basis, the downstream switch may report congestion on a per-flow basis. Unfortunately, unlike next turn output ports that are finite, the number of individual-flows is indeterminate. To allow flow based status notification, the downstream switch must inspect individual flows and return status for each flow. This also requires an AS packet to be tagged with flow-id. Finally, the switches must inspect deeper into the AS packet rather than inspecting just the packet header thereby adding to the switch complexity. To the best of our knowledge, there are no fabrics that use destination-based routing and yet take the downstream switch’s output port status into consideration when scheduling packets. Most fabrics solely rely on traditional approaches (end-to-end and link-based flow control) for congestion management.

Details

Looking back at Figure 1, it can be seen that if switch B is made aware of the transient congestion in link 2 of switch C, it can modify its link scheduling policy such that it gives priority to packets belonging to f0 over that of g0. Recall that the source-level path information in the packet header allows switch B to know apriori that link 2 in switch C would be the next turn (port) used by packets belonging to g0. If the next turn output buffer status were available, a switch can incorporate that information and accordingly change its scheduling policy. This is the essence of localized congestion control, known as the status based flow control (SBFC) in AS. Modification of the scheduling policy based on output port buffer status has traditionally been used in switches with combined input and output buffers (CIOQ)[7]. Typically, the backpressure from a congested output buffer allows the scheduler to prioritize packets heading out to non-congested output ports. The SBFC mechanism takes it a step further – only that the scheduling policy is based on the next turn output buffer status information of a switch located downstream. It must be mentioned that next turn (port) status information can be utilized by an AS switch scheduler only because AS uses a sourcebased routing. In contrast, fabrics such as ATM, Ethernet or Infiniband use a destination (table-lookup) based approach. In such fabrics, a switch would be unable to utilize such next turn (or port) information to influence the scheduling policy. At the same time, incorporating information about next output ports for each entry would increase the table size exponentially (in the order of ports). It

3.2.1

Notification

The following issues need to be addressed when generating a notification for enabling SBFC: • Layer to be used for the notification • Frequency with which the notification needs to be sent • Choosing a target (among several upstream switches) for such notifications • Notification contents In PCI Express and AS, only the transaction layer packets (TLPs) require flow credits. Data link layer packets (DLLPs), while lossy, do not consume flow credits[1]. From a notification perspective, the data link layer packets (DLLPs) are always used in AS for such notification. Since communication occurs only between adjacent switch nodes, DLLPs are indeed an appropriate choice. But more importantly, such notification should not be delayed either due to lack of AS flow credits or due to scheduling delays. The scheduling policy in AS is such that DLLPs always have a higher priority over TLPs. Any delays incurred by the notification may render them ineffective, as the upstream switch would persist in sending packets to the congested link during the interim period. On the other hand, DLLPs may be dropped resulting in a lost notification. However, the relatively rare occurrence of a lost notification (when using DLLPs) is preferred to a frequent occurrence of a delayed notification (when using TLPs). As for the frequency and target for sending such a notification, one could use a naïve solution that would broadcast the information about output buffer status at periodic intervals to each neighboring component. This scheme results in sending notifications even to those switch link partners that may never transmit any packet to the corresponding output ports in the switch. From an efficiency perspective, this scheme is not preferable as it squanders link bandwidth. Instead, AS supports a reactive mechanism. An incoming packet targeting a specific output port receives a notification for that particular output port buffer assuming some threshold is exceeded. This not only reduces link bandwidth wastage but also takes into account the flow behavior (information is sent only to upstream switches targeting the output port). A 2-bit status in the notification packet allows up to 4 thresholds to be specified. Depending on the threshold value, the upstream switch

sets a corresponding timer for the specific next turn. The timer allows the scheduling policy to modify its priority for a specific time period. Table 1 details how the status (XOFF) field needs to be interpreted.

Table 1. XOFF notification. XOFF Value 0 1 2 3

Description (and interpretation at the receiver) No Congestion (clear timer - Equivalent to an XON) Low congestion Threshold (short timer) High congestion Threshold (long timer) Severe Congestion (very long timer Equivalent to a traditional XOFF)

AS supports a traditional XON/XOFF scheme[15] in which an explicit XON (XOFF = 0) is sent once the congestion clears. This approach requires a switch to keep track of the upstream ports to which an XOFF has been sent; once the buffers fall below a certain threshold, an XON needs to be sent to those ports. Obviously, this approach adds to the switch complexity. AS also provides an alternative to the traditional XON/XOFF model by supporting two intermediate XOFF states (low and high thresholds). This permits the receiver to automatically clear the congestion status without waiting for an explicit XON notification. AS allows some flexibility in generating the notification. A switch node need not generate notification for all four XOFF values. It may continue to use a traditional XOFF/XON mode and generate XOFFs with values 0 or 3 (the very long timer at the receiver for XOFF=3 is essentially a safety timer to accommodate for lost XONs). Alternatively, a switch can generate only XOFF = 1 or 2 (low and high timers) and avoid the complexity of supporting a true XON/XOFF model. At the extreme, a switch node may support all 4 XOFF states. For our evaluations, we assume the short- and longtimer model. Though discussed in the context of next turn output buffer usage, the congestion status is applicable even for switches that employ only input-buffers. In reality, input-buffered switches are more prevalent than output buffered switches due to their less aggressive memory requirements. Besides, input-buffered switches can match the performance of an output-buffered switch with appropriate internal speedups[16]. Such input-buffered switches would necessarily use virtual output queues (VOQs)[17] to avoid trivial head-of-line blocking. For VOQs in AS switches, the input buffer is organized as sub-queues based on . Here VC corresponds to the AS Virtual Channel while Output Port refers to the output port to which the packet is headed to within the switch. For such a VOQbased AS switch, the next turn buffer status that need to be reported to the upstream switch may be calculated by aggregating the packet usage across all VOQs for each output port.

3.2.2

Response

A node that receives an XOFF status notification must incorporate the information into its scheduling policy. This can be achieved provided the VOQs are further sub-classified based on next turns. In other words, a VOQ that is organized based on must be further organized based on . A clarification is in order: increasing the number of queues does not imply that the size of the input buffer needs to be increased. The size of the input buffer indeed remains constant; rather it is the number of queue pointers that increases. Nevertheless, the number of pointers

that may need to be maintained for the worst case scenario of accommodating a downstream switch with 256 next turns (the maximum ports supported in a AS switch) could overwhelm the switch design. Rather than being assigned in a static fashion as described above, the queues may also be assigned dynamically to address the problem of maintaining a large number of queues. In the dynamic approach, the switch continues to organize the input buffer based on queues. In addition to these queues, a finite set of auxiliary queues is associated with each output port. When an XOFF notification is received for a particular output port for which no auxiliary queue entry is yet to be allocated, a new queue with the corresponding value is dynamically allocated. If, on the other hand, no auxiliary queues are available, the XOFF messages may be ignored. It is expected that an output port would receive XOFF status notifications for a small subset of next turns from the downstream switch and hence, a small set of auxiliary queues would be sufficient. Whenever a packet at the head of the corresponding queue has a “next turn” value for which an auxiliary queue exists, the packet is moved to the auxiliary queue. Unlike the static approach, sharing a fixed set of auxiliary queues among all not only reduces the total number of sub-queues that need to be maintained within the switch but also provides an effective platform for enabling SBFC in switches with large port counts.

4

Evaluation

The SBFC mechanism mentioned in the previous section was evaluated using a network fabric simulator. Table 2 describes the simulation parameters. Note that we use a (constant) small packet size for the simulations. This is to keep the transient congestion level at a minimum so as to not overstate the benefits of the SBFC mechanism. Increasing the packet sizes only exacerbated the congestion effect in our simulations. Also, the link-level flow credits have been sized appropriately so as to enable flow-through to be achieved for the packet flows under consideration. The stated flow credits are indeed more than sufficient to allow the complete fabric bandwidth to be saturated in the presence of non-congesting flows. The XOFF thresholds for the next turn output buffers are specified in relationship to the input buffer (link-level) flow credits. The ECN support at the source node includes the AIMD mechanism with a 50% reduction in the window (reduction in flow credits) and a linear increase (in flow credits) every 4µs. Since our simulations are based on small fabrics (3 hops at-most), 6µs is added to the ECN response so as to account for delays that a BECN notification would experience when traversing through a relatively larger fabric. Table 2. Simulation Parameters. Parameter

Value

Link Bandwidth Buffer Packet Size Next turn Ports Credits Credit Size XOFF thresholds

2 Gb/s Input buffered with VOQs 16 bytes 8 800 bytes 16 bytes Low: 50% Credits High: 80% Credits Short: 0.5-1µs, Long: 4-8µs AIMD

XOFF timers (range) ECN support at source ECN delay

6µs

f0

f1

g0

g1

g2

g3

Bandwidth (Mb/sec)

1200 1000 800 600 400 200

84 0

80 0

76 0

72 0

68 0

64 0

60 0

56 0

52 0

48 0

44 0

40 0

36 0

32 0

28 0

24 0

20 0

16 0

0

Tim e (µsec)

Figure 2. Bandwidth drop for flows f0 and f1 are due to primary and secondary transient congestion at switch nodes B and C.

5

Experiments

Figure 1 is used as the basis for all the experiments. Before discussing the results, a clarification is in order on the size of the fabric and traffic type used in the evaluations. For a large multi-state switching fabric, SBFC is not intended to solve the sustained congestion problem; it is rather intended to complement the ECN mechanism that is rather ineffective in handling transient congestion (see following section). Since SBFC mechanism works only on an adjacent switch basis, a 3-stage fabric would better illustrate its effectiveness. This does not imply that the SBFC mechanism is not applicable for larger multi-stage switch fabrics. Indeed, the 3-stage fabric shown may be considered as part of a larger switch fabric. The focus in the evaluations is strictly on transient congestion -- a scenario that is illustrated with (a) uniform and (b) bursty rate of packet injection at the sources. Recall from Figure 1 that all flows except g3 are uniform flows (while g3 is bursty). Finally for sake of brevity, the ability of ECN to handle sustained congestion is not shown, as its efficacy is already well understood.

5.1

Transient Congestion

Figure 2 illustrates the transient congestion condition when an intermittent flow results in a temporary over-subscription of the link bandwidth. The onset of g3 results in a transient congestion in link 2 of switch C ultimately causing f0 to drop in bandwidth. The fair scheduling policy in switch C allocates the congested link bandwidth equally among all flows (g0, g1, g2 and g3). This works out to 500 Mb/s (link bandwidth/flows where link bandwidth = 2000Mb/s and flows = 4). Since the bandwidth requirements of g1 (with bandwidth 500 Mb/s) and g2 (with bandwidth 300 Mb/s) can be satisfied, they are unaffected. The residual link bandwidth (link bandwidth – g1 – g2 i.e. 2000 – 500 – 300 = 1200Mb/s) is distributed evenly between g0 and g3. This results in the bandwidth of g0 and g3 to be limited to 600 Mb/s. As a result, g0 ’s packets start accumulating in switch C. Now, in a link-level flow credit model, flow credits consumed by a packet are returned only when the packet has exited the switch. When g0 ‘s bandwidth is reduced from 800 Mb/s to 600 Mb/s, the rate of credits returned by switch C to switch B’s output link 1 is also reduced correspondingly. Because switch B is oblivious to the congestion in the downstream switch, the scheduler continues to partition output

link 1’s bandwidth (and correspondingly the flow credits) evenly between f0 and g0. This causes f0’s bandwidth also to drop to 600 Mb/s (matching g0 ‘s bandwidth). The accumulation of f0 ’s packets in switch B, in turn, results in a drop in the rate of credits returned by switch B to switch A’s output link 1. This leads to a reduction in f1’s bandwidth to 600 Mb/s. Finally, when the bursty flow g3 ceases, the congested flows recover rapidly and make use of the spare 19% link bandwidth left unused in the steady non-congested state. Hence the spike in bandwidths for g0, f0 and f1 is observed as soon as the bursty flow ceases. Shortly thereafter, the flows return to the non-congested steady state.

Figure 3 illustrates the effect of using a traditional end-to-end scheme for handling transient congestion. Initially, switch C generates a BECN to the source of the primary congesting flow g0. As can be seen from the figure, even though there is a significant drop in the source injection rate of g0, it is not timely enough to rescue the non-congesting flow f0. As before, the secondary effect also affects flow f1. Interestingly, flow f1 also gets penalized during the recovery process. Since g0’s injection rate is reduced due to the BECN, switch B’s output link 1 is underutilized. However, the same is not true for switch A’s output link 1. As soon as flow f1 ’s packets that have backed up in switch A start to utilize the complete link bandwidth (and temporarily saturating it), the BECN mechanism is triggered for f1 needlessly causing the source to reduce its packet injection rate. Overall, it is clear that a traditional ECN mechanism is quite ineffective in addressing the transient congestion and results in an appreciable drop in the fabric throughput. Figure 4 illustrates how SBFC addresses the transient congestion problem. In this experiment, the BECN mechanism was turned off. It can be clearly seen that the only flow that is affected is flow g0 targeting the congested link in switch C. And unlike the earlier case, flows f0 and f1 are totally unaffected. As soon as the bursty traffic ceases, flow g0 recovers and switch B’s output link 1 is completely utilized for a short period. Finally, the interaction of both SBFC and ECN mechanisms is shown in Figure 5. Unlike the behavior in Figure 3, BECN is selectively triggered for the truly congesting flow g0. None of the other flows are affected. Based on Figure 4, one could argue that the injection rate

for flow g0 also needs not be reduced. It is likely that the SBFC and BECN generation may need to be tightly integrated within a switch;

this is a topic that requires further investigation.

g1

g0

f1

f0

g2

g3

Bandwidth (Mb/sec)

1200 1000 800 600 400 200

0

0

84

76

80

0

0 72

0 68

0 64

0 60

0 56

0 52

0 48

0 44

0 40

0 36

0 32

0 28

0 24

0 20

16

0

0

T im e (µ s e c )

Figure 3. Behavior when switch nodes generate BECNs to sources for controlling the packet injection rate. f0

f1

g0

g1

g2

g3

Bandwidth (Mb/sec)

1200 1000 800 600 400 200

84 0

80 0

76 0

72 0

68 0

64 0

60 0

56 0

52 0

48 0

44 0

40 0

36 0

32 0

28 0

24 0

20 0

16 0

0

Time (µsec)

Figure 4. Behavior when Status Based Flow Control (SBFC) mechanism is used (no BECNs are generated). g0

f1

f0

g1

g3

g2

1200

Bandwidth

1000 800 600 400 200

Tim e (µs e c)

Figure 5. Behavior when Status Based Flow Control (SBFC) mechanism is used in conjunction with BECNs.

840

800

760

720

680

640

600

560

520

480

440

400

360

320

280

240

200

160

0

6 5.2

Buffer Requirements

Rather than using the SBFC mechanism, an alternate way to address transient congestion would be to increase the flow credits i.e. the buffer requirements for each input port. Figure 6 shows the additional buffering that is required to handle the bursty flow g3 used in the earlier experiments. For the duration of burst shown, a 150% increase in internal buffering is required if flow f0, which is the primary victim is to remain unaffected. For flow f1 (secondary victim) to remain unaffected, it calls for a 75% increase in the buffering requirements. In other words, the SBFC mechanism can significantly reduce the extra buffering (and hence the memory) needed to handle transient congestion. Such a reduction would be helpful for single-chip switch implementations[19]. 75%

100%

125%

Traditional congestion control mechanisms based on end-to-end schemes are not effective in handling transient congestion. This is because notifications send to the source elicits a slow response that may result in unnecessary throttling of the flows. This paper presented a localized congestion control mechanism called Status Based Flow Control (SBFC) for handling transient congestion in AS. Not only does the SBFC mechanism adequately control the transient congestion problem but it also permits a graceful transition to the traditional end-to-end scheme. We presented detailed simulation results that highlighted the efficacy of this scheme. Finally, the SBFC mechanism is applicable not just to AS but to any fabric employing a source-based path routing similar to that of AS.

150%

50%

1400

1200

1200

1000

f1 Bandwidth (Mb/sec)

f0 Bandwidth (Mb/sec)

50%

Conclusion

1000 800 600 400 200

75%

800 600 400 200 0

0 160 180 200 220 240 260 280 300 320 340 360 380 400

160 180 200 220 240 260 280 300 320 340 360 380 400

Time (µsec)

Time (µsec)

Figure 6. Increase in input buffers to accommodate the transient congestion experienced by flows f0 and f1.

7 1.

References

PCI Special Interest Group, “PCI Express Base Specification Revision 1.0a,” Apr 2003. 2. D. Mayhew and V. Krishnan, “PCI Express and Advanced Switching: Evolutionary Path to Building Next Generation Interconnects”, Hot Interconnects 11, August 2003. 3. V. Jacobson, “Congestion Avoidance and Control,” Proc. SIGCOMM’88, Aug. 1988, pp. 314-329. 4. S. Floyd and V. Jacobson, “Random Early Detection Gateways for Congestion Avoidance,” IEEE/ACM Transactions on Networking, Vol. 1, No. 4, pp. 397-413, Aug. 1993. 5. H.T. Kung and R. Morris, “Credit-Based Flow Control for ATM Networks,” IEEE Network, Vol. 9, No. 2, Mar/Apr. 1995, pp. 40-48. 6. J. Nagle, “Congestion Control in IP/TCP Internetworks,” IETF RFC896, Jan. 1994. 7. K.K. Ramakrishnan, S. Floyd, and D. Black, “The Addition of Explicit Congestion Notification (ECN) to IP,” IETF RFC3168, Sep. 2001. 8. W. Goralski, “Frame Relay for High-Speed Networks,” John Wiley & Sons, Inc., New York, Jan. 1999. 9. S. Floyd, “Congestion Control Principles,” IETF RFC2914, Sep. 2000. 10. V. Paxson and S. Floyd, “Wide Area Traffic: The Failure of Poisson Modeling,” IEEE/ACM Transaction on Networking, Vol. 3, No. 3, pp. 226-244, June 1995. 11. Advanced Switching Interconnect SIG, “Advanced Switching Specification Revision 1.0,” Dec 2003.

12. Infiniband Trade Association, “Infiniband Architecture Specification, Release 1.0,” Oct. 2000. http://www.infinibandta.org 13. W.J. Dally and C.L. Seitz, “Deadlock-free Message Routing in Multiprocessor Interconnection Networks,” IEEE Transactions on Computers, Vol. 36, pp. 547-553, May 1987. 14. S. Chuangg, A. Goel, N. McKeown, and B. Prabhakar, “Matching Output Queueing with a Combined Input Output Queued Switch,” INFOCOM (3), 1999. 15. J. Walrand and P. Varaiya. High-Performance Communications Networks. Morgan Kaufman Publishers, 2 edition, 2000. 16. N. McKeown, "iSLIP: A scheduling algorithm for input-queued switches," IEEE/ACM Transactions on Networking, vol. 7, no. 2, pp. 188--201, April 1999. 17. T. Anderson and S. Owicki and J. Saxe and C. Thacker, “High-Speed Switch Scheduling for Local Area Networks", ACM Transactions on Computer Systems, November 1993. 18. Agilent Technologies HDMP-2840 Product Brief, Aug 2002. 19. D. Pnevmatikatos and G. Kornaros, “ATLAS II: Optimizing a 10Gbps Single-chip ATM Switch'', IEEE International ASIC/SOC Conference 1999. 20. S. Floyd, “HighSpeed TCP for Large Congestion Windows”, IETF RFC3649, Dec 2003. 21. Lisong Xu, Khaled Harfoush, and Injong Rhee, "Binary Increase Congestion Control for Fast Long-Distance Networks", Infocom March 2004. 22. Advanced Switching Interconnect SIG Website, http://www.asisig.org/education