> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)

REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)

1Gbps) and RTTs are so low (< 0.1ms) that the packets from different servers generally reach the bottleneck link almost simultaneously [1]. Hence, even if the bottleneck queue is drop-tail, it acts like dropping these packets with an equal possibility. Moreover, we assume a connection is timeout only if the packets in its entire sending window are lost. The references [1] [5] show that the fullwindow loss is the dominating kind of timeout to cause TCP Incast, thus other kinds of timeout are ignorable in Incast modeling. At last, we assume minRTO (= 200ms by default) is significant for the transmission time of the requested data block, so even one timeout leads to Incast with goodput collapse. B. Modeling TCP Incast Probability The client will suffer from Incast if any of its connections is timeout, so the Incast probability at time t is

PIncast (t ) 1 (1 Pi (t ))

Pi (t ) p(t )

wi ( t )

, for 1 i n(t )

n

Let W(t) = ∑ be the sum of the servers’ sending windows and V(t) = ∑ be the sum of the background flows’ windows. At this time, W(t) + V(t) packets are injected into the network, and C·D packets are served by the bottleneck link. Suppose that the number of packets arriving at the link exceeds the buffer size, then (W(t) + V(t) − B − C·D) packets are equally dropped. So the packet drop rate p(t) satisfies

p (t )

W (t ) V (t ) B C D W (t ) V (t )

max ln[1 PIncast ( w)] ln[1 ( p) wi ] i 1

subject to

n

g ( w) wi W 0. i 1

It is easy to check that the Hessian matrix of −ln[1 – PIncast(w)] is positive semi-definite over the region w ≥ 0, which means that ln[1 – PIncast(w)] is concave, and thus will be globally maximized by Lagrange multiplier λ and a unique window allocation w* = (wi*, 1≤ i ≤ n) if and only if

C. Minimizing Incast Probability from Application Layer Next, we analyze how can the client application minimize Incast probability PIncast(t) through tuning the parameters in (4). In order to reveal each individual parameter’s impact on Incast, we keep the remaining parameters unchanged while analyzing a parameter. Since we focus on minimizing PIncast(t) at any given time t, we omit t in the notations, like n(t) to n. 1) Adjust Window Sizes w = (wi, 1≤ i ≤ n) while fixing other parameters (B, C, D, W, V, n): In this case, the packet loss rate p = (W + V − B − CD)/(W + V) is a constant. Then according to (4), minimizing the Incast risk PIncast(w) is to solve the following optimization problem via the method of Lagrange multiplier.

1 i n ( t )

where Pi(t) is the timeout probability of the i-th connection. The timeout possibility Pi(t) is determined jointly by the i-th connection’s sending window size wi(t) and the packet drop rate p(t). Since the full window loss is considered as the main cause to timeout, Pi(t) can be approximated as

W (t ) V (t ) B C D wi ( t ) PIncast (t ) 1 1 ( ) W (t ) V (t ) 1in ( t )

ln[1 PIncast ( w* )] g ( w* ) , for 1 i n wi wi n g ( w* ) w* W 0 i i 1

So the window allocation maximizing ln[1 – PIncast(w)] is

wi*

W 0, for 1 i n n

which thereby minimizes Incast probability PIncast(w) in (4). Remark: To minimize Incast risk, the client should allocate an identical sending window size to the concurrent connections. Such allocation is feasible in practice. The client knows the sending windows’ sum W and the connections’ amount n, and can regulate each connection’s sending window wi by setting the TCP advertised window (awnd) field in ACK packets. 2) Adjust Concurrent Connection Number n while fixing other parameters (B, C, D, W, V). According to the Incast probability in (4) and the optimal window allocation in (7), the timeout-free possibility 1 – PIncast(n) can be re-written as

1 PIncast (n) 1 ( p)W / n n

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < where p = (W+ V − B − C· D)/(W + V) is the packet drop rate. To maximize ln[1 – PIncast(n)], we derive its first derivative

ln[1 PIncast (n)] ln(p) pW / n W ln(1 pW / n ) n n (1 pW / n )

which is negative for the drop rate p ∈ (0, 1). Accordingly, the Incast probability PIncast(n) is an increasing function of the connection amount n, and will be minimized at n = 1. Remark: The client can reduce Incast possibility by lowering the number of concurrent connections n while keeping the total window size W unaltered (by requesting data from a subset of the servers). In this way, the client mitigates Incast without affecting link utilization. But note that the client should not constantly limit n to a very low value like n = 1. This is because a small n may cause a great waste of bandwidth in the cases where each connection has so few data to send (SRU) that it finishes sending before fully utilizing bandwidth [1]. How to determine the reasonable value of n is studied below. We will verify the above model (4) (7) (9) in Section IV-A. III. ADAPTIVE APPLICATION-LAYER INCAST CONTROL BASED ON A SLIDING-CONNECTION-WINDOW MECHANISM Based on the analysis for (7) and (9), we design Adaptive Application-layer Incast Control (AAIC). With AAIC, the client dynamically tunes the number of concurrent connections using a sliding-connection-window mechanism, and assigns an equal awnd to each connection. For a terminated connection due to timeout, it will be quickly re-established by AAIC. A. Equally Allocate Advertised Window to Connections As (7) indicates, equally allocating sending window to the existing connections minimizes the risk of TCP Incast. Note that sending window size is upper-bounded by advertised window size (awnd), thus AAIC can emulate the equal sending window allocation by setting awnd at the client (e.g., through the setsokopt() call in BSD systems) as follows

awndi

, for 1 i n n

However, with (11) alone, AAIC cannot guarantee the concurrent connections to be timeout-free if background traffic exists. To further reduce Incast risk, AAIC must adaptively tune the concurrent connections’ amount n, and quickly recover data transmission from timeout servers. These design demands are met by the following two mechanisms, respectively. B. Determine Concurrent Connection Number by a SlidingConnection-Window Mechanism To utilize the bottleneck link fully without inducing Incast, AAIC uses a sliding-window-like mechanism to adjust the number of concurrent connections dynamically for various conditions. Analogous to TCP congestion control, AAIC keeps a window holding current connections. In concept, we term this window as connection window, or shortly, con_wnd. When the existing connections’ number is less than the con_wnd size, a new connection is established and admitted to con_wnd. When a connection finishes, it is removed from the window. AAIC exploits an Additive Increase Multiplicative Decrease (AIMD) policy to control the con_wnd size n. The initial value of n is n = 1. Whenever a connection in con_wnd finishes, AAIC infers that the bottleneck link is not jamming, so it gradually increases n as

1 n min{B, n } n

As the con_wnd size n grows, some connections inevitably result in timeout due to the background TCP traffic. As usual, minRTO = 200ms is much longer than a connection’s ordinary life period (mostly less than 1ms). Thereby, AAIC deduces a connection to be timeout-broken when three new connections have been admitted and finished since the last time the connection transmitted any data. After detecting timeout, AAIC is aware of the timeout probability being too high. According to our analysis for (9), it can reduce the probability by cutting the number of concurrent connections and fixing other parameters. As a result, to avoid causing more timeouts, AICC halves the con_wnd size n as

where awndi is awnd of the connection i, and Ω is the total awnd of n concurrent connections. While such adjustment is suboptimal, it regulates the non-uniformity of the sending window sizes and thus decreases Incast probability. Next, we decide the total awnd Ω. The connections can fully utilize the bottleneck link without self-induced losses if Ω = C·D + B. Because the bandwidth-delay product C·D (≈ 10KB) is usually ignorable when compared to the buffer size B (> 100KB) [1], AAIC simply lets Ω ≈ B. Then AAIC equally distributes the total awnd Ω = B among n connections as B awndi , for 1 i n n

3

n max{1,

n } 2

C. Slow Withdrawal and Fast Reconnection When timeout happens to l connections, AAIC should halve con_wnd (assume its size is n) to n/2 according to (13). A naïve scheme to implement (13) is instantly terminating n/2 existing connections and removing them from con_wnd. However, if there are more than n/2 live connections that are not timeout, the naï ve scheme has to close some of the live connections and incurs unnecessary data retransmissions. To avoid this, AAIC uses a scheme called slow withdrawal to shrink con_wnd gradually to n/2 without closing live connections. First, it closes and removes the l timeout connections from con_wnd, and sets con_wnd to the greater one of n – l and n/2. Next, if con_wnd is

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)

REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)

1Gbps) and RTTs are so low (< 0.1ms) that the packets from different servers generally reach the bottleneck link almost simultaneously [1]. Hence, even if the bottleneck queue is drop-tail, it acts like dropping these packets with an equal possibility. Moreover, we assume a connection is timeout only if the packets in its entire sending window are lost. The references [1] [5] show that the fullwindow loss is the dominating kind of timeout to cause TCP Incast, thus other kinds of timeout are ignorable in Incast modeling. At last, we assume minRTO (= 200ms by default) is significant for the transmission time of the requested data block, so even one timeout leads to Incast with goodput collapse. B. Modeling TCP Incast Probability The client will suffer from Incast if any of its connections is timeout, so the Incast probability at time t is

PIncast (t ) 1 (1 Pi (t ))

Pi (t ) p(t )

wi ( t )

, for 1 i n(t )

n

Let W(t) = ∑ be the sum of the servers’ sending windows and V(t) = ∑ be the sum of the background flows’ windows. At this time, W(t) + V(t) packets are injected into the network, and C·D packets are served by the bottleneck link. Suppose that the number of packets arriving at the link exceeds the buffer size, then (W(t) + V(t) − B − C·D) packets are equally dropped. So the packet drop rate p(t) satisfies

p (t )

W (t ) V (t ) B C D W (t ) V (t )

max ln[1 PIncast ( w)] ln[1 ( p) wi ] i 1

subject to

n

g ( w) wi W 0. i 1

It is easy to check that the Hessian matrix of −ln[1 – PIncast(w)] is positive semi-definite over the region w ≥ 0, which means that ln[1 – PIncast(w)] is concave, and thus will be globally maximized by Lagrange multiplier λ and a unique window allocation w* = (wi*, 1≤ i ≤ n) if and only if

C. Minimizing Incast Probability from Application Layer Next, we analyze how can the client application minimize Incast probability PIncast(t) through tuning the parameters in (4). In order to reveal each individual parameter’s impact on Incast, we keep the remaining parameters unchanged while analyzing a parameter. Since we focus on minimizing PIncast(t) at any given time t, we omit t in the notations, like n(t) to n. 1) Adjust Window Sizes w = (wi, 1≤ i ≤ n) while fixing other parameters (B, C, D, W, V, n): In this case, the packet loss rate p = (W + V − B − CD)/(W + V) is a constant. Then according to (4), minimizing the Incast risk PIncast(w) is to solve the following optimization problem via the method of Lagrange multiplier.

1 i n ( t )

where Pi(t) is the timeout probability of the i-th connection. The timeout possibility Pi(t) is determined jointly by the i-th connection’s sending window size wi(t) and the packet drop rate p(t). Since the full window loss is considered as the main cause to timeout, Pi(t) can be approximated as

W (t ) V (t ) B C D wi ( t ) PIncast (t ) 1 1 ( ) W (t ) V (t ) 1in ( t )

ln[1 PIncast ( w* )] g ( w* ) , for 1 i n wi wi n g ( w* ) w* W 0 i i 1

So the window allocation maximizing ln[1 – PIncast(w)] is

wi*

W 0, for 1 i n n

which thereby minimizes Incast probability PIncast(w) in (4). Remark: To minimize Incast risk, the client should allocate an identical sending window size to the concurrent connections. Such allocation is feasible in practice. The client knows the sending windows’ sum W and the connections’ amount n, and can regulate each connection’s sending window wi by setting the TCP advertised window (awnd) field in ACK packets. 2) Adjust Concurrent Connection Number n while fixing other parameters (B, C, D, W, V). According to the Incast probability in (4) and the optimal window allocation in (7), the timeout-free possibility 1 – PIncast(n) can be re-written as

1 PIncast (n) 1 ( p)W / n n

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < where p = (W+ V − B − C· D)/(W + V) is the packet drop rate. To maximize ln[1 – PIncast(n)], we derive its first derivative

ln[1 PIncast (n)] ln(p) pW / n W ln(1 pW / n ) n n (1 pW / n )

which is negative for the drop rate p ∈ (0, 1). Accordingly, the Incast probability PIncast(n) is an increasing function of the connection amount n, and will be minimized at n = 1. Remark: The client can reduce Incast possibility by lowering the number of concurrent connections n while keeping the total window size W unaltered (by requesting data from a subset of the servers). In this way, the client mitigates Incast without affecting link utilization. But note that the client should not constantly limit n to a very low value like n = 1. This is because a small n may cause a great waste of bandwidth in the cases where each connection has so few data to send (SRU) that it finishes sending before fully utilizing bandwidth [1]. How to determine the reasonable value of n is studied below. We will verify the above model (4) (7) (9) in Section IV-A. III. ADAPTIVE APPLICATION-LAYER INCAST CONTROL BASED ON A SLIDING-CONNECTION-WINDOW MECHANISM Based on the analysis for (7) and (9), we design Adaptive Application-layer Incast Control (AAIC). With AAIC, the client dynamically tunes the number of concurrent connections using a sliding-connection-window mechanism, and assigns an equal awnd to each connection. For a terminated connection due to timeout, it will be quickly re-established by AAIC. A. Equally Allocate Advertised Window to Connections As (7) indicates, equally allocating sending window to the existing connections minimizes the risk of TCP Incast. Note that sending window size is upper-bounded by advertised window size (awnd), thus AAIC can emulate the equal sending window allocation by setting awnd at the client (e.g., through the setsokopt() call in BSD systems) as follows

awndi

, for 1 i n n

However, with (11) alone, AAIC cannot guarantee the concurrent connections to be timeout-free if background traffic exists. To further reduce Incast risk, AAIC must adaptively tune the concurrent connections’ amount n, and quickly recover data transmission from timeout servers. These design demands are met by the following two mechanisms, respectively. B. Determine Concurrent Connection Number by a SlidingConnection-Window Mechanism To utilize the bottleneck link fully without inducing Incast, AAIC uses a sliding-window-like mechanism to adjust the number of concurrent connections dynamically for various conditions. Analogous to TCP congestion control, AAIC keeps a window holding current connections. In concept, we term this window as connection window, or shortly, con_wnd. When the existing connections’ number is less than the con_wnd size, a new connection is established and admitted to con_wnd. When a connection finishes, it is removed from the window. AAIC exploits an Additive Increase Multiplicative Decrease (AIMD) policy to control the con_wnd size n. The initial value of n is n = 1. Whenever a connection in con_wnd finishes, AAIC infers that the bottleneck link is not jamming, so it gradually increases n as

1 n min{B, n } n

As the con_wnd size n grows, some connections inevitably result in timeout due to the background TCP traffic. As usual, minRTO = 200ms is much longer than a connection’s ordinary life period (mostly less than 1ms). Thereby, AAIC deduces a connection to be timeout-broken when three new connections have been admitted and finished since the last time the connection transmitted any data. After detecting timeout, AAIC is aware of the timeout probability being too high. According to our analysis for (9), it can reduce the probability by cutting the number of concurrent connections and fixing other parameters. As a result, to avoid causing more timeouts, AICC halves the con_wnd size n as

where awndi is awnd of the connection i, and Ω is the total awnd of n concurrent connections. While such adjustment is suboptimal, it regulates the non-uniformity of the sending window sizes and thus decreases Incast probability. Next, we decide the total awnd Ω. The connections can fully utilize the bottleneck link without self-induced losses if Ω = C·D + B. Because the bandwidth-delay product C·D (≈ 10KB) is usually ignorable when compared to the buffer size B (> 100KB) [1], AAIC simply lets Ω ≈ B. Then AAIC equally distributes the total awnd Ω = B among n connections as B awndi , for 1 i n n

3

n max{1,

n } 2

C. Slow Withdrawal and Fast Reconnection When timeout happens to l connections, AAIC should halve con_wnd (assume its size is n) to n/2 according to (13). A naïve scheme to implement (13) is instantly terminating n/2 existing connections and removing them from con_wnd. However, if there are more than n/2 live connections that are not timeout, the naï ve scheme has to close some of the live connections and incurs unnecessary data retransmissions. To avoid this, AAIC uses a scheme called slow withdrawal to shrink con_wnd gradually to n/2 without closing live connections. First, it closes and removes the l timeout connections from con_wnd, and sets con_wnd to the greater one of n – l and n/2. Next, if con_wnd is

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)