Distributed Reinforcement Learning Frameworks for ... - IEEE Xplore

0 downloads 0 Views 433KB Size Report
Oct 20, 2010 - Distributed Reinforcement Learning Frameworks for. Cooperative Retransmission in Wireless Networks. Ghasem Naddafzadeh-Shirazi ...
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 59, NO. 8, OCTOBER 2010

Distributed Reinforcement Learning Frameworks for Cooperative Retransmission in Wireless Networks Ghasem Naddafzadeh-Shirazi, Peng-Yong Kong, and Chen-Khong Tham

Abstract—We address the problem of cooperative retransmission in the media access control (MAC) layer of a distributed wireless network with spatial reuse, where there can be multiple concurrent transmissions from the source and relay nodes. We propose a novel Markov decision process (MDP) framework for adjusting the transmission powers and transmission probabilities in the source and relay nodes to achieve the highest network throughput per unit of consumed energy. We also propose distributed methods that avoid solving a centralized MDP model with a large number of states by employing model-free reinforcement learning (RL) algorithms. We show the convergence to a local solution and compute a lower bound for the performance of the proposed RL algorithms. We further empirically confirm that the proposed learning schemes are robust to collisions and are scalable with regard to the network size and can provide significant cooperative diversity while enjoying low complexity and fast convergence.

Examples of using the MDP model and the RL for wireless problems other than cooperative communication are discussed in [8]–[11], which try to find the optimal adaptive transmission rate and power for a single source–destination (S-D) pair. In addition, Dianati et al. [12] and Naddafzadeh-Shirazi et al. [13] design distributed MDP models for the cooperation problem. However, they do not investigate the distributed learning methods. In [14], the nodes collaboratively learn the nearoptimal routing strategies using local channel-state information. Our learning methods give similar RL results for the cooperation problem in the media access control (MAC) layer. In this paper, we propose MDP and RL frameworks for the cooperation problem to maximize the network throughput (i.e., the total number of useful received packets) per unit of consumed energy, while limiting the amount of collisions that occur in the network. Moreover, this paper extends our work in [1] by computing a lower bound for the performance of the proposed RL methods and showing their convergence to near-optimal cooperation strategies. II. M ARKOV D ECISION P ROCESS -BASED C OOPERATION F RAMEWORK

Index Terms—Distributed Markov decision process (MDP) for wireless networks, media access control (MAC) cooperative retransmission, reinforcement learning (RL).

I. I NTRODUCTION The problem of cooperation in wireless networks has received significant attention in recent years. The efficient cooperation among nodes can significantly contribute to the performance of the wireless network. This is because some of the data sent by a source node may be missed by the intended destination but successfully received by a neighbor node with potentially better channel quality. Using the finite-state Markov channel (FSMC) model in [2], which models the wireless channel as a Markov process, the problem of cooperation can also be modeled as a Markov decision process (MDP) framework by defining appropriate action set, reward function, and state transition probabilities. Some examples of distributed MDP and reinforcement learning (RL) models are discussed in [3]–[5], and distributed value functions (DVFs) are discussed in [6]. In these studies, each agent, i.e., a wireless node in our context, has a local MDP consisting of its own action, state, and reward functions. The agents then coordinate to find a near-optimal solution by the exchange of limited information only with their neighbors. More details about these models and their learning variations will be given in Section III. In another similar work, Kok and Vlassis [7] compare the performance of the DVF and other cooperative learning methods in a general multiagent environment using the topology information.

Manuscript received September 13, 2009; revised February 5, 2010 and April 27, 2010; accepted June 10, 2010. Date of publication July 19, 2010; date of current version October 20, 2010. This work was done under the Ultrawideband-Enabled Sentient Computing (UWB-SC) Architecture and Middleware with Coordinated Quality-of-Service Project, which is part of the UWB-SC Research Program funded by the Science and Engineering Research Council, Agency for Science, Technology, & Research, Singapore. This work was presented in part at the 20th IEEE International Symposium on Personal, Indoor, and Mobile Radio Communications. The review of this paper was coordinated by Dr. L. Li. The authors are with the Institute for Infocomm Research, Agency for Science, Technology & Research, Singapore 138632, and also with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117576. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVT.2010.2059055

4157

Fig. 1(a) shows a simple cooperative retransmission scenario in the MAC layer. As can be seen, we consider an N -node slotted Aloha system [15]. Slotted Aloha is a simple MAC protocol and makes the implementation of cooperation protocols easier. Note that the proposed protocols in this paper can be adapted to work with more efficient MAC protocols, e.g., carrier sense multiple access. However, we do not consider that these protocols avoid further algorithmic details and focus on illustrating the essential properties of distributed cooperative learning in wireless networks. Each node in the network, i.e., Ri , i = 1, . . . , N , can be a source (S), destination (D), and/or a relay (L). We assume that the main routes between S-D pairs are already established and known to the nodes in the routes. Hence, considering only the MAC layer cooperation, there are direct transmissions along the main routes (solid lines), as well as the supportive cooperation links (dashed lines). We assume that the packets are generated in the source nodes according to a Poisson arrival process at rate λi . We consider a medium with two separate channels, as can be seen in Fig. 1(b). The main (data) channel is used for transferring data among the nodes. The control channel is used for a stop-and-waitbased automatic repeat request mechanism and for the RL parameter exchange among the neighboring nodes. Exchanging RL parameters among neighbors is necessary to find a system-wise near-optimal solution in the distributed RL methods, as will be explained in Section III. The control channel is assumed error-free, i.e., the message passing among neighbor nodes can be performed perfectly. A. Proposed MDP Model—States and Actions An MDP is defined as a tuple M = (A, S, T, ρ), where A and S denote action and state space. T (s |s, a) indicates the transition probability from state s ∈ S to s ∈ S after doing action a ∈ A, and ρ(s, a) is the reward that is obtained by doing action a in state s. The state space consists of the nodes’ link qualities and buffer state information. Concerning the link qualities, we assume a slow Rayleigh fading environment with quasi-static property. This means that the link quality is fixed during a transmission and changes according to the FSMC model only at the beginning of each time slot. Let Ql denote the channel quality of link l, and define a set of predetermined received SNR values {q0 , q1 , . . . , qM −1 , qM } such that 0 = q0 < q1 < · · ·
η Q + Ql + σn2 l l=l∗ l



where l∗ = arg maxl Ql corresponds to the link with the best channel quality, and l denotes the links whose transmissions to other destinations are interfering with D. Also, η is a predefined threshold based on the receiver’s structure, and σn2 is the Gaussian noise variance that is assumed equal for all links. Therefore, collisions may happen when two or more links with good channel quality transmit concurrently, and the interference is large. On the other hand, if the interference resulting from other links is small due to their poor quality, cooperation still leads to a successful packet delivery to D. Based on this model, a cooperation is useful only if the S-D link is of poor quality. To include the buffer state information in the MDP state, let Bi and Ci denote the direct and cooperative buffers of Ri , respectively. The packets that are overheard from neighbor nodes are stored in Ci , and direct packets are stored in Bi . Also, let D(Bi ) and D(Ci ) denote the intended destination of the packets at the head of Bi and Ci , respectively. Note that it is necessary to include the size of the buffers in the state space to enable the node to decide based on the number of available packets for transmission. The sizes of the direct and cooperative buffers, i.e., the number of packets that are currently in the buffers, are denoted by b = |Bi | and c = |Ci |, respectively. From what was stated above, the overall state space for node Ri is given by (|Bi |, |Ci |, μi,D(Bi ) , μi,D(Ci ) ). In other words, each node should keep track of its direct and cooperative buffer sizes, as well as the link qualities for the intended destinations of packets at the head of these buffers. The latter can be obtained by sensing the channel when receiving data from the corresponding node or by piggybacking such information in the positive acknowledgement/negative

acknowledgement (ACK/NAK) packets when channel sensing is impossible. The action of the MDP is given by ai = (asi , aci , ei ), where asi and aci denote the probabilities of relaying direct and cooperative packets, respectively. Also, ei is the quantized transmission power ei ∈ e {jE/me }m j=1 , where E is the maximum allowed transmission power, and me is a quantization factor. Node Ri transmits the packet at the head of Bi and Ci with probability asi ≥ 0 and aci ≥ 0, respectively, using the transmission power ei . The transmission probability space is obtained by quantizing the interval [0, 1] into ma equal intervals, and the probabilities are pruned such that asi + aci ≤ 1. Thus, the node keeps silent with probability 1 − asi − aci . B. Objective and Reward Function We use throughput (the total number of useful received packets) per unit of consumed energy (for transmission) as the performance metric of our system. Let v(t) be the number of packets that are successfully delivered to their (MAC-layer) destinations at time slot t. Let e(t) be the transmission power consumed by all transmitters at time slot t, i.e., N e(t) = i=1 ei (t). Our objective is to maximize the throughput per transmitted energy over an infinite time horizon 1 J = lim E τ →∞ τ

 τ   v(t) t=1

e(t)

.

(1)

The unit of J is then packets per millijoule. Note that similar to [8], [13], and [14], maximizing J takes into account both throughput and energy consumption of the nodes. Therefore, the proposed MDP model is very suitable for the energy-constrained networks, such as sensor networks. To maximize J, nodes should appropriately decide on transmission power as well as transmission and cooperation probabilities. To maximize J, we use the following reward function for the nodes. The nodes that do not transmit packets receive a reward equal to 0. Otherwise, a reward is assigned to the transmitters based on their success or failure. Specifically, the reward for node Ri at time slot t is



ρi (t) =

0, 1 , ei (t)

no transmission or failure success

(2)

where ei (t) is the transmission power of Ri at time slot t. Therefore, a success with lower transmission power would result in a higher reward,

IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 59, NO. 8, OCTOBER 2010

while no reward is given for a failure. A success or failure can be locally detected when the intended destination replies in the control channel with an ACK or a NAK, respectively. The reward function in (2) is suitable for maximizing J in a distributed manner due to the fact that it takes into account both successful transmissions (probably from other transmitters) and consumed energy. Moreover, the reward function is suitable for the distributed RL methods, as will be discussed in Section III. III. D ISTRIBUTED L EARNING A LGORITHMS A. DVFs In DVFs [6], each node operates based on a distributed MDP and some limited information from its neighbors. The main idea is to coordinate the entire system by only exchanging the current value functions among the neighbors. Interesting results about applying DVFs on the distributed systems are demonstrated in [16]. In the DVF method, the value functions are communicated among the neighboring nodes. Nodes try to maximize the weighted sum of their own and neighbors’ value functions. Therefore, the Bellman optimality equation for DVFs would be in the form of Vi∗ (s) = max

ai ∈Ai

⎧ ⎨

ρi (si , ai ) + γ



⎫ ⎬



Pi (si |si , ai ) Vi∗ (s )



si ∈Si

(3)

where N ei(i) is the set of nodes in the transmission range of Ri , and wi (j) is the weight of Rj ’s value function at Ri . According to [6], the DVF Q-learning update rule can be written as Qi (si , ai ) ← (1 − α)Qi (si , ai )



+α ⎝ρi (si , ai ) + γ



4159

By comparing (6) with (4), it can be observed that the local rewards in DVF ρi (si , ai ) are replaced with the global reward ρ in global reward-based learning (GRL). In addition, the weighting mechanism over value functions is removed in GRL. When the global reward is available to all nodes, this learning method is shown to converge to an optimal policy that maximizes the expected global reward of the system [6]. In our model, this global reward can be seen as . the immediate value of J at current time slot t, i.e., ρ = v(t)/e(t), which is unavailable to the individual nodes. To implement GRL, node Ri approximates the global reward as the  average reward in its neighborhood, i.e., ρi  (1/(|N ei(i)| + 1)) j∈i∪N ei(i) ρj (sj , aj ), where |N ei(i)| is the number of Ri neighbors. We refer to this learning method as GRL hereafter. C. DRV Functions Now, we propose another learning method, which is called the distributed reward and value (DRV) function, by using a combination of the aforementioned methods, namely, the DVF and the GRL. In fact, our proposed method is based on the DVF with the difference that, in addition to the value functions, the rewards are also communicated between the neighbors. Mathematically, the nodes use the following Q-learning in the DRV: Qi (si , ai ) ← (1 − α)Qi (si , ai )

⎛ + α⎝

wi (j)ρj (sj , aj )

j∈N ei(i)

⎞ wi (j)Vj (sj )⎠



+γ (4)



⎞ wi (j)Vj ⎠

j∈N ei(i)

j∈N ei(i)

Vi (si ) = max Qi (si , a) where α is the learning rate, and the subscript i indicates the local MDP model in Ri . Qi (si , ai ) is called the Q-value for each local state–action pair, which indicates the time-averaged reward obtained from performing action ai in the (local) state si . B. GRL In another distributed MDP approach in [3], local states and actions are assumed; however, the reward function is globally shared among the nodes. In this model, the neighboring policies are needed to find the actions of the corresponding nodes. A dynamic programming method can be locally used in each node to find the optimal policy based on the following Bellman equation:



Vi∗ (s) = max a∈A

ρi (s, a)



 j∈N ei(i)

wi (j)



(5)

s ∈S

Unfortunately, Chang and Fu [3] do not provide a learning framework for the model-free situations. When only the local states si and immediate global reward ρ are available for Ri , the optimal local policies can be found by the following Q-learning rules: Qi (si , ai ) ← (1 − α)Qi (si , ai )





+ α ρ + γ max Qi (si , a) . a∈Ai

(7)

where wi (j) are the weights given to the reward of the neighbors, similar to the wi (j) weights used for value functions here and in (4). Note that in the GRL and the DVF, either reward or value functions are communicated, while in the proposed DRV method, both reward and value functions are shared among neighbors. The rationale behind communicating both reward and value functions is to provide a balance between the immediate and long-term reward in the system. More specifically, since the immediate reward emphasizes the current status of the system and the value function is a longterm average of the rewards, a more complete view of the system can be obtained at the nodes by communicating both reward and value functions. IV. A LGORITHMIC D ETAILS AND P ROPERTIES OF THE P ROPOSED R EINFORCEMENT L EARNING M ODELS

 P (s |s, a)Vj∗ (s ) .

a∈Ai

(6)

A. Implementation Details—The Distributed Cooperation Protocol Fig. 2 shows the sequence of steps in executing the proposed RL algorithms in a distributed manner at each node. As can be seen, the following procedure is executed at node Ri in time slot t. 1) Ri determines its local state si = (|Bi |, |Ci |, μi,D(Bi ) , μi,D(Ci ) ) by observing the local buffer and link qualities. As stated previously, these link qualities have been obtained by sensing channel or piggybacked data on control packets. 2) In the DRV, the value of the current state Vit (si ) and the reward obtained from the previous time slot ρit−1 are broadcast in

4160

IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 59, NO. 8, OCTOBER 2010

where

 wi (j) =

wi (j)

j∈ / N ei(i)

wi (j) + wi (j) j ∈ N ei(i)

(9)

Fig. 2. Learning algorithm sequence in each time slot.

only the values of Vit (si ) only the values of ρit−1 are

the control channel. In the DVF, are exchanged, while in the GRL, broadcast, and the global reward is estimated by averaging over the neighboring rewards. After this stage, a node has successfully received rewards or value functions (or both) of its entire neighbor set. 3) According to the selected learning method, the Q-learning formulas in (4), (6), or (7) are used to update the Q-values in the DVF, GRL, or DRV method, respectively. 4) The best action is chosen according to the current policy by choosing the action that results in the highest expected reward.1 This action will determine the packet to be sent in the data channel and the corresponding transmission power. 5) After the transmission, destinations will send ACK or NAK messages in the control channel. Ri then uses this feedback to calculate its reward according to (2). Note that since we assume a slow Rayleigh fading channel, the channel-state transitions are very infrequent. Therefore, a reduced overhead protocol can be designed in such a way that the control packets are exchanged by node Ri only when a state transition occurs in Si . Hence, the overhead would be in the order of Θ(ft N ), where ft  1 is the probability of having a state transition [2]. Note that, however, the rate of convergence would be slower by a factor of ft .

where wi (j) is the set of weights intensifying the reward of nodes in DVF convergence. Therefore, the DRV will have the same convergence behavior as the DVF and the GRL.  Theorem 1 shows that the proposed DRV behaves similarly to DVF and GRL methods from the convergence point of view. Moreover, since the immediate reward is also exchanged among the neighbors, the effect of collision with the neighboring nodes is emphasized more in the DRV. In fact, the value function carries the information about the entire network as a long-term average, while the reward function intensifies the effect of immediate neighboring nodes in the current channel conditions. C. Lower Bound An intuitive lower bound is given by the performance of a noncooperative protocol. To obtain a tighter lower bound for the performance of our learning methods, we relax the power control assumption in our system and assume that all nodes transmit with maximum power E. In addition, we relax the assumption that nodes with direct transmission can be relays as well. Under this new situation, a node is considered a relay only if it is not a part of any main S-D routes. This means that, in Fig. 1(a), only nodes labeled as Li can be relays. These assumptions enable us to assert the following theorem for the lower bound of J. Theorem 2:  Denote the average packet arrival rate in the network N by λ = (1/N ) i=1 λi . Then, a lower bound for the value of J in the proposed distributed cooperative protocols is given by

B. Convergence Behavior The same line of reasoning as the DVF [6] and the GRL [3] can be used to prove that the proposed distributed solution, i.e., the DRV, also converges to a local optimum. Moreover, the time complexity of the proposed RL methods can also be analyzed using a standard Q-learning analysis approach in the literature. Specifically, as it is shown in [17] and [18], Q-learning convergence time is a polynomial function of the state and action size. This is despite the exponential time complexity of model-based MDP solutions. We can formally state these properties in the following theorem. Theorem 1: Suppose that the nodes in a wireless network use the cooperation protocol in Section IV-A by using one of the DVF, GRL, or DRV methods. Then, the local node policies will converge to a local optimal point for J in (1), i.e., a local maximum for the system throughput per consumed energy. Moreover, the worst-case convergence time in Ri is a polynomial function of |Ai ||Si |, i.e., the product of the local state and action size. Proof: The convergence of the DVF and the GRL can be proved in a similar approach to that in [6]. The only difference of the DRV with the DVF is that, in the DRV, additional weights wi (j) are applied to the neighbor rewards. Hence, the optimal value function of the DRV can be written as Vi∗ (s) =

N 

(wi (j)ρj (s, a))

LJ =

λ(1 − 2 2 + 3 ) 2E

(10)

where is the average failure probability of the FSMC. (The proof is given in the Appendix.) Note that due to the unavailability of cooperative diversity, the noncooperative scheme may be unable to achieve this lower bound, even if perfect power control is applied. V. P ERFORMANCE E VALUATION Here, we examine the performance of RL methods in a cooperative wireless network by means of simulations. Table I summarizes the parameters used in simulations. We use equal values of wi (j) and wi (j) for the DVF and the DRV. In other words, each Ri sets wi (j) = wi (j) = (1/(|N ei(i)| + 1)), ∀j ∈ N ei(i), where |N ei(i)| denotes the number of Ri ’s neighbors. The effect of adjusting these weights on the system performance remains an open question for future work. Note that transmission probabilities are selected a from a quantization of interval [0, 1], i.e., asi , aci ∈ {j/ma }m j=0 . s c + a ≤ 1, the total number of actions is equal to Hence, since a i i ma  m . e 2

(8)

j=1

1 Alternatively, other well-known action selection mechanisms such as softmax [17] can be used at this stage.

A. Performance Comparison for Different Values of N Fig. 3 shows the value of J for DRV, DVF, GRL, and noncooperative (noncoop) methods when varying the number of nodes in the network from 5 to 20. The lower bound (LB) in Theorem 2 and the Gupta–Kumar upper bound (UB) [19] are also shown in this figure.

IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 59, NO. 8, OCTOBER 2010

4161

TABLE I S IMULATION PARAMETERS

Fig. 3. Comparison of successful transmission per consumed energy (J, in packets per millijoule) in different learning methods and noncooperative mode (noncoop). The lower bound (coop-LB) and the Gupta–Kumar upper bound (UB) of J are also shown. N varies from 5 to 20, and λ = 0.6.

Fig. 4. Improvement of J in the DRV compared with other methods for different traffic loads (λ) and N = 20 nodes. The vertical axis shows the percentage of DRV improvement over GRL, DVF, cooperative lower bound (coop-LB), and noncooperative models.

As can be seen, all of the RL methods significantly outperform the noncooperative scheme by providing at least 50% improvement. A similar performance gain can be observed compared with the lower bound, which can also be viewed as the performance of a deterministic cooperation strategy, as explained in Theorem 2. Thus, our learning methods are able to exploit the channel dynamics, and accordingly adjust their cooperation probabilities, to achieve a higher performance gain. Another important observation from Fig. 3 is that the DRV outperforms the DVF and the GRL. This is due to the fact that communicating both reward and value functions among neighbor nodes can provide more information about the current situation of all the nodes in the network. Thus, more accurate performance is achieved by avoiding useless retransmissions in a deep fading channel state. This agrees with the explanations in Section III-C.

B. Performance Improvement Under Different Loads Fig. 4 shows the percentage of improvement of J in the DRV method compared with the GRL, the DVF, the cooperative lower bound (coop-LB), and the noncooperative mode. In this simulation, the number of nodes is fixed to N = 20, and λ varies from 0.1 to 1.0. As can be seen, the improvement over all the methods is an increasing function of the system load. This is expected due to the fact that, in higher arrival rates, the amount of potential simultaneous transmitters, and, in turn collisions, increases, and having more information about the system status becomes more vital for better performance. In fact, in the lower arrival rates, the learning methods perform almost similarly. However, as the arrival rate increases, the DRV can better utilize the available information and outperform the other learning methods. Note that, in any case, the improvements over the nonco-

Fig. 5. Convergence behavior of the distributed MDP methods. The learning algorithms converge after ∼500 time slots. λ = 0.1; N = 20.

operative and deterministic cooperation (coop-LB) scenarios are very significant. C. Convergence To examine the convergence behavior of the distributed learning methods, the value of J as a function of time is presented in Fig. 5. As stated in Theorem 1, all of the learning methods are able to converge to a local optimum after sufficient iterations. The average convergence time presented in Fig. 5 is much smaller, i.e., around 500 time slots.

4162

IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 59, NO. 8, OCTOBER 2010

maximum transmission power, whereas the power control is available in the actual RL method. We choose the packet arrival rate of these N/3 source nodes as λ. Clearly, the original data rates can be reduced to this model by shutting down transmitters that are not in the source list of this new topology. Also, let be the average FSMC failure probability in a transmission with power E. Under these assumptions, the throughput of each specific (S-L-D) tuple can be approximated by p λ, where . it can be shown that p = 1 − (1 − (1 − )2 ) = 1 − 2 2 + 3 is the probability that a packet is successfully delivered to D (by S or L, since their channels evolve independently). Consequently, the system throughput per unit of consumed energy can be written as J = (p λ(N/3))/((2N/3)E) = (p λ/2E), which completes the proof. Note that the lower bound is independent of the number of nodes N . R EFERENCES Fig. 6. Average buffer size as a function of λ; comparison between the DRV and the noncooperative method.

This convergence time is relatively small, and RL can converge in less than 1 s for a typical time slot duration of 1 ms. D. Effect of the DRV on the Buffer Size We have also examined the effect of increasing the arrival rate on the buffer size, i.e., the number of packets in the buffer, of the nodes. Fig. 6 shows the average number of packets in the direct buffer Bi as a function of arrival rate λ for the DRV and the noncooperative scenarios. The average is taken over the buffer sizes of the entire nodes. The arrival rate λ varies from 0.1 to 1 packet per time slot and is assumed equal for all the nodes. As can be seen, the cooperation in the learning methods results in a fewer average number of packets in the buffer. This result is expected since the rate of successful transmission is higher in the cooperative methods, and hence, fewer packets will remain in the buffers over time. This result also indicates that the overall delay experienced by the packets in the cooperative scenario is less than that in the noncooperative scheme. In addition, buffer overflow seems unlikely to happen. VI. C ONCLUSION AND F UTURE W ORK We have proposed a distributed MDP model and reinforcement learning framework for the cooperation problem in the MAC layer of wireless networks. We have showed that despite implementation simplicity, the distributed RL mechanisms can be efficiently used to solve the distributed cooperation problem and to obtain significant cooperation gains in the wireless networks. These methods are able to learn near-optimal transmission schedules and transmission powers that lead to a higher throughput per unit of consumed energy and fewer collisions in the network. A PPENDIX P ROOF OF T HEOREM 2 We consider a simple topology where the nodes are divided into three equal-sized subsets, and each N/3 of the nodes acts as source (S), relay (L), or destination (D). Thus, on average, in a uniform randomly deployed network, each S-D link can be assigned an individual relay. Note that this is an underestimation of the cooperation performance in the learning methods in two ways. First, a relay node may cooperate with more than one S-D pair in a real scenario. Second, we use the

[1] G. Naddafzadeh-Shirazi, P.-Y. Kong, and C. K. Tham, “Cooperative retransmissions using Markov decision process with reinforcement learning,” in Proc. 20th IEEE PIMRC, 2009, pp. 652–656. [2] H. S. Wang and N. Moayeri, “Finite-state Markov channel—A useful model for radio communication channels,” IEEE Trans. Veh. Technol., vol. 44, no. 1, pp. 163–171, Feb. 1995. [3] H. S. Chang and M. C. Fu, “A distributed algorithm for solving a class of multi-agent Markov decision problems,” in Proc. IEEE CDC, 2003, pp. 5341–5346. [4] J. Shen, V. Lesser, and N. Carver, “Minimizing communication cost in a distributed Bayesian network using a decentralized MDP,” in Proc. 2nd AAMAS, 2003, pp. 678–685. [5] M. Lauer and M. Riedmiller, “An algorithm for distributed reinforcement learning in cooperative multi-agent systems,” in Proc. 17th Int. Conf. Mach. Learning, 2000, pp. 535–542. [6] J. Schneider, W.-K. Wong, A. Moore, and M. Riedmiller, “Distributed value functions,” in Proc. 16th Int. Conf. Mach. Learn., 1999, pp. 371–378. [7] J. R. Kok and N. Vlassis, “Collaborative multiagent reinforcement learning by payoff propagation,” J. Mach. Learn. Res., vol. 7, pp. 1789–1828, Dec. 2006. [8] C. Pandana and K. J. R. Liu, “Near-optimal reinforcement learning framework for energy-aware sensor communications,” IEEE J. Sel. Areas Commun., vol. 23, no. 4, pp. 788–797, Apr. 2005. [9] A. T. Hoang and M. Motani, “Buffer and channel adaptive modulation for transmission over fading channels,” in Proc. IEEE ICC, 2003, vol. 4, pp. 2748–2752. [10] A. T. Hoang and M. Motani, “Buffer and channel adaptive transmission over fading channels with imperfect channel state information,” in Proc. WCNC, 2004, vol. 3, pp. 1891–1896. [11] D. Djonin and V. Krishnamurthy, “Amplify-and-forward cooperative diversity wireless networks: Model, analysis, and monotonicity properties,” in Proc. CDC-ECC, Dec. 2005, pp. 3231–3236. [12] M. Dianati, X. Ling, S. Naik, and X. Shen, “A node cooperative ARQ scheme for wireless ad hoc networks,” IEEE Trans. Veh. Technol., vol. 55, no. 3, pp. 1032–1044, May 2006. [13] G. Naddafzadeh-Shirazi, P.-Y. Kong, and C.-K. Tham, “Markov decision process frameworks for cooperative retransmission in wireless networks,” in Proc. IEEE WCNC, 2009, pp. 1–6. [14] J. Dowling, E. Curran, R. Cunningham, and V. Cahill, “Using feedback in collaborative reinforcement learning to adaptively optimize MANET routing,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 35, no. 3, pp. 360–372, May 2005. [15] A. Tanenbaum, Computer Networks, 4th ed. Englewood Cliffs, NJ: Prentice-Hall, 2003, ch. 4.2. [16] E. D. Ferreira and P. K. Khosla, “Multi-agent collaboration using distributed value functions,” in Proc. IEEE Intell. Vehicles Symp., Oct. 2000, pp. 404–409. [17] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998. [18] S. Koenig and R. G. Simmons, “Complexity analysis of real-time reinforcement learning,” in Proc. AAAI, 1997, pp. 99–105. [19] P. Gupta and P. R. Kumar, “The capacity of wireless networks,” IEEE Trans. Inf. Theory, vol. 46, no. 2, pp. 388–404, Mar. 2000.