Stochastic Learning Solution for Distributed ... - Semantic Scholar

13 downloads 524 Views 969KB Size Report
Dec 14, 2008 - such as the GSM standard and Qualcomm's proposal to the IS-95 standard use a finite number of discretized power levels. This motivates the ...
932

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 16, NO. 4, AUGUST 2008

Stochastic Learning Solution for Distributed Discrete Power Control Game in Wireless Data Networks Yiping Xing, Member, IEEE, and R. Chandramouli, Senior Member, IEEE

Abstract—Distributed power control is an important issue in wireless networks. Recently, noncooperative game theory has been applied to investigate interesting solutions to this problem. The majority of these studies assumes that the transmitter power level can take values in a continuous domain. However, recent trends such as the GSM standard and Qualcomm’s proposal to the IS-95 standard use a finite number of discretized power levels. This motivates the need to investigate solutions for distributed discrete power control which is the primary objective of this paper. We first note that, by simply discretizing, the previously proposed continuous power adaptation techniques will not suffice. This is because a simple discretization does not guarantee convergence and uniqueness. We propose two probabilistic power adaptation algorithms and analyze their theoretical properties along with the numerical behavior. The distributed discrete power control problem is formulated as an -person, nonzero sum game. In this game, each user evaluates a power strategy by computing a utility value. This evaluation is performed using a stochastic iterative procedures. We approximate the discrete power control iterations by an equivalent ordinary differential equation to prove that the proposed stochastic learning power control algorithm converges to a stable Nash equilibrium. Conditions when more than one stable Nash equilibrium or even only mixed equilibrium may exist are also studied. Experimental results are presented for several cases and compared with the continuous power level adaptation solutions. Index Terms—Game theory, power control, stochastic learning, wireless networking.

I. INTRODUCTION

D

UE TO THE increased demand for wireless services, efficient use of available resources is important. Power control that mitigates unnecessary interference and saves the battery life of mobile users is a useful technique. Early works [1]–[4] addressed the power control problem as an eigenvalue problem for nonnegative matrices. As a consequence of the Perron–Frobenius theorem, the existence and uniqueness of a feasible power vector associated with the eigenvalue of the channel gain matrix is shown. These algorithms were centralized in the sense that all of the power vector components were found by inversion of a matrix which was composed of channel

Manuscript received September 21, 2004; revised April 12, 2006, and December 18, 2006; first published March 31, 2008; last published August 15, 2008 (projected); approved by IEEE/ACM TRANSACTIONS ON NETWORKING Editor N. Shroff. This work was supported in part by the National Science Foundation under Grant CAREER 0133761. Part of this work was presented at the IEEE International Conference on Communications (ICC), Paris, France, 2004. Y. Xing was with the Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ 07030 USA. He is now with Bear Stearns, New York, NY (e-mail: [email protected]). R. Chandramouli is with the Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ 07030 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TNET.2007.911424

gains of all users. Centralized power control requires extensive control signaling in the network and may not be applied efficiently in practice. However, it can be used to give bounds on the performance of distributed power control algorithms. Distributed versions which depend only on local information, such as the measured signal-to-interference ratios (SIRs) or channel gain of a specific user, have been developed in [5] and [6]. Further, rate control is associated with power control in [7] and [8] that jointly optimizes the transmitting rate and power. Due to the imperfect estimation of the interference at each power control update, a stochastic power control algorithm is developed in [9] that evolves stochastically and the convergence is defined in terms of the mean-squared error (MSE) of the power vector with respect to the optimal power vector that is the solution of a feasible deterministic power control problem. Later, convergence properties were improved by using averaging in [10]. The problem of joint power control, multiuser detection, and diversity combining is addressed in [11] and [12], where the effect of using antenna diversity is studied. Naturally, on the one hand, mobile users prefer to transmit at a lower power for a fixed SIR, and, on the other hand, for a given transmitter power, they prefer to obtain a better SIR. This observation motivates a reformulation of the power control problem using concepts from microeconomics and game theory. The framework was originally proposed in [13] for voice traffic, where a utility function is defined for each mobile user. The utility function reflects the user’s preference regarding the SIR and the transmitter power. Different utility functions and formulation of noncooperative power control games are proposed [14]–[16]. This concept is extended for power control in wireless data networks in [17] and [18], where throughput per battery life is chosen as the utility function, which seems to be a practical metric. In most of the previous studies, the transmitter power level can assume any continuous value in a domain. However, in a digital cellular system or future PCS systems, the power level is quantized into discrete values. For instance, in GSM, the uplink and downlink transmission power may vary from 5 to 33 dBm, at values which are equally spaced by 2 dBm. In Qualcomm’s CDMA proposal for IS-95 [19], the power levels are equally spaced by 0.5 dB, within a dynamic range of 85 dB in the uplink and 12 dB in the downlink. Therefore, it is not clear how to apply those power control algorithms into a practical power-quantized system. Some previous work in this direction can be found in [20] and [21]. Discrete power control algorithms were developed based on conventional continuous framework. However, it is shown in [21] that, by simply “discretizing” the continuous power control algorithm, the convergence and uniqueness of the continuous power control are lost. Game/utility theory-based

1558-2566/$25.00 © 2008 IEEE Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on December 14, 2008 at 13:47 from IEEE Xplore. Restrictions apply.

XING AND CHANDRAMOULI: STOCHASTIC LEARNING SOLUTION FOR DISTRIBUTED DISCRETE POWER CONTROL GAME IN WIRELESS DATA NETWORKS 933

power control is a useful tool to model and analyze mobile user’s satisfaction on both QoS and power consumption. However, as discussed before, discrete transmission power control may need separate analysis on convergence and uniqueness issues. In this paper we focus on a noncooperative game theory framework for wireless data networks similar to the one proposed in [18] but differing in several key aspects as described later. In [18], a noncooperative continuous-valued power control game (NPG) formulation has been shown to admit a unique Nash equilibrium. The proof is based on the fact that the utility function is continuous in power space and quasi-concave in each user’s power variable. Also, the first-order necessary optimality condition is calculated to show the uniqueness. However, for discrete transmit power levels, there will be cases where the game has multiple pure equilibria or even mixed equilibria. In addition to discussing some properties of the proposed NPG, we also propose two discrete stochastic learning power control (DSLPC) algorithms to solve it in a distributed manner. DSLPC is based on principles from stochastic learning automata [22]. Some of the key advantages of DSLPC over previous deterministic power control strategies are the following:. • Generally speaking, since DSLPC operates in the probability space to search for the best power levels, it does not normally suffer from getting stuck in local optimum. Iterative deterministic techniques operating at the surface of an objective function may suffer due to this phenomenon. • DSLPC is capable of discovering optimal mixed power control strategies in NPG. • DSLPC can handle both deterministic and stochastic power control situations. • The proposed DSLPC algorithm is found to be fairly independent of the initialization values used. A stochastic learning technique has been successfully used in wireless packet networks for online prediction and tracking and is shown to be computationally simple and efficient in [23] and [24]. In our decentralized iterative learning process, at each iteration, the only information needed to update the power strategies for individual terminal users is the feedback (payoff) from the base station, which saves the consumption of channel bandwidth for extra communication between mobile users and the base stations. The convergence and stability of the DSLPC are theoretically studied in detail for a two-user two-power-level case. Detailed experimental results are presented for a more general case to illustrate the convergence and optimality properties of DSLPC. The remainder of this paper is organized as follows. In Section II, mathematical preliminaries of the stochastic learning model are introduced. In Section III, the formulation of a noncooperative power control game and the utility function we used to evaluate the satisfaction of mobile users are presented. DSLPC is also discussed in detail in Section III. The Nash equilibrium in both pure and mixed strategies of this iterative game is defined. Two stochastic learning solutions to solve the distributed power control game are presented in Sections IV and V. Numerical results are presented in Section VI. Finally, in Section VII, we conclude this paper.

II. MATHEMATICAL PRELIMINARIES OF STOCHASTIC AUTOMATA GAMES Here, we briefly discuss the concept behind automata games [22]. Abstractly, a learning automaton can be considered to be an object that can choose from a finite number of actions. For every action that it chooses, the random environment in which it operates evaluates that action. A corresponding stochastic feedback is sent to the automaton based on which the next action is chosen. As this process progresses the automaton learns to choose the optimal action for that unknown environment asymptotically. The stochastic iterative algorithm used by the automaton to select its successive actions based on the environment’s response defines the stochastic learning algorithm. An important property of the learning automaton is its ability to improve its performance with time while operation in an unknown environment. For consistence, our notation follows or parallels that from standard book on game theory (see, e.g., [25]) and [26]. In multiple automata games, instead of one automaton automata, say (player) playing against the environment, take part in a game. Consider a typical audescribed by a 4-tuple . Each player tomaton has a finite set of actions or pure strategies . Let cardinality of be . The result of each play is a random payoff to each player. Let denote the random payoff to player . It is assumed here that . Define functions , , by

(1) The function is called the expected payoff function or utility . The objective of each player is function of player to maximize its expected payoff. Players choose their strategies based on a time-varying probability distribution. Let denote the action choice probability distribution of the th automaton at time instance . Then, denotes the probability with which th automaton player chooses the th pure strategy at instant . Thus, is the strategy probability vector employed by the th player at instant . denotes the stochastic learning algorithm according to which the elements of the set are updated at each time , i.e., , where and are the actual actions selected by and the payoff received by , respectively, at , . The working of a learning automaton, say , can be described as follows. Initially, at , one of the actions is chosen by the player at random with an arbitrarily chosen initial probability. This action is then applied to the system and the response from the environment is observed. Based on the response , the probabilities of actions for the next period of time are updated according to the updating rule . This process is repeated by all of the players until a stopping criterion is reached or the probability vector converges. Fig. 1 pictorially depicts the basic automata game setup. III. PROBLEM FORMULATION We propose a distributed discrete stochastic power control system in which each mobile user behaving as a learning au-

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on December 14, 2008 at 13:47 from IEEE Xplore. Restrictions apply.

934

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 16, NO. 4, AUGUST 2008

Fig. 1. Basic multiple automata game setup.

tomaton adjusts its power over time based on some feedback from the base station to arrive at the optimal strategy. In this context, the feedback that a wireless terminal receives is referred to as the instantaneous utility or payoff, which represents the satisfaction of the mobile user for the chosen power level.

A. Utility Function and Noncooperative Power Control Game One of the most important concerns here is the measure of satisfaction, which for a wireless data network is naturally related to the amount of information that a user can transfer in the lifetime of its battery. Therefore, we choose the throughput per battery life as the utility function, as this brings a practical and meaningful metric to serve as the definition of utility [17]. We note that the proposed algorithm and analysis should hold for other utility functions also. Let the power vector denote the selected power levels of all the users, where is the power level selected by user , is the strategy set for user and is the set of all power vectors. The resulting utility value for the th user is . To emphasize that the th user has control over its own power only, we use an alternative notation , where denotes the vector consisting of elements of other than the th element. The utility of user obtained by expending power can be expressed as [18] (2) where is the information bits in frame (packets) of bits at a rate b/s using W of power. The efficiency function is defined as , where is the bit error rate (BER). For example, if a noncoherent FSK modulation scheme is used, then we have , and (SIR of user ) is defined as (3)

and is the available spread-spectrum bandwidth [Hz], is the AWGN power at the receiver [W], and is the set of path gains from the mobile to the base station. An example of one user’s utility function, assuming that all other users’ transmitting powers are fixed, is shown in Fig. 2. We can notice that the curve of the utility function is concave in powers and have a unique local maximum point. However, this case may change when only discrete power values are available

Fig. 2. Example of a user’s utility function.

as discussed later. There maybe even more than one local maximum. In the noncooperative power control game, each user maximizes its own utility in a distributed fashion. Formally, the NPG is expressed as (4) The solution of this NPG is given in the sense of the Nash equilibrium [25] as follows. Definition 1: A power vector is a Nash equilibrium of the NPG if, for every , for all . At a Nash equilibrium, given the power levels of other players, no user can improve its utility level by making individual changes in its power. The existence and uniqueness of the NPG equilibrium has been shown in [18] for continuous power space for mobile users. B. Discrete Stochastic Learning Power Control Game In the stochastic learning game, the mobile users act as players or learning agents who participate in the power control game. The objective of each player is to maximize its payoff, which reflects the satisfaction of the players. And the payoff is measured in utility [e.g., (2)]. The game is played repeatedly to learn the optimal strategies. Each individual automaton (or mobile user) may not be aware of the following: • number of mobile users participating in the game; • strategies available to the other users; • responses for each possible play. The only information a player knows is its payoff after each play, based on which the player learns the optimum strategy. A strategy for player is defined to be a probability vector , where player chooses action (or power level with probability . Because each mobile user can only choose a power level from a finite discrete set, should

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on December 14, 2008 at 13:47 from IEEE Xplore. Restrictions apply.

XING AND CHANDRAMOULI: STOCHASTIC LEARNING SOLUTION FOR DISTRIBUTED DISCRETE POWER CONTROL GAME IN WIRELESS DATA NETWORKS 935

be a finite discrete set with dimension . Then, we can define the expected payoff for player as given by

IV. DISCRETE LEARNING POWER CONTROL-I (DSLPC-I) A. Discrete Learning Power Control Algorithm The first DSLPC algorithm that is used by each mobile user is given below: DSLPC-I

(5) 1) Set the initial probability vector Definition 2: The -tuple of strategies to be aNash equilibrium, if, for each

is said , we have

, for each .

2) At every time step , each user chooses a power according to . Thus, the th player chooses its action probability vector action at instant , based on the probability distribution . 3) Each player obtains a payoff based on the set of all actions. , which is normalized. The payoff to player is

(6) A Nash equilibrium is said to be in pure strategies if is a Nash equilibrium with each being a unit probability vector. While, it is a nondegenerate mixed Nash equilibrium if a player puts positive weight on more than one pure strategy. In general, each above may be a mixed strategy, and we refer to satisfying (6) as a Nash equilibrium in mixed strategies. With this definition, when there is no pure equilibrium as maybe the case in the discrete power control game, the mobile users search for a mixed Nash equilibrium. It is well known that every finite strategic-form game has a mixed strategy equilibrium [25], that is, there always exists a Nash equilibrium in our formulation of the discrete power control game. Further, we do some definition which will be used later following the notation of [26]. Define by

(7) where

is defined by

(8) Define functions

,

,

, on

by

(9)

4) Each player updates its action probability according to the rule

(10) 5) stopping criterion met step 2).

; else, go to

is the step size of the probability updating rule and is normalized to lie in the interval . We normalize the payoff as (11) and . is the utility of where user . In our system, can just be set to 0 since , , but it is not realistic to know in advance. So what we do for the normalization is that we update the maximum value dynamically, that is, first we initialize , then at instance , if , let , otherwise keep unchanged. We note that this normalization does not affect our theoretical results (this is not the actual maximum value of but is an estimated value). However, this is perhaps the best we can do without advance information. Alternatively, we could choose to be a large constant. However, as one can imagine, choosing too large compared with will significantly decrease the convergence speed of DSLPC-I algorithm since all will be small values. Although we introduced this dynamic normalization in our algorithm, based on the above comments, we assume in our theoretical analysis that the normalization has been already carried out. The update equation in (10) is of the linear reward-inaction [22] type, that is, when a chosen power action results in a reward then the probability of choosing that action in the next time step is updated if not no updating takes place. Let denote the state of the power strategies at instant . Under this learning algorithm, we can see that is a Markov process.

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on December 14, 2008 at 13:47 from IEEE Xplore. Restrictions apply.

936

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 16, NO. 4, AUGUST 2008

B. Theoretical Analysis of DSLPC-I The DSLPC-I results in a vector stochastic process of choice action probabilities, so we need to characterize the long-term behavior of this process. Our analysis resorts to an ordinary differential equation (ODE) whose solution approximates the asympif the step size parameter used in (10) totic behavior of is sufficiently small. We can represent the learning algorithm given in (10) as (12) , and . Then, as , we define a function as the following conditional expectation: where

(13) With initial vector verge weakly, as [26]

, to

, the sequence will conwhich is the solution of the ODE

(14) First, we analyze a relatively simple two-player two-action stochastic power control game. The game is defined by a pair of game matrices (15) specifying the payoffs for the row player (player 1) and the column player (player 2), respectively. If the row player chooses action and the column player chooses action , the payoff to the row player is and the payoff to the column player is . Depending on the values of the and the game can be classified into three categories as follows. • Category 1: if or , at least one of the two players has a dominant strategy, therefore there is just 1 strict equilibrium. • Category 2: if , , and , there are two pure equilibria and one mixed equilibrium. • Category 3: if , , and , there is just one mixed equilibrium. Note that this stochastic game for power control falls into the class of a general-sum game. Let denote the probability of the row player picking action 1 and let denote the probability of the column player picking action 1. We can then express the differential equation (14) whose solution characterizes the long-term behavior of the stochastic learning algorithm as

(16)

(17) This differential equation system is an autonomous system. By setting the left-hand side of (16) and (17) to zero, we can solve ) of the system and the corfor the critical point (e.g., , , which is responding constant solution called an equilibrium solution. We can easily see that all pure strategies are equilibrium solutions, since, for a pure strategy, either or , which makes . Similarly, or is also an equilibrium solution from (17). of an autonomous Definition 3: [30] A critical point system is stable if, given any , there exists a such that every solution , of the system that satisfies

at

also satisfies

for all . If that every solution

at

is stable and there exists an , that satisfies

such

also satisfies

and then the critical point is asymptotically stable. In other words, a critical point is stable if any trajectory that begins near (within of) the point remains near (within of) this point. If all trajectories that start near a stable critical point actually approach it as , then the critical point is asymptotically stable. Thus, if the ODE has an asymptotically stable stationary point (critical point) , then for all initial conditions sufficiently close to it, the action probabilities generated by the proposed algorithm converges to . Remark 1: We know that [26]: 1) all stationary points (critical points) that are not Nash equilibrium are unstable and 2) all pure strategies that are strict Nash equilibria are asymptotically stable, which can be proved using Lyapunov’s stability theorem [22]. Theorem 1: DSLPC-I will not discover a power strategy that is not a Nash equilibrium. Proof: We prove this result by contradiction. Suppose that the Markov process generated by (12) converges [1) which implies that point of convergence is stable] to a nonNash equilibrium. We know that the equilibrium solutions of the ODE [(16) and (17)] that characterize the long term behavior of DSLPC-I are by definition stationary points. This implies that: 2) DSLPC-I will only converge to stationary points. 1) and 2) together imply that stationary points that are not Nash equilibrium are stable contradicting Remark 1 (a).

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on December 14, 2008 at 13:47 from IEEE Xplore. Restrictions apply.

XING AND CHANDRAMOULI: STOCHASTIC LEARNING SOLUTION FOR DISTRIBUTED DISCRETE POWER CONTROL GAME IN WIRELESS DATA NETWORKS 937

A sketch of the critical point set for a system, along with representative integral curves and their trajectories with arrow indicating the flow, is called a phase plane diagram. We use these diagrams to obtain qualitative information about the solution of the system of ODE, which is referred to as phase plane analysis [29]. As (16) and (17) are nonlinear differential equations, trajectories exit makes it difficult to solve for the plicitly. But we can obtain the trajectories using numerical techniques. Theorem 2: When the game has only one pure equilibrium, then the learning algorithm, for any initial condition in , always converges to a Nash equilibrium. Proof: To get some insight into the problem, we first prove the theorem for a special two users, two strategies case. When the game has only one pure equilibrium, we have or . Without loss of generality, we assume that user one has one dominant strategy, which means (18) Further, from (16) we have (19) or , thus From (18) and (19), either is a monotonic function of , which means is either nondecreasing or non-increasing. Also, due to the nature of the learning algorithm given by (10), all solutions of the ODE, for initial conditions in , will be confined to . Hence, by [32], asymptotically all the trajectories will be in the set . Supconverges at . pose that From (17), we have (20) Then we can have two cases as follows. , 1) User 2 has not converged before , then for which is a constant. Then, is monotonic, so is either non-decreasing or nonincreasing. Further, due to the nature of the learning algorithm given by (10), all solutions of the ODE, for initial conditions in , will be confined to . Hence, by [32], asymptotically all the trajectories will be in the set . Hence, converges at . 2) User 2 has converged before , then the set to which it converges is again. Thus, from the above analysis, all solutions have to converge to some stationary points. Since, by Remark 1, all stationary points

that are not Nash equilibria are unstable, for this special case the theorem follows. Then, we will prove the general case where there are users, each with strategies. It can be claimed that, if there is only one pure Nash equilibrium, then at least there are users which have a dominant strategy. This statement comes from the users have dominant strategies, while fact that suppose two users do not have dominant strategies. Then these two users can use a nondegenerate mixed strategy in a Nash equilibrium, which is not dominated. Hence, there exists at least one nondegenerate mixed Nash equilibrium, which means that, when there are less than users having a dominant strategy, there may not be only one pure Nash equilibrium, and proves our claim. Then, similar to the special case proven above, we can prove users’ strategy will converge first then the th user these converges. In this case, let (21) For those users having a dominant strategy, if is the dominant , hence , strategy, then otherwise, hence, is monotonic for all . Due to the nature of the learning algorithm, all solutions of the ODE, for initial conditions in will be confined to . Hence, by [32], asymptotically all the trajectories will be in the set . Similar to the . special case, the th user will converge to the set Thus, from the above analysis, all solutions converge to some stationary points. Since by Remark 1 all stationary points that are not Nash equilibria are unstable, the theorem follows. It is well known that there exists at lease one Nash equilibria in mixed strategies in a finite stochastic game. Nash equilibrium in mixed strategies may form attractors for DSLPC-I. However, since the Markov process generated by the updating in DSLPC-I has absorbing states it is also possible that these mixed strategies may not be discovered. Based on our simulation results, in the general case when there may exist multiple equilibria or even mixed equilibria in the power control game, the DSLPC-I will converge to one of the absorbing states. Therefore, we investigate another discrete learning power control algorithm in Section V which is capable of converging to mixed Nash equilibrium also. V. DISCRETE LEARNING POWER CONTROL-II (DSLPC-II) We propose DSLPC-II as given by (22) to discover mixed Nash power strategies. This is useful especially when a Nash power solution in pure strategies does not exist. DSLPC-II is similar to DSLPC-I except that players update their action probabilities according to the equations in Step 4) of DSLPC-I. See (22) at the bottom of the page, where is the reward

(22)

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on December 14, 2008 at 13:47 from IEEE Xplore. Restrictions apply.

938

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 16, NO. 4, AUGUST 2008

parameter, is the penalty parameter, and essentially controls the update step size. To gain insight on the theoretical performance we again consider the simple case of two players each with two pure power strategies. Let the arbitrary but fixed initial mixed strategies be and , where and , . It can be seen that is a stationary Markov process. It can be verified by direct computation that it is distance diminishing and that all the states are nonabsorbing. Proposition 1: [22] A learning algorithm which is distance diminishing and has no absorbing states is ergodic. An immediate consequence of this proposition is that the for large probability distribution of the process is independent of the initial values of and . In the following, our aim is to characterize the asymptotic distribution of this process. Therefore, define (23) where is the game matrix as given in (15). Let denote the (von Neumann) value of the game (corresponding to the matrix ). We now prove that the DSLPC-II will converge to a Nash converge to the value of the game equilibrium which makes . First, we define

(24) where and two strategies, let , and

. As there are only two users , ,

Similarly, we can get

(28) (29) We consider two separate cases for analysis: case A: only one mixed equilibrium and case B: only one pure equilibrium exists. When there are multiple pure equilibria, the DSLPC-II’s behavior is similar to DSLPC-I as illustrated in the numerical results section. We first prove two useful lemmas for these two cases following the technique in [30]. A. Only One Mixed Equilibrium Case , there exists such Lemma 1: For every that, for all , there exists a unique such that and . Proof: As discussed before, there is only one mixed equilibrium when , , and . We further divide this case into two subcases: (30) (31) The optimal mixed strategies can be easily shown to be the following: (32)

(25)

where , . We first deal with the case in (30). By defining

Then, from the game matrix in (15), we get we see from (32) that is a decreasing function of

. Now, since with if if if

(26) and therefore

(27)

,

(33) .

Also, we observe that, for each fixed , is quadratic in and at and at . Therefore, using [30, Lemma A], it follows that there exists a unique for each fixed such that . Using a similar argument, for each we see that there exists a unique fixed such that . The functions and when plotted on the unit square have a unique intersection such that proving the first part of the lemma.

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on December 14, 2008 at 13:47 from IEEE Xplore. Restrictions apply.

XING AND CHANDRAMOULI: STOCHASTIC LEARNING SOLUTION FOR DISTRIBUTED DISCRETE POWER CONTROL GAME IN WIRELESS DATA NETWORKS 939

Now, by choosing sufficiently small, we see that can be made close to zero for all and close to unity for all using [30, Lemma A]. Similarly, can be made close to unity for all and to zero for all . This in turn implies that, for any given , there exits such that, for all , we obtain , proving the second part. The proof follows along the same lines for the second subcase in (31). Lemma 2: is negative definite where is as defined in Lemma 1. Proof: We present the proof for subcase in [30] and omit the proof for subcase in (31) since it follows along similar lines. denotes the Jacobian of . Then Suppose

In

learning

giving

Therefore, from the Routh–Hurwitz condition [31], all eigenvalues are negative, implying that the matrix is negative definite. B. Only One Pure Equilibrium Case Recall that there is only one pure equilibrium if (36) The proof of Lemma 1 for this case follows along the same lines as in Case A, and therefore we omit the details. We provide the proof for Lemma 2 for this case next. The condition for only one pure equilibrium can be further divided into four different subcases. Without loss of generality, and are the optimal pure let us consider the case when strategies, which implies that and . For this subcase, we know that and . Now can be approximated by , as shown by the last equation at the bottom of the page. Therefore, as before, we have

where

and with the assumption that

(34) When there is only one mixed strategy from Lemma 1, it foland lows that, by proper choice of , we can get . Then, can be approximated by and (35), shown at the bottom of the page. If and denote the trace and determinant of the matrix, then

Therefore, we conclude that this matrix is also negative definite, completing the proof of Lemma 2 for this subcase. Proposition 2: For every , , and such that, for all and , . Proof: The proof is provided in the Appendix.

(35)

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on December 14, 2008 at 13:47 from IEEE Xplore. Restrictions apply.

940

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 16, NO. 4, AUGUST 2008

TABLE I LIST OF PARAMETERS FOR THE SINGLE-CELL CDMA SYSTEM USED IN THE EXPERIMENTS

Proposition 2 means that, for the two-user two-strategy case, the DSLPC-II converges to the Nash equilibrium if there is only one pure or one mixed equilibrium. The proposed DSLPC-II algorithm is based on a learning scheme, which is an ergodic scheme. Therefore, it can be characterized by a finite ergodic chain. Let be the transition matrix for a Markov chain. If exists for all independently of and if , then we say the chain is ergodic [28]. Hence, we believe that, when there are multiple equilibria, the algorithm should converge to those equilibria with fixed probability distribution. For the more general situation of players, each with strategies, we resort to simulations and numerical analysis to investigate the performance of the two proposed algorithms as presented in Section VI. We note that there exists a speed-accuracy conflict with the learning algorithms. Let . In other words, is the probability with which converges to when the initial probability is . It was shown that approaches 1 as tends to zero [28], where and correspond to the penalty and reward step parameters. Thus, accuracy increases as decreases. However, the latter implies that the rate of convergence will decrease. Hence, the choice of the functions and involves a tradeoff between speed and accuracy. This is a classic problem in stochastic control system. Thus, we note that the choice of the step parameters and in our algorithms is application-dependent, which should be done by practical experiment or training. VI. NUMERICAL RESULTS In our simulation-based experiments, we use a single-cell CDMA system, and the main parameters and variables are listed in Table I. Path gains are obtained using the simple path loss model , where is a constant. First, we consider the learning algorithm in a discrete time environment with two possible strategies for two mobile users, which is the case considered for theoretical analysis. Here we want to illustrate power control using DSLPC-I and show how the derived ODE [(16) and (17)] trajectories evolve. The two terminals are assumed to be located at from the base station. Both of the users have two pure power strategies . The game matrix for this case is as follows:

(37)

Fig. 3. Evolution of probability of choosing power strategies using DSLPC-I versus iteration time with two mobile users and two power levels 0.01 W and 0.1 W.

, this game falls into category Since 1, which means that there is just one strict equilibrium, which is the pair with respective power strategy pair (0.01 W, 0.1 W). This is a Nash equilibrium. The probability of user 1 choosing power 0.01 W is and choosing power 0.1 W is . In the same way, the probability for user 2 to choose power 0.01 W is and for power 0.1 W it is . We set the initial values of and to be equal to 0.5 and the step size parameter . The proposed DSLPC-I algorithm converges correctly to the solution and , both with probability 1 as depicted in Fig. 3, where . Thus, in this two-user two-action case, a Nash equilibrium in pure strategy is discovered by DSLPC-I. The trajectory of the solutions of the ODE [(16) and (17)] and are shown in Fig. 4. The ODE solution is obtained by numerical methods. The parameters and in that ODE are calculated by the utility function (2). The arrow shows the flow direction of the trajectory. This trajectory of the solution of the ODE characterizes the long-term behavior of DSLPC-I process. We can notice that the curve starts at the point and ultimately is then attracted by . It can also be noticed that the actual probability of evolves in the same way as the ODE trajectory. The initial probabilities of and do not affect the convergence of our algorithm to the pure Nash equilibrium. To show this, we can set , . It is shown in Fig. 5 that, as we expected, the trajectory still converges to the same equilibrium solution. When the power levels are discrete, the pure equilibrium may not be unique and there may even be mixed equilibrium. When there are multiple pure strategies, a straightforward strategy such as a deterministic maximization of the utility function by every user in each iteration may not converge. It could oscillate between the two pure strategies depending on the initialization of the power vectors. However, since DSLPC evolves in probability space, it may not be attracted by a local maximum

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on December 14, 2008 at 13:47 from IEEE Xplore. Restrictions apply.

XING AND CHANDRAMOULI: STOCHASTIC LEARNING SOLUTION FOR DISTRIBUTED DISCRETE POWER CONTROL GAME IN WIRELESS DATA NETWORKS 941

Fig. 6. Oscillations exhibited by a deterministic iterative utility maximization solution. Fig. 4. Trajectory of ODE and (p ( ); p ( )) with (0) = (0) = 0:5.

Fig. 7. Paths induced by DSLPC-I with different initial values of (0) and (0).

Fig. 5. Trajectory of ODE with (0) = 0:2, (0) = 0:7.

equilibrium. To illustrate this, we construct an example as follows. Two users are located at from the base station with two pure power strategies W W . The probability for user 1 to choose power 0.003 W is denoted by and for power 0.975 W it is . Similarly, the probability for user 2 to choose power 0.003 W is and to choose power 0.975 W is . The game matrix for this game is given as

(38) , , Since and , this game falls into category 2, which means there are two pure equilibria and one mixed equilibrium. As we can see, the two pure Nash equilibria are and with power strategy pairs (0.003 W, 0.003 W) and (0.975 W,

0.975 W), respectively. It is obvious that the first equilibrium is better than the second one. Thus, it is desirable for the power control algorithm to converge to the first equilibrium. With initial power strategy pair close to (0.003 W, 0.003 W), a deterministic iterative maximization will converge to the first equilibrium and, with an initialization close to (0.975 W, 0.975 W), it will converge to the second equilibrium. However, with for initial values near (0.975 W, 0.003 W) or (0.003 W, 0.975 W), it will not even not converge but oscillate as shown in Fig. 6. While using DSLPC-I, except for the extreme cases, for almost all other initial probabilities, it converges to the first equilibrium as illustrated in Fig. 7. In this figure, we have plotted for both players the probability of choosing their first strategy (in this case 0.003 W). As starting points for the DSLPC-I algorithm we chose eight different initializations. For every initialization, we observe that DSLPC-I converges to the desired equilibrium. Every path plotted is an average of ten runs. This illustrates the robustness of DSLPC-I to initializations. For the extreme cases when the initial probabilities are too close to the equilibrium points (e.g., with , ), DSLPC-I will converge to the corresponding equilibrium. However, we can

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on December 14, 2008 at 13:47 from IEEE Xplore. Restrictions apply.

942

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 16, NO. 4, AUGUST 2008

Fig. 8. Evolution of the probability of choosing power strategies using DSLPC-II versus iteration time with two mobile users and two power levels 0.01 W and 0.1 W.

Fig. 10. Powers at equilibrium of NPG with continuous power space and DSLPC-I with b : .

= 0 01

Fig. 11. DSLPC under fading channels.

Fig. 9. Utilities at equilibrium of NPG with continuous power space and DSLPC-I with b : .

= 0 01

always keep those situation from happening by setting all the initial action probabilities equally likely. Next, we study the DSLPC-II algorithm also in a two-user two-strategy case. We use the same setup as the one used for evaluating DSLPC-I. With parameters and , the DSLPC-II algorithm is seen to converge to the optimal strategy, as shown in Fig. 8. Further, for a more general case, we consider that the system has nine users located at distances m from the base station with multiple power levels (more than two). The discrete power levels we used are in the range from 0.001 W to 1 W equally spaced by 0.02 W. The parameter choice is made in our simulation. The learned equilibrium utilities for individual mobile users are depicted

in Fig. 9. For comparison, the game equilibrium assuming continuous power values are also shown there. It is clear that the continuous power value assumption produces higher utility values. From the figure, we see that DSLPC-I performs quite well and the gap due to the power level discretization and the upper bound is relatively small. The corresponding equilibrium powers are displayed in Fig. 10. We note that DSLPC-II also provides similar performance in this case. We note that, when pricing is added, then the utilities may improve significantly. DSLPC can easily be modified to add pricing by changing the utility function used as our payoff. Actually, the wireless channel statistic is varying in time. It is interesting to investigate the performance of our proposed power control algorithms in the environment of fading channels. Without loss of generality, we exam the DSLPC-I algorithm under Rayleigh fading channels. The Doppler shift is modeled according to “Jakes spectrum.” The slot duration is 1.25 ms, and the number of samples per slot is 5. We simulated a system with two terminal users, moving at 60 km/hr. As shown in Fig. 11, when the terminal is in deep fade, the user’s utility value will be

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on December 14, 2008 at 13:47 from IEEE Xplore. Restrictions apply.

XING AND CHANDRAMOULI: STOCHASTIC LEARNING SOLUTION FOR DISTRIBUTED DISCRETE POWER CONTROL GAME IN WIRELESS DATA NETWORKS 943

low, and when the channel condition becomes better, the utility value will increase correspondingly to the optimal value. Hence, the learning automata can track the channel evolving. This behavior of the learning automata is also reported in [34].

3) The (matrix) differential equation (41) has a unique solution

with

as follows:

VII. CONCLUSION A noncooperative discrete power control game is investigated. Conditions for the existence of pure and mixed equilibria are derived. Two stochastic iterative algorithms are presented to solve this game in a distributed manner. Convergence results for these learning algorithms are presented. It is shown that these algorithms can be used to discover pure as well as mixed equilibrium strategies. The ODE characterizing the algorithms is observed to match the empirical values quite well. Wireless fading channels are considered. It is observed that the proposed algorithm tracks the fading conditions while maximizing the utility function. The convergence accuracy tradeoff issues are also discussed. APPENDIX The following theorem is from [30]. For each , let be a stationary Markov process for all and where is a subset of the where -dimensional space . The parameter is an index of the magnitude of , and we are concerned with the as and . asymptotic behavior of Assumptions: A1) . A2) . A3) , where is a norm defined in and all of the orders of magnitudes are uniform in . A4) has bounded Lipschitz derivative. A5) is Lipschitz in . Let and . Let denote the Hessian of and denote an . inner product in The following theorem summarizes the behavior of when and . Theorem A: If, in addition to Assumptions A1–A5, 1) is compact; 2) there exists an unique such that , that is, for all and ; 3) for all and ; then the following conclusions are true. 1) uniformly for all and . 2) For any , the (vector) differential equation (39) has an unique solution

, where

and

(40)

(42) where ,( refers to the distribution of and refers to the normal distribution with mean and covariance matrix ) and as , , is obtained as the unique solution of the system of linear equations (43) Proof of Proposition 2: The proof follows closely the techniques in [33] and [30]. Combining Lemmas 1 and 2, it follows that the process satisfies all of the conditions of the above Theorem A. Hence, from the conclusion 2) of that theorem, we obtain (44) for all

uniformly for all

, where (45)

From the properties of , it follows that the differential (45) is uniformly asymptotically stable in , that is (46) Also, from conclusion 3) of Theorem A, we obtain that the norconverges in dismalized random vector tribution. This, in turn, implies that converges, that is (47) From Lemma 1, (44), (46), and (47), it follows that for any there exists a such that for all we have (48) This concludes the proof of Proposition 2. REFERENCES [1] J. M. Aein, “Power balancing in system employing frequency reuse,” COMSAT Technical Review, vol. 3, no. 2, pp. 277–300, 1973. [2] R. W. Nettleton and H. Alavi, “Power control for a spread spectrum radio system,” in Proc. IEEE Vehicular Technology Conf., Toronto, ON, Canada, 1983, pp. 242–246. [3] S. A. Grandhi, R. Vijayan, D. J. Goodman, and J. Zander, “Centralized power control in cellular radio systems,” IEEE Trans. Veh. Technol., vol. 42, no. 5, pp. 466–468, Nov. 1993. [4] J. Zander, “Performance of optimum transmitter power control in cellular radio systems,” IEEE Trans. Veh. Technol., vol. 41, no. 1, pp. 57–62, Feb. 1992. [5] J. Zander, “Distributed cochannel control in celluar radio systems,” IEEE Trans. Veh. Technol., vol. 41, no. 4, pp. 305–311, Aug. 1992.

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on December 14, 2008 at 13:47 from IEEE Xplore. Restrictions apply.

944

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 16, NO. 4, AUGUST 2008

[6] S. A. Grandhi, R. Vijayan, and D. J. Goodman, “Distributed power control in cellular radio systems,” IEEE Trans. Commun., vol. 42, no. 2, pp. 226–228, Feb. 1994. [7] A. Sampath, P. S. Kumar, and J. M. Holtzman, “Power control resource management for a multimedia CDMA wireless system,” in Proc. IEEE PIMRC, 1995, pp. 21–25. [8] K. Seong-Lyun, Z. Rosberg, and J. Zander, “Combined power control and transmission rate selection in cellular networks,” in Proc. IEEE Vehicular Technology Conf., Sep. 1999, pp. 19–22. [9] S. Ulukus and R. Yates, “Stochastic power control for cellular radio systems,” IEEE Trans. Commun., vol. 46, no. 6, pp. 784–798, Jun. 1998. [10] M. K. Varanasi and D. Das, “Fast stochastic power control algorithms for nonlinear multiuser receivers,” IEEE Trans. Commun., vol. 50, no. 11, pp. 1817–1827, Nov. 2002. [11] A. Yener, R. D. Yates, and S. Ulukus, “Joint power control, multiuser detection and beamforming for CDMA systems,” in Proc. IEEE Vehicular Technology Conf., Houston, TX, May 1999, pp. 1032–1036. [12] J. Zhang, E. K. P. Chong, and I. Kontoyiannis, “Unified spatial diversity combining and power allocation schemes for CDMA systems in multiple time-scale fading channels,” IEEE J. Sel. Areas Commun., vol. 19, no. 7, pp. 1276–1288, Jul. 2001. [13] H. Ji and C. Y. Huang, “Non-cooperative uplink power control in cellular radio systems,” Wireless Networks, vol. 4, no. 2, pp. 233–240, 1998. [14] T. Alpcan, T. Basar, R. Srikant, and E. Altman, “CDMA uplink power control as a noncooperative game,” in Proc. IEEE Conf. Decision and Control, Dec. 2001, vol. 1, pp. 197–202. [15] C. W. Sung and W. S. Wong, “A noncooperative power control game for multirate CDMA data networks,” IEEE Trans. Wireless Commun., vol. 2, no. 1, pp. 186–194, Jan. 2003. [16] M. Xiao, N. B. Shroff, and E. K. P. Chong, “A utility-based powercontrol scheme in wireless cellular systems,” IEEE/ACM Trans. Netw., vol. 11, no. 2, pp. 210–221, Apr. 2003. [17] D. Famolari, N. B. Mandayam, D. J. Goodman, and V. Shah, “A new framework for power control in wireless data networks: Games, utitlity and pricing,” in Proc. 36th Annu. Allerton Conf. Communications, Control, and Computing, Monticello, IL, 1998, pp. 289–310. [18] C. U. Saraydar, N. B. Mandayam, and D. J. Goodman, “Efficient power control via pricing in wireless data networks,” IEEE Trans. Commun., vol. 50, no. 2, pp. 291–303, Feb. 2002. [19] “An overview of the application of code division multiple access (CDMA) to digital cellular systems and personal cellular networks,” Qualcomm Inc., Doc. EX60-10010, 1992. [20] C. W. Sung, K. K. Leung, and W. S. wong, “A quality-based fixed-step power control algorithm with adaptive target threshold,” IEEE Trans. Veh. Technol., vol. 49, no. 7, pp. 1430–1439, Jul. 2000. [21] M. Andersin, Z. Rosberg, and J. Zander, “Distributed discrete power control in cellular PCS,” Wireless Pers. Commun., vol. 6, pp. 211–231, 1998. [22] K. S. Narendra and M. A. L. Thathachar, Learning Automata: An Introduction. Englewood Cliffs, NJ: Prentice-Hall, 1989. [23] R. Chandramouli, “A stochastic technique for on-line prediction and tracking of wireless packet networks,” in Proc. 35th Asilomar Conf. Signals, Systems, and Computing, 2001, vol. 1, pp. 672–676. [24] S. Kiran and R. Chandramouli, “An adaptive energy efficient link layer protocol using stochastic learning control,” in Proc. IEEE Int. Conf. Communications , 2003, pp. 1114–1118. [25] D. Fudenberg and J. Tirole, Game Theory. Cambridge, MA: MIT Press, 1992. [26] P. S. Sastry, V. V. Phansalkar, and M. A. L. Thathachar, “Decentralized learning of Nash equilibria in multi-person stochastic games with incomplete information,” IEEE Trans. Syst., Man, Cybernet., vol. 24, no. 5, pp. 769–777, May 1994.

[27] J. Hu and M. P. Wellman, “Multiagent reinforcement learning: Theoretical framework and an algorithm,” in Proc. 15th Int. Conf. Machine Learning, 1998, pp. 242–250. [28] K. S. Narendra and A. Annaswamy, Stable Adaptive System. Englewood Cliffs, NJ: Prentice-Hall, 1989. [29] R. N. Kent and B. S. Edward, Fundamentals of Differential Equations and Boundary Value Problem. Reading, MA: Addison-Wesley, 1996. [30] S. Lakshmivarahan and K. S. Narendra, “Learning algorithms for twoperson zero-sum stochastic games with incomplete information: A unified approach,” SIAM J. Control Optim., vol. 20, no. 4, pp. 541–552, Jul. 1982. [31] A. Blaquiere, Non-Linear System Analysis. New York: Academic, 1966. [32] K. S. Narendra and A. Annaswamy, Stable Adaptive Systems. Englewood Cliffs, NJ: Prentice-Hall, 1989. [33] M. F. Norman, Markov Processes and Learning Models. New York: Academic, 1973. [34] M. Haleem and R. Chandramouli, “Adaptive downlink scheduling and rate selection: A cross layer design special issue on mobile computing and networking,” IEEE J. Sel. Areas Commun., vol. 23, no. 6, pp. 1287–1297, Jun. 2005.

Yiping Xing (M’06) received the B.S. degree in electrical engineering from the University of Electronic Science and Technology of China (UESTC), Chengdu, China, in 2001, and the M.E. degree in electrical engineering and the Ph.D. degree from the Stevens Institute of Technology, Hoboken, NJ, in 2004 and 2006, respectively. His research interests include radio resource management for cellular and ad hoc networks, access control for cognitive radios, and game theory for wireless networks. He is currently with Bear Stearns, New York. Dr. Xing was the recipient of the Outstanding Research Award in 2005 and the Graduate Fellowship Award in 2006 from Stevens Institute of Technology. He was also the recipient of the IEEE CCNC 2006 Best Student Paper Award for his paper on dynamic spectrum access.

Rajarathnam Chandramouli (M’99–SM’06) received the Ph.D. degree in computer science in 1999 and the M.A. degree in mathematics in 1999 from the University of South Florida, Tampa, the M.E. degree in electrical and computer engineering in 1994 from the Indian Institute of Science and the B.Sc. degree in mathematics in 1990 from Loyola College, Chennai. He is the Thomas E. Hattrick Chair Associate Professor of Information Systems in the Electrical and Computer Engineering (ECE) Department, Stevens Institute of Technology, Hoboken, NJ. His research in wireless networking and security, cognitive networks, steganography/ steganalysis, and applied probability is funded by the National Science Foundatio, the U.S. AFRL, the U.S. Army, the Office of Naval Research, and other agencies. Currently, he is the Founding Chair of the IEEE COMSOC Technical Sub-Committee on Cognitive Networks and Management Board member of IEEE SCC 41 standards committee.

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on December 14, 2008 at 13:47 from IEEE Xplore. Restrictions apply.