Traffic-Aware Online Network Selection in ... - IEEE Xplore

0 downloads 0 Views 2MB Size Report
Traffic-Aware Online Network Selection in. Heterogeneous Wireless Networks. Qihui Wu, Senior Member, IEEE, Zhiyong Du, Student Member, IEEE, Panlong ...
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2394431, IEEE Transactions on Vehicular Technology 1

Traffic-Aware Online Network Selection in Heterogeneous Wireless Networks Qihui Wu, Senior Member, IEEE, Zhiyong Du, Student Member, IEEE, Panlong Yang, Member, IEEE, Yu-Dong Yao, Fellow, IEEE, and Jinlong Wang, Senior Member, IEEE,

Abstract—We focus on the network selection problem in heterogeneous wireless networks. Many traditional approaches select the best network according to quality of service (QoS) related criteria, which neglects diverse user demands. We aim to select networks maximizing the quality of experience (QoE) of users. When the availability and dynamics of network state information are considered, most of the existing approaches can not make effective selection decisions since they are vulnerable to the uncertainty in network state information. To address this issue, we introduce the idea of online learning for network selection. In this paper, we formulate the network selection problem as a continuous time multi-armed bandit (CT-MAB) problem. A traffic-aware online network selection algorithm (ONES) is designed to match typical traffic types of users with respective optimal networks in terms of QoE. Moreover, we found that the correlation among multiple traffic network selections can be exploited to improve the learning capability. This motivates us to propose another two more efficient algorithms: decoupled online network selection algorithm (D-ONES) and virtual multiplexing ONES (VM-ONES). Simulation results demonstrate that our online network selection algorithms attain around 10% gain in QoE reward rate over non learning-based algorithms and learning based algorithms without QoE considerations. Index Terms—Online network selection, QoE, traffic type, online learning, heterogeneous wireless networks.

I. I NTRODUCTION Wireless communications in 4G and beyond is making efforts to integrate various wireless access technologies into a heterogeneous network environment. Networks such as LTE, WLAN and WiMAX provide multiple choices for network access. Moreover, roaming terminals are equipped with multiple radio interfaces for heterogeneous wireless network access. Taking the smartphone for example, it can integrate GSM, 3G, WiFi and Bluetooth in a nutshell, and is able to access any one of them. However, this convenience also introduces challenges. One of the difficulties lies in making the right choice for users to utilize heterogeneous wireless network resource efficiently. Many efforts in existing literature have been made in designing different network selection criterion. In most existing work, quality of service (QoS) related criteria are used, for instance, c Copyright ⃝2013 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. Qihui Wu, Zhiyong Du, Panlong Yang and Jinlong Wang are with the College of Communications Engineering, PLAUST, Nanjing 210007,China. E-mail: [email protected]; wqhtxdk@aliyun; [email protected]; [email protected] Yu-Dong Yao is with the Department of Electrical and Computer Engineering, Stevens Institute of Technology, New Jersey, USA. E-mail: [email protected]

the received signal strength [1], bandwidth [3]. Some other studies consider a combination of several factors as one criterion. For example, in [5], the overall load among networks and users’ battery lifetime are jointly considered. However, two limitations exist in most of existing work. First, the above mentioned QoS related criteria are mainly used to evaluate some network-centric performance metric, rather than to directly meet user requirement or demand. The user demand here refers to user’s personalized requirements on QoS provisioning. Indeed, nowadays, users expect services to meet their demands and pay more attention to the satisfactory level of services. We believe that the quality of experience (QoE) is another important criterion to consider, since QoE is defined by the international telecommunication union (ITU) as “the overall acceptability of an application or service, as perceived subjectively by the end-user” [9]. QoE takes into account both technical parameters (e.g., QoS) and usage context variables [10], thus a user may place different QoE demands on different traffic or applications. Therefore, in order to optimize the user’s QoE, the optimal network selection strategy should be able to distinguish the diverse QoE demands resulted from different traffic types, and make appropriate network selection accordingly. This differs significantly from traditional network selection algorithms that are unaware of the diversity in QoE demand and choose a network to maximize a general criterion. Second, current network selection algorithms are vulnerable to uncertainty in network state information (NSI). Due to the lack of efficient information sharing scheme among different networks and the business competition amongst network operators, some important NSI such as the actual throughput and delay may not be available to users. Monitoring NSI of all networks is costly and impractical. Also, changing traffic load leads to dynamic NSI. In this context, most of existing network selection algorithms, such as [7] [8] [11], are unable to perform effectively because they often rely on such prior information. We tackle the NSI uncertainty (availability and dynamics) problem by introducing online learning in network selection. Without interrupting the ongoing communication, online learning can adjust and asymptotically match each traffic type to the corresponding optimal network. Designing such an online learning based network selection algorithm is not trivial. The first challenge origins from the consideration about the network access cost and QoE reward. It is known that accessing either access points in WLAN or base stations in cellular networks often incurs cost such as energy consumption and transmission fee. Hence, in order to evaluate networks’ overall performance, we need to jointly

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2394431, IEEE Transactions on Vehicular Technology 2

consider the network access cost and the QoE reward. Second, matching one traffic type with its optimal network corresponds to one learning task. When there are several traffic types with different QoE demands, the online learning algorithm faces multiple learning tasks. The multi-task learning requirement would lead to poor overall convergence. In this paper, we investigate the network selection problem in an online learning framework. We first introduce specific QoE functions that map experienced NSI to QoE reward for different traffic types distinguished with representative QoE demands. We also define a QoE reward rate, which is the ratio of the expected QoE reward and the expected network access cost, as a metric to evaluate network selection policies’ performance by jointly considering the network access cost and QoE reward. Based on the QoE reward rate, we then formulate the network selection problem as a continuous time multi-armed bandit (CT-MAB) problem and propose a trafficaware online network selection algorithm (ONES). Finally, in order to improve the convergence speed of ONES, another two algorithms, decoupled online network selection algorithm (D-ONES) and virtual multiplexing ONES (VM-ONES), are proposed. Our main contributions are summarized as follows. 1) We design an online network selection framework to maximize the user’s QoE. By formulating the network selection problem in heterogeneous wireless networks as a CT-MAB problem, we propose a unified online network selection framework to match each traffic type of a user with a corresponding optimal network. 2) Three traffic-aware online network selection algorithms, ONES, D-ONES and VM-ONES, are proposed under the CT-MAB framework. ONES derived from a learning algorithm can achieve a regret in logarithmic order of the slot number and polynomial with the number of traffic types. In order to overcome the limitation of poor convergency performance of ONES, we decouple the network selection problem to multiple independently updating sub-MABs, and propose a new algorithm DONES. On the other hand, by exploring the correlation among multiple learning tasks and updating multi-task learning simultaneously, the third algorithm, VM-ONES, is proposed. D-ONES and VM-ONES also achieve the logarithmic order regret and their regrets are much smaller than that of ONES, resulting in significant performance improvements. The rest of this paper is organized as follows. Related work is reviewed in Section II. The system model is presented in Section III. We formulate the problem in Section IV. The online network selection algorithms are presented in Section V. We evaluate the performance of the proposed framework in Section VI. Section VII concludes the paper. II. R ELATED W ORK In this section, some related studies are presented. We would cover the following five aspects: network selection, vertical handoff, bandwidth aggregation and the multi armed-bandit problem.

Network selection related work mainly focuses on two aspects: network selection decision criteria and network selection decision algorithms. On network selection decision criteria, several metrics have been proposed in literature. Received signal strength (RSS) based criteria [1] [2] effectively reflects the distance information between the user and the network access point, which is important for mobile equipments to keep seamless connections. Authors in [3] and [4] adopt the available bandwidth as the main criterion for network selection. Overall load among heterogeneous wireless networks can also be a criterion in [5]. Some studies consider a combination of several factors such as network handoff cost, delay, available bandwidth [13] to evaluate the overall network performance. Authors in [6] classify the decision criteria into four categories: network metrics, device related criteria, traffic requirements and user preferences. As for network selection decision algorithms, the most widely used algorithm is multiple attribute decision making (MADM). Because of their deterministic nature and easy implementation, several MADM mechanisms have been introduced to evaluate the network performance and choose the best one when multiple performance metrics are involved in decisions [7]. In addition, considering the dynamic wireless network environment, Markov decision process based algorithms are adopted to make network selection maximizing the long term reward [8]. We summarize several limitations on current work in network selection. On one hand, most existing work adopts QoS criteria, which neglects the user demand. On the other hand, current work commonly relies on prior NSI and assumes a static network environment. Compared with existing work, the criterion in our paper is user’s QoE. Specifically, we distinguish QoE demands for different traffic types. Although the QoE-based network selection is proposed in [11], the authors only focus on one traffic type and assume that the NSI is known and static, which results in a totally different algorithm. Further, we introduce the online learning to tackle the availability and dynamics in NSI. Although the dynamics in NSI is considered in [8], it is assumed that the instantaneous NSI is observable and the state transition probabilities are known. Vertical handoff focuses on how to provide seamless handover for users roaming across multiple wireless networks. A large number of work studied the vertical handoff decision algorithms on the user side. An optimal vertical handoff strategy in a vehicular network setting is proposed in [12]. Fuzzy logic can also be used in vertical handoff decision [13]. A comprehensive survey on vertical handoff algorithms can refer to [14]. When the bandwidth of a single network can not meet the transmission performance requirement, bandwidth aggregation or concurrent multipath transfer is a promising solution, where a user transmits data concurrently from multiple wireless networks. Without any modification on infrastructures or 802.11 protocols, authors in [15] aggregate the available bandwidth across several WLAN APs by maintaining concurrent TCP. Since the original TCP protocol does not support transfer from multiple wireless interfaces, modifications and enhancements to TCP are needed [16] [17] for supporting concurrent multi-

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2394431, IEEE Transactions on Vehicular Technology 3

path transfer. Moreover, a new protocol, stream control transmission protocol (SCTP) [18], is developed by the Internet Engineering Task Force (IETF) to incorporate multihoming capability for users. Multihomed users can associate with multiple end to end paths to increase throughput. Multi armed bandit (MAB) problem is in fact the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible [20]. The MAB problem has been applied in statistics, control, as well as communications [20]- [23]. The CT-MAB problem considered in this paper is an extension of the classical MAB problem. The unique characteristics in this continuous time version lie in that each time playing an arm takes a random period of time and the goal is to maximize the expected reward received in one unit time, that is, to maximize the average reward rate [24]. III. S YSTEM M ODEL We consider a heterogeneous wireless network consisting of M wireless networks with the same or different technologies/standards, such as LTE, IEEE 802.11 WLAN and WiMAX. We denote the network set as M = {m1 , m2 , ..., mM } . A multiple mode user equipment (UE) locates in the heterogeneous wireless network and can access any one of the networks. A. System Description The proposed network selection scheme generally operates in a slot based manner as shown in Fig. 1. Upon a traffic arrival, such as transferring a file, the UE will start a transmission and adjust the accessed network in the transmission procedure, where a network a ∈ M will be selected at the beginning of each slot based on a given network selection policy π. Note that the slot length here may last for several seconds and each traffic may last for multiple slots. To facilitate fast and dynamic network switching and deal with possible out of order packets, we assume that transport layer protocols such as ECCP [17] or SCTP [18] supporting UE multihoming are used. In order to evaluate how well a network meets the user’s demand, we adopt QoE as the metric, since the QoE can accurately reflect the overall acceptability of an application or service. For the transmission in each slot, the UE evaluates the QoE reward ε, which will be discussed in the next subsection. Meanwhile, the transmission in each slot also incurs network access cost τ . The network access cost can be measured in different ways such as energy consumption, transmission fee. The QoE reward ε and network access cost τ are important feedback information for countering the NSI uncertainty and realizing online network selection algorithms in our scheme.

slot Three types of traffic

The Instant of network selection

Fig. 1: An illustration of the slot based network selection scheme. Three types of traffic are considered and each traffic may last for several slots. At the beginning of each slot, the access network will be updated .

type classification can be diverse dependent on the actual consideration, there are some widely used classifications in existing literature. Basically, according to the characteristic of throughput requirement, traffic is classified as stream traffic and elastic traffic in [34]. The stream traffic can only partially benefit from the allocated throughput, which includes audio and video applications. The elastic traffic can benefit from all the throughput such as web browsing, e-mail, file transfer. Authors in [27] further incorporate the brittle traffic into consideration and get three types of traffic: brittle traffic, stream traffic and elastic traffic. The brittle traffic represents traffic that places strict requirement on bandwidth and has no adaptive properties, which may include video telephony, telemedicine, etc. On the other hand, according to the running applications, traffic is classified into video traffic, audio traffic and file transfer in [26] [32]. We do not specify a concrete traffic classification, rather, we assume that the type set are known and { preferred traffic } denoted as S = s1 , s2 , ..., s|S| , where s ∈ S is one of |S| traffic types. The stationary probability of each∑arriving traffic ps = 1. The being type s ∈ S is ps with 0 < ps < 1 and s∈S

QoE function Q (V) maps experienced NSI into QoE reward, i.e., the QoE reward ε = Q (V) , (1) } where V = v1 , v2 , ..., v|V| is the related NSI, |V| is the number of parameters. The corresponding QoE function for traffic type s is denoted as Qs (V). The user “experienced NSI” indicates that the NSI is the feedback from an end to end transmission perspective rather than the estimated or probed result. Commonly, the NSI may include throughput, delay, loss rate, jitter. Actually, the effective sets of parameters for different traffic types may be subsets of V and can be different, as in [26]. {

IV. P ROBLEM F ORMULATION We formulate the online network selection problem as a CTMAB problem in this section. For clarity, a glossary of main variable definitions is given in Table I. A. Goal of the UE

B. QoE Rewards of Traffic In this paper, we assess the QoE reward by QoE functions. Explicitly, the QoE functions map the experienced NSI into the user’s QoE reward in each slot. Considering different types of traffic in the UE, we define specific QoE function for each type of traffic according to its characteristic. While the traffic

Based on the observed history information Λ (i) = {s1 , a (1) , ε (1) , τ (1) , ..., si−1 , a (i − 1) , ε (i − 1) , τ (i − 1)}, a network selection policy π makes a decision, i.e., a (i) = π (Λ (i)), where si , a (i), ε (i) and τ (i) are the traffic type, the selected network, the QoE reward and network access cost of the i-th slot, respectively. We assume that the

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2394431, IEEE Transactions on Vehicular Technology 4

TABLE I: Glossary of key variable definitions Variable M S V i π a (i) ε (i) τ (i) ps si Qs (·) gπ Rπ (i) π∗ ε¯m,s (i) τ¯m,s (i) Em,s Γm,s κm,s Tm,s (i) Ts (i) Tπ (i) g¯π (i) cˆi,Tm,s (i) S˜i ε˜s (i) τ˜s (i)

Definition Available networks set Traffic type/side information set in UE The set of NSI Slot index Network selection policy Selected network in i-th slot QoE reward in i-th slot Network access cost in i-th slot The stationary probability of traffic type s Traffic type of i-th slot QoE function for traffic s The expected QoE reward rate of policy π Regret in i-th slot The optimal policy Average QoE reward of traffic type s in network m for the first i slots Average network access cost of traffic type s in network m for the first i slots Expected reward when network m is selected for traffic type s Expected network access cost for traffic type s when network m is selected Relative value for traffic type s when network m is selected Number of slots where network m was selected for traffic type s in the first i slots Number of slots where the traffic type is s in the first i slots Number of slots where a network compatible with policy π was selected The sample mean of expected QoE reward rate of policy π in the first i slots Confidence interval for estimate of κm,s Virtual traffic type set in i-th slot Virtual QoE reward in i-th slot Virtual network access cost in i-th slot

Sec. III

Sec. IV

Sec. V

B. Continuous Time Multi-armed Bandit Problem Formulation

NSI is approximately fixed during one slot. Actually, there is no need to store all the history information, since our algorithms work in an online updating manner as shown in Section V. Intuitively, the UE’s goal is to find a network selection policy maximizing the expectation of the accumulative QoE reward, for instance, [ Tthe expected ]total QoE rewards in suc∑ cessive T slots, E Qsi (V (i)) , where V (i) is the expei=1

rienced NSI in i-th slot. However, this goal is not appropriate in our setting. Indeed, the accumulative QoE reward can not capture the influence of network access cost. For example, a policy with the maximal accumulative QoE reward may incur unbearable access cost, such as energy consumption or transmission fee, and is thus not preferred. Alternatively, since the reward rate, which is defined as the ratio of reward and cost, can provide fair tradeoff between QoE reward and network access cost, we resort to maximize the expected QoE reward rate gπ : ∑  ∑  T T Qsi (V (i)) ε (i)  i=1     = lim sup E  i=1  gπ = lim sup E      T T T →∞ T →∞ ∑ ∑ τ (i) τ (i) i=1

policy is determined by the characteristics of traffic as well as NSI. The traffic type si determines the distinct QoE function, the NSI V affects the QoE reward and the network access cost. Therefore, the traffic type together with the unknown and dynamic NSI make the computation of gπ challenging. It is possible that we can perform online learning on the QoE reward rate of each policy from communication and find the optimal policy. However, learning the optimal policy faces the exploring and exploiting dilemma. Maximizing user’s QoE needs more access to networks estimated to be optimal, while converging to the optimal strategy requires sufficient access to networks estimated to be suboptimal. Therefore, tradeoff between exploitation and exploration of the heterogeneous wireless network resource is important. In the next subsection, we model the network selection as a continuous time MAB (CT-MAB) problem. The MAB problem is a powerful tool in control, learning theory and related applications, which can balance the exploitation and exploration tradeoff. We search for solutions for our problem in the CT-MAB framework.

i=1

(2) where E is the expectation when policy π is adopted. Considering ε (i) and τ (i), the optimal network selection

Before formally formulating the problem, we briefly introduce basic MAB problems. In classical MAB, there is a decision maker playing with “K” gambling machines, i.e., K “arms” of a bandit. Each time the player takes an arm a to play, he gets a reward ε. The rewards of playing each arm are i.i.d.. variables following some unknown distribution. Denote the expected reward of playing arm k by θk , θ∗ = max θk . k The player’s goal is to find an optimal arm selection policy π maximizing the accumulative expected reward, or equivalently, minimizing the regret at i-th play [ i ] ∑ ∗ Rπ (i) = iθ − E ε (t) , (3) t=1

which is defined as the accumulative expected reward gap between always playing the arm with largest expected reward and policy π. The well known index-based policy selects arms based on constructing upper estimate of θk , θˆk = θ¯k + c [20], which is the sample mean θ¯k corrected by a one-sided confidence interval c. The expected reward falls within the confidence interval of the average reward with high probability. We outline two key features differentiating the CT-MAB problem [24] from classical MAB: the side information and cost. Specifically, playing an arm also incurs cost, which is a random variable related to the side information and selected arm. The side information is a set of “task type” information indicating some particular utility for each paly. Therefore, the optimal options for different side information may be different. These differences result in the goal in the CT-MAB problem, maximizing the reward rate, while in the classical bandit problem the goal is maximizing the accumulative expected reward. To model the problem, the available wireless networks are treated as the arms. Selecting a network corresponds playing an arm, where the QoE reward ε and the network access cost

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2394431, IEEE Transactions on Vehicular Technology 5

τ in each slot are the reward and cost, respectively. Different traffic types have diverse QoE functions, thus the traffic type s ∈ S of each slot is exactly the side information. In the rest of this paper, we use the traffic type and side information interchangeably. If the rewards and costs in different slots are independent, maximizing the expectation in (2) is equivalent to maximizing the expectations’ ratio of the numerator and the denominator [25], that is [T ] ∑ Eπ ε (i) gπ = lim sup (4) [ i=1 ]. T T →∞ ∑ Eπ τ (i) i=1

For the above CT-MAB problem, the optimal network selection policy π ∗ is the one maximizing the expected QoE reward rate, (P1) π ∗ = arg sup gπ

(5)

π

Algorithm 1: Online Network Selection Algorithm (ONES) Initiate: i=0, Ts (0) = 0, ε¯m,s (0) = 0, τ¯m,s (0) = 0 for m ∈ M, s ∈ S. Loop i = i + 1, Upon the arrival of i-th traffic, If Tsi (i − 1) ≤ |M| Select {Tsi (i − 1) + 1}-th network in M. Else Select network a (i) = arg max {¯ εm,si (i − 1) m∈M

−¯ τm,si (i − 1) g¯∗ (i − 1) + cˆi−1,Tm,si (i−1)

}

to access and communicate. End if Update ε¯m,si (i), τ¯m,si (i), g¯∗ (i) and cˆi,Tm,si (i) according to (12)-(18). End loop

V. T RAFFIC -AWARE O NLINE N ETWORK S ELECTION A LGORITHMS A. Property of the Optimal Network Selection Policy In classical MAB problems, the player’s goal is to maximize the accumulative reward. However, for CT-MAB in this paper, the goal is to maximize the average reward rate as in (P1). Due to this difference, classical UCB1 [20] and the similar learning algorithms can not be directly used in the CT-MAB problem. However, learning algorithms for CT-MAB can be obtained by some transformations of the goal. We denote εm,s (i) and τm,s (i) as the reward and cost when the traffic type is s and network m is selected in the i-th slot, respectively. We assume that the rewards and costs are bounded by εm,s (i) ∈ [εmin , εmax ] , τm,s (i) ∈ [τmin , τmax ] . The expected reward and cost when network m is selected with traffic type s are Em,s = E [εm,s (i)] , Γm,s = E [τm,s (i)] . Denote the relative value κm,s as the expected reward that can be collected minus the expected reward the optimal policy can collect when network m is selected with side information s, κm,s = Em,s − Γm,s g ∗ , where g



= sup g . Based on the theory of semi-Markov π

Proposition 1. A deterministic stationary policy π ∗ : S → M is optimal for the CT-MAB problem, if and only if it satisfies the condition, m∈M

In the rest of this section, we first propose an online network selection algorithm, ONES. After analyzing its performance, we propose two more efficient algorithms, D-ONES and VMONES, which can achieve better convergence performance. B. Online Network Selection Algorithm: ONES Motivated by Proposition 1, we propose an online network selection algorithm (ONES, see Algorithm 1) derived from a learning algorithm in [24]. The main idea of this algorithm is to develop the upper estimates of κm,s with particular confidence bounds. The algorithm is as follows, in the first |M| slots of each traffic type s ∈ S, each network m ∈ M is selected once. After that, the network maximizing the upper estimate of κm,si is selected for each arriving traffic type si . The parameters in the algorithm are defined as: Tm,s (i) =

i ∑

I (a (t) = m, st = s),

(9)

t=1

decision problems [24], the following proposition holds.

∀s ∈ S

m∈M

(6)

π

κπ∗ (s),s = max κm,s

network selection policy is π ∗ ∈ Π that chooses a network maximizing the relative value κ (m, s) for each traffic type s ∈ S simultaneously. Therefore, (P1) can be changed to (P2): { } (P2) π ∗ = π (s)| π (s) = arg max κm,s , ∀s ∈ S . (8)

(7)

where π (s) denotes the selected network for traffic type s under policy π. Denote Π as the set of stationary deterministic policy, thus |S| |Π| = |M| . This proposition indicates that the optimal

Ts (i) =

i ∑

I (st = s),

(10)

I (a (t) = π (st )),

(11)

t=1

Tπ (i) =

i ∑ t=1

where I (·) is the indicator function, Tm,s (i) is the number of slots when network m is selected for traffic type s in the first i slots. Ts (i) is the number of slots when the traffic type is s in the first i slots. Tπ (i) is the number of slots when the

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2394431, IEEE Transactions on Vehicular Technology 6

selected network is compatible with policy π, that is, the same with the decision of policy π. In addition, ε¯m,s (i) =

τ¯m,s (i) =

1 Tm,s (i)

i ∑

1

i ∑

Tm,s (i)

t=1

i ∑

g¯π (i) =

I (a (t) = m, st = s)ε (t)

(12)

I (a (t) = m, st = s)τ (t)

(13)

t=1

t=1 i ∑

I (a (t) = π (st ))ε (t) (14) I (a (t) = π (st ))τ (t)

t=1

[ ] g¯ (i) = max g¯π (i) − ci,Tπ (i) ∗

(15)

π∈Π

where ε¯m,s (i) and τ¯m,s (i) are the average reward and average cost in the first i-th slots when network m is selected for traffic type s, respectively. g¯π (i) is the estimate of average reward rate of network selection policy π in i-th slot and g¯∗ (i) is a lower estimate of g ∗ in i-th slot, where √ (√ ) ci,Tπ (i) = 2c1 log i |Π| + 1 /Tπ (i), (16) { c1 = 2 max

2

(εmax − εmin ) ε2max (τmax − τmin ) , 2 4 τmin τmin

2

} .

In ONES, ε¯m,si (i − 1) − τ¯m,si (i − 1) g¯∗ (i − 1) + cˆi−1,Tm,si (i−1) (17) is the upper estimate of κm,si and cˆi,Tm,s (i) is the confidence interval defined as √ ( ) √ cˆi,Tm,s (i) = (d0 + d1 ) log i |Π| + 1 /Tm,s (i) (18) √ d0 =

d1 =



The regret is a well known performance measure for online learning algorithms. Different from classical regret in (3), the regret in our problem can be defined as: i ∑ t=1

τ (i) −

i ∑

From another perspective, we can treat (P2) as a compound MAB problem. We define the compound MAB problem as a MAB problem consisting of |S| sub-MAB problems. The subMAB problems are distinguished by the traffic type s ∈ S. The available networks M are the common |M| arms of all subMAB problems. Each sub-MAB problem with traffic type s aims at finding the optimal network to maximize κm,s . The aim of the compound MAB problem is to find the network selection policy maximizing κm,s for s ∈ S, simultaneously. Consequently, the compound MAB requires multi-task online learning. The updating of sub-MABs in ONES are correlated. Note that i in cˆi,Ti (m,s) is the total slot number of all the traffic rather than the slot number of traffic type s, Ts (i). This results in a relative larger confidence interval cˆi,Ti (m,s) than that in conventional MABs, e.g., in [20]. The relative large confidence intervals lead to conservative updating of sub-MABs, where the sampling in suboptimal networks for the online network selection is more emphasized. Hence, the correlated updating of sub-MABs directly induces performance loss. In response to the limitation of ONES, we try to seek for a new algorithm with a finer upper estimate for κm,s . Fortunately, Proposition 1 indicates that this multi-task online learning can be decomposed by the relative values, which provides the possibility. To this end, we propose a new algorithm, decoupled online network selection algorithm (DONES, see Algorithm 2), which decouples the updating of sub-MABs. Denote √ 2 log t ϕt,t′ = , (21) t′ we get an upper confidence estimate of Em,s as ε¯m,s (i) + ϕTs (i),Tm,s (i) ,

2 c , 2τmax 1

} { 2 2 . 8 max (εmax − εmin ) , ε2max (τmax − τmin ) /τmin

R (i) = g ∗

C. Decoupled Online Network Selection Algorithm: D-ONES

ε (i)

(19)

t=1

Note that to maximize κm,s in (P2) is equivalent to minimize the expected regret. Generally, the optimal grow speed of the regret is demonstrated to be the logarithmic order of the play number. The optimal logarithmic regret of ONES is guaranteed by the following theorem.

and the lower confidence estimate of Γm,s as τ¯m,s (i) − τmax ϕTs (i),Tm,s (i) . Since g¯i∗ is a lower estimate of g ∗ , we obtain a new upper estimate of κm,s in i-th slot as [ ] ε¯m,s (i)+ϕTs (i),Tm,s (i)− τ¯m,s (i)−τmax ϕTs (i),Tm,s (i) g¯i∗ (22) Different from ONES, the upper estimate of κm,s in D-ONES is related with Ti,s , rather than i. Note that given the same Tm,s (i), the confidence interval in (22) is smaller than that in (17), which means that the decoupled updating results in a finer upper estimate in D-ONES. Thus, we can expect a better performance in D-ONES. Also, the following theorem demonstrates the logarithmic order regret of D-ONES under a mild condition.

Theorem 1 (Theorem 1 in [24]). The expected regret R (i) of ONES in i slots is upper bounded as Theorem 2. The expected regret R (i) of D-ONES in i slots ) [( is upper bounded as 2π 2 ∗ 2+ 3(|Π|+1)2 |M| |S|+2 |M| |S| log (i) E [R (i)]≤G [ i ] ( (√ )) ] (20) E [R (i)] ≤ ∑ p L∗ (s) ∑ E ∑ I {π (s) = m} ∑ ∑ (d0 +d1 ) log i |Π|+1 s t + t=1 s∈S ∆m (s)2 m̸=π ∗ (s) m:∆m >0 s∈S ] [ ∑ ∑ π2 ∗ 8ζ log t + 1 + = p L (s) ∗ ∗ m,s s 3 (23) where G = τmax g − εmin , ∆m (s) = κπ∗ (s),s − κm,s . s∈S m̸=π ∗ (s) 0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2394431, IEEE Transactions on Vehicular Technology 7

Algorithm 2: Decoupled Online Network Selection Algorithm (D-ONES)

Algorithm 3: Virtual Multiplexing Online Network Selection Algorithm (VM-ONES)

Initiate: i=0, Ts (0) = 0, ε¯m,s (0) = 0, τ¯m,s (0) = 0 for m ∈ M, s ∈ S. Loop i = i + 1, Upon the arrival of i-th traffic, If Tsi (i − 1) ≤ |M| Select {Tsi (i − 1) + 1}-th network in M. Else Select network { a (i) = arg max ε¯m,si (i − 1) + ϕTsi (i−1),Tm,si (i−1) m∈M [ ] } − τ¯m,si (i − 1) − τmax ϕTsi (i−1),Tm,si (i−1) g¯∗ (i − 1)

Initiate: i=0, Ts (0) = 0, ε¯m,s (0) = 0, τ¯m,s (0) = 0 for m ∈ M, s ∈ S. Loop i = i + 1, Upon the arrival of i-th traffic, If Tsi (i − 1) ≤ |M| Select {Tsi (i − 1) + 1}-th network in M. Else Select network

to access and communicate. End if Update ε¯m,si (i), τ¯m,si (i) and g¯∗ (i) according to (12)(15), update dTsi (i−1),Tm,si (i−1) according to (21). End loop

to access and communicate. End if Loop s ∈ S˜i Create virtual samples {s, a ˜s (i) , ε˜s (i) , τ˜s (i)} for s. End Loop For each s ∈ S, update ε¯m,s (i) and τ¯m,s (i) according to (26)-(27), update g¯∗ (i) and cˆi,Tm,s (i) according to (15) and (18) with virtual samples. End Loop

(

ζm,s

)2 ∗ 1 + g¯i−1 τmax = max ( )]2 , [ ∗ ∗ g ¯i−1 Γm,s − Γπ∗ (s),s Eπ∗ (s),s − Em,s + g¯i−1

when the following (24) holds for ∀m ̸= π ∗ (s) , s ∈ S ( ) ∗ Eπ∗ (s),s −Em,s +¯ gi−1 Γm,s −Γπ∗ (s),s ≥ 0,

(24)

where ps is the stationary probability of each arriving traffic being type s ∈ S, L∗ (s) = max {Γm,s g ∗ − Em,s } is the m∈M largest loss for traffic type s. The proof of theorem 2 can be seen in the appendix. Note that ∆m (s) = κπ∗ (s),s − κm,s ( ) = Eπ∗ (s),s − Em,s + g ∗ Γm,s − Γπ∗ (s),s > 0 ∗ → g ∗ with high probability, When i is sufficiently large, g¯i−1 which means that (24) holds. Therefore, the above regret bound can be arrived with high probability.

D. Virtual Multiplexing Online Network Selection Algorithm: VM-ONES As mentioned above, in each slot, one sub-MAB problem with side information the same as current traffic type is updated in both ONES and D-ONES. For this reason, the bounds of the expected regrets of ONES and D-ONES are approximately the sum average regret of each sub-MAB. This can be seen in the bound of E [R (i)] in (20) where ) ( 2π 2 2 + 3(|Π|+1) |M| |S| and 2 |M| |S| log (i) grow linearly with |S|, i.e., the number of sub-MABs. A similar result can be found in the expected regret bound of D-ONES. Although the expected regrets of ONES and D-ONES grow in the optimal logarithmic order, their actual regrets are still considerable large. Along a different avenue with ONES and D-ONES, we consider a new approach to exploit the correlation among

a (i) = arg max {¯ εm,si (i − 1) m∈M

−¯ τm,si (i − 1) g¯∗ (i − 1) + cˆi−1,Tm,si (i−1)

}

sub-MABs to speed up the updating, rather than decoupling the updating of sub-MABs. Actually, there exist available correlation among sub-MABs. We notice that for a selected network a (i) = m′ in i-th slot with traffic type si = s′ , the average QoE reward ε¯m′ ,s (i) and average network access cost τ¯m′ ,s (i) can be updated not only for sub-MAB with side information s = s′ but also for sub-MABs with s ̸= s′ . This is possible if for fixed network m′ and experienced NSI V (i), the only difference among εs (i) and τs (i) for s ∈ S lies in the different QoE functions of the traffic. This motivates us to utilize the correlation in such cases with an appropriate formulation. We update all the sub-MABs in each slot by constructing virtual samples {s, a ˜s (i) , ε˜s (i) , τ˜s (i)} for s ̸= si . Denoting the virtual side information set in i-th slot as S˜i = { s| s ̸= si , s ∈ S}, we assume that there are virtual traffic (type s ∈ S˜i ) arrivals with the current traffic (type si ) at the same time and they can virtually multiplex in the selected network, which is a ˜s (i) = m′ . The virtual network access cost is τ˜s (i) = τ (i) and the QoE reward can be updated as ε˜s (i) = Qs (V (i)) , ∀s ∈ S˜i

(25)

The sample means of the rewards and the access costs are updated as follows i i ∑ ∑ ˜ )ε˜s (t) I(a(t)=m,st =s)ε(t)+ I (a ˜s (t)=m,s∈S(t) t=1 ε¯m,s (i) = t=1 Tm,s (i)+T˜m,s (i) (26) i i ∑ ∑ ˜ )τ˜s (t) I(a(t)=m,st =s)τ (t)+ I (a ˜s (t)=m,s∈S(t) t=1 τ¯m,s (i) = t=1 Tm,s (i)+T˜m,s (i) (27)

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2394431, IEEE Transactions on Vehicular Technology 8

( ) i ∑ where T˜m,s (i) = I a ˜s (i) = m, s ∈ S˜i is the number of

DͲONES

ONES

VMͲONES

t=1

virtual samples with traffic type s and selected network m observed up to the i-th slot. The sample means of the QoE reward and the access cost incorporate the virtual samples and all the sub-MABs can be updated in parallel. Similarly, the virtual samples should also be incorporated in computing g¯π (i). Based on the virtual samples, we propose a virtual multiplexing online network selection algorithm (VM-ONES, see Algorithm 3). Interestingly, we found that the expected regret of VM-ONES has the same bound as ONES. The proof is similar with the proof of Theorem 1. Nevertheless, the simulation and following analysis indicate a significant reduction in the actual regret of VM-ONES, compared with ONES. We approximately compare the convergence performance of VM-ONES with that of ONES in a special case, where each traffic lasts for only one slot. With virtual samples, there are one real traffic (type si ) and |S|−1 virtual traffic types, which means that all the |S| sub-MABs can be updated in parallel in each slot. Suppose that there is only one type of traffic s in the network selection problem, and in each slot, the sub-MAB with s is updated in ONES. Denote W (p∗ ) as the state that the probability of selecting the best network is no lower than p∗ . We use ns , the average slot number for single sub-MAB to converge to state W (p∗ ), to reflect the convergence speed. For the compound MAB of the network selection problem, ONES and VM-ONES converge to W (p∗ ) only when all the sub-MABs converge to W (p∗ ). In ONES, the probability that sub-MAB with s is updated is ps and the total slot number for sub-MAB with s to converge is approximately npss . Therefore, the average slot number for ONES to converge to W (p∗ ) is ns NON ES = max . s∈S ps On the other hand, all |S| sub-MABs can be updated in parallel in VM-ONES. Denote ns0 = max ns , the average slot number s∈S

for VM-ONES to converge to W (p∗ ) is NV M −ON ES = max ns = ns0 . s∈S

Since max npss ≥ s∈S

ns0 ps0

, thus, the convergence performance gain

NON ES 1 ≥ . NV M −ON ES ps0 Accordingly, the convergence performance advantage of VMONES can effectively decrease the regret. In order to clearly understand ONES, D-ONES and VM-ONES, we briefly explain the relationship of the three algorithms. As illustrated in Fig. 2, ONES is derived from a learning algorithm, which, however, suffers from poor convergence performance resulting in performance loss. By decoupling the updating of sub-MABs in ONES, D-ONES gets a new upper estimate of κ (m, s) with smaller confidence interval. VM-ONES takes a further step to update sub-MABs in parallel by exploiting the correlation among sub-MABs in some cases. By virtual multiplexing, VM-ONES updates all sub-MABs in each slot. Therefore, the dependency among sub-MABs in D-ONES is weaker than that

• •

Newupper estimate Decoupled updating

• •

Constructvirtual Samples UpdatesubͲMABs simulatenously

Fig. 2: Relationship of the three algorithms.

of ONES. While in VM-ONES, the dependency is stronger than that of ONES. VI. P ERFORMANCE E VALUATION A. Simulation Setup 1) QoE Function Definition: First, we differentiate traffic into three types similar with [26]: sv -video traffic, sa -audio traffic and se -elastic traffic. The corresponding traffic type set is denoted as S = {sv , sa , se }. We emphasize that more complicated user demands could also be adopted. Some widely used QoE models are specified for each type of traffic. The QoE reward in our simulation is the mean opinion score (MOS) [23], which is used as a subjective measure of the network quality. The MOS has five values from 1 to 5 indicating users’ satisfactory degrees “Bad”, “Poor”, “Fair”, “Good” and “Excellent”, respectively. In the following, we briefly present the QoE functions of the three traffic types, which map user experienced NSI into MOS output. For video traffic, the MOS is mainly dependent on the loss of a single slice of a frame from the video stream [26]. With some transformation, the MOS is simplified as a function of the Peak Signal-to-Noise Ratio (PSNR) [28], the QoE function Qvideo is Qvideo (Psnr ) = 4.5 −

3.5 1 + exp (b1 (Psnr − b2 ))

(28)

where b1 and b2 are the parameters determining the shape of the function, Psnr is the experienced PSNR. In the simulation, b1 = 1 and b2 = 5. For audio traffic, the QoE function Qaudio is defined by a nonlinear mapping of the R-factor [29]] Qaudio (Rf ) = 1+0.035Rf +7·10−6 Rf (Rf − 60) (100 − Rf ) (29) where Rf is the R-factor defined by ITU to reflect the audio quality impairment from different aspects. Generally, delay and packet loss rate are two main concerns. Thus, Rf can be computed by [29] Rf = 94.2 − Ie − Id where Ie is the impairment caused by packet loss rate, Id is the impairment caused by delay. Given a packet loss rate e, the impairment Ie is defined as [29] [30] Ie = γ1 + γ2 ln (1 + γ3 e) where γ1 , γ2 and γ3 are constant parameters dependent on the codec. It is recommended that when G.729a is used, the parameters are γ1 = 11, γ2 = 40 and γ3 = 10. The packet loss

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2394431, IEEE Transactions on Vehicular Technology 9

Q

Q

video

elastic

4.5

5

4

4.5 4

3.5

3

MOS

MOS

3.5 3

2.5 2.5 2

2

1.5

1

1.5

0

5

10

1

15

0

500

1000

Psnr(dB)

1500

2000

Throughput(kbps)

Fig. 3: QoE functions for video traffic, audio traffic and elastic traffic. Audio

Video

0.4 LTE WLAN1 WLAN2

0.35

0.35 0.3

0.14 0.12 0.1 0.08

0.25

0.25

0.2

0.2

0.15

0.15

0.1

0.1

0.06 0.04 0.05

0.02 0

0

5000

10000

0

0.05

0

5000

10000

0

0.35

5000

0.3

0.14 0.12 LTE WLAN1 WLAN2

0.1 0.08

0.25

0.25

0.2

0.2

0.15

0.15

0.1

0.1

0.06 0.04

0

10000

LTE WLAN1 WLAN2

0.35

0.3

0.05

0

5000

0

10000

0.05

0

5000

Slot index

10000

0

Audio

0.2

0.4 LTE WLAN1 WLAN2

0.16

0.02 0

Video

Elastic

0.4

0.18

0.16 0.3

Audio

0.2 LTE WLAN1 WLAN2

Ratio of selecting each network

Ratio of selecting each network

Elastic

0.4 LTE WLAN1 WLAN2

5000

10000

0.4

0.35

0.35

0.16 0.3

0.3

0.25

0.25

0.14 0.12 LTE WLAN1 WLAN2

0.1

LTE WLAN1 WLAN2

0.2

0.08

0.15

0.15

0.1

0.1

0.05

0.05

0.04

0

0

5000

10000

0

0

Slot index

Fig. 4: A sample run of ONES.

LTE WLAN1 WLAN2

0.2

0.06

0.02

0

Elastic

0.4

0.18

Ratio of selecting each network

Video 0.2 0.18

5000

10000

0

0

5000

10000

Slot index

Fig. 5: A sample run of D-ONES.

Fig. 6: A sample run of VM-ONES.

TABLE II: Synthetic network scenario in the simulation LTE WLAN1 WLAN2

emin 0.002 0.002 0.002

eunit 0.002 0.002 0.002

Ne 3 4 5

dmin 20ms 50ms 60ms

dunit 10ms 10ms 10ms

rate e consists of the loss probability in the network enetwork and the loss probability caused by playout loss eplayout as e = enetwork + (1 − enetwork ) eplayout . In the simulation, eplayout = 0.005. On the other hand, the delay impairment Id reflects the effect caused by the delay d in ms, Id = 0.024d + 0.11 (d − 177.3) I (d − 177.3) where I (·) is the indicator function. Note that 177.3ms is believed to be a delay threshold for audio traffic [30]. The delay d consists of codec delay dcodec , playout delay dplayout and network delay dnetwork as d = dcodec + dplayout + dnetwork , where dcodec = 25ms, dplayout = 60ms are adopted in the simulation. For non-real time traffic such as file transfer and web browsing, we call it elastic traffic. The corresponding QoE is defined as an increasing function of throughput θ [31] Qelastic (θ) = b3 log (b4 θ)

(30)

where b3 and b4 can be determined by the required maximal and minimal throughput. We assume the required minimal and maximal throughput are 100kbps and 2000kbps and the resulting parameters are set as b3 = 2.6949, b4 = 0.0235. The QoE functions of three types of traffic are shown in Fig. 3. 2) Simulation Scenario: We consider a heterogeneous wireless network consisting of two WLAN networks (WLAN1 and WLAN2) and a LTE network. A multi-mode user equipment (UE) in the heterogeneous wireless network can access any of the three networks. Due to the complexity in NSI dynamics,

Nd 7 6 5

θmin 250kbps 400kbps 250kbps

θunit 50kbps 50kbps 50kbps

Nθ 11 18 11

Psnr 4dB 5dB 7dB

we adopt a discrete model similar to [8] to model the packet loss rate, delay and throughput of networks. The packet loss probability of the network enetwork in a slot can be any of the following Ne states, ene = emin + eunit ne ,

ne = 1, ..., Ne

where emin is the minimal packet loss rate, eunit is a packet loss rate unit and Nb is the number of state. For example, when emin = 0.001, eunit = 0.0005 and Nb = 3, then the packet loss can be approximated into three states as 0.0015, 0.002 and 0.0025. Similarly, the network delay dnetwork is characterized by dmin , dunit and Nd . The throughput θ is characterized by θmin , θunit and Nθ . Accordingly, the instant NSI can be represented by the joint state (e, d, θ). In order to fully present the behavior of our algorithms, we follow the idea of synthetic network scenario in [35]. Specifically, we specify the parameter dynamic range of each network as shown in Tab. II. The parameters are set mainly referring to both the reference in [36] and our trace data by android application “speedtest”. Apparently, the three networks correspond to the optimal networks for the three types of traffic, respectively. The PSNR is assumed fixed for each network. The network access cost τ here refers to nominal transmission fee. We assume that the WLAN is closed access manner and its transmission fee is slightly more than that of LTE. Specifically, the costs are set as 1.1, 1.2 and 1.2 for LTE, WLAN1 and WLAN2, respectively. The transmission

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2394431, IEEE Transactions on Vehicular Technology 10

4500 ONES D−ONES VM−ONES

3500

3000

3000

2500 2000 1500

2500 2000 1500

1000

1000

500

500

0

0

2000

4000

6000

Slot index

8000

10000

Expected QoE reward rate

3500

0

ONES D−ONES VM−ONES

4000

Expected regret

Expected regret

4000

3.2

3.1

3.1

3

2.9

2.8

2.7

ONES D−ONES VM−ONES UCB1(throughput) UCB1(delay)

2.6

0

2000

4000

6000

8000

10000

Slot index

(a) Scenario 1

3.2

(b) Scenario 2

Fig. 7: Expected regrets of ONES, D-ONES and VM-ONES in two scenarios.

power or energy [39] could also be set as the cost. As for the UE, the traffic arrival follows a Poisson process and the traffic types are independent of each other. For the traffic type in each slot, the stationary probabilities of being video traffic, audio traffic and elastic traffic are P1 = 0.2, P2 = 0.4 and P3 = 0.4, respectively. Due to the diversity and dynamics in traffic type distributions [39], we also evaluate the algorithms with different traffic type distributions. B. Results We present the simulation results from the following five aspects: (1)convergence validation of the proposed algorithms, (2)performance comparison with learning based network selection algorithms without QoE consideration and traffic awareness, (3)performance comparison with non learning-based network selection algorithms, (4)performance evaluation in dynamic environment and (5) performance evaluation in multiuser scenarios. 1) Convergence Validation of ONES, D-ONES and VMONES: We first give the sample runs of ONES, D-ONES and VM-ONES in scenario 1, where the joint NSI (e, d, θ) could be at any of its possible states in Tab. II with uniform probability in each network. In order to clearly see the network selection trend, each traffic only lasts for one slot. In Fig. 4, Fig. 5 and Fig. 6, the x-axis of each figure represents the slot number and the y-axis represents the ratio of selecting a network for some specific traffic type. We can observe in Fig. 4 that, with the slot number increase, the achieved deterministic network selection policy in ONES is πON ES : {video → WLAN2, audio → LTE, elastic → WLAN1} . Fig. 5 and Fig. 6 illustrate that D-ONES and VM-ONES converge to the same network selection policy as ONES, i.e., πD−ON ES = πV M −ON ES = πON ES . To check whether the achieved deterministic stationary network selection policy is optimal, we simulated 10000 slots averaged by 100 runs for each of the M|S| = 27 deterministic stationary network selection policies and the corresponding expected QoE reward rates are presented in Table III. In the table, we denote LTE, WLAN1 and WLAN2 as network 1, 2 and 3 for simplicity. A deterministic stationary network selection policy is represented as (a-b-c) which indicates that the optimal networks for the video traffic, the audio traffic and the elastic traffic are networks “a”, “b” and “c”, respectively.

2.5

0

500

1000

1500

2000

Expected QoE reward rate

4500

ONES D−ONES VM−ONES UCB1(throughput) UCB1(delay)

3

2.9

2.8

2.7

2.6

2.5

0

500

Slot index

(a)

1000

1500

2000

Slot index

(b)

Fig. 8: Expected QoE reward rates of UCB1(delay), UCB1(throughput) and the proposed algorithms in two scenarios.

It can be easily found that the maximum expected QoE reward rate is 3.2177 and the corresponding policy is (3-1-2), which is the same with the achieved policy above. We thus conclude that the proposed three algorithms can converge to the optimal network selection policy. According to the Proposition 1, the algorithms should finally converge to some optimal deterministic stationary policy. That is, each type of traffic has a corresponding preferred network. Comparing Fig. 4, Fig. 5 with Fig. 6, it can be observed that VM-ONES converges much faster than the other two algorithms, and D-ONES converges faster than ONES in this sample run. To better understand the algorithms’ performance, we compare the regrets of the three algorithms in Fig. 7 in two scenarios. The first scenario is the same with the above scenario 1, where the optimal policy is {video → WLAN2, audio → LTE, elastic → WLAN1}. We change the throughput distributions in the networks to get scenario 2, where the optimal policy is {video → WLAN2, audio → LTE, elastic → LTE}. In the following simulation, each traffic lasts for a random slot number in the range of [1,10]. The regrets are obtained by averaging the regrets of 100 runs and each run is simulated in 10000 slots. We can see that the expected regrets of the three algorithms grow in a sub-linear order of slot number for both scenarios. In addition, both D-ONES and VM-ONES have much smaller regrets than ONES, and VM-ONES achieves the smallest regret. This verifies the performance improvement of D-ONES and VM-ONES. We also consider a non-independent changing NSI scenario (scenario 3). Note that in the above scenario, the NSI in different slots is independent. We consider a more general scenario where the joint NSI states are correlative in two successive slots. Specially, the NSI of each network in two successive slots changes following a Markovian model, where the NSI will maintain the current state in the next slot with a relative higher probability. We found that in scenario 3, the proposed algorithms are still effective and the result is similar with Fig. 7. We omit the result due to the limited space. 2) Performance Comparison with Learning Based Network Selection Algorithms without QoE Consideration and Traffics Awareness: In this section, two learning based network selection algorithms without QoE consideration and traffic

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2394431, IEEE Transactions on Vehicular Technology 11

TABLE III: Expected QoE reward rates of all deterministic stationary network selection policies Policy (1-1-1) (1-2-1) (1-3-1) (2-1-1) (2-2-1) (2-3-1) (3-1-1) (3-2-1) (3-3-1)

Rate 2.7741 2.6695 2.6563 2.8958 2.7676 2.7535 3.1185 3.0008 2.9876

Policy (1-1-2) (1-2-2) (1-3-2) (2-1-2) (2-2-2) (2-3-2) (3-1-2)* (3-2-2) (3-3-2)

Rate 2.9121 2.7729 2.7738 2.9818 2.8719 2.8647 3.2177 3.0929 3.0798

Policy (1-1-3) (1-2-3) (1-3-3) (2-1-3) (2-2-3) (2-3-3) (3-1-3) (3-2-3) (3-3-3)

Rate 2.5267 2.4379 2.4201 2.6131 2.5241 2.5115 2.8579 2.7542 2.7290

awareness are considered. The first one seeks for the network offering the largest average throughput, where UCB1 algorithm is adopted to learn the throughput optimal network. We denote this algorithm as UCB1(throughput). The second algorithm adopts UCB1 to learn the network with the smallest average delay, denoted as UCB1(delay). We omit the details of these two algorithms, since they can be easily constructed with some modifications to the original UCB1 algorithm in [20]. The expected QoE reward rates (call “reward rate” for simplicity in the simulation.) averaged over 100 runs of all algorithms in scenario 1 and scenario 2 are presented in Fig. 8. As can be seen that the reward rates of UCB1(delay) and UCB1(throughput) keep almost unchanged, which are significantly smaller than that of D-ONES and VM-ONES due to the lack of QoE consideration and traffic awareness. On the other hand, although the reward rates of the proposed three algorithms generally increase with the slot number increase, their growth speeds are diverse. The reward rate of VM-ONES grows very fast and arrives at a high level (up to 10% gain over UCB1(delay)), while the reward rates of ONES and D-ONES are smaller. Especially, the reward rate of ONES is much smaller than that of D-ONES and VM-ONES, even smaller than that of UCB1(throughput) within 1000-th slot. The main reason we infer is that due to the poor convergence speed, ONES achieves very limited performance gain, even worse than the UCB1(throughput) in its early stage. However, ONES gradually improves its performance and finally outperforms the UCB1(throughput). We point out that even though it takes about 100 slots for D-ONES and VM-ONES to achieve an outstanding reward rate, this does not hamper their applications. On one hand, the proposed algorithms are online selections rather than selection without communication. The key insight of this work is to leverage the data transmission process for learning, where performance improvements can be achieved throughout the process. On the other hand, the algorithms do not incur any additional cost except for the lightweight computation cost. 3) Performance Comparison with Non Learning-based Network Selection Algorithms: We further compare the performance of the proposed algorithms with three non learningbased network selection algorithms. The first algorithm always chooses a fixed network without distinguishing traffic types. Depending on the selected network, there are three cases: LTE, WLAN1 and WLAN2. Note that the fixed network selection represents most existing non-dynamic network selection algo-

Ͳ

Ͳ

Scenario2

Scenario1 3.5

3.5

3

3

2.5

2.5

2

2

1.5

1.5 1

1 video,audio,elastic VMͲONES

LTE

video,audio WLAN1

audio,elastic

WLAN2

(a)

Random

video,elastic MaxͲthroughput

video,audio,elastic VMͲONES

LTE

video,audio WLAN1

WLAN2

audio,elastic Random

video,elastic MaxͲthroughput

(b)



Fig. 9: Performance comparison with non learning-based algorithms in two scenarios.

rithms, i.e., selecting one network satisfying some predefined criterion without considering the NSI dynamics. The second algorithm randomly selects one network with equal probability in each slot without distinguishing traffic types. The third algorithm is assumed to know the instant achievable throughput of all networks and select the network with maximal throughput in each slot, denoted as max-throughput. The VM-ONES is used for comparison. We carried 50 runs for each algorithm and compare the average reward rates of the 1000-th slot in the above two scenarios. Furthermore, four cases of traffic type combinations are considered in each scenario: video+audio +elastic (P1 = P2 = 0.3, P3 = 0.4), video+audio (P1 = P2 = 0.5), audio+elastic (P2 = P3 = 0.5) and video+elastic (P1 = P3 = 0.5). The corresponding result is shown in Fig. 9. We can see that VMONES achieve better performance than the other algorithms except for the case of audio+elastic (P2 = P3 = 0.5) in both scenarios. In addition, the performance gains over the other algorithms vary in different cases of traffic type combinations. This is because that due to the network diversity, the preferred networks for each traffic type could be different. As a result, the achieved reward rates and performance gains over other algorithms relate with the traffic type distribution. The proposed algorithms can effectively exploit the multiple networks diversity. However, note that since the LTE is fairly acceptable for both audio traffic and elastic traffic, the networks diversity gain in the case with audio+elastic is limited. Hence, the fixed selection algorithm could perform better than VM-ONES in this case. 4) Performance Evaluation in Dynamic Environment: We further evaluate the proposed algorithms in two dynamic scenarios: the mobility scenario and the non-stationary scenario. In the mobility scenario (scenario 4), we assume that the user travels through different coverage areas of the networks. To take the mobility’s effect into account, the coverage area of LTE and WLAN is divided into two concentric circles [34], where the inner zone and the outer zone differ in (e, d, θ) distributions. Due to the radio transmission loss, the user has larger probabilities being in relatively high QoS states in the inner zone compared with the outer zone. We assume that the two partially overlapping WLANs (WLAN1 and WLAN2) locate in coverage area of the LTE, and the user moves from the overlapping area of WLAN1 (inner zone) and LTE (inner zone) to the overlapping area of the three networks (outer zone of WLAN1, outer zone of WLAN2 and inner zone of LTE) at Ͳ

Ͳ

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2394431, IEEE Transactions on Vehicular Technology 12

3.15

3.2 ONES D−ONES VM−ONES

3.25

Heuristic versions Original algorithms

VM−ONES

3.1

VM−ONES 3.1

3.15 3.1 3.05 3 2.95

3

3.05

D−ONES

2.9

2.8 ONES 2.7 scenario 1

WLAN1+WLAN2+LTE

WLAN2+LTE

3 D−ONES 2.95

2.9

2.9 WLAN1+LTE

QoE reward rate

Expected QoE reward rate

Expected QoE reward rate

3.2

VM−ONES

scenario 2

2.6

2.85

2.85 2.8

0

1000

2000

3000 Slot index

4000

5000

6000

2.5

0

1000

2000 3000 Slot index

4000

5000

2.8

0

500

1000 Slot index

1500

2000

Fig. 10: The algorithms in the mobility Fig. 11: The algorithms in the non- Fig. 12: The convergence results of the scenario. stationary NSI scenario. algorithms in a multi-user scenario.

the 1000th slot and finally to the overlapping area of WLAN2 (inner zone) and LTE (inner zone) at the 3000th slot. The result in Fig. 10 shows that the increase in the available network number at 1000th slot incurs significant perturbations in the algorithms, while the effect of the decrease in the available network number at 1000th slot is negligible. We infer that the adoption of new choices introduces additional uncertainties and results in more explorations of the algorithms. Compared with Fig. 8, we can observe that the ONES and D-ONES could fairly adapt to the mobility effect, while the VM-ONES is less robust (its performance even degrades after 4000th slot). In the non-stationary scenario (scenario 5), the NSI distribution changes due to network load dynamics. In this scenario, we assume that the NSI (e, d, θ) distribution is the same with scenario 1 before 1000-th slot. In 1000-th slot, the NSI distribution converts to be the same with scenario 2. To better adapt to network load dynamics, we also consider heuristic versions of our algorithms, where new samples of rewards and costs have larger and fixed weights in computing ε¯m,s (i) and τ¯m,s (i). As shown in Fig. 11, for ONES and DONES, the algorithms can adapt to new stable states and the heuristic algorithms have slight improvements to the original algorithms. While the heuristic VM-ONES show performance loss compared with the original algorithm. The above results indicate that due to the adoption of virtually samples, the VMONES shows reduced stability compared with the other two algorithms in the dynamic scenarios. 5) Performance Evaluation in Multi-user Scenario: Finally, we evaluate the algorithms in a multi-user scenario. In this scenario, the users’ access behavior could effect each other. Commonly, due to user congestion, the user achieved QoS performance degrades with the increase of network load. To incorporate the effect of user congestion, the network load is classified into low, medium, or high three states, which corresponds to the number of users in the network is less than threshold L1 , between threshold L1 and threshold L2 (L1 < L2 ) and larger than threshold L2 , respectively. The three types of load states correspond to three different NSI (e, d, θ) probability distributions for each user in each network. Generally, we assume that the probabilities of NSI states being in relatively better QoS (e.g., larger throughput and smaller delay and loss) decreases as the user congestion becomes heavier. We test the convergence results of the proposed

algorithms in a 10 users scenario with P2 = 0.5, P3 = 0.5 and L1 = 3, L2 = 10. The QoE reward rate per user averaged by 50 samples in Fig. 12 indicates that all users in the three algorithms could converge to stable states and achieve improved performance. We compare the algorithm performance with two existing multi-user network selection algorithms. The first one is the reinforcement learning based algorithm in [37]. In this algorithm, each user maintains a Q-value for each available network, which is learned knowledge about the network. All users make network selection decision and update Q-values based on the received instant reward following some rules (see in [37]), simultaneously. The algorithm is able to converge to some state where all users achieve the same expected reward. The other algorithm is the RAT selection algorithm in [38]. The RAT selection algorithm is similar with the best response, where the user always selects the network with maximal expected throughput (stricter conditions are required in [38]). Note that only one user is allowed to access one network each time. In the simulation, the user is assumed to know the throughput statistical information of all network in each slot and select the the network with maximal expected throughput in RAT selection algorithm. We also very the number of users from 5 to 10, 15 and 20. Three types of traffic distributions are considered as P2 = 0.3 P3 = 0.7, P2 = P3 = 0.5 and P2 = 0.7 P3 = 0.3. The QoE reward rate per user averaged by 50 runs with different user number and traffic distributions is shown in Fig. 13. It is seen that the user performance of all algorithms degrades as the user number increases. Except for the case with user number 10 in P2 = 0.3 P3 = 0.7, the VM-ONES shows performance gain over the other two algorithms. We argue that the performance improvement of VM-ONES derives from its QoE awareness and the learning ability. The result indicates that the proposed algorithm could also achieve fair performance in the multi-user scenario. Moreover, the achieved performance also relates with the traffic type distribution. VII. C ONCLUSIONS In this paper, we studied the traffic aware online network selection framework in heterogeneous wireless networks. From the perspective of user demand, we aim to maximize the user’s QoE reward. To address the availability issue of network state

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2394431, IEEE Transactions on Vehicular Technology Ͳ



Ͳ

 

3 2.9 2.8

3.1 3 2.9 2.8 2.7

2.7 10

15

Reinforcementlearningin[37]

VMͲONES

RATselectionin[38]



 

13

3.1 3 2.9 2.8 2.7

5

20

 

3.2

QoErewardrateperuser

3.1

5



P2 =0.7,P3 =0.3

3.2

QoErewardrateperuser

QoErewardrateperuser



P2 =P3 =0.5

P2 =0.3,P3 =0.7 3.2

VMͲONES

 

10

15

Reinforcementlearningin[37]

20

5

RATselectionin[38]

VMͲONES

10

15

Reinforcementlearningin[37]

20 RATselectionin[38]



Fig. 13: Performance comparison in multi-user scenario.

A PPENDIX P ROOF OF T HEOREM 2



Proof: We first give the following lemma. Lemma 1. (Proposition 2 in [25]) The following bound holds for the expected regret of any arbitrary policy π      [ i ] ∑ ∗ ∑ ∗ E [R (i)] ≤ L (s) E I {πt (s) ̸= π (s) , st = s} t=1 s∈S [ ] i ∑ ∑ ∗ ∗ = ps L (s) E I {πt (s) ̸= π (s)} , t=1

s∈S

 



where L∗ (s) = max {Γj,s g ∗ − Ej,s } is the largest loss in















information in a practical network environment, we formulate the network selection problem as a continuous time multiarmed bandit problem. Three traffic aware online network selection algorithms, ONES, D-ONES and VM-ONES are proposed. By differentiating the traffic into multiple types with distinct QoE demands, the three algorithms attempt to match each type of traffic with its  respective best network. Ͳ     Ͳ  The algorithms all achieve the optimal logarithmic order regret. Especially, D-ONES and VM-ONES obtain better online learning performance than ONES. Compared with some other algorithms, e.g., non learning-based algorithms and learning based algorithms without QoE consideration and traffic awareness, the proposed online network selection algorithms show considerable performance gains. In our future work, we expect to conduct simulations in a more detailed simulations environment. In addition, algorithms for multiple-user scenarios are needed.

side information s, ps is the stationary probability of side information s.

Ͳ



 



 

ACKNOWLEDGMENTS This work was supported by the NSF of China under Grant No. 61232018, 61272487, 61401508 and in part by Jiangsu Province NSF of China under Grant No. BK2011116.

j

According to Lemma 1, the expected regret of D-ONES satisfies [ i ] ∑ ∑ ps L∗ (s) E I {πt (s) ̸= π ∗ (s)} E [R (i)] ≤ s∈S [ t=1 ] i ∑ ∑ ∑ ∗ ps L (s) E I {πt (s) = m} = m̸=π ∗ (s)

s∈S

t=1

The above formula implies that the regret consists of the losses of sub-optimal network selections for each traffic type s ∈ S. Denote i ∑ Qm,s (i) = I {πt (s) = m} t=1

for any s ∈ S, m ∈ / π ∗ (s), where π ∗ is the optimal network selection policy. Then, Qm,s (i) = 1+ ≤ l+ ≤ l+ [

i ∑ t=N +1 i ∑ t=N +1

i ∑

I {a (t) = m}

t=N +1

I {a (t) = m, Qm,s (t − 1) ≥ l} { I ε¯π∗ (s),s (t−1)+ϕt−1,Q∗s (t−1)−

] ∗ τ¯Q∗s (t−1),s (t − 1)−τmax ϕt−1,Q∗s (t−1) g¯t−1 ≤ ε¯m,s (t−1) [ ] ∗ +ϕt−1,Qm,s (t−1) − τ¯m,s (t−1)−τmax ϕt−1,Qm,s (t−1) g¯t−1 , Qm,s (t − 1) ≥ l}

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2394431, IEEE Transactions on Vehicular Technology 14

According to the Hoeffding inequality [34], for (33) { ( ) } ∗ ¯ m,s (t−1) ≥ Em,s −Γm,s g¯∗ + 1+¯ P X gt−1 τmax ϕt,t2 t−1 { 2 } 2 ∗ −2t2 (1+¯ gt−1 τmax ) ϕ2 t,t2 ≤ exp 2 ∗ t2 (1+¯ gt−1 τmax )

i { ∑ ∗ = l+ I ε¯π∗ (s),s (t−1)− τ¯Q∗s (t−1),s (t−1) g¯t−1 + t=N +1 ) ( ∗ 1+¯ gt−1 τmax ϕt−1,Q∗s (t−1) ≤ ε¯m,s (t−1) ( ) ∗ ∗ −¯ τm,s (t−1) g¯t−1 + 1 + g¯t−1 τmax ϕt−1,Qm,s (t−1) ,

= exp {−4 log t}

Qm,s (t − 1) ≥ l} { i ∑ ∗ ≤ l+ I min ε¯π∗ (s),s (t−1)− τ¯Q∗s (t−1),s (t−1) g¯t−1 0