A Novel QoS Monitoring Approach Sensitive to ... - IEEE Xplore

3 downloads 6333 Views 437KB Size Report
Email: [email protected]; [email protected]; [email protected] ... the position of server and users, and the load at runtime. Ignoring.
2015 IEEE International Conference on Web Services

A Novel QoS Monitoring Approach Sensitive to Environmental Factors Pengcheng Zhang1 , Yuan Zhuang1 , Hareton Leung2 , Wei Song3 , Yu Zhou4 1 College of Computer and Information, Hohai University, Nanjing, China, 211110 2 Department of Computing, Hong Kong Polytechnic University, HongKong, China 3 School of Computer Science & Technology, Nanjing University of Science and Technology 4 College of Computer Science & Technology, Nanjing University of Aeronautics and Astronautics Email: [email protected]; [email protected]; [email protected]

invoking operation to this service is 90%”. Consequently, existing QoS monitoring approaches heavily reuse monitoring approaches towards probability quality property [10], [9], [21]. Among them, ProMon (Probabilistic Monitor) approach based on SPRT (Sequential Probability Ratio Test) is widely used [10], [19]. The idea is based on Wald’s hypothesis test theory [18]. It first makes hypothesis (H0 and H1 ) about the total property, then judges whether the hypothesis should be accepted or rejected according to the statistical inference of samples. Although classical hypothesis is widely used in QoS monitoring, there are some clear drawbacks. Firstly, we need a significance level α to calculate refusing area of the hypothesis. Depending on different significance levels, we can sometimes get opposite conclusions. Secondly, it is sometimes invalid to check with the refusing area given by α. When the sample is large enough, the posterior probability of H0 is near 1, which is called “Lindley paradox” [3]. In order to deal with these problems, Zhu et al. [21] propose a new approach based on Bayesian statistics. It directly calculates the posterior probability of H0 and H1 , and checks results by the ratio of posterior probability. However, existing Bayesian algorithms cannot solve the drawback of Bayesian itself, i.e., the independent hypothesis of Bayesian does not suit all situations. In QoS monitoring, fluctuation of running environment and contexts can greatly change the running environment of service [16]. Each sample has its own “identification”, i.e., sample’s time, clients position, servers property and so on. All of these factors will affect the impact of samples on overall determination [15]. Hence, failures brought about by Bayesian algorithm may be fatal, which eventually lead to wrong monitoring result. Influence of environmental factors have been considered in QoS prediction [16] , dynamic web service composition [15] and monitoring [11], [17]. However, existing QoS monitoring approaches do not quantize environmental factors. In order to solve this problem, this paper applies TF-IDF algorithm [1] to calculate the impact of environmental factors on QoS monitoring and proposes a novel QoS monitoring approach called wBSRM (weighted Bayesian Runtime Monitoring) based on weighted na¨ıve Bayesian classification [2]. The approach consists of a training stage and a monitoring stage: during training,

Abstract—The quality of service-oriented system relies heavily on the third-party service. Such reliance would result in many uncertainties, in consideration of the complex and changeable network environment. Hence, effective runtime monitoring technique is required by service-oriented system. Several monitoring approaches have been proposed. However, all of these approaches do not consider the influences of environmental factors such as the position of server and users, and the load at runtime. Ignoring these influences, which exist among monitoring process, may cause wrong monitoring results. In order to solve this problem, this paper proposes a novel QoS monitoring approach sensitive to environmental factors called wBSRM (weighted Bayesian Runtime Monitoring) based on weighted naive Bayesian and TF-IDF (Term Frequency-Inverse Document Frequency). The proposed approach measures influence of environmental factor by TF-IDF algorithm and then constructs weighted na¨ıve Bayesian classifier by learning part of samples to classify monitoring results. Experiments are conducted based on both public network data set and randomly generated data set. The experimental results demonstrate that our approach is better than previous approaches. Index Terms—Quality of Service; TF-IDF algorithm; weighted na¨ıve Bayesian classifiers; monitor

I. Introduction In recent years, SOA (Service Oriented Architecture) is more widely used by internal and external enterprises [6]. The success of SOA increasingly depends on QoS (Quality of Service) of third-part services. Due to the dynamic evolution attribute, runtime quality assurance is required for these thirdpart services since many actions, such as changing user requirements and version update, can lead to changes of service at runtime. Runtime monitoring is a technique focuses on verifying the correctness or reliability of software systems at runtime, and detecting the shortcoming, or anomaly that may cause adaptation, self-optimization, and reconfiguration [5]. Recently for SOA there is increasing research on monitoring QoS properties, such as performance, reliability, and availability [8]. Most QoS properties can be expressed by probability quality properties [8]. For example, service reliability can be described as “The average running time without failures of this service in one year is 95%”, and response time can be described as “The probability of response time is within 8s after sending 978-1-4673-7272-5/15 $31.00 © 2015 IEEE DOI 10.1109/ICWS.2015.29

145

As BaProMon is influenced by prior distribution, it is a big problem to choose a proper prior probability. Furthermore, all the approaches described above do not consider the influence of environmental factors. Neglecting environmental factors may produce wrong monitoring results or cost more time to draw conclusions. In order to solve this problem, a new QoS monitoring approach that is sensitive to environmental factors is proposed.

we set samples satisfying QoS property as sort c0 , and those does not satisfied as sort c1 . We get na¨ıve Bayesian classifier by calculating parts of samples using TF-IDF algorithm; during monitoring, the na¨ıve Bayesian classifier is invoked for each sample, then the ratio of posterior probability of sort c0 that satisfies QoS property and sort c1 that does not satisfies is obtained. After analyzing the ratio, we can determine whether the sample satisfies QoS property or not, or cannot be determined. The contributions of this paper are summarized as follows: •



III. Preliminaries

A novel QoS monitoring approach sensitive to environmental factors called wBSRM is proposed. By calculating the weights using TF-IDF algorithm, the approach can consider the impact of environmental factors. The weight of each sample will be considered by the na¨ıve Bayesian classifier as the final monitor results. wBSRM and its corresponding algorithms are validated by a set of dedicated experiments. The experimental data are based on both an open source network data set and a randomly generated data. Experimental results show that wBSRM is more efficient compared to current QoS monitoring approaches.

A. na¨ıve Bayesian classifier The ideological foundation of na¨ıve Bayesian can be summarized as follows [2]: based on sample collections to be classified, the occurrence probability of different sorts are calculated. The sort that has the largest probability is considered to be the sample collection that should be classified. C(X) = {c0 , c1 } is pre-defined sorts collection, X = {x1 , ..., xn } is a sample vector, according to Bayesian formula: P(ci |X) =

The rest of this paper is organized as follows. Section 2 reviews related work and discusses their limitations. Section 3 provide some background materials. In Section 4, the approach wBSRM is proposed. The experimental validation of wBSRM is performed in Section 5. Finally, Section 6 offers some conclusions and suggestions for future work.

(P(ci )P(X|ci )) P(X)

(1)

where P(X|ci ) can hardly be estimated. This is because X is an n dimensional vector and the value of n varies from thousands to tens of thousands. To simply the estimation of P(X|ci ), na¨ıve Bayesian assumes that when X is judged to sort ci , the value of xk in X is mutually independent. Thus, the conditional probability of given sort ci can be expressed as:

II. Related work Several monitoring approaches for probabilistic quality property have been proposed recently. Chan et al. [4] provides a platform based on .net application to monitor the property of PCTL. It calculates the ratio of the successful samples and total samples to get the probability, and then compares it with pre-defined probability standard. The conclusion of this approach lacks statistical analysis and there may be relatively large error between the obtained result and actual result. Lee et al. [14] proposes a monitor approach basing on Mac frame. This frame extends probability of MEDL [13]. Grunske and Zhang [10] propose ProMo, which introduces probability logic CS Lmon into run-time monitoring. CS Lmon is used to define probability property. ProMo uses hypothesis testing technology, when significance levels are α and 1−β. However, these hypothesis testing approaches cannot support continuous monitoring. Grunske [9] improves SPRT (iSPRT) approach by adopting back method and reuses previous monitoring information to realize dynamic monitoring. When real probability approximately equals property requirement probability, a large amount of monitoring results fall into neutral zone. Zhu et al. [21] propose a probability monitoring approach called BaProMon, which is based on two algorithms: BSRM (Bayesian statistics Runtime Monitor) and improved BSRM (iBSRM). BaProMon calculates Bayesian factors and implements hypothesis testing.

P(X|ci ) =

n 

P(xk |ci )

(2)

k=1

According to (2), P(ci |X) can be calculated as follows: P(ci |X) =

P(ci )

n

k=1 P(xk |ci ) P(X)

(3)

In fact, the values P(X) for all ci are equal [12]. Consequently the sort that makes the numerator maximum is the sort that X should be classified. Because the classifying process is based on na¨ıve Bayesian hypothesis, i.e., properties are mutually independent, it is called as na¨ıve Bayesian classification approach. The process is shown as follows: Training stage: the sample data set is trained, then prior conditional probability P(xk |ci ) and practical probability P(ci ) are both obtained. Classifying stage: posterior probability is calculated, and the sort and sample that makes posterior probability largest is returned. C(X) = arg max{P(ci )P(X|ci )} ci ∈C

This is the model of na¨ıve Bayesian.

146

(4)

B. Parameters estimation of Binomial distribution In the success-or-failure experiment, many issues about parameters estimating can be concluded as those from Bernoulli parameter estimation. The occurrence probability of Event A is set as θ (0 < θ < 1). Among the n independent experiments, the probability of that Event A occurs x times is   x P(x|θ) = θ x (1 − θ)(n−x) (5) n

‫ׯ‬ᩪ⭥㝁     

   

Fig. 1.

λ

P(x|θ)c(θ) c(θ)P(x|θ) dθ

Since kernel function is θ x (1 − θ)(n−x) , 

θ(x+1) (1 − θ)(n−x) dθ θˆ = E(θ|x) = θ(θ|x) dθ =  θ x (1 − θ)(h−x) dθ

wBSRM architecture overview

In this section, we first give an overview of wBSRM, then theoretical description of wBSRM is given. Finally, the detail algorithms of wBSRM are presented. A. Overview of wBSRM The architecture of wBSRM is shown in Figure 1. The main modules of wBSRM are described as follows: Construct the Bias classifier. We first eliminate those historical samples lacking information, which may reduce error data and then obtain relatively accurate prior information. Prior information formula (Bernoulli distribution formula) can be obtained from the empirical estimation of binomial distribution to construct na¨ıve Bayesian classifier. Compute the weight of factors via TF-IDF. TF-IDF algorithm is used to construct the na¨ıve Bayesian classifier. We can first monitor the web service with na¨ıve Bayesian classifier. Then, TF-IDF algorithm is applied to compute the factors weight. In wBSRM, we treat environmental factors such as users location, servers location or network performance as key factors that affect QoS as well as information that quantizes sample weight. We define the collection of these service factors as combination of impact factors. In fact, it is hard to measure the effect of weight of each different factor. Consequently, we only consider the final influence of combination of all impact factors on the classification of sample collection. After training a set of samples, we can get the influence of each sample on the classification of whole samples. According to the basic

(10)

(11)

Because x is an integer, according to (11), the integration can be calculated easily. Then it can be easily demonstrated that θˆ is a increasing function of λ. This proves under the condition that practical samples are the same, the larger λ is, the larger θˆ is. λ is in direct proportion to average of experienced samples. Consequently the larger the average of experienced samples is, the larger θˆ is. In other words, in EB estimation, θˆ is relative to practical samples as well as prior information. When λ=1, (6) becomes Bayesian hypothesis without prior information. Then θˆ determined by (11) is Γ(x + 2)Γ(n + 2) x + 1 θˆ = = Γ(n + 3)Γ(x + 1) n + 2

       

IV. weigh Bayesian Runtime Monitoring approach

ˆ based on Once λ is determined, we can estimate θ, i.e. θ, the squared lost. From (5) and (6), we can get the posterior probability of θ h(θ|x) = 

 

   

TF-IDF [1] is a widely used weighting technique in information searching and a key technique in measuring relativity between web and searching. TF represents the occurrence frequency of searched word in a single web. The higher the occurrence frequency, the higher the relativity between searching and this web. IDF represents the occurrence frequency of the searched in all the webs. The higher this value is, the harder it is to get results. Some researchers also discover that the concept of so-called IDF is the probability distribution cross entropy under specific condition. In general, the high capacity of a word in predicting a theme, the higher its weight.

  1 n θ x (1 − θ)(n−x) dθ (7) x 0 λ

λ nθ n E(x) = xPG (x) dx = dθ = λ (8) 2 0 λ ¯, If there exist samples x1 , ..., xn , let E(x) = m1 m i=1 xi = x then n2 λ = x¯. Because 0 < θ < 1, and λ  1. We set

2 x¯ (9) λ = min 1, n

c(θ)P(x|θ) dθ =

  

   

          

  

C. TF-IDF (term frequency-inverse document frequency)

Marginal distribution PG (x) of x is defined as:

    

          

Bayesian approach regards θ as a random variable, and sets c(θ) as its prior probability. The approach applies Bayesian formula to estimate θ based on implementing samples. The approach in this paper adopts Bayesian estimating, which combines the classical approach and Bayesian approach to estimate θ [7]. ⎧ 1 ⎪ ⎪ ⎪ 0