Backward Inference in Bayesian Networks for ... - Springer Link

7 downloads 33708 Views 493KB Size Report
Dec 10, 2005 - Journal of Network and Systems Management, Vol. 13, No. 4, December ... such as IBM Tivoli, HP OpenView, or Cisco serial network management software, support the .... To the best of our knowledge, no approximation has ...
c 2005) Journal of Network and Systems Management, Vol. 13, No. 4, December 2005 ( DOI: 10.1007/s10922-005-9003-8

Backward Inference in Bayesian Networks for Distributed Systems Management Jianguo Ding,1,5 Bernd Kr¨amer,2 Yingcai Bai,3 and Hansheng Chen4 Published online: 10 December 2005 The growing complexity of distributed systems in terms of hardware components, operating system, communication and application software and the huge amount of dependencies among them have caused an increase in demand for distributed management systems. An efficient distributed management system needs to work effectively even in face of incomplete management information, uncertain situations, and dynamic changes. In this paper, Bayesian networks are proposed to model dependencies between managed objects in distributed systems. The strongest dependency route (SDR) algorithm is developed for backward inference in Bayesian networks. The SDR algorithm can track the strongest causes and trace the strongest routes between particular effects and its causes, the strongest dependency of causes can be also achieved by the algorithm. Thus, the backward inference provides an efficient mechanism in fault locating, and is beneficial for performance management. KEY WORDS: Distributed systems management; uncertainty; Bayesian networks; backward inference .

1. INTRODUCTION With the growth of distributed systems in size, heterogeneity, pervasiveness, and complexity of applications and network services, the effective management in distributed systems becomes more important and more difficult. Managers have to live with unstable, uncertain and incomplete management information. Individual hardware defects or software errors or combinations of such defects and 1 Software

Engineering Institute, East China Normal University, Shanghai 200062, P. R. China. of Electrical Engineering and Information Engineering, FernUniversit¨at Hagen, Germany. 3 Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, P. R. China. 4 East-china Institute of Computer Technology, Shanghai, P. R. China. 5 To whom correspondence should be addressed at Software Engineering Institute, East China Normal University, Shanghai 200062, P. R. China. E-mail: [email protected]. 2 Department

409 C 2005 Springer Science+Business Media, Inc. 1064-7570/05/1200-0409/0 

410

Ding, Kr¨amer, Bai, and Chen

errors in different system components may cause the degradation of services of other (remote) components in the network or even their complete failure due to functional dependencies between managed objects. Hence an effective distributed fault detection mechanism is needed to support rapid decision making in distributed systems management and allow for partial automation of fault correction. In the past decade, a great deal of research effort has been focused on improving a management system in fault detection and diagnosis. Rule-based methods were proposed for fault detection [1, 2]. Finite state machines were used to model fault propagation behavior and duration [3, 4]. Coding-based methods [5, 6] and case-based methods [7, 8] were used for fault identification and isolation. Dependency graph model was used for fault isolation in integrated management systems [9]. However, most of these solutions are based on certain mechanism and very sensitive to the “noise” (such as loss, delay, corruption of messages etc.). That means they are unable to deal with the incomplete and imprecise management information effectively. Probabilistic reasoning is another effective approach for fault detection in distributed systems management [10–13]. On the practical side, most of the current commercial management softwares, such as IBM Tivoli, HP OpenView, or Cisco serial network management software, support the integration of different management domains, collect information, perform remote monitoring, provide fault alarm, and perform some statistics on management information. But they still lack facilities for exact fault localization, or the automatic execution of appropriate fault recovery actions. From the experiences, a typical metric for on-line fault identification is 95% fault location accuracy and 5% faults can not be located and recovered in due time [14]. Hence for large distributed systems including thousands of managed components it may be rather time-consuming and difficult to resolve the problems within short time by exhaustive search in locating the root causes of a failure and it may interrupt important services in the systems. In this paper Bayesian networks are applied to model dependencies among managed objects and to provide efficient methods to locate the root causes of failure situations in the presence of imprecise management information. The ultimate goal is to automate part of the daily management business. Bayesian Networks are effective means to model probabilistic knowledge by representing cause-and-effect relationship among key entities of a managed system. Bayesian Networks can be used to generate useful predictions about future faults and decisions even in the presence of uncertain or incomplete information. Bayesian networks have been applied to problems in medical diagnosis [15, 16], map learning [17] and language understanding [18]. In distributed systems management, the general operation is to infer the particular causes from the observation of effects, particularly in fault diagnosis and malfunction recovering. To improve the performance of distributed systems management, the key routes between managed objects need to be traced and the key causes of observed errors or failures need to be identified. In this paper a

Backward Inference in Bayesian Networks for Distributed Systems Management

411

strongest dependence route (SDR) algorithm for backward inference in Bayesian networks is presented. The SDR algorithm allows users to trace the strongest dependency route from some malicious effects to its causes, so that the most probable causes are investigated firstly. The algorithm also provides a dependency ranking of particular effects’ causes. The application of Bayesian Networks to distributed system management is discussed in Section 2. Section 3 presents the strongest dependency route (SDR) algorithm for backward inference in Bayesian networks. Section 4 concludes and identifies directions for further research. 2. BAYESIAN NETWORKS MODEL FOR DISTRIBUTED SYSTEMS MANAGEMENT 2.1. Introduction of Bayesian Networks Bayesian networks (BNs), also known as Bayesian belief networks, belief networks, causal networks or probabilistic networks, are an important knowledge representation in Artificial Intelligence [19–22]. Bayesian networks use directed acyclic graphs (DAGs) with probability labels to represent probabilistic knowledge. Bayesian networks can be defined as a triplet (V, L, P), where V is a set of variables (nodes of the DAG), L is the set of causal links among the variables (the directed arcs between nodes of the DAG), P is a set of probability distributions defined by: P = {p(v|π (v))|v ∈ V }; π (v) denotes the parents of node v. The DAG is commonly referred to as the dependence structure of a Bayesian network. In Bayesian networks, the information included in one node depends on the information of its predecessor nodes. The former denotes an effect node; the latter represents its causes. This dependency relationship is denoted by a probability distribution in the interval [0, 1]. Note that an effect node can also act as a causal node of other nodes, where it then plays the role of a cause node. An important advantage of BNs is the avoidance of building huge joint probability distribution tables that include permutations of all the nodes in the network. Rather, for an effect node, only the states of its immediate predecessor need to be considered. Due to the dense knowledge representation of BNs, BNs can represent large amount of interconnected and causally linked data as they occur in distributed systems. Generally speaking: (1) BNs can represent knowledge in depth by modeling the functionality of the transmission network in terms of cause and effect relationship among network components and network services. (2) They can provide guidance in diagnosis. Calculations over a BN can determine both the precedence of detected effects and the network part that needs further investigation in order to provide a finer grained diagnosis. (3) They have the capability of handling uncertain and incomplete information due to their grounding in probability theory.

412

Ding, Kr¨amer, Bai, and Chen

(4) They provide a compact and well-defined problem space since they use an exact solution method for any combination of evidence or set of faults. BNs are appropriate for automated reasoning because of their deep representations and precise calculations. Of course, BNs have some limitations in the knowledge representations, such as those related to time series.

2.2. Mapping Distributed Systems to Bayesian Networks We represent uncertainty in the dependencies among the entities of distributed system by assigning probabilities to the links in the dependency or causality graph [23, 24]. Some commonly accepted assumptions in this context are: (a) given fault A, the occurrences of faults B and C that may be caused by A are independent, (b) given occurrence of faults A and B that may cause event C, whether A actually causes C is independent of whether B causes C (the OR relationship among alternative causes of the same event), and (c) root faults are independent of one another. This dependency graph can be transformed into a belief network, which is a DAG with certain special properties [25]. To the best of our knowledge, no approximation has been proposed that fits all types of networks. We focus on a class of belief networks representing a simplified model of conditional probabilities called noisy-OR gates [19]. The simplified model contains binary-valued random variables. The noise-OR model associated an inhibitory factor with every cause of a single effect and assumes that they are all independent. The effect is absent only if all inhibitors corresponding to the present causes are activated. When a distributed system is modeled as a BN, two important processes need to be resolved: 2.2.1. Ascertain the Dependency Relationship Between Managed Entities Dependencies represent consumer and provider relationship between various cooperating entities in a distributed system. When one entity requires a service performed by another entity in order for it to execute its function, this relationship between the two entities is called a dependency. When a managed entity A (such as a service, an application component in software or hardware) depends on a managed entity B, we say that A is the dependent and B is the antecedent. The notion of dependency can be applied at various levels of granularity. Sometimes the dependencies that occur between different system components should be defined carefully. For example, the maintenance of an Email server obviously affects the service “Email” and thus all the users whose user agents have a client/server relationship with this specific server; however, other services (News, WWW, FTP) are still usable because they do not depend on a functioning Email

Backward Inference in Bayesian Networks for Distributed Systems Management

413

service. So the inter-system dependencies are always confined to the components of the same service. Two models are useful to get the dependency between cooperating entities in distributed systems. – The functional model defines generic service dependencies and establishes the principle constrains to which the other models are bound. A functional dependency is an association between two entities, typically captured first at design time, which says that one component requires some services from another. – The structural model contains the detailed descriptions of software and hardware components that realize the service. A structural dependency contains detailed information and is typically captured first at deployment or installation time. 2.2.2. Obtain the Measurement of the Dependency When BNs are used to model distributed systems, BNs represent causes and effects between observable symptoms and the unobserved problems, so that when a set of evidences is observed, the most likely causes can be determined by inference technologies. Single-cause (fault) and multi-cause (fault) are two kinds of general assumptions to consider the dependencies between managed entities in distributed systems management. In BNs, a non-root node may have one or several parents (causal nodes). Single-cause means any of the causes must lead to the effect. While the existence of multiple causes means that one effect is generated only when more than one cause occur simultaneity. So the measurement of the dependencies has various possibilities based on the particular problem domain. Management information statistics are the main source to get the dependencies between the managed objects in distributed systems. The empirical knowledge of experts and experiments are useful to determine the dependency. Some researchers have performed useful work to discover dependencies from the application view in distributed systems [26–28]. Figure 1 shows a particular detail of the campus network of the FernUniversit¨at in Hagen. When only the connection service for end users is considered, Fig. 2 illustrates the associated Bayesian network. Here the probability of dependency for connection service is derived from the recorder of connection failure between objects, thus the adaptations to load-balancing mechanisms within a router do not interfere this model. The arrows in the BN denote the dependency from causes to effects. The weights of the links denote the probability of dependency between the objects. When one node has several parents (causes), the dependency between

414

Ding, Kr¨amer, Bai, and Chen

Fig. 1. Example of distributed system.

Fig. 2. Example of bayesian network for Figure 1.

the parents and their child can be denoted by joint probability distribution. In this example, the component F and component E are the causes for component D. The ¯ EF ¯ ) = 100% denotes the probability of the non-availability of annotation p(D| component D is 100% when component F is in order but component E is not. Other annotations can be read similarly. In this example, some evidences, such as the status of component D are easy to be detected through regular monitoring, but the causes of a failure of component D are not always obvious. One important task in management is to infer hidden factors from the available evidences.

Backward Inference in Bayesian Networks for Distributed Systems Management

415

The probability distribution describes the general characteristics of the network behavior based on observation over long time. The real-time states change so quickly in distributed system and really the real-time states may not work as the foundation for future management decision in this situation. That means the persistent property of the system is important for the systems management. 3. PROBABILISTIC INFERENCE IN BAYESIAN NETWORKS FOR DISTRIBUTED SYSTEMS MANAGEMENT 3.1. Basic Model of Inference in Bayesian Networks The most common approach towards reasoning with uncertain information about dependencies in distributed systems is probabilistic inference, which traces the causes from effects. The task of backward inference amounts to finding the most probable instances of some hypothesis variables, given the observed evidence. We define E as the set of effects (evidences) which we can observe, and C as the set of causes. The inference from effects to causes is denoted by p(cj |Ei ), Ei ⊂ E, cj ∈ C. Before discussing the complex backward inference in BNs, a simplification model will be examined. In BNs, one node may have one or several parents (if it is not a root node), and we denote the dependency between parents and their child by a joint probability distribution (JPD). Figure 3 shows the basic model for backward inference in BNs. Let X = (x1 , x2 , . . . , xn ) be the set of causes, Y be the effect of X. According to the definition of BNs, the following variables are known:p(x1 ), p(x2 ), . . . , p(xn ), p(Y |x1 , x2 , . . . , xn ) = p(Y |X). Here x1 , x2 ,. . ., xn are mutually independent, so p(X) = p(x1 , x2 , . . . , xn ) = p(x1 )p(x2 ), . . . , p(xn ) =

n 

p(xi )

(3.1)

i=1

p(Y ) =



P (XY ) =

X







p(Y |X)p(X) =

X

n    p(Y |X) p(xi ) X

(3.2)

i=1

by Bayes’ theorem,

 p(Y |X) ni=1 p(xi ) p(Y |X)p(X)

p(X|Y ) = =   p(Y ) p(Y |X) ni=1 p(xi )

(3.3)

X

which computes to p(xi |Y ) =

 X\xi

p(X|Y )

(3.4)

416

Ding, Kr¨amer, Bai, and Chen

Fig. 3. Basic model for backward inference in bayesian networks..

In Eq. (3.4), X\xi = X − {xi }. According to the Eqs. (3.1)–(3.4), the individual conditional probabilityp(xi |Y ) can be achieved from the JPDp(Y |X), X = (x1 , x2 , . . . , xn ). The backward dependency from effects to causes can be obtained from Eq. (3.4). The dashed arrowed lines in Fig. 3 denote the backward inference (p(xi |Y )) from effect Y to individual cause xi , i ∈ [1, 2, . . . , n]. In Fig. 2, when a fault in component D (connection to Internet Provider) ¯ = 1. Then based is detected, the state of D would be out of order, sayp(D) ¯ = 67.8%,p(E| ¯ D) ¯ = 32.5%. This can on Eqs. (3.1)–(3.4), we obtain p(F¯ |D) be interpreted as follows: when component D is not available, the probability of a fault in component F is 67.8% and the probability of a fault in component E is 32.5%. Here only the fault related to connection service is considered.

3.2. Strongest Dependency Route (SDR) Algorithm for Backward Inference In distributed system management, normally when some faults are reported, the most urgent task is to locate the causes, to bring faulty states back to normal, and possible to improve the performance of the system. The key factors that are related to the defect in the system should be identified. The strongest dependency route (SDR) algorithm is proposed to resolve these tasks based on probabilistic inference. Before we describe the SDR algorithm, the definition of strongest cause and strongest dependency route are given as follows: Definition 3.2.1. In a Bayesian Network let C be the set of causes, E be the set of effects. For ei ∈ E, Ci be the set of causes based on effect ei , iff p(ck |ei ) = max[p(cj |ei ), cj ∈ C], then ck is the strongest cause for effect ei . Definition 3.2.2. In a Bayesian Network, let C be the set of causes, E be the set of effects, let R be the set of routes from effect ei ∈ E to its cause cj ∈ C, R = (R1 , R2 , . . . , Rm ). Let Mk be the set of transition nodes between ei and cj in

Backward Inference in Bayesian Networks for Distributed Systems Management

417

route Rk ∈ R. Iffp(cj |Mk , ei ) = max[p(cj |Mt , ei ), t = (1,2,. . .,m)], then Rk is the strongest route between ei and cj . The detailed description of the SDR algorithm is described as follows: 3.2.1. Pruning of the BNs When a concrete problem domain is considered, a common strategy is to omit some variables that are not related to the problem domain. To achieve this, a prune operation is defined as follows: Algorithm 3.2.1 Prune (BN = (V, L, P), EK ⊂ E, Ek = {e1 , e2 , . . . , ek }) { new BN = (V , L , P ,); V  = {e1 , e2 , . . . , ek } // add Ek to V L = φ; // φ denotes empty set for ei ∈ Ek (i = 1, . . . , k) {vi = ei ; while vi = NIL do { V =V ∈ {π (vi ) }; // add vertex π (vi ) to V  vi ← π (vi ) L = L + < π (vi ), vi >; // add edge < π (vi ), vi > to L } } return BN ; } Generally speaking, multiple effects (symptoms) may be observed at a moment, so Ek ={e1 , e2 ,. . .,ek } is defined as original effects. In the operation of pruning, every step just integrates current nodes’ parents into BN and omits their brother nodes, because their brother nodes are independent with each other. The pruned graph is composed of the effect node Ek and its entire ancestor. And the end nodes construct the set of causes based on the effect node ei . 3.2.2. Strongest Dependency Route (SDR) Trace Algorithm After the pruning algorithm has been applied to a BN, a simplified sub-BN is obtained. Between every cause and effect, there may be more than one dependency route. The questions now are: which route is the strongest dependency route and among all causes, which is the strongest cause? The SDR algorithm is used to trace the strongest route, to locate the causes and to generate the dependency sequence among the causes based on particular effects in Bayesian networks. SDR algorithm uses product calculation to measure the serial strongest dependencies

418

Ding, Kr¨amer, Bai, and Chen

between effect nodes and causal nodes. In the SDR algorithm, multiple effects are considered. Suppose Ek ⊂ E,Ek = {e1 , e2 , . . . , ek }. If k=1, the graph will degenerate to a single-effect model. Here the backward dependency calculation is based on Eqs. (3.1)–(3.4) in Section 3.1. When several effects (symptoms) Ek are observed at the same time, then p(ei )=1, or p(¯ei ) = 1, i = (1, 2, . . . , k). Here only the state p(ei )=1 is considered. For simplification, a virtual effect node  is added as the common parent of Ex .  is a certain event, p()=1, sop(ei |) = 1, ei ∈ Ex . Actually, Ek and  can be integrated as one node, because all of them are certain events. Thus the importing of virtual node  can not affect the result of the algorithm. Algorithm 3.2.2 (SDR): {Input: V: the set of nodes (variables) in the Bayesian network L: the set of links in the Bayesian network P: the dependency probability distribution for every node in Bayesian network Ek = {e1 , e2 , . . . ., ek }: the set of initial effect nodes in BN Output: T: a spanning tree of the Bayesian network, rooted at vertex , whose path from ei to each vertex v is a strongest dependency route, and a vertexlabelling giving the dependency probability from ei to each vertex. Variables: depend[v]: the strongest probability dependency between v and all its descendants; (l): the condition probability of p(v|u), v as the parent of u, they share the link l, (l)can be calculated from JPD of p(u| π (u)) based on Eqs. (3.1)–(3.4); ϕ(l): the temporal variable which records the strongest dependency between nodes; {Initialize the SDR tree T as vertex ; Initialize the set of frontier edges for tree T as empty; depend[]:=1; Write label 1 on vertex ; While SDR tree T does not yet span the BN {For each frontier edge l for T {Let u be the labelled endpoint of edge l; Let v be the unlabeled endpoint of edge e (v is one parent of u); Let(l)=p(v|u); Setφ(l)=depend[u]*(l)=depend[u]*p(v|u); }

Backward Inference in Bayesian Networks for Distributed Systems Management

419

Let l be a frontier edge for BN that has the maximum ϕ-value; Add edge l (and vertex v) to tree T; depend[v]=ϕ(l); Write label depend[v] on vertex v; } Return SDR tree T and its vertex labels; }

}

The result of the SDR algorithm is a spanning tree T. Every cause node cj ∈ C is labeled with depend[ci ] = p(cj |Mk , ei ), ei ∈ Ex , Mk is the transition nodes between ei and cj in route Rk ∈ R. According to the values of the labels in the cause nodes, a cause sequence can be obtained. This sequence is important for fault primary diagnosis and related maintenance operations. Meanwhile, using Depth-First search on the spanning tree, the strongest route between effect nodes and cause nodes can be also achieved. Suppose one reasoning chain of a series of variables (from cause to effect) is: x1 → x2 → · · · → xn , then the JPDp(xn |xn−1 , . . . , x1 ) is considered as the backward reasoning value based on the backward serial variables (from effect to cause):xn → xn−1 → · · · → x1 . 3.2.3. Proof of the SDR Algorithm Now we prove the correctness of SDR algorithm. Algorithm 3.2.2 gives a way to identify the strongest route from effect ei (ei ∈ Ek ) tocj (cj ∈ C). If the routeei , u1 , u2 , . . . , un , cj is the strongest dependency route,ei , δ1 , δ2 , . . . , δm , cj is any route from ei to cj . Then p(u1 |ei ) × p(u2 |u1 ) × · · · × p(cj |un ) ≥ p(δ1 |ei ) × p(δ2 |δ1 ) × · · · × p(cj |δm ) (3.5) Define weight(u, π (u)) = −lg(p(π (u)|u)), Eq. (3.5) is transferred to weight(ei , u1 ) + weight(u1 , u2 ) + · · · + weight(un , cj ) ≤ weight(ei , δ1 ) + weight(δ1 , δ2 ) + · · · + weight(δm , cj )

(3.6)

Lemma: When a vertex u is added to the spanning tree T, d[u] = weight(ei , u) = −lg(depend[u]).0 < depend[δj ] ≤ 1, so d[δj ] ≥ 0. (Note depend[δ j ]=0, or else exists empty link between δ j and its children.) Proof: Suppose to the contrary that at some point the SDR algorithm first attempts to add a vertex u to T for which d[u] = weight(ei , u). Consider the situation just prior to the insertion of u. See Fig. 4. Consider the true strongest dependency route from ei to u. Because ei ∈T, and u ∈ V \T , at some point this route must first take a jump out of T. Let (x, y) be the edge taken

420

Ding, Kr¨amer, Bai, and Chen

Fig. 4. Proof of SDR algorithm.

by the path, where x ∈ T, andy ∈ V \T (it may be that x = ei , y = u). We now prove that d[y] = weight(ei , y). We have computed x, so d[y] ≤ d[x] + weight[x, y].

(3.7)

Since x was added to T earlier, by hypothesis, d[x] = weight[ei , x].

(3.8)

ei , . . . , x, y is sub-path of a strongest dependency route, by Eq. (3.8), weight[ei , y] = weight[ei , x] + weight[x, y] = d[x] + weight[x, y] (3.9) By Eqs. (3.7) and (3.9), we get d[y] ≤ weight[ei , y]. Hence d[y] = weight[ei , y]. Now note that since y appears midway on the route from ei to u, and all subsequent edges are positive, we have weight[ei , y] < weight[ei , u], and thus d[y] = weight[ei , y] < weight[ei , u] = d[u]. Thus, y would have been added to T before u, in contradiction to our assumption that u is the next vertex to be added to T. Since the calculation is correct for every effect node, it is also true for multiple effect nodes in tracing the strongest dependency route. At the end of the algorithm, all vertices are in T, thus all dependency (weight) estimates are correct. 3.2.4. Complexity Analysis of the SDR Algorithm To determine the complexity of SDR algorithm, we observe that every link (edge) in BN is only calculated one time, so the size of the links in BN is consistent with the complexity. It is known in a complete directed graph that the number of edges is n(n − 1)/2 = (n2 − n)/2, where n is the size of the nodes in the pruned spanning tree of BN. Normally a BN is an incomplete directed graph. So the calculation time of SDR is less than (n2 −n)/2. Hence the complexity of SDR is

Backward Inference in Bayesian Networks for Distributed Systems Management

421

Fig. 5. Simulation of inference in bayesian network..

O(n2 ). In Fig. 2, if the effect nodes D and C are detected, the calculation complexity of cause detection after SDR algorithm is 3 ( depend[D] > depend[G] > depend[F ] >

422

Ding, Kr¨amer, Bai, and Chen

Fig. 6. The JPD of BN in Fig. 5 (here 0 denotes normal state, 1 denotes abnormal state)..

depend[B] > depend[A]. The cause sequence is important for fault primary diagnosis and related maintenance operations. In the simulation experiment, detection rate represents the percentage of faults that occurred in the network in a given experiment that were detected by an algorithm. As stated in Section 1, when an application system meets the 5% nonlocated faults, generally the manager has to detect them randomly or exhaustively. The objective of any fault management system is to minimize the time to localize a fault. The time to localize a fault is the sum of the time to propose possible fault hypotheses (fault identification) and the time to perform testing in order to

Fig. 7. The spanning tree from static BN in Fig. 5.

Backward Inference in Bayesian Networks for Distributed Systems Management

423

Fig. 8. Comparison of the detection rate between random detection and SDR detection.

verify these hypotheses. The time required for testing is affected by the number of managed objects that must be tested. Thus, if the network management system is able to identify the source of a fault, it is desirable that the minimum number of tests be performed. Hence, there are two main aspects, subject to optimization, of any fault localization process: accuracy of the hypothesis it provides and time complexity of the fault identification algorithm it uses. In order to optimize the time to localize the fault we should maximize the accuracy of the proposed hypotheses and minimize the time complexity of the fault identification process. During the 1000 tests, the simulation results are presented in Fig. 8.

3.4. Related Algorithms for Probabilistic Inference There exist various types of inference algorithm for Bayesian networks. All in all, they can be classified into two types of inferences: exact inference [21, 31, 32] and approximate inference [33]. Each class offers different properties and works better on different classes of problems, but it is very unlikely that a single algorithm can solve all possible problem instances effectively. Every resolution is always based on a particular requirement. This situation is true for almost all computational problems and probabilistic inference using general Bayesian networks, which has been shown to be NP-hard by Cooper [34]. However, Pearl’s algorithm, the most popular inference algorithm in BNs, can not be extended easily to apply to acyclic multiply connected digraphs in general. A straightforward application of Pearl’s algorithm to an acyclic digraph comprising one or more loops invariably leads to insuperable problems [33, 35].

424

Ding, Kr¨amer, Bai, and Chen

Another popular exact BN inference algorithm is the clique-tree algorithm [31]. It transforms a multiply connected network into a clique tree by clustering the triangulated moral graph of the underlying undirected graph first, and then performs message propagation over the clique tree. But it is difficult to record the internal nodes and the dependency routes between particular effect nodes and causes. In distributed systems management, the states of internal nodes and a key route, which connect the effects and causes, are important for management decisions. Moreover, the sequence of localization for potential faults is very useful for reference to systems managers. For system performance management, the identification of related key factors is also important. Few algorithms give satisfactory resolution for this case. Compared with other algorithms, the SDR algorithm belongs into the class of exact inferences and it provides an efficient method to trace the strongest dependency routes from effects to causes and to track the dependency sequences of the causes. It is useful in fault location, and it is beneficial for performance management. Moreover, it can treat multiple connected networks modeled as DAGs. 4. CONCLUSIONS AND FUTURE WORK In distributed systems of realistic size and complexity, managers have to live with unstable, uncertain and incomplete information. In this paper, Bayesian networks are proposed to represent the knowledge about managed objects and their dependencies in uncertain environment, and probabilistic reasoning is applied to determine the causes of failures or errors. Bayesian inference is a popular mechanism underlying probabilistic reasoning systems. The SDR algorithm introduced in this paper provides an efficient method in backward inference to trace the causes from effects. After the SDR algorithm, the strongest dependency routes from effects to causes can be obtained; meanwhile, the strongest dependency sequence of causes can be also achieved. Hence the SDR inference provides an efficient mechanism in fault locating, and it is beneficial for performance management. Most distributed systems, however, dynamically update their structures, topologies and their dependency relationship between management objects. We need to accommodate sustainable changes and maintain a healthy management system based on learning strategies that allows us to modify the cause-effect structure and also the dependencies between the nodes of a Bayesian networks correspondingly. However, the Bayesian paradigm does not provide direct mechanism for modeling temporal dependency in dynamic systems [36, 37]. Thus, we apply dynamic bayesian networks (DBNs) to import temporal factor, to model the dynamic changes among managed objects and their dependencies between each other, and further to investigate the prediction issues based on the inference techniques in fault management in the presence of imprecise and dynamic

Backward Inference in Bayesian Networks for Distributed Systems Management

425

management information. Nonlinear regression theory [38] is imported to capture the trend of changes and to give reasonable prediction of individual components and the trends of the changes in dependencies between managed components in distributed system. ACKNOWLEDGMENTS This research is part of international quality networks (IQN) project and is supported by DAAD (the German Academic Exchange Service). The authors would like to thank Carsten Schippang for providing the sample data of the campus network of FernUniversit¨at Hagen for a whole year. Also many thanks are due to the anonymous reviewers for their valuable comments. REFERENCES 1. A. Osmani and F. Krief, Model-Based Diagnosis for Fault Management in ATM Networks, Proceedings of International Conference on ATM ICATM 99. pp. 91–99, 1999. 2. J. Zupan and D. Medhi, An Alarm Management Approach in the Management of Multi-Layered Networks, 3rd IEEE International Workshop on IP Operations & Management (IPOM 2003), pp. 77–84. 2003. 3. R. E. Miller and K. A. Arisha, Fault Management Using Passive Testing for Mobile IPv6 Networks, Proceedings of 2001 IEEE Global Telecommunications Conference. Vol. 3, pp. 1923–1927, 2001. 4. I. Rouvellou and G. W. Hart, Automatic Alarm Correlation for Fault Identification, Proceedings of IEEE INFOCOM’95, pp. 553–561, 1995. 5. S. A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie, High speed and robust event correlation, IEEE Communications Magazine, Vol. 34, No. 5, pp. 82–90, 1996. 6. C. Lo, S. H. Chen, and B. Lin, Coding-based schemes for fault identification in communication networks, Journal of Network and Systems Management, Vol. 10, No. 3, pp. 157–164, 2000. 7. L. Lewis, A case-based reasoning approach to the resolution of faults in communication networks, in Integrated Network Management, III, Elsevier Science Publishers B.V., Amsterdam, pp. 671– 682, 1993. 8. G. Pemido, J. Nogueira, and C. Machado, An Automatic Fault Diagnosis and Correction System for Telecommunications Management, Proceedings of 6th IFIP/IEEE International Symposium on Integrated Network Management, pp. 777–791, 1999. 9. S. K¨atker and K. Geihs, A generic model for fault isolation in integrated management systems, Journal of Network and Systems Management Vol. 5, No. 2, pp. 109–130, 1997. 10. R. H. Deng, A. A. Lazar, and W. Wang, A probabilistic approach to fault diagnosis in linear lightwave networks, IEEE Journal on Selected Areas in Communications, Vol. 11, No. 9, pp. 1438–1448, 1993. 11. C. S. Hood and C. Ji, Proactive network-fault detection, IEEE Transactions on Reliability, Vol. 46, No. 3, pp. 333–341, 1997. 12. R. Sterritt and D. W. Bustard, Fusing hard and soft computing for fault management in telecommunications systems, IEEE Transactions on Systems, Man, and Cybernetics, Part C, Vol. 32, No. 2, pp. 92–98, 2002. 13. C. S. Chao, D. L. Yang, and A. C. Liu. An automated fault diagnosis system using hierarchical reasoning and alarm correlation, Journal of Network and Systems Management, Vol. 9, No. 2, pp. 183–202, 2001.

426

Ding, Kr¨amer, Bai, and Chen

14. C. Hill, High-availability systems boost network uptime: Part 1, http://www.eetasia.com/ ARTICLES/2001JUL/2001JUL01 NTEK ST QA TA.PDF. Motorola Telecom Business Unit, 2001. 15. D. Nikovski, Constructing Bayesian networks for medical diagnosis from incomplete and partially correct statistics, IEEE Transactions on Knowledge and Data Engineering, Vol. 12, No. 4, pp. 509–516, 2000. 16. W. Wiegerinck, H. J. Kappen, E. W. M. T. ter Braak, W. J. P. P. ter Burg, M. J. Nijman, Y. L. O, and J. P. Neijt, Approximate inference for medical diagnosis, Pattern Recognition Letters, Vol. 20, pp. 1231–1239, 1999. 17. K. Basye, T. Dean, and J. Scott Vitter, Coping with Uncertainty in Map Learning, Machine Learning Vol. 29, No. 1, pp. 65–88, 1997. 18. E. Charniak and R. P. Goldman, A Semantics for Probabilistic Quantifier-Free First-Order Languages, with Particular Application to Story Understanding, Proceedings of the IJCAI-89, pp. 1074–1079, Morgan-Kaufmann. 19. J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, San Mateo, CA, 1988. 20. R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter, Probabilistic Networks and Expert Systems, Springer-Verlag, New York, 1999. 21. J. Pearl, Causality: Models, Reasoning, and Inference, Cambridge, England, Cambridge University Press, New York, NY, 2000. 22. Y. Xiang, Probabilistic Reasoning in Multiagent Systems: A graphical models approach, Cambridge University Press, Cambridge, ISBN 0-521-81308-5, 2002. 23. I. Katzela and M. Schwarz, Schemes for fault identification in communication networks, IEEE Transactions on Networking, Vol. 3, No. 6, pp. 733–764, 1995. 24. S. Klinger, S. Yemini, Y. Yemini, D. Ohsie and S. Stolfo, A Coding Approach to Event Correlation, Proceedings of the fourth international symposium on Integrated network management IV, pp. 266–277, 1995. 25. D. Heckerman and M. P. Wellman, Bayesian networks, Communications of the ACM, Vol. 38, No. 3, pp. 27–30, 1995. 26. M. Gupta, A. Neogi, M. K. Agarwal and G. Kar, Discovering Dynamic Dependencies in Enterprise Environments for Problem Determination, 14th IEEE/IFIP International Workshop on Distributed Systems Operations and Management, Heidelberg, Germany, 2003. 27. A. Keller, U. Blumenthal and G. Kar, Classification and Computation of Dependencies for Distributed Management, Proceedings of 5th IEEE Symposium on Computers and Communications, Antibes-Juan-les-Pins, France, 2000. 28. J. Gao, G. Kar and P. Kermani, Approaches to Building Self Healing Systems using Dependency Analysis, Proceedings of the IEEE/IFIP Network Operations and Management Symposium, April, 2004. 29. M. Matsumoto and Y. Kurita, Twisted GFSR generators, ACM Transactions on Modeling and Computer Simulation, Vol. 2, pp. 179–194, 1992. 30. M. Matsumoto and Y. Kurita, Twisted GFSR generators II, ACM Transactions on Modeling and Computer Simulation, Vol. 4, pp. 254–266, 1994. 31. S. L. Lauritzen and D. J. Spiegelhalter, Local computations with probabilities on graphical structures and their application to expert systems, Journal of the Royal Statistical Society, Series B, Vol. 50, pp. 157–224, 1988. 32. J. Pearl, A Constraint-Propagation Approach to Probabilistic Reasoning, Uncertainty in Artificial Intelligence, North-Holland, Amsterdam, pp. 357–369, 1986. 33. R. M. Neal, Probabilistic inference using Markov chain Monte Carlo methods, Technical Report CRG-TR93-1, University of Toronto, Department of Computer Science, 1993.

Backward Inference in Bayesian Networks for Distributed Systems Management

427

34. G. Cooper, Computational complexity of probabilistic inference using Bayesian belief networks, Artificial Intelligence, Vol. 42, pp. 393–405, 1990. 35. F. L. Koch, and C. B. Westphall, Decentralized network management using distributed artificial intelligence, Journal of Network and Systems Management, Vol. 9, No. 4, December 2001. 36. C. F. Aliferis and G. F. Cooper, A Structurally and Temporally Extended Bayesian Belief Network Model: Definitions, Properties, and Modeling Techniques, Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence, pp. 28–39, 1996. 37. J. D. Young and E. Santos, Introduction to Temporal Bayesian Networks, Presented at the 7th Midwest AI and Cognitive Science Conference, 1996. 38. A. S. Weigend and N. A. Gershenfeld, Time Series Prediction: Forecasting the Future and Understanding the Past, Addison-Wesley, ISBN: 0-201-62602-0, 1994. 39. H. J. Suermondt and G. F. Cooper, Probabilistic inference in multiply connected belief network using loop cutsets, International Journal of Approximate Reasoning, Vol. 4, pp. 283–306, 1990.

Jianguo Ding received his M. Sc. in computer science from Hefei University of Technology, P. R. China, in 1999. He obtained a joint Ph.D. in computer science between Shanghai Jiao Tong University in P. R. China and FernUniversit¨at Hagen in Germany in 2005. He was supported by DAAD (the German Academic Exchange Service) scholarship. He is a member of the IEEE. His current research interests include distributed systems management, intelligent technology and probabilistic reasoning. Bernd Kr¨amer is a full professor at FernUniversit¨at in Hagen, Germany. He obtained his diploma and doctorate in computer science from the Technical University of Berlin. He is the president of the international Society for Process and Design Sciences. His research interests include distributed software engineering, e-learning technology, distributed systems management, and dependable software. Yingcai Bai graduated from Tsinghua University, P. R. China. He is a full professor at Shanghai Jiao Tong University, P. R. China. He is also the president of Shanghai Engineering Center of GOLDEN Network and the president of Shanghai Computer Open System Association. His research interests include network architecture, network security, and distributed systems management. Hansheng Chen graduated in mathematics from Fudan University, P. R. China. He is a professor at East-China Institute of Computer Technology. He is a visiting professor at FernUniversit¨at in Hagen, Germany. His research focuses on software engineering and distributed systems.