Submitted to: IJAR

Universiteit van Amsterdam

IAS technical report IAS-UVA-06-03

Fault Localization in Bayesian Networks

Jan Nunnink and Gregor Pavlin Intelligent Systems Laboratory Amsterdam, University of Amsterdam The Netherlands

This paper considers the accuracy of classification using Bayesian networks (BNs). It presents a method to localize network parts that are (i) in a given (rare) case responsible for a potential misclassification, or (ii) modeling errors that consistently cause misclassifications, even in common cases. We analyze how inaccuracies introduced by such network parts are propagated through a network and derive a method to localize the source of the inaccuracy. The method is based on monitoring the BN’s ‘behavior’ at runtime, specifically the correlation among a set of observations. Finally, when bad network parts are found, they can be repaired or their effects mitigated. Keywords: Bayesian networks, Fault localization, Classification.

IAS

intelligent autonomous systems

Fault Localization in Bayesian Networks

Contents

Contents 1 Introduction

1

2 Bayesian networks and classification 2.1 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 2

3 Classification Accuracy

3

4 Fault Probability 4.1 Fault Causes . 4.1.1 Cause 1: 4.1.2 Cause 2: 4.1.3 Cause 3:

. . . .

4 5 5 6 6

5 Reinforcement Propagation 5.1 Reinforcement Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 7

6 Fault Monitoring 6.1 Factor Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Consistency Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Estimation of the Summary Accuracy . . . . . . . . . . . . . . . . . . . . . . . .

8 9 9 9

7 Fault Localization Algorithm 7.1 Determining the Cause Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11 13

. . . . . . . . . . . . . Rare Cases . . . . . . Modeling Inaccuracies Erroneous Evidence .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

8 Experiments 13 8.1 Synthetic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 8.2 Real World Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 9 Applications 15 9.1 Localizing Faulty Model Components . . . . . . . . . . . . . . . . . . . . . . . . . 15 9.2 Deactivating Inadequate Model Components . . . . . . . . . . . . . . . . . . . . . 16 10 Discussion 16 10.1 Non-Tree Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 10.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 10.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Intelligent Autonomous Systems Informatics Institute, Faculty of Science University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam The Netherlands Tel (fax): +31 20 525 7461 (7490) http://www.science.uva.nl/research/ias/

Corresponding author: Jan Nunnink tel: +31 20 525 7517 [email protected] http://www.science.uva.nl/∼jnunnink/

Copyright IAS, 2006

Section 1

1

Introduction

1

Introduction

Discrete Bayesian networks (BNs) are rigorous and powerful probabilistic models for reasoning about domains which contain a significant amount of uncertainty. They are often used as classifiers for decision making or state estimation in a filtering context. BNs are especially suited for modeling causal relationships [15]. Using such a causal model, we propagate the values of observed variables to probability distributions over unobserved variables [14, 7]. These posterior distributions are often the basis for a classification process. Moreover, in many applications the classification result is mission critical (e.g. situation assessment in crisis circumstances). In this context, we emphasize the difference between the generalization accuracy and the classification accuracy. In general, classification is based on models that are generalizations over many different situations. Causal BNs capture such generalizations through conditional probability distributions over related events. However accurate generalizations do not guarantee accurate classification in a particular case. In a rare situation, a set of observations could result in erroneous classification, even if the model precisely described the true distributions. Namely, in this particular situation a certain portion of the conditional distributions does not ‘support’ accurate inference, since it does not model the rare case. By using such inadequate relations for inference, the posterior probability of the true (unobserved) state would be reduced. In addition, the probability of encountering situations for which a modeled relation is inadequate increases with the divergence between the true and the modeled probability distributions. Inadequate relations influence the inference process and subsequent classification. This is reflected in the way the model ‘behaves’ under different circumstances. We can monitor a BN and from its behavior draw conclusions about the existence of one of inadequate parts. This is based on the following principle: A given classification node splits the network in several independent fragments. These fragments can be seen as different experts giving independent ‘votes’ about the state of that node. The degree to which they ‘agree’ on a state is a measure for the accuracy of the classification. The data conflict approach by Jensen et al. [8] (as well as [10] and [9]) uses a roughly similar principle. It is based on the assumption that given an adequate model all evidence should be correlated and hence the joint probability of all evidence should be greater than the product of the individual evidence probabilities. More discussion about related approaches can be found in Section 10.2. Using our measure, we present a method for localizing possible inaccuracies in general, and show how to use this method to determine the type of cause. The advantage of this method over previous work is that we can estimate a lower bound on its effectiveness, and that this lower bound has asymptotic properties given the network topology. This allows one to determine for which networks localization works best. Furthermore, the localization procedure is more straightforward than the existing methods. Possible applications for the proposed method are presented in Section 9. Among its uses are: • Localized model errors can be manually corrected or relearned. • Localized inaccurate information sources can be deactivated at run-time to improve the classification.

2

Bayesian networks and classification

A Bayesian network BN is defined as a tuple hD, pˆi, where D = hV, Ei is a directed a-cyclic graph (DAG) consisting of a set of nodes V = {V1 , . . . , Vn } and a set of directed edges hVi , Vj i ∈ E between pairs of nodes. Each node corresponds to a variable and pˆ is the joint probability

2

Fault Localization in Bayesian Networks

()*+ /.-, 0123 7654 A4 B 4 ª ® 4 ª ® ¦® ½ ª¥ 0123 7654 0123 7654 C H5 55 ¦ ¥ ½ 7654 0123 7654 0123 0123 7654 E F G Figure 1: An example Bayesian network. Nodes {A, B, C, E} represent fragment F1H , which is

rooted in node H. H has a branching factor of 3. A has branching factor 2.

distribution over all variables, defined as pˆ(V) =

Y

pˆ(Vi |π(Vi )),

Vi ∈V

where pˆ(Vi |π(Vi )) is the conditional probability table (CPT) for node Vi given its parents in the graph π(Vi ). In this paper we use pˆ to specifically refer to estimated values, such as modeling parameters (CPTs) and posterior probabilities, while we use a p (without hat) for the true probabilities in the modeled world. We assume that all probabilities that we can estimate have a corresponding true value in the real world. Each variable has a finite number of discrete states (or values), denoted by lower case letters. The DAG represents the causal structure of the domain, and the conditional probabilities encode the causal strength. Furthermore, we will denote a set of observations or evidence about the state of variables by E, the main classification or hypothesis node by H, and the states of that node by hi . The (hidden) true state of H is denoted by h∗ . A variable H is classified as the state hi for which hi = arg max pˆ(H = hj , E).

(1)

hj

pˆ(H = hj , E) is obtained from the joint probability distribution by marginalizing over all variables except H and those in E, and factorizing the joint probability distribution, as follows: X Y Y pˆ(H = hj , E) = pˆ(Vi |π(Vi )) · e. (2) V\H Vi ∈V

e∈E

We distinguish between evidence variables and non-evidence variables. Where a non-evidence variable (except H) is used in the equation, we marginalize it out by summing over its states. Where an evidence variable appears, it is replaced by its observed state. This is done through multiplication with the vector e, which contains 1s at the entries corresponding to the observed states and 0s elsewhere. Note that this involves the multiplication of potentials. See also [7], section 1.4.6.

2.1

Factorization

For the analysis in the next sections we require a notion of dependence between different parts of a network given the classification variable. We can use d-separation [14] to identify fragments of the DAG which are conditionally independent given the classification node. Definition 1 Given a DAG and classification node H, we identify a set of fragments FiH (i = 1, . . . , k) which are all pairwise d-separated given H. A fragment is defined as a set of nodes that includes H. Hence, the nodes in a fragment are conditionally independent of nodes in the other fragments given H. Node H is called the root of all fragments FiH . The number of fragments rooted in H (k) is called the branching factor.

Section 3

Classification Accuracy

3

See Figure 1 for an example (we will use this example throughout the paper). Given node H, 3 fragments can be seen, namely (i) nodes {A, B, C, E, H}, (ii) nodes {F, H}, and (iii) nodes {G, H}. If A would be the classification variable, we could identify 2 fragments, namely (i) nodes {A, C, E}, and (ii) the rest of the nodes plus A. This fragmentation has the useful property that the classification equation (2) can be factorized such that each factor corresponds one-on-one with a fragment. By splitting the sum and product and regrouping them per fragment, we rewrite (2) as X Y Y pˆ(H = hi , E) = pˆ(Vi |π(Vi )) · e V\H Vi ∈V

=ˆ p(hi |π(H))

X

Y

e∈E

pˆ(Vi |π(Vi )) ·

Y

o e·

φ1 (hi )

e∈E1

V1 \H Vi ∈V1 \H

.. .

··· X

Y

pˆ(Vi |π(Vi )) ·

Y

o e,

φk (hi )

(3)

e∈Ek

Vk \H Vi ∈Vk \H

where we partition the complete set of variables V into k subsets Vi each of which consists of the nodes in the corresponding fragment FiH . Ei denotes the subset of evidence in fragment FiH . We can identify factors φj (hi ) (j = 1, . . . , k) whose product is the joint probability for each hi . ΦH denotes the set of all factors associated with node H. φj (hi ) denotes the value of factor φj for hi . The d-separation between the fragments directly implies the following: Proposition 1 The factors φj in (3) are mutually independent, given classification variable H. A factor is independent in the sense that a change in its value does not change the value of another factor. For example, consider the BN shown in Figure 1. H is the classification variable and the evidence is E = {e1 , f1 , g2 }. The fragments are F1H = {A, B, C, E, H}, F2H = {F, H} and F3H = {G, H}, and evidence sets for the fragments are E1 = {e1 }, E2 = {f1 } and E3 = {g2 }. The factorization becomes

pˆ(H = hi , E) =

zX A

3

pˆ(A)

X

φ1 (hi )

(hi ) φ3 (hi ) }| X { z φ1}| { z }| { pˆ(C|A)ˆ p(e1 |C) pˆ(B)ˆ p(hi |A, B) pˆ(f2 |hi ) pˆ(g2 |hi ) .

C

(4)

B

Classification Accuracy

Recall that h∗ denotes the true (hidden) state of variable H and that we classify H according to (1). We define the accuracy of a classification as follows: Definition 2 A classification of variable H, given evidence E, is accurate iff h∗ = arg max pˆ(H = hi , E). hi

(5)

In other words, the true state should have the greatest estimated probability. It is difficult to directly analyze the (in)accuracy and its causes from this definition. Therefore, we plug the factorization from the previous section into (5). Recall that φi (hj ) is the value of factor φi for state hj of the classification node H, given evidence Ei . Since factors φ1 (hj ), . . . , φk (hj ) are mutually conditionally independent, we can define the following accuracy condition for factors:

4

Fault Localization in Bayesian Networks

Definition 3 Factor φi supports an accurate classification iff h∗ = arg max φi (hj ). hj

(6)

We say that φi reinforces state arg maxhj φi (hj ) of H. In other words, we call a factor accurate if it gives the true state a greater value than all other states of H. The intuition behind this is that a factor is accurate if it contributes to an accurate classification, which is made clear through the following observation: Consider a BN and factorized probability distribution pˆ(H, E). Suppose we augment its DAG by adding a new fragment FiH which corresponds to a new factor φi in the factorization. If φi satisfies Definition 3, then it will contribute towards obtaining an accurate classification by increasing the probability for the true state h∗ relative to all other state hk 6= h∗ . Let E 0 denote the union of the original evidence E and the evidence from the new fragment. The relative probability for H can be expressed as ∀hk 6= h∗ :

φi (h∗ ) pˆ(h∗ , E) pˆ(h∗ , E) pˆ(h∗ , E 0 ) = > . pˆ(hk , E 0 ) φi (hk ) pˆ(hk , E) pˆ(hk , E)

(7)

This inequality is obvious since φi (h∗ )/φi (hk ) > 1 for all hk 6= h∗ , if φi satisfies Definition 3. Summarizing, a classification is accurate as long as the joint probability for the true state is the greatest. This happens if a sufficient number of factors contribute to an accurate classification by satisfying Definition 3.

4

Fault Probability

In Definition 3, we say that a factor does not support accurate classification if, given the evidence, it does not satisfy Condition (6). In that case, we call the factor or corresponding fragment inadequate for the current classification task. We will also use the term fault to denote the same violation of (6). A factor φi (H) is obtained by combining parameters from one or more CPTs from the corresponding network fragment FiH . This combination depends on the evidence which in turn depends on the true distributions over modeled events. Thus, with a certain probability we encounter a situation in which factor φi (H) is adequate, i.e. it satisfies condition (6). We can show that this probability depends on the true distributions and simple relations between the true distributions and the CPT parameters. We can facilitate further analysis, by using the concept of factor reinforcements to characterize the influence of a single CPT. For the sake of clarity, we focus on diagnostic inference only. For example, consider a CPT pˆ(E|H) relating variables H and E. If we assume that one of the two variables was instantiated, we can compute a reinforcement at another related variable. For the instantiation E = e∗ , one of the factors associated with H can be expressed as φi (H) = pˆ(e∗ |H). In this situation factors are identical to CPT parameters and adequacy of CPTs can be defined. If all CPTs from FiH were adequate in a given situation, then also φi would be adequate. This is often not the case, however. Whether a factor φi is adequate depends on which CPTs from the corresponding fragment are inadequate. Obviously, the higher is the probability that any CPT from a fragment FiH is adequate, the higher is the probability that FiH is adequate as well. We need to assume a lower bound pre on the probability that in a certain case a single CPT is adequate and supports accurate reinforcement. In this section we argue that pre > 0.5 is a plausible assumption. Consider a fragment FiH consisting of only two adjacent nodes H and

Section 4

Fault Probability

5

b1 b2 b3

h1 0.7 0.2 0.1

h2 0.4 0.3 0.3

Table 1: Example pˆ(B|H).

E, whose relation is defined by a single CPT. Definition 3 implies that FiH is adequate if the corresponding factor φi (hj ) is greatest for the true state h∗ . Thus, the CPT is adequate if the state h∗ causes such evidence ei that, after instantiation of the corresponding variable in FiH , factor φi (h∗ ) is greatest. For each state hi of H we first define a set of states of E for which the CPT parameters satisfy condition (6): Bhi = {ek |∀hj 6= hi : pˆ(ek |hi ) > pˆ(ek |hj )}. In addition, for each possible state hi we can express the probability phi that a state from Bhi will take place: X phi = p(ej |hi ), (8) ej ∈Bhi

where p(ei |hj ) describe the true distributions. In other words, phi is the probability that cause hi will result in an effect for which the CPT parameters satisfy (6). pre is defined as the lower bound on the probability that a CPT pˆ(E|H) will be adequate: pre = mini (phi ). For example, let’s assume a simple model consisting of two nodes B and H related through a CPT shown in Table 1, which is identical to the true probabilities in the domain. Suppose that h2 is the (hidden) true state of H, so h∗ = h2 . For the corresponding factor φ(H) = pˆ(B|H) to be adequate, h2 should cause evidence bk such that the factor reinforces h2 (see Definition 3). We can see that the evidence sets for which h2 = arg max φ(hi ) are {b2 } and {b3 } (in this case arg max φ(hi ) returns the maximum value on a row of the CPT). The probability that either of these sets is caused by h2 , pˆ([b2 ∨ b3 ]|h2 ) = 0.6. Similarly, if h1 were the true state of H, then we would get that pˆ(b1 |h1 ) = 0.7. Thus, whichever the true state of H, for this example CPT we get that the probability that factor φ(H) is adequate, pre , is at least 0.6. A consequence of Definition 3 is that pre does not change even if the values in the CPT would change to pˆ(bk |hj ) 6= p(bk |hj ), as long as simple inequality relations between the CPT values and the true distributions are satisfied: ∀bk : arg max pˆ(bk |hj ) = arg max p(bk |hj ), hj

hj

(9)

where p denotes the true distribution in the problem domain. Note that this relation is very coarse, and we can assume that it can easily be identified by model builders or learning algorithms. For a more thorough discussion see [13].

4.1

Fault Causes

A CPT does not support accurate classification in a given situation, i.e. it is inadequate, if it does not satisfy Condition (6). We identify three types of faults that cause inadequacies. 4.1.1

Cause 1: Rare Cases

Suppose a CPT is correct in the sense that the CPT parameters are sufficiently close to the true probabilities in order to satisfy (9). In a rare case, however, the true state is not the most

6

Fault Localization in Bayesian Networks

likely state given the effect b∗ that materialized: h∗ 6= arg maxhj p(b∗ |hj ). Then, the case can get misclassified, since the model reinforces the most likely state. As an example, consider a simple domain where the distribution over binary variables F (fire) and S (smoke) is given by p(s|f ) = 0.7, p(s|f ) = 0.7 and p(f ) = 0.5. If the world would be in the rare case {f, s} where we observe S = s, inference would decrease the probability of the true state f , violating Condition (6). 4.1.2

Cause 2: Modeling Inaccuracies

Alternatively, CPT parameters might not satisfy (9). Then, if a case is common, the true state of H is not reinforced. By considering rationale from the previous section, we assume that this fault type is not frequent. In other words, consider a fragment containing evidence bi . If the true probabilities satisfy p(bi |h∗ ) > p(bi |hi ) for all i, but the model parameters satisfy pˆ(bi |hi ) > pˆ(bi |h∗ ) for some i, the CPT does not support accurate classification. We call this a model inaccuracy. 4.1.3

Cause 3: Erroneous Evidence

The evidence inserted into a BN is typically provided by other systems, such as sensors, databases or humans. Observation and interpretation of signals from the world can, however, be influenced by noise or system failures, possibly leading to wrong classifications.

5

Reinforcement Propagation

We introduce a coarse inference algorithm, which propagates factor reinforcements through a tree structured DAG. It only propagates reinforcements from leaves to roots, i.e. it only ‘collects’ evidence for diagnostic inference. As we show later, with the help of this algorithm we can monitor a model’s runtime ‘behavior’, which can give clues about the adequacy of CPTs. The algorithm is based on the concept of a factor reinforcement, which was already mentioned in Definition 3 but is made more formal here. Definition 4 (Factor Reinforcement) Given a classification variable H, a fragment FiH and some instantiation of the evidence variables in FiH , we define the corresponding factor reinforcement RH (φi ): RH (φi ) = arg max φi (hj ). (10) hj

In other words, reinforcement RH (φi ) is a function that returns that state hj of variable H, whose probability is increased the most (i.e. is reinforced) by instantiating nodes of fragment FiH . For example, given factorization (4), we obtain three reinforcements for H. If a factor φi is accurate (see Definition 3), then RH (φi ) = h∗ . Moreover, we can count for any node H how many of its fragments reinforced each of its states. Let ni be the number of factors reinforcing state hi . We call N = {n1 , . . . , nm } the set of reinforcement counters, where m is the number of states of H. ni is defined as ni =k {φj ∈ ΦH |hi = RH (φj )} k,

(11)

where k · k denotes the size of a set. Suppose that in our running example the reinforcements were h1 , h2 and h1 . If H would have 3 states then N = {2, 1, 0}. Next, classification chooses that state hi which got reinforced by the most factors, i.e. which had the greatest reinforcement counter:

Section 5

Reinforcement Propagation

7

Definition 5 (Reinforcement Summary) The reinforcement summary SH of a node H is defined as: SH = hi s.t. ∀j6=i ni ≥ nj , (12) If H is an evidence node then SH is defined as the observed state of H. In our example, where N = {2, 1, 0}, we get SH = h1 . For BNs with tree structured DAGs we can summarize the definitions presented above into a coarse inference process. We assume that the evidence nodes are the tree’s leaves and its classification node is the tree root. Consider a set V consisting of all leaf nodes. We define a set P = {Ni ∈ / V|children(Ni ) ⊆ V} consisting of all nodes not in V whose children are all elements of V. For each parent Y ∈ P we determine the reinforcement summary SY resulting from the propagation from its children. Every parent node Y is then instantiated as if the reinforcement summary state returned by SY were observed. We then set V ← V ∪ P and the procedure is repeated until the reinforcement summary is determined at the root node H. This implies that at all times all nodes in the set V are leaves and/or instantiated nodes, so the reinforcement summaries can be computed. This procedure can be summarized by Algorithm 1. Algorithm 1: Reinforcement Propagation Algorithm 1 2 3

4

Collect all leaf nodes in the set V; Find P = {Ni ∈ / V|children(Ni ) ⊆ V}, the set of nodes whose children are all in V; if P 6= ∅ then for each node Y ∈ P do Find set σ(Y ) of all instantiated children of Y ; for each node Xi ∈ σ(Y ) do Compute reinforcement RY (φi ) at node Y caused by the instantiation of Xi ; end Compute reinforcement summary SY at node Y ; Instantiate node Y as if SY were observed (hard evidence); end Make parent nodes P elements of V: V ← V ∪ P; else Stop; end Go to step 2;

With this algorithm, we obtain SX for all unobserved variables X by recursively using Definition 4 and 5. In BNs with tree-like DAGs and binary nodes, Algorithm 1 corresponds to a system of hierarchical decoders that implement the repetition coding technique known from information theory (see for example [11], Chapter 1). This implies asymptotic properties. Namely, given the pre > 0.5, as the branching factors increase, the probability that SH = h∗ at any node approaches 1. This property can be explained through binomial distributions, as we will show in the next section.

5.1

Reinforcement Accuracy

While pre > 0.5 is a lower bound on the probability that a particular CPT provides an accurate reinforcement, pf denotes the probability that a factor reinforcement, resulting from Algorithm 1, is accurate. This lower bound will be necessary for the analysis in upcoming sections. First we need to make the following assumptions:

8

Fault Localization in Bayesian Networks

Assumption 1 The BN graph contains a high number of nodes with many conditionally independent fragments. Hence, a high number of independent factors can be identified. Assumption 2 The probability pre > 0.5, that any CPT supports an accurate classification in a given situation (Section 4 provides a rationale for this assumption). Proposition 2 Given Assumptions 1 and 2 and sufficiently high branching factors, the probability pf that the true state of a classification node will be reinforced by a factor is greater than 0.5. Proof (Sketch) Factor reinforcement is calculated recursively using Definition 4 and 5 beginning at the leaf nodes and ending at the classification node. We show that with each recursion loop Proposition 2 holds. Let H be a classification node with k factors, φi be one of the factors associated with H and let G be the child of H from the fragment corresponding to φi (see for example Figure 1). We can write pf for factor φi as: pf = pre psum + α(1 − pre )(1 − psum ),

(13)

where psum is the probability that the reinforcement summary at node G equalled the true state of G. If the reinforcement summary is accurate and the fragment between H and G is adequate then the reinforcement at H is accurate. The second term represents the situation where the reinforcement summary at G is inaccurate and the fragment between G and H contains a fault. These can cancel each other out, which can result in an accurate reinforcement at H. 0 < α < 1 is a scalar that represents the probability that such a situation occurs. Note that for binary variables α = 1. Next, let pf be the minimum pf over all factors associated with H. From Definition 5 we can give a lower bound on psum for node H: psum ≥

k X m=dk/2e

µ ¶ k p m f

m

(1 − pf )k−m .

(14)

This is a lower bound because the reinforcement summary is defined as the state with the maximum reinforcement counter, which is less restrictive than the absolute majority (dk/2e) used in (14). Assumption 2 states that pre > 0.5 and therefore (13) implies that there exists a sufficiently high psum for which pf > 0.5. (14) implies that a sufficiently high psum can be obtained if pf > 0.5 and k is sufficiently high. The recursion starts with the leaf nodes, for which psum = 1 since they are instantiated. Thus, if a network contains enough fragments (Assumption 1), and pre > 0.5 (Assumption 2) then pf > 0.5 for all classification nodes. ¤ For the complete proof see [13]. Additionally, from the above analysis, we can observe the following property: Corollary 3 pf will increase and approach 1 if the branching factors increase.

6

Fault Monitoring

We want to estimate the adequacy of a particular model fragment for a particular case. It is clear that we cannot directly apply Definition 3, because we do not know the true state of hidden

Section 6

Fault Monitoring

9

variables and thus cannot evaluate Condition (6). We will show in this section however that given certain (in)accuracies a model will ‘behave’ in a certain way. We will call this behavior model response, describe it in terms of the reinforcements from the previous section, and show that it can give clues to the existence of inaccuracies.

6.1

Factor Consistency

Since the true state of a hidden variable is unknown, it is impossible to directly determine whether or not RH (φi ) = h∗ holds. We can however use the following definition whose condition is directly observable and which describes the relationship between multiple factors: Definition 6 (Factor Consistency) Given any node H, a set of factors ΦH is consistent iff ∀φi , φj ∈ ΦH

RH (φi ) = RH (φj ).

The factors are thus consistent if they reinforce the same state of H. Given the fact that there can be only one true state h∗ at a given moment, we observe that if each element of a set of factors ΦH satisfies the condition in Definition 3, then that set must be consistent. If a set of factors is not consistent, then there exist elements from that set that do not satisfy the condition in Definition 3. Obviously, through various faults we will observe inconsistent factor sets in most situations. In that case we should determine which of the factors in an inconsistent set violate Definition 3. We will next show how this can be achieved, using the result from Proposition 2 and by introducing a consistency measure.

6.2

Consistency Measure

We define a measure for the degree of consistency of any factor φi with respect to the observed factor reinforcements of all factors of a set ΦH . Definition 7 (Consistency Measure) Given a node H, a set of factors ΦH , and a reinforcement counter ni for each state hi (see Section 5). The consistency measure for a factor φi ∈ ΦH is defined as: CH (φi ) = nj − max nk , k6=j

where hj is the state of H that was reinforced by factor φi . In other words, the consistency measure for a factor φi is equal to the number of factors ‘agreeing’ with φi (including φi itself), minus the maximum number of reinforcements any other state of H got. For the running example, where the reinforcements were RH (φ1 ) = RH (φ3 ) = h1 and RH (φ2 ) = h2 , we get CH (φ1 ) = 1 and CH (φ2 ) = −1. Using this definition we can describe certain relations between value of the consistency measure and estimated factor accuracy.

6.3

Estimation of the Summary Accuracy

We use p = pf > 0.5 as the a priori probability that a reinforcement equals the true state, RH (φi ) = h∗ (recall Proposition 2). Consider a node H = {h1 , . . . , hm } and associated reinforcement counters N = {n1 , . . . , nm }. N is the sum over N , and thus the total number of factors. The conditional probability that any particular state hi equals the true state h∗ , given that we observed the reinforcement set N and assuming uniform priors over h∗ , can be expressed as pni (1 − p)N −ni p(hi = h∗ |N ) = P n . N −nj j j p (1 − p)

10

Fault Localization in Bayesian Networks

The numerator consists of the probability of a correct reinforcement to the power of the number of reinforcements supporting hi , times the probability that a reinforcement is inaccurate to the power of the number of reinforcements not supporting hi . The denominator normalizes the distribution. We want to determine exactly for which degree of consistency CH this conditional probability is greater than 0.5. This is the case if X pni (1 − p)N −ni > pnj (1 − p)N −nj . j6=i

Because of the sum term in the equation this is difficult to express in terms of CH . We take an upper bound of the right hand side of the inequality, and if this new inequality is satisfied P the original is satisfied as well. The upper bound we use is: n max x ≥ x. If we define c = p/(1 − p), then this becomes: cni > (m − 1)cmaxj6=i nj , which is equivalent to: ni − max nj > j6=i

log(m − 1) . log c

(15)

The left hand side is now equal to the consistency measure CH . We also want to determine exactly when p(hi = h∗ |N ) is smaller than 0.5. This is true if X pni (1 − p)N −ni < pnj (1 − p)N −nj . j6=i

We now take a lower bound of the right hand side of the inequality, so if thisP new inequality is satisfied the original is satisfied as well. Thus, for the lower bound: max x ≤ x, cni < cmaxj6=i nj , which is equivalent to: ni − max nj < 0 j6=i

(16)

and we derive the following implications:

CH

CH < 0 ⇒ p(hi = h∗ |N ) < 0.5 log(m − 1) ⇒ p(hi = h∗ |N ) > 0.5 > log c

(17) (18)

CH here denotes CH (φ), hi = RH (φ) and m is the number of states of H. These implications give the probability of an accurate factor reinforcement, given its consistency measure. This allows us to use an observable quantity (the reinforcement counters), in order to derive the probability that a particular fragment is adequate. Thus, if a factor has a negative consistency measure, the corresponding fragment probably introduces a fault. Implication (18) is not trivial to interpret, since the condition depends on the unknown factor c = p/(1 − p). It turns out however that without knowing the exact value of c we can often specify an adequate CH , which makes implication (18) valid. It is important to note that the consistency measure can only take integer values. For example, any value of log(m−1)/ log c < 1 requires CH to be at least 1 in order to satisfy (18). Condition log(m − 1)/ log c < 1 is satisfied if p ∈ hpmin , 1]. Table 2 shows the lower bound of interval hpmin , 1] for which different values of CH are adequate. These bounds depend also on m. If m = 2, then CH = 1 is adequate for any p ∈ h0.5, 1]. Recall that we already assumed p to be greater than 0.5.

Section 7

Fault Localization Algorithm

m= 2 3 4 5

11

CH = 1 0.50 0.66 0.75 0.80

CH = 2 0.50 0.58 0.63 0.66

Table 2: Minimum value of p that is sufficient to satisfy (18) given a certain value of CH and m.

(a)

X

X

X

Y

Y

Y

FY i

(b)

FY i

(c)

FYi

Figure 2: (a) Network section. (b) Comparison at node Y . (c) Comparison at node X.

7

Fault Localization Algorithm

Depending on which CPTs from a network fragment FiH are inadequate in a given situation, the resulting factor φi might be inaccurate as well. We can often localize inadequate CPTs by using (17) and (18). Consider a network section consisting of two adjacent nodes, X and Y (see Figure 2). First we consider one particular fragment FiY rooted in Y . At run-time, the consistency measure CY (φi ) can be obtained at node Y for the factor corresponding to FiY (see Figure 2b). This measure combined with (17) and (18) indicates whether FiY up until node Y is adequate. Let F 0 be fragment FiY plus the edge hX, Y i. F 0 would be a fragment of X if we would remove all fragments of Y except for FiY from the graph (see Figure 2c). Let φ0i be its corresponding factor. We can observe the consistency measure CX (φ0i ) at node X for fragment F 0 . To compute this consistency, we need to know the reinforcement RX (φ0i ). This can be obtained using the reinforcement propagation algorithm by ignoring the reinforcements from all fragments rooted in Y , expect for FiY . We then compare the reinforcement of φ0i on node X with the reinforcements of all other factors of X, and obtain the consistency (see Figure 2c). Again, this gives an indication of the adequacy of the fragment FiY , this time including edge hX, Y i. These two consistency measures combined indicate the adequacy of the CPT parameters pˆ(Y |X) corresponding to the edge hX, Y i. We use the following rule: Rule 1 Let θt and θf be thresholds on the consistency measure. If, for any node X, we observe CX (φ) > θt then we assume x∗ = RX (φ). If we observe CX (φ) < θf then we assume x∗ 6= RX (φ). Given this rule, we can determine the adequacy of the CPT parameters pˆ(Y |X) based on the following intuition: If a fragment is adequate up to Y , but the extended fragment inadequate up to X, then the fault lies with edge hX, Y i. All such localization rules are shown in Table 3. In other words, we compare the consistency at two adjacent nodes and classify the edge between the nodes as adequate or inadequate. We can show that the use of Rule 1 in conjunction with appropriate thresholds will guarantee that in most cases the (in)adequacy of the CPT corresponding to hX, Y i is correctly determined. Proposition 4 (Fault Localization) Given a network with binary nodes containing a sufficient number of fragments and pf > 0.5 (see Section 5.1), fault localization based on Rule 1 and

12

Fault Localization in Bayesian Networks

x∗ = RX (φ0i ) true true false false

y ∗ = RY (φi ) true false true false

⇒ ⇒ ⇒ ⇒

edge hX, Y i ok inadequate inadequate ok

Table 3: Localization rules. The values in the first two columns correspond to the truth of the

equality in the column header. Table 3 with thresholds θt = log(m−1) and θf = 0 will correctly determine whether a particular log c CPT pˆ(Y |X) is adequate or inadequate with more than 50% chance. Proof If for any node A and factor φ we observe CA (φ) > θt = log(m−1) then Rule 1 tells us log c to assume that a∗ = RA (φ). (18) implies that, given pf > 0.5, the probability p∗A that this assumption is correct, namely that a∗ truly equals RA (φ), is p∗A = p(RA (φ) = a∗ |N ) > 0.5. Analogously, if for any node A and factor φ we observe CA (φ) < 0 then Rule 1 tells us to assume that a∗ 6= RA (φ). (17) implies that, given pf > 0.5, the probability p∗A that this assumption is correct, namely that a∗ truly does not equals RA (φ), is p∗A = 1 − p(RA (φ) = a∗ |N ) > 0.5. This holds for nodes X and Y from Table 3, and thus the probability that we will choose the right state of the first two columns of Table 3, and thereby draw the right conclusion about edge hX, Y i, is p∗X · p∗Y , where both p∗X > 0.5 and p∗Y > 0.5 as shown above. If we wrongly choose the state of both columns we will draw the same conclusion. The total probability of drawing the correct conclusion is therefore pcorrect = p∗X · p∗Y + (1 − p∗X ) · (1 − p∗Y ). It is easy to see that if p∗X > 0.5 and p∗Y > 0.5 then pcorrect > 0.5, and that therefore fragment classification using Rule 1, Table 3 and the appropriate thresholds is correct with more than 50% chance. ¤ We can apply such analysis to all non-terminal nodes by running Algorithm 2. Algorithm 2: Localization Algorithm 1 2

Execute Algorithm 1, and store all factor reinforcements; for each node X do for each fragment FiX of X do Let Y be the child of X within FiX ; for each fragment FjY of Y do Compute CX (φ0 ) and CY (φ) for FjY ; Using thresholds θp and θf , and Table 3, classify CPT pˆ(Y |X); end Use majority voting on all classifications of CPT pˆ(Y |X) based on different FjY ; end end We observe the following property of Algorithm 2:

Corollary 5 The majority voting at the end of Algorithm 2 improves with higher branching factors. Higher branching factors imply more votes about the state of a fragment and therefore higher expected localization accuracy. This accuracy converges asymptotically to 1 if the branching factors increase. Note that while the proof is given for networks with binary nodes, the algorithm is likely to be effective for multi-state nodes as well. In that case, the implications on the second and

Section 8

Experiments

13

fourth line of Table 3 are not necessarily valid. For example, there are rare circumstances where x∗ = RX (φ0i ) and y ∗ 6= RY (φi ), but where the CPT is nonetheless adequate. This is possible because multiple states of Y (including those not equal to y ∗ ) could all reinforce the same true state x∗ . See for example Table 1, where both b2 and b3 reinforce h2 . If h∗ = h2 and b∗ = b2 , but b3 were instantiated, then the CPT would be deemed inadequate while it was in fact not introducing a fault. If these circumstances do not occur often, then the majority voting in Algorithm 2 will mitigate their effects, and the localization will work correctly, especially if the branching factors are high. The experiments in Section 8.1 support this.

7.1

Determining the Cause Type

We can distinguish between rare cases (type 1) and model errors (type 2) by their frequency of occurrence. For this we need to perform fault localization on a BN for a set of cases. If certain fragments are diagnosed as inadequate for a large number of cases, then this is an indication that the fragment might contain erroneous parameters. Alternatively, it might be possible to find model errors by localizing faults on a case from the domain which one knows is not rare. In other words, a case for which we know that the true state of every node is the most likely state given the evidence (see Section 4.1, cause 1). This excludes the possibility for faults due to a rare case. Any found faults are then probably caused by model errors.

8

Experiments

To verify our claims and illustrate some of the properties of the algorithm 2 we applied it to several synthetic networks, in which we artificially introduced faults. We also applied it to a real network, which we adapted such that it represents an oversimplification of the problem domain, thus introducing faults.

8.1

Synthetic Networks

We generated BNs with random CPTs, using a simple tree DAG with fixed branching factor k and 4 levels. We initialized all CPTs such that the probability pre of a CPT being adequate (see Section 4) could be controlled. We let pre take values 1, 0.95, . . . , 0.4. Then we generated 1000 data samples for each particular network, applied algorithm 2 on each sample case, and observed its output. We used 0 as the positive threshold θt , which meant that the consistency measure had to be at least 1 for assuming that a CPT is adequate. Even though the algorithm does not know the value of c in (18), it turned out that algorithm 2 is quite insensitive to the precise positive threshold value, which confirms the rationale at the end of Section 6.2. The algorithm’s output was compared with the ground truth, i.e. which fragments really were inadequate for the given data case. This ground truth can be obtained from the complete case, which was known to us. Given the inadequate CPTs that were present for a given case, we recorded the percentage of CPTs that the algorithm could detect and the percentage of detected inadequacies that turned out to be false positives. We applied the algorithm to networks with varying branching factors (but the same general structure). The percentages are plotted in Figure 3. For Figure 4 we varied the number of states per network variable. Figure 3 confirms the analysis that for any value of pre > 0.5, higher branching factors increase the algorithm’s effectiveness. Figure 4 also shows that the algorithm performs better on networks with more node states, which can be explained by the fact that in such cases inadequate sample values are spread over more states. For example, suppose that in a certain situation a node is in state 1, but

14

Fault Localization in Bayesian Networks

1 0.9 0.8

percentage

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

3

5

7

branching factor

Figure 3: The effect of branching factors on a network with 4-state nodes, for different values

of pre : 0.9 (dash-dotted), 0.7 (solid), 0.5 (dashed). Top curves show percentage found, bottom curves show percentage of false positives.

1 0.9 0.8

percentage

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1

0.9

0.8

0.7 pre

0.6

0.5

0.4

Figure 4: The effect of the number of node states on a network with branching factor 5, for

different values of pre (horizontal axis). Number of states: 2 (dashed), 3 (solid), 4 (dashdotted). Top curves show percentage found, bottom curves show percentage of false positives. The dotted line shows the worst case scenario for 3 and 4 states.

Section 9

Applications

15

1

2

3

13

4

14

5

15

6

16

7

17

8

9

10

18

19

20

27

28

29

11

21

12

22

23

24

25

26

Figure 5: Network structure used for the experiment with the real network.

an inadequate fragment has caused a higher belief in a different state. If a node has more states, inaccurate classifications will be spread among more alternatives. Thus, on average, the difference between the counter of the correct state and the other counters increases, making the correct state still stand out. For example given N = {3, 2, 0}, state 1 would have a consistency measure of 1, while for N = {3, 1, 1} it would be 2. Note that the degree of this spread also influences the quality, as can be seen from the dotted line in Figure 4. This line shows the effectiveness if we enforce only one alternative state (i.e. if a fragment is inadequate it will always cause the same inaccurate state), which on average decreases the consistency measure. The worst case scenario is equivalent to localization in binary BNs. We expect real networks to be somewhere in between this worst and best case scenario.

8.2

Real World Experiment

Next, we tested the algorithm on a real network, namely a subtree of the Munin medical diagnosis network [1] (see Figure 5 for the subtree structure). This tree BN is a significant simplification of the problem domain. It was constructed by first manually setting the (simple) network structure and then using the EM algorithm [6] to learn the parameters from a data set sampled from the complete Munin network. Obviously, when we would attempt to classify cases using this simple BN, misclassifications will occur. The question is whether our algorithm can detect these misclassifications and localize their causes. We applied the algorithm on the tree BN for a set of sample cases generated by the complete network. Since the state of all (hidden) variables in all cases was known, we knew which CPTs were inadequate. On the tree network, the algorithm found 75.7% of all inadequate CPTs, while producing 20.9% false positives, which confirms that the algorithm can be effective in a real world setting.

9

Applications

In Section 4.1 we identified three types of causes for classification faults. The presented approach to localization can be used to detect inadequate CPTs and mitigate their impact.

9.1

Localizing Faulty Model Components

The localization algorithm can discover faults of Type 2, where a CPT does not accurately capture the general tendencies in the modeled domain (see (9)). By applying the localization algorithm to many different samples obtained in different situations, we can localize CPTs which are found to be inadequate in the majority of the samples. Such CPTs represent modeling errors, but cannot be avoided if the model is used in changing domains and the learning examples or expertise used for generation of the model do not capture the characteristics of the new domain. Fault localization can be especially useful in domains which change sufficiently slowly, allowing us to discover local inadequacies and adapt the model gradually to the new domain.

16

9.2

Fault Localization in Bayesian Networks

Deactivating Inadequate Model Components

In the case of faults of Type 1, we can use the localization algorithm 2 to localize CPTs that are inadequate in a particular situation corresponding to a certain set of observations. A CPT considered inadequate can be set to a uniform distribution, which effectively renders the fragment connected to the rest of the network via this CPT inactive. Since a fragment related to the rest of the network via an inadequate CPT does not support accurate classification in a given situation, its deactivation at runtime can improve the overall inference accuracy. In principle, by deactivating an inadequate CPT the divergence between the the estimated distribution over the hypothesis variable and the true point-mass distribution can be reduced. This is useful if the classification considers decision thresholds that are greater than 0.5. If for a given observation set the estimated distribution does not approach the true point mass distribution sufficiently enough, then the case cannot be classified. By deactivating a fragment the percentage of such cases can be reduced without any loss of performance. Since the fault localization algorithm can fail, occasionally adequate CPTs could be considered inadequate, which can reduce the classification quality. However, by considering the properties of the localization algorithm, we can show that it is more likely to encounter cases (i.e. sets of observations), for which the classification quality improves. This is especially the case if fragments rooted in the hypothesis node have identical topologies and CPTs, which corresponds to models of conditionally independent processes of the same type running in parallel. Models that support improved classification through fragment deactivation are relevant for a significant class of applications, where states of hidden variables are inferred through interpretation (i.e. fusion) of information obtained from large amounts of different sources, such as sensors. As it was shown in [13], such fusion can be based on BNs where each sensor is associated with a conditionally independent fragment given the monitored phenomenon. The improvement of the estimation through deactivation of fragments is illustrated with the help of an experiment. We used a BN with a tree topology, branching factor 5 and 4 levels of nodes, corresponding to 125 leaf nodes. The CPTs at every level were identical, such that pre = 0.75. This network was used for the data generation through sampling. The sampled data sets, consisting of 5000 cases, were fed to two classifiers. Both classifiers were based on the BN identical to the generative model. For one classifier we used the fault localization and deactivated inadequate fragments. Compared to the classifier using unaltered BN, the average posterior probability of the true hypothesis was significantly higher (0.81 instead of 0.76). Furthermore, the divergence between the estimated and the true distribution over the hypothesis variable was reduced for 67% of the data cases. Finally, of the cases that were misclassified by the unaltered BN, 11% got correctly classified after deactivation of inadequate CPTs. In contrast, a correct classification became a misclassification after deactivation in only 2% of cases. Furthermore, we assume that a sensor failure is a rare event. Consequently, if a sensor is broken, the CPT relating the monitored phenomenon and the fragment corresponding to the sensor is inadequate. If a few of the existing sensors are broken, we can localize the corresponding CPTs and mitigate the impact of broken sensors by deactivating the corresponding network fragments.

10 10.1

Discussion Non-Tree Networks

The analyses in the sections above are based on tree structured DAGs. For example, pre denotes the probability that a single CPT, corresponding to a single edge in the graph, is accurate. Obviously, real domains are often not represented by pure trees. It is possible, however, to convert an arbitrary DAG to a tree structure by compounding multiple nodes into hyper nodes

Section 10

Discussion

(a)

17

/.-, ()*+ 0123 7654 B rr¥¥A ;;; ¤ r ¤ r ; ¤ ¥ r À ¤¢ xrr0123¢¥ 7654 0123 0123 7654 D C ¥7654 H ;; ¤ ;; ¤ ² ¢¥¥¥ ¤¢ ¤ À 7654 0123 7654 0123 0123 7654 E F G99 ¤ 99 ² ² ¤¢ ¤¤ ¿ /.-, ()*+ /.-, ()*+ /.-, ()*+ I J L

(b)

()*+ /.-, 7654 0123 A ÄB ¥ ;;; Ä ¥ ;À ÄÄ Ä ¥£ ¥ 7654 0123 7654 0123 E H ¤ ??? § ¤ § ?Â ¤ § ¤¢ §¤ ()*+ /.-, 7654 0123 ?>=< 89:; I F JL

Figure 6: Example network structures: (a) original DAG, (b) tree DAG.

and marginalizing out certain nodes. The states of the hyper nodes are the cartesian products of the states of the original nodes. Note that this will increase the size of the CPTs and thus the assumption of pre > 0.5 becomes more difficult to justify. We give a simple example to illustrate this claim. Suppose the structure of our model is the DAG shown in Figure 6(a), where all leaf nodes are the evidence nodes. Given node A, nodes C and D are not independent and therefore must be part of the same fragment, together with E. Since we want fragments to consist of only one node, C and D either have to be compounded or marginalized out. To avoid unnecessarily large CPTs, we choose marginalization. Now all the fragments to the left of A consist of only one node. On the right side of the DAG, given H, F is an independent fragment. The fragment consisting of G, J and L cannot be split into multiple fragments given H, and it is not tree structured. It can be converted to a tree by compounding J and L and marginalizing out G. The resulting DAG is shown in Figure 6(b). Note that the structure of this DAG is equal to that of the running example (see Figure 1).

10.2

Related Work

Several authors have addressed the problems with reliable inference and modeling robustness. Sensitivity based approaches focus on the determination of modeling components that have a significant impact on the inference process [3, 2]. We must take special care of such components, since eventual modeling faults will have a great impact as well. Sensitivity analysis is carried out prior to the operation and deals with the accuracy of the generalizations. Another class of approaches, including ours, is focusing on determination of the model quality or performance in a given situation at runtime. The central idea of our approach is observation of the consistency of the model’s runtime reinforcements, which is different from common approaches to runtime analysis of BNs, as for example, data conflict [8] and straw models [10, 9]. The data conflict approach is based on the assumption that given an adequate model all observations should be correlated and pˆ(e1 , . . . , en ) > pˆ(e1 ) · · · pˆ(en ). If this inequality is not satisfied then this is an indication that the model does not ‘fit’ the current set of observations [7]. A generalization of this method, [9], is based on the use of straw models. Simpler (straw) models are constructed through partial marginalization, which, in a coherent situation, should be less probable than the original model. Situations in which the evidence observations are very unlikely under the original model and more probable under the straw model indicate a data conflict. While these approaches can handle more general BNs than our method, their disadvantage is that the conflict scores are difficult to interpret; at which score should an action be undertaken, or what is the probability that a positive score indicates an error? Another approach proposed in [5] is the surprise index of an evidence set. This index is defined as the joint probability of the evidence set plus the sum of the probabilities of all possible evidence sets that are less probable. If the index has a value below a certain threshold ([5] proposes a threshold of 0.1 or lower), the evidence set is deemed surprising, indicating a possibly erroneous model. Clearly, this approach requires computing the probabilities of an

18

REFERENCES

exponentially large number of possible evidence sets, making it intractable for most models. In addition, most of the common approaches to model checking focus on the net-performance of models and do not directly support detection of inaccurate parts of a model [16]. An exception is the approach [4] based on logarithmic penalty scoring rules. However, in this case the scores can be determined only for the nodes corresponding to observable events, while we reason about the nodes modeling hidden events.

10.3

Conclusion

We have presented an approach to fault detection in BNs that are used for classification. This was done through the following steps: 1. We identified a partitioning of a BN such that each fragment has an independent influence on a classification node. 2. We identified three different fault causes which can be present in a CPT, and argued that 0.5 is a plausible lower bound on the probability of encountering such a fault cause. 3. We presented a coarse view on the inference process and showed how faults can be propagated through different network fragments. 4. We introduced a measure to monitor the consistency among the influences of multiple network fragments on a node, and showed that we can find thresholds on this measure such that we can deduce the probability of a fault existing in a fragment. 5. We presented an algorithm that can combine the consistency measures at different nodes in the network in order to determine whether the fragment between the nodes contains a fault. One might question the assumption about large branching factors. However, there exist applications where this is the case, as for example the Distributed Perception Networks [12], which deals with hundreds or thousands of observations, each corresponding to a fragment in a BN. We showed that the results from fault localization can be used in several ways, such as localization of erroneous modeling parameters, faulty information sources and modeling components that do not support accurate inference in a particular situation due to rare cases. Furthermore, we established a lower bound to the algorithm’s effectiveness, which we showed to converge asymptotically to 1 for network topologies with increasing branching factors.

References [1] S. Andreassen, F. V. Jensen, S. K. Andersen, B. Falck, U. Kjærulff, M. Woldbye, A. R. Sørensen, A. Rosenfalck, and F. Jensen. MUNIN — an expert EMG assistant. In ComputerAided Electromyography and Expert Systems, chapter 21. Elsevier Science Publishers, Amsterdam, 1989. [2] E. Castillo, J. M. Guti´errez, and A. S. Hadi. Sensitivity analysis in discrete Bayesian Networks. IEEE Transactions on Systems, Man, and Cybernetics. Part A: Systems and Humans, 27:412–423, 1997. [3] V. M. H. Coup´e and L. C. van der Gaag. Practicable sensitivity analysis of bayesian belief networks. In Joint Session of the 6th Prague Symposium of Asymptotic Statistics and the 13th Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, pages 81–86, Prague, 1998.

REFERENCES

19

[4] R. G. Cowell, A. P. Dawid, and D. J. Spiegelhalter. Sequential model criticism in probabilistic expert systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(3):209–219, 1993. [5] J. D. F. Habbema. Models for diagnosis and detection of combinations of diseases. In F. de Dombal et al., editor, In Proc. IFIP Conf. on Decision Making and Medical Care, pages 399–411, 1976. [6] D. Heckerman. A tutorial on learning with Bayesian networks. In M. Jordan, editor, Learning in Graphical Models. MIT Press, Cambridge, MA, 1999. [7] F. V. Jensen. Bayesian Networks and Decision Graphs. Springer-Verlag, New York, 2001. [8] F. V. Jensen, B. Chamberlain, T. Nordahl, and F. Jensen. Analysis in hugin of data conflict. In In Proc. Sixth International Conference on Uncertainty in Artificial Intelligence, pages 519–528, 1990. [9] Y.-G. Kim and M. Valtorta. On the detection of conflicts in diagnostic bayesian networks using abstraction. In Proceedings of the Eleventh International Conference on Uncertainty in Artificial Intelligence, pages 362–367, 1995. [10] K. Laskey. Conflict and surprise: Heuristics for model revision. In In Proc. Seventh International Conference on Uncertainty in Artificial Intelligence, pages 197–204, 1991. [11] David J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. Available from http://www.inference.phy.cam.ac.uk/mackay/itila/. [12] G. Pavlin, M. Maris, and J. Nunnink. An agent-based approach to distributed data and information fusion. In Proc. IEEE/WIC/ACM Joint Conference on Intelligent Agent Technology, pages 466–470, 2004. [13] G. Pavlin and J. Nunnink. Inference meta models: A new perspective on inference with bayesian networks. Technical Report IAS-UVA-06-01, Informatics Institute, University of Amsterdam, The Netherlands, 2006. [14] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. [15] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000. [16] L. C. van der Gaag and S. Renooij. Evaluation scores for probabilistic networks. In Proceedings of the 13th Belgium-Netherlands Conference on Artificial Intelligence, pages 109–116, 2001.

20

REFERENCES

Acknowledgements

IAS reports This report is in the series of IAS technical reports. The series editor is Bas Terwijn ([email protected]). Within this series the following titles appeared: F. Oliehoek and N. Vlassis Dec-pomdps and extensive form games: equivalence of models and algorithms. Technical Report IAS-UVA-06-02, Informatics Institute, University of Amsterdam, The Netherlands, April 2006. G. Pavlin and J. Nunnink and F. Groen Inference meta models: A new perspective on belief propagation with bayesian networks. Technical Report IASUVA-06-01, Informatics Institute, University of Amsterdam, The Netherlands, March 2006. Z. Zivkovic and O. Booij How did we built our hyperbolic mirror omni-directional camera - practical issues and basic geometry. Technical Report IAS-UVA-0504, Informatics Institute, University of Amsterdam, The Netherlands, December 2005. All IAS technical reports are available for download at the IAS website, http://www. science.uva.nl/research/ias/publications/reports/.

Universiteit van Amsterdam

IAS technical report IAS-UVA-06-03

Fault Localization in Bayesian Networks

Jan Nunnink and Gregor Pavlin Intelligent Systems Laboratory Amsterdam, University of Amsterdam The Netherlands

This paper considers the accuracy of classification using Bayesian networks (BNs). It presents a method to localize network parts that are (i) in a given (rare) case responsible for a potential misclassification, or (ii) modeling errors that consistently cause misclassifications, even in common cases. We analyze how inaccuracies introduced by such network parts are propagated through a network and derive a method to localize the source of the inaccuracy. The method is based on monitoring the BN’s ‘behavior’ at runtime, specifically the correlation among a set of observations. Finally, when bad network parts are found, they can be repaired or their effects mitigated. Keywords: Bayesian networks, Fault localization, Classification.

IAS

intelligent autonomous systems

Fault Localization in Bayesian Networks

Contents

Contents 1 Introduction

1

2 Bayesian networks and classification 2.1 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 2

3 Classification Accuracy

3

4 Fault Probability 4.1 Fault Causes . 4.1.1 Cause 1: 4.1.2 Cause 2: 4.1.3 Cause 3:

. . . .

4 5 5 6 6

5 Reinforcement Propagation 5.1 Reinforcement Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 7

6 Fault Monitoring 6.1 Factor Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Consistency Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Estimation of the Summary Accuracy . . . . . . . . . . . . . . . . . . . . . . . .

8 9 9 9

7 Fault Localization Algorithm 7.1 Determining the Cause Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11 13

. . . . . . . . . . . . . Rare Cases . . . . . . Modeling Inaccuracies Erroneous Evidence .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

8 Experiments 13 8.1 Synthetic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 8.2 Real World Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 9 Applications 15 9.1 Localizing Faulty Model Components . . . . . . . . . . . . . . . . . . . . . . . . . 15 9.2 Deactivating Inadequate Model Components . . . . . . . . . . . . . . . . . . . . . 16 10 Discussion 16 10.1 Non-Tree Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 10.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 10.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Intelligent Autonomous Systems Informatics Institute, Faculty of Science University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam The Netherlands Tel (fax): +31 20 525 7461 (7490) http://www.science.uva.nl/research/ias/

Corresponding author: Jan Nunnink tel: +31 20 525 7517 [email protected] http://www.science.uva.nl/∼jnunnink/

Copyright IAS, 2006

Section 1

1

Introduction

1

Introduction

Discrete Bayesian networks (BNs) are rigorous and powerful probabilistic models for reasoning about domains which contain a significant amount of uncertainty. They are often used as classifiers for decision making or state estimation in a filtering context. BNs are especially suited for modeling causal relationships [15]. Using such a causal model, we propagate the values of observed variables to probability distributions over unobserved variables [14, 7]. These posterior distributions are often the basis for a classification process. Moreover, in many applications the classification result is mission critical (e.g. situation assessment in crisis circumstances). In this context, we emphasize the difference between the generalization accuracy and the classification accuracy. In general, classification is based on models that are generalizations over many different situations. Causal BNs capture such generalizations through conditional probability distributions over related events. However accurate generalizations do not guarantee accurate classification in a particular case. In a rare situation, a set of observations could result in erroneous classification, even if the model precisely described the true distributions. Namely, in this particular situation a certain portion of the conditional distributions does not ‘support’ accurate inference, since it does not model the rare case. By using such inadequate relations for inference, the posterior probability of the true (unobserved) state would be reduced. In addition, the probability of encountering situations for which a modeled relation is inadequate increases with the divergence between the true and the modeled probability distributions. Inadequate relations influence the inference process and subsequent classification. This is reflected in the way the model ‘behaves’ under different circumstances. We can monitor a BN and from its behavior draw conclusions about the existence of one of inadequate parts. This is based on the following principle: A given classification node splits the network in several independent fragments. These fragments can be seen as different experts giving independent ‘votes’ about the state of that node. The degree to which they ‘agree’ on a state is a measure for the accuracy of the classification. The data conflict approach by Jensen et al. [8] (as well as [10] and [9]) uses a roughly similar principle. It is based on the assumption that given an adequate model all evidence should be correlated and hence the joint probability of all evidence should be greater than the product of the individual evidence probabilities. More discussion about related approaches can be found in Section 10.2. Using our measure, we present a method for localizing possible inaccuracies in general, and show how to use this method to determine the type of cause. The advantage of this method over previous work is that we can estimate a lower bound on its effectiveness, and that this lower bound has asymptotic properties given the network topology. This allows one to determine for which networks localization works best. Furthermore, the localization procedure is more straightforward than the existing methods. Possible applications for the proposed method are presented in Section 9. Among its uses are: • Localized model errors can be manually corrected or relearned. • Localized inaccurate information sources can be deactivated at run-time to improve the classification.

2

Bayesian networks and classification

A Bayesian network BN is defined as a tuple hD, pˆi, where D = hV, Ei is a directed a-cyclic graph (DAG) consisting of a set of nodes V = {V1 , . . . , Vn } and a set of directed edges hVi , Vj i ∈ E between pairs of nodes. Each node corresponds to a variable and pˆ is the joint probability

2

Fault Localization in Bayesian Networks

()*+ /.-, 0123 7654 A4 B 4 ª ® 4 ª ® ¦® ½ ª¥ 0123 7654 0123 7654 C H5 55 ¦ ¥ ½ 7654 0123 7654 0123 0123 7654 E F G Figure 1: An example Bayesian network. Nodes {A, B, C, E} represent fragment F1H , which is

rooted in node H. H has a branching factor of 3. A has branching factor 2.

distribution over all variables, defined as pˆ(V) =

Y

pˆ(Vi |π(Vi )),

Vi ∈V

where pˆ(Vi |π(Vi )) is the conditional probability table (CPT) for node Vi given its parents in the graph π(Vi ). In this paper we use pˆ to specifically refer to estimated values, such as modeling parameters (CPTs) and posterior probabilities, while we use a p (without hat) for the true probabilities in the modeled world. We assume that all probabilities that we can estimate have a corresponding true value in the real world. Each variable has a finite number of discrete states (or values), denoted by lower case letters. The DAG represents the causal structure of the domain, and the conditional probabilities encode the causal strength. Furthermore, we will denote a set of observations or evidence about the state of variables by E, the main classification or hypothesis node by H, and the states of that node by hi . The (hidden) true state of H is denoted by h∗ . A variable H is classified as the state hi for which hi = arg max pˆ(H = hj , E).

(1)

hj

pˆ(H = hj , E) is obtained from the joint probability distribution by marginalizing over all variables except H and those in E, and factorizing the joint probability distribution, as follows: X Y Y pˆ(H = hj , E) = pˆ(Vi |π(Vi )) · e. (2) V\H Vi ∈V

e∈E

We distinguish between evidence variables and non-evidence variables. Where a non-evidence variable (except H) is used in the equation, we marginalize it out by summing over its states. Where an evidence variable appears, it is replaced by its observed state. This is done through multiplication with the vector e, which contains 1s at the entries corresponding to the observed states and 0s elsewhere. Note that this involves the multiplication of potentials. See also [7], section 1.4.6.

2.1

Factorization

For the analysis in the next sections we require a notion of dependence between different parts of a network given the classification variable. We can use d-separation [14] to identify fragments of the DAG which are conditionally independent given the classification node. Definition 1 Given a DAG and classification node H, we identify a set of fragments FiH (i = 1, . . . , k) which are all pairwise d-separated given H. A fragment is defined as a set of nodes that includes H. Hence, the nodes in a fragment are conditionally independent of nodes in the other fragments given H. Node H is called the root of all fragments FiH . The number of fragments rooted in H (k) is called the branching factor.

Section 3

Classification Accuracy

3

See Figure 1 for an example (we will use this example throughout the paper). Given node H, 3 fragments can be seen, namely (i) nodes {A, B, C, E, H}, (ii) nodes {F, H}, and (iii) nodes {G, H}. If A would be the classification variable, we could identify 2 fragments, namely (i) nodes {A, C, E}, and (ii) the rest of the nodes plus A. This fragmentation has the useful property that the classification equation (2) can be factorized such that each factor corresponds one-on-one with a fragment. By splitting the sum and product and regrouping them per fragment, we rewrite (2) as X Y Y pˆ(H = hi , E) = pˆ(Vi |π(Vi )) · e V\H Vi ∈V

=ˆ p(hi |π(H))

X

Y

e∈E

pˆ(Vi |π(Vi )) ·

Y

o e·

φ1 (hi )

e∈E1

V1 \H Vi ∈V1 \H

.. .

··· X

Y

pˆ(Vi |π(Vi )) ·

Y

o e,

φk (hi )

(3)

e∈Ek

Vk \H Vi ∈Vk \H

where we partition the complete set of variables V into k subsets Vi each of which consists of the nodes in the corresponding fragment FiH . Ei denotes the subset of evidence in fragment FiH . We can identify factors φj (hi ) (j = 1, . . . , k) whose product is the joint probability for each hi . ΦH denotes the set of all factors associated with node H. φj (hi ) denotes the value of factor φj for hi . The d-separation between the fragments directly implies the following: Proposition 1 The factors φj in (3) are mutually independent, given classification variable H. A factor is independent in the sense that a change in its value does not change the value of another factor. For example, consider the BN shown in Figure 1. H is the classification variable and the evidence is E = {e1 , f1 , g2 }. The fragments are F1H = {A, B, C, E, H}, F2H = {F, H} and F3H = {G, H}, and evidence sets for the fragments are E1 = {e1 }, E2 = {f1 } and E3 = {g2 }. The factorization becomes

pˆ(H = hi , E) =

zX A

3

pˆ(A)

X

φ1 (hi )

(hi ) φ3 (hi ) }| X { z φ1}| { z }| { pˆ(C|A)ˆ p(e1 |C) pˆ(B)ˆ p(hi |A, B) pˆ(f2 |hi ) pˆ(g2 |hi ) .

C

(4)

B

Classification Accuracy

Recall that h∗ denotes the true (hidden) state of variable H and that we classify H according to (1). We define the accuracy of a classification as follows: Definition 2 A classification of variable H, given evidence E, is accurate iff h∗ = arg max pˆ(H = hi , E). hi

(5)

In other words, the true state should have the greatest estimated probability. It is difficult to directly analyze the (in)accuracy and its causes from this definition. Therefore, we plug the factorization from the previous section into (5). Recall that φi (hj ) is the value of factor φi for state hj of the classification node H, given evidence Ei . Since factors φ1 (hj ), . . . , φk (hj ) are mutually conditionally independent, we can define the following accuracy condition for factors:

4

Fault Localization in Bayesian Networks

Definition 3 Factor φi supports an accurate classification iff h∗ = arg max φi (hj ). hj

(6)

We say that φi reinforces state arg maxhj φi (hj ) of H. In other words, we call a factor accurate if it gives the true state a greater value than all other states of H. The intuition behind this is that a factor is accurate if it contributes to an accurate classification, which is made clear through the following observation: Consider a BN and factorized probability distribution pˆ(H, E). Suppose we augment its DAG by adding a new fragment FiH which corresponds to a new factor φi in the factorization. If φi satisfies Definition 3, then it will contribute towards obtaining an accurate classification by increasing the probability for the true state h∗ relative to all other state hk 6= h∗ . Let E 0 denote the union of the original evidence E and the evidence from the new fragment. The relative probability for H can be expressed as ∀hk 6= h∗ :

φi (h∗ ) pˆ(h∗ , E) pˆ(h∗ , E) pˆ(h∗ , E 0 ) = > . pˆ(hk , E 0 ) φi (hk ) pˆ(hk , E) pˆ(hk , E)

(7)

This inequality is obvious since φi (h∗ )/φi (hk ) > 1 for all hk 6= h∗ , if φi satisfies Definition 3. Summarizing, a classification is accurate as long as the joint probability for the true state is the greatest. This happens if a sufficient number of factors contribute to an accurate classification by satisfying Definition 3.

4

Fault Probability

In Definition 3, we say that a factor does not support accurate classification if, given the evidence, it does not satisfy Condition (6). In that case, we call the factor or corresponding fragment inadequate for the current classification task. We will also use the term fault to denote the same violation of (6). A factor φi (H) is obtained by combining parameters from one or more CPTs from the corresponding network fragment FiH . This combination depends on the evidence which in turn depends on the true distributions over modeled events. Thus, with a certain probability we encounter a situation in which factor φi (H) is adequate, i.e. it satisfies condition (6). We can show that this probability depends on the true distributions and simple relations between the true distributions and the CPT parameters. We can facilitate further analysis, by using the concept of factor reinforcements to characterize the influence of a single CPT. For the sake of clarity, we focus on diagnostic inference only. For example, consider a CPT pˆ(E|H) relating variables H and E. If we assume that one of the two variables was instantiated, we can compute a reinforcement at another related variable. For the instantiation E = e∗ , one of the factors associated with H can be expressed as φi (H) = pˆ(e∗ |H). In this situation factors are identical to CPT parameters and adequacy of CPTs can be defined. If all CPTs from FiH were adequate in a given situation, then also φi would be adequate. This is often not the case, however. Whether a factor φi is adequate depends on which CPTs from the corresponding fragment are inadequate. Obviously, the higher is the probability that any CPT from a fragment FiH is adequate, the higher is the probability that FiH is adequate as well. We need to assume a lower bound pre on the probability that in a certain case a single CPT is adequate and supports accurate reinforcement. In this section we argue that pre > 0.5 is a plausible assumption. Consider a fragment FiH consisting of only two adjacent nodes H and

Section 4

Fault Probability

5

b1 b2 b3

h1 0.7 0.2 0.1

h2 0.4 0.3 0.3

Table 1: Example pˆ(B|H).

E, whose relation is defined by a single CPT. Definition 3 implies that FiH is adequate if the corresponding factor φi (hj ) is greatest for the true state h∗ . Thus, the CPT is adequate if the state h∗ causes such evidence ei that, after instantiation of the corresponding variable in FiH , factor φi (h∗ ) is greatest. For each state hi of H we first define a set of states of E for which the CPT parameters satisfy condition (6): Bhi = {ek |∀hj 6= hi : pˆ(ek |hi ) > pˆ(ek |hj )}. In addition, for each possible state hi we can express the probability phi that a state from Bhi will take place: X phi = p(ej |hi ), (8) ej ∈Bhi

where p(ei |hj ) describe the true distributions. In other words, phi is the probability that cause hi will result in an effect for which the CPT parameters satisfy (6). pre is defined as the lower bound on the probability that a CPT pˆ(E|H) will be adequate: pre = mini (phi ). For example, let’s assume a simple model consisting of two nodes B and H related through a CPT shown in Table 1, which is identical to the true probabilities in the domain. Suppose that h2 is the (hidden) true state of H, so h∗ = h2 . For the corresponding factor φ(H) = pˆ(B|H) to be adequate, h2 should cause evidence bk such that the factor reinforces h2 (see Definition 3). We can see that the evidence sets for which h2 = arg max φ(hi ) are {b2 } and {b3 } (in this case arg max φ(hi ) returns the maximum value on a row of the CPT). The probability that either of these sets is caused by h2 , pˆ([b2 ∨ b3 ]|h2 ) = 0.6. Similarly, if h1 were the true state of H, then we would get that pˆ(b1 |h1 ) = 0.7. Thus, whichever the true state of H, for this example CPT we get that the probability that factor φ(H) is adequate, pre , is at least 0.6. A consequence of Definition 3 is that pre does not change even if the values in the CPT would change to pˆ(bk |hj ) 6= p(bk |hj ), as long as simple inequality relations between the CPT values and the true distributions are satisfied: ∀bk : arg max pˆ(bk |hj ) = arg max p(bk |hj ), hj

hj

(9)

where p denotes the true distribution in the problem domain. Note that this relation is very coarse, and we can assume that it can easily be identified by model builders or learning algorithms. For a more thorough discussion see [13].

4.1

Fault Causes

A CPT does not support accurate classification in a given situation, i.e. it is inadequate, if it does not satisfy Condition (6). We identify three types of faults that cause inadequacies. 4.1.1

Cause 1: Rare Cases

Suppose a CPT is correct in the sense that the CPT parameters are sufficiently close to the true probabilities in order to satisfy (9). In a rare case, however, the true state is not the most

6

Fault Localization in Bayesian Networks

likely state given the effect b∗ that materialized: h∗ 6= arg maxhj p(b∗ |hj ). Then, the case can get misclassified, since the model reinforces the most likely state. As an example, consider a simple domain where the distribution over binary variables F (fire) and S (smoke) is given by p(s|f ) = 0.7, p(s|f ) = 0.7 and p(f ) = 0.5. If the world would be in the rare case {f, s} where we observe S = s, inference would decrease the probability of the true state f , violating Condition (6). 4.1.2

Cause 2: Modeling Inaccuracies

Alternatively, CPT parameters might not satisfy (9). Then, if a case is common, the true state of H is not reinforced. By considering rationale from the previous section, we assume that this fault type is not frequent. In other words, consider a fragment containing evidence bi . If the true probabilities satisfy p(bi |h∗ ) > p(bi |hi ) for all i, but the model parameters satisfy pˆ(bi |hi ) > pˆ(bi |h∗ ) for some i, the CPT does not support accurate classification. We call this a model inaccuracy. 4.1.3

Cause 3: Erroneous Evidence

The evidence inserted into a BN is typically provided by other systems, such as sensors, databases or humans. Observation and interpretation of signals from the world can, however, be influenced by noise or system failures, possibly leading to wrong classifications.

5

Reinforcement Propagation

We introduce a coarse inference algorithm, which propagates factor reinforcements through a tree structured DAG. It only propagates reinforcements from leaves to roots, i.e. it only ‘collects’ evidence for diagnostic inference. As we show later, with the help of this algorithm we can monitor a model’s runtime ‘behavior’, which can give clues about the adequacy of CPTs. The algorithm is based on the concept of a factor reinforcement, which was already mentioned in Definition 3 but is made more formal here. Definition 4 (Factor Reinforcement) Given a classification variable H, a fragment FiH and some instantiation of the evidence variables in FiH , we define the corresponding factor reinforcement RH (φi ): RH (φi ) = arg max φi (hj ). (10) hj

In other words, reinforcement RH (φi ) is a function that returns that state hj of variable H, whose probability is increased the most (i.e. is reinforced) by instantiating nodes of fragment FiH . For example, given factorization (4), we obtain three reinforcements for H. If a factor φi is accurate (see Definition 3), then RH (φi ) = h∗ . Moreover, we can count for any node H how many of its fragments reinforced each of its states. Let ni be the number of factors reinforcing state hi . We call N = {n1 , . . . , nm } the set of reinforcement counters, where m is the number of states of H. ni is defined as ni =k {φj ∈ ΦH |hi = RH (φj )} k,

(11)

where k · k denotes the size of a set. Suppose that in our running example the reinforcements were h1 , h2 and h1 . If H would have 3 states then N = {2, 1, 0}. Next, classification chooses that state hi which got reinforced by the most factors, i.e. which had the greatest reinforcement counter:

Section 5

Reinforcement Propagation

7

Definition 5 (Reinforcement Summary) The reinforcement summary SH of a node H is defined as: SH = hi s.t. ∀j6=i ni ≥ nj , (12) If H is an evidence node then SH is defined as the observed state of H. In our example, where N = {2, 1, 0}, we get SH = h1 . For BNs with tree structured DAGs we can summarize the definitions presented above into a coarse inference process. We assume that the evidence nodes are the tree’s leaves and its classification node is the tree root. Consider a set V consisting of all leaf nodes. We define a set P = {Ni ∈ / V|children(Ni ) ⊆ V} consisting of all nodes not in V whose children are all elements of V. For each parent Y ∈ P we determine the reinforcement summary SY resulting from the propagation from its children. Every parent node Y is then instantiated as if the reinforcement summary state returned by SY were observed. We then set V ← V ∪ P and the procedure is repeated until the reinforcement summary is determined at the root node H. This implies that at all times all nodes in the set V are leaves and/or instantiated nodes, so the reinforcement summaries can be computed. This procedure can be summarized by Algorithm 1. Algorithm 1: Reinforcement Propagation Algorithm 1 2 3

4

Collect all leaf nodes in the set V; Find P = {Ni ∈ / V|children(Ni ) ⊆ V}, the set of nodes whose children are all in V; if P 6= ∅ then for each node Y ∈ P do Find set σ(Y ) of all instantiated children of Y ; for each node Xi ∈ σ(Y ) do Compute reinforcement RY (φi ) at node Y caused by the instantiation of Xi ; end Compute reinforcement summary SY at node Y ; Instantiate node Y as if SY were observed (hard evidence); end Make parent nodes P elements of V: V ← V ∪ P; else Stop; end Go to step 2;

With this algorithm, we obtain SX for all unobserved variables X by recursively using Definition 4 and 5. In BNs with tree-like DAGs and binary nodes, Algorithm 1 corresponds to a system of hierarchical decoders that implement the repetition coding technique known from information theory (see for example [11], Chapter 1). This implies asymptotic properties. Namely, given the pre > 0.5, as the branching factors increase, the probability that SH = h∗ at any node approaches 1. This property can be explained through binomial distributions, as we will show in the next section.

5.1

Reinforcement Accuracy

While pre > 0.5 is a lower bound on the probability that a particular CPT provides an accurate reinforcement, pf denotes the probability that a factor reinforcement, resulting from Algorithm 1, is accurate. This lower bound will be necessary for the analysis in upcoming sections. First we need to make the following assumptions:

8

Fault Localization in Bayesian Networks

Assumption 1 The BN graph contains a high number of nodes with many conditionally independent fragments. Hence, a high number of independent factors can be identified. Assumption 2 The probability pre > 0.5, that any CPT supports an accurate classification in a given situation (Section 4 provides a rationale for this assumption). Proposition 2 Given Assumptions 1 and 2 and sufficiently high branching factors, the probability pf that the true state of a classification node will be reinforced by a factor is greater than 0.5. Proof (Sketch) Factor reinforcement is calculated recursively using Definition 4 and 5 beginning at the leaf nodes and ending at the classification node. We show that with each recursion loop Proposition 2 holds. Let H be a classification node with k factors, φi be one of the factors associated with H and let G be the child of H from the fragment corresponding to φi (see for example Figure 1). We can write pf for factor φi as: pf = pre psum + α(1 − pre )(1 − psum ),

(13)

where psum is the probability that the reinforcement summary at node G equalled the true state of G. If the reinforcement summary is accurate and the fragment between H and G is adequate then the reinforcement at H is accurate. The second term represents the situation where the reinforcement summary at G is inaccurate and the fragment between G and H contains a fault. These can cancel each other out, which can result in an accurate reinforcement at H. 0 < α < 1 is a scalar that represents the probability that such a situation occurs. Note that for binary variables α = 1. Next, let pf be the minimum pf over all factors associated with H. From Definition 5 we can give a lower bound on psum for node H: psum ≥

k X m=dk/2e

µ ¶ k p m f

m

(1 − pf )k−m .

(14)

This is a lower bound because the reinforcement summary is defined as the state with the maximum reinforcement counter, which is less restrictive than the absolute majority (dk/2e) used in (14). Assumption 2 states that pre > 0.5 and therefore (13) implies that there exists a sufficiently high psum for which pf > 0.5. (14) implies that a sufficiently high psum can be obtained if pf > 0.5 and k is sufficiently high. The recursion starts with the leaf nodes, for which psum = 1 since they are instantiated. Thus, if a network contains enough fragments (Assumption 1), and pre > 0.5 (Assumption 2) then pf > 0.5 for all classification nodes. ¤ For the complete proof see [13]. Additionally, from the above analysis, we can observe the following property: Corollary 3 pf will increase and approach 1 if the branching factors increase.

6

Fault Monitoring

We want to estimate the adequacy of a particular model fragment for a particular case. It is clear that we cannot directly apply Definition 3, because we do not know the true state of hidden

Section 6

Fault Monitoring

9

variables and thus cannot evaluate Condition (6). We will show in this section however that given certain (in)accuracies a model will ‘behave’ in a certain way. We will call this behavior model response, describe it in terms of the reinforcements from the previous section, and show that it can give clues to the existence of inaccuracies.

6.1

Factor Consistency

Since the true state of a hidden variable is unknown, it is impossible to directly determine whether or not RH (φi ) = h∗ holds. We can however use the following definition whose condition is directly observable and which describes the relationship between multiple factors: Definition 6 (Factor Consistency) Given any node H, a set of factors ΦH is consistent iff ∀φi , φj ∈ ΦH

RH (φi ) = RH (φj ).

The factors are thus consistent if they reinforce the same state of H. Given the fact that there can be only one true state h∗ at a given moment, we observe that if each element of a set of factors ΦH satisfies the condition in Definition 3, then that set must be consistent. If a set of factors is not consistent, then there exist elements from that set that do not satisfy the condition in Definition 3. Obviously, through various faults we will observe inconsistent factor sets in most situations. In that case we should determine which of the factors in an inconsistent set violate Definition 3. We will next show how this can be achieved, using the result from Proposition 2 and by introducing a consistency measure.

6.2

Consistency Measure

We define a measure for the degree of consistency of any factor φi with respect to the observed factor reinforcements of all factors of a set ΦH . Definition 7 (Consistency Measure) Given a node H, a set of factors ΦH , and a reinforcement counter ni for each state hi (see Section 5). The consistency measure for a factor φi ∈ ΦH is defined as: CH (φi ) = nj − max nk , k6=j

where hj is the state of H that was reinforced by factor φi . In other words, the consistency measure for a factor φi is equal to the number of factors ‘agreeing’ with φi (including φi itself), minus the maximum number of reinforcements any other state of H got. For the running example, where the reinforcements were RH (φ1 ) = RH (φ3 ) = h1 and RH (φ2 ) = h2 , we get CH (φ1 ) = 1 and CH (φ2 ) = −1. Using this definition we can describe certain relations between value of the consistency measure and estimated factor accuracy.

6.3

Estimation of the Summary Accuracy

We use p = pf > 0.5 as the a priori probability that a reinforcement equals the true state, RH (φi ) = h∗ (recall Proposition 2). Consider a node H = {h1 , . . . , hm } and associated reinforcement counters N = {n1 , . . . , nm }. N is the sum over N , and thus the total number of factors. The conditional probability that any particular state hi equals the true state h∗ , given that we observed the reinforcement set N and assuming uniform priors over h∗ , can be expressed as pni (1 − p)N −ni p(hi = h∗ |N ) = P n . N −nj j j p (1 − p)

10

Fault Localization in Bayesian Networks

The numerator consists of the probability of a correct reinforcement to the power of the number of reinforcements supporting hi , times the probability that a reinforcement is inaccurate to the power of the number of reinforcements not supporting hi . The denominator normalizes the distribution. We want to determine exactly for which degree of consistency CH this conditional probability is greater than 0.5. This is the case if X pni (1 − p)N −ni > pnj (1 − p)N −nj . j6=i

Because of the sum term in the equation this is difficult to express in terms of CH . We take an upper bound of the right hand side of the inequality, and if this new inequality is satisfied P the original is satisfied as well. The upper bound we use is: n max x ≥ x. If we define c = p/(1 − p), then this becomes: cni > (m − 1)cmaxj6=i nj , which is equivalent to: ni − max nj > j6=i

log(m − 1) . log c

(15)

The left hand side is now equal to the consistency measure CH . We also want to determine exactly when p(hi = h∗ |N ) is smaller than 0.5. This is true if X pni (1 − p)N −ni < pnj (1 − p)N −nj . j6=i

We now take a lower bound of the right hand side of the inequality, so if thisP new inequality is satisfied the original is satisfied as well. Thus, for the lower bound: max x ≤ x, cni < cmaxj6=i nj , which is equivalent to: ni − max nj < 0 j6=i

(16)

and we derive the following implications:

CH

CH < 0 ⇒ p(hi = h∗ |N ) < 0.5 log(m − 1) ⇒ p(hi = h∗ |N ) > 0.5 > log c

(17) (18)

CH here denotes CH (φ), hi = RH (φ) and m is the number of states of H. These implications give the probability of an accurate factor reinforcement, given its consistency measure. This allows us to use an observable quantity (the reinforcement counters), in order to derive the probability that a particular fragment is adequate. Thus, if a factor has a negative consistency measure, the corresponding fragment probably introduces a fault. Implication (18) is not trivial to interpret, since the condition depends on the unknown factor c = p/(1 − p). It turns out however that without knowing the exact value of c we can often specify an adequate CH , which makes implication (18) valid. It is important to note that the consistency measure can only take integer values. For example, any value of log(m−1)/ log c < 1 requires CH to be at least 1 in order to satisfy (18). Condition log(m − 1)/ log c < 1 is satisfied if p ∈ hpmin , 1]. Table 2 shows the lower bound of interval hpmin , 1] for which different values of CH are adequate. These bounds depend also on m. If m = 2, then CH = 1 is adequate for any p ∈ h0.5, 1]. Recall that we already assumed p to be greater than 0.5.

Section 7

Fault Localization Algorithm

m= 2 3 4 5

11

CH = 1 0.50 0.66 0.75 0.80

CH = 2 0.50 0.58 0.63 0.66

Table 2: Minimum value of p that is sufficient to satisfy (18) given a certain value of CH and m.

(a)

X

X

X

Y

Y

Y

FY i

(b)

FY i

(c)

FYi

Figure 2: (a) Network section. (b) Comparison at node Y . (c) Comparison at node X.

7

Fault Localization Algorithm

Depending on which CPTs from a network fragment FiH are inadequate in a given situation, the resulting factor φi might be inaccurate as well. We can often localize inadequate CPTs by using (17) and (18). Consider a network section consisting of two adjacent nodes, X and Y (see Figure 2). First we consider one particular fragment FiY rooted in Y . At run-time, the consistency measure CY (φi ) can be obtained at node Y for the factor corresponding to FiY (see Figure 2b). This measure combined with (17) and (18) indicates whether FiY up until node Y is adequate. Let F 0 be fragment FiY plus the edge hX, Y i. F 0 would be a fragment of X if we would remove all fragments of Y except for FiY from the graph (see Figure 2c). Let φ0i be its corresponding factor. We can observe the consistency measure CX (φ0i ) at node X for fragment F 0 . To compute this consistency, we need to know the reinforcement RX (φ0i ). This can be obtained using the reinforcement propagation algorithm by ignoring the reinforcements from all fragments rooted in Y , expect for FiY . We then compare the reinforcement of φ0i on node X with the reinforcements of all other factors of X, and obtain the consistency (see Figure 2c). Again, this gives an indication of the adequacy of the fragment FiY , this time including edge hX, Y i. These two consistency measures combined indicate the adequacy of the CPT parameters pˆ(Y |X) corresponding to the edge hX, Y i. We use the following rule: Rule 1 Let θt and θf be thresholds on the consistency measure. If, for any node X, we observe CX (φ) > θt then we assume x∗ = RX (φ). If we observe CX (φ) < θf then we assume x∗ 6= RX (φ). Given this rule, we can determine the adequacy of the CPT parameters pˆ(Y |X) based on the following intuition: If a fragment is adequate up to Y , but the extended fragment inadequate up to X, then the fault lies with edge hX, Y i. All such localization rules are shown in Table 3. In other words, we compare the consistency at two adjacent nodes and classify the edge between the nodes as adequate or inadequate. We can show that the use of Rule 1 in conjunction with appropriate thresholds will guarantee that in most cases the (in)adequacy of the CPT corresponding to hX, Y i is correctly determined. Proposition 4 (Fault Localization) Given a network with binary nodes containing a sufficient number of fragments and pf > 0.5 (see Section 5.1), fault localization based on Rule 1 and

12

Fault Localization in Bayesian Networks

x∗ = RX (φ0i ) true true false false

y ∗ = RY (φi ) true false true false

⇒ ⇒ ⇒ ⇒

edge hX, Y i ok inadequate inadequate ok

Table 3: Localization rules. The values in the first two columns correspond to the truth of the

equality in the column header. Table 3 with thresholds θt = log(m−1) and θf = 0 will correctly determine whether a particular log c CPT pˆ(Y |X) is adequate or inadequate with more than 50% chance. Proof If for any node A and factor φ we observe CA (φ) > θt = log(m−1) then Rule 1 tells us log c to assume that a∗ = RA (φ). (18) implies that, given pf > 0.5, the probability p∗A that this assumption is correct, namely that a∗ truly equals RA (φ), is p∗A = p(RA (φ) = a∗ |N ) > 0.5. Analogously, if for any node A and factor φ we observe CA (φ) < 0 then Rule 1 tells us to assume that a∗ 6= RA (φ). (17) implies that, given pf > 0.5, the probability p∗A that this assumption is correct, namely that a∗ truly does not equals RA (φ), is p∗A = 1 − p(RA (φ) = a∗ |N ) > 0.5. This holds for nodes X and Y from Table 3, and thus the probability that we will choose the right state of the first two columns of Table 3, and thereby draw the right conclusion about edge hX, Y i, is p∗X · p∗Y , where both p∗X > 0.5 and p∗Y > 0.5 as shown above. If we wrongly choose the state of both columns we will draw the same conclusion. The total probability of drawing the correct conclusion is therefore pcorrect = p∗X · p∗Y + (1 − p∗X ) · (1 − p∗Y ). It is easy to see that if p∗X > 0.5 and p∗Y > 0.5 then pcorrect > 0.5, and that therefore fragment classification using Rule 1, Table 3 and the appropriate thresholds is correct with more than 50% chance. ¤ We can apply such analysis to all non-terminal nodes by running Algorithm 2. Algorithm 2: Localization Algorithm 1 2

Execute Algorithm 1, and store all factor reinforcements; for each node X do for each fragment FiX of X do Let Y be the child of X within FiX ; for each fragment FjY of Y do Compute CX (φ0 ) and CY (φ) for FjY ; Using thresholds θp and θf , and Table 3, classify CPT pˆ(Y |X); end Use majority voting on all classifications of CPT pˆ(Y |X) based on different FjY ; end end We observe the following property of Algorithm 2:

Corollary 5 The majority voting at the end of Algorithm 2 improves with higher branching factors. Higher branching factors imply more votes about the state of a fragment and therefore higher expected localization accuracy. This accuracy converges asymptotically to 1 if the branching factors increase. Note that while the proof is given for networks with binary nodes, the algorithm is likely to be effective for multi-state nodes as well. In that case, the implications on the second and

Section 8

Experiments

13

fourth line of Table 3 are not necessarily valid. For example, there are rare circumstances where x∗ = RX (φ0i ) and y ∗ 6= RY (φi ), but where the CPT is nonetheless adequate. This is possible because multiple states of Y (including those not equal to y ∗ ) could all reinforce the same true state x∗ . See for example Table 1, where both b2 and b3 reinforce h2 . If h∗ = h2 and b∗ = b2 , but b3 were instantiated, then the CPT would be deemed inadequate while it was in fact not introducing a fault. If these circumstances do not occur often, then the majority voting in Algorithm 2 will mitigate their effects, and the localization will work correctly, especially if the branching factors are high. The experiments in Section 8.1 support this.

7.1

Determining the Cause Type

We can distinguish between rare cases (type 1) and model errors (type 2) by their frequency of occurrence. For this we need to perform fault localization on a BN for a set of cases. If certain fragments are diagnosed as inadequate for a large number of cases, then this is an indication that the fragment might contain erroneous parameters. Alternatively, it might be possible to find model errors by localizing faults on a case from the domain which one knows is not rare. In other words, a case for which we know that the true state of every node is the most likely state given the evidence (see Section 4.1, cause 1). This excludes the possibility for faults due to a rare case. Any found faults are then probably caused by model errors.

8

Experiments

To verify our claims and illustrate some of the properties of the algorithm 2 we applied it to several synthetic networks, in which we artificially introduced faults. We also applied it to a real network, which we adapted such that it represents an oversimplification of the problem domain, thus introducing faults.

8.1

Synthetic Networks

We generated BNs with random CPTs, using a simple tree DAG with fixed branching factor k and 4 levels. We initialized all CPTs such that the probability pre of a CPT being adequate (see Section 4) could be controlled. We let pre take values 1, 0.95, . . . , 0.4. Then we generated 1000 data samples for each particular network, applied algorithm 2 on each sample case, and observed its output. We used 0 as the positive threshold θt , which meant that the consistency measure had to be at least 1 for assuming that a CPT is adequate. Even though the algorithm does not know the value of c in (18), it turned out that algorithm 2 is quite insensitive to the precise positive threshold value, which confirms the rationale at the end of Section 6.2. The algorithm’s output was compared with the ground truth, i.e. which fragments really were inadequate for the given data case. This ground truth can be obtained from the complete case, which was known to us. Given the inadequate CPTs that were present for a given case, we recorded the percentage of CPTs that the algorithm could detect and the percentage of detected inadequacies that turned out to be false positives. We applied the algorithm to networks with varying branching factors (but the same general structure). The percentages are plotted in Figure 3. For Figure 4 we varied the number of states per network variable. Figure 3 confirms the analysis that for any value of pre > 0.5, higher branching factors increase the algorithm’s effectiveness. Figure 4 also shows that the algorithm performs better on networks with more node states, which can be explained by the fact that in such cases inadequate sample values are spread over more states. For example, suppose that in a certain situation a node is in state 1, but

14

Fault Localization in Bayesian Networks

1 0.9 0.8

percentage

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

3

5

7

branching factor

Figure 3: The effect of branching factors on a network with 4-state nodes, for different values

of pre : 0.9 (dash-dotted), 0.7 (solid), 0.5 (dashed). Top curves show percentage found, bottom curves show percentage of false positives.

1 0.9 0.8

percentage

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1

0.9

0.8

0.7 pre

0.6

0.5

0.4

Figure 4: The effect of the number of node states on a network with branching factor 5, for

different values of pre (horizontal axis). Number of states: 2 (dashed), 3 (solid), 4 (dashdotted). Top curves show percentage found, bottom curves show percentage of false positives. The dotted line shows the worst case scenario for 3 and 4 states.

Section 9

Applications

15

1

2

3

13

4

14

5

15

6

16

7

17

8

9

10

18

19

20

27

28

29

11

21

12

22

23

24

25

26

Figure 5: Network structure used for the experiment with the real network.

an inadequate fragment has caused a higher belief in a different state. If a node has more states, inaccurate classifications will be spread among more alternatives. Thus, on average, the difference between the counter of the correct state and the other counters increases, making the correct state still stand out. For example given N = {3, 2, 0}, state 1 would have a consistency measure of 1, while for N = {3, 1, 1} it would be 2. Note that the degree of this spread also influences the quality, as can be seen from the dotted line in Figure 4. This line shows the effectiveness if we enforce only one alternative state (i.e. if a fragment is inadequate it will always cause the same inaccurate state), which on average decreases the consistency measure. The worst case scenario is equivalent to localization in binary BNs. We expect real networks to be somewhere in between this worst and best case scenario.

8.2

Real World Experiment

Next, we tested the algorithm on a real network, namely a subtree of the Munin medical diagnosis network [1] (see Figure 5 for the subtree structure). This tree BN is a significant simplification of the problem domain. It was constructed by first manually setting the (simple) network structure and then using the EM algorithm [6] to learn the parameters from a data set sampled from the complete Munin network. Obviously, when we would attempt to classify cases using this simple BN, misclassifications will occur. The question is whether our algorithm can detect these misclassifications and localize their causes. We applied the algorithm on the tree BN for a set of sample cases generated by the complete network. Since the state of all (hidden) variables in all cases was known, we knew which CPTs were inadequate. On the tree network, the algorithm found 75.7% of all inadequate CPTs, while producing 20.9% false positives, which confirms that the algorithm can be effective in a real world setting.

9

Applications

In Section 4.1 we identified three types of causes for classification faults. The presented approach to localization can be used to detect inadequate CPTs and mitigate their impact.

9.1

Localizing Faulty Model Components

The localization algorithm can discover faults of Type 2, where a CPT does not accurately capture the general tendencies in the modeled domain (see (9)). By applying the localization algorithm to many different samples obtained in different situations, we can localize CPTs which are found to be inadequate in the majority of the samples. Such CPTs represent modeling errors, but cannot be avoided if the model is used in changing domains and the learning examples or expertise used for generation of the model do not capture the characteristics of the new domain. Fault localization can be especially useful in domains which change sufficiently slowly, allowing us to discover local inadequacies and adapt the model gradually to the new domain.

16

9.2

Fault Localization in Bayesian Networks

Deactivating Inadequate Model Components

In the case of faults of Type 1, we can use the localization algorithm 2 to localize CPTs that are inadequate in a particular situation corresponding to a certain set of observations. A CPT considered inadequate can be set to a uniform distribution, which effectively renders the fragment connected to the rest of the network via this CPT inactive. Since a fragment related to the rest of the network via an inadequate CPT does not support accurate classification in a given situation, its deactivation at runtime can improve the overall inference accuracy. In principle, by deactivating an inadequate CPT the divergence between the the estimated distribution over the hypothesis variable and the true point-mass distribution can be reduced. This is useful if the classification considers decision thresholds that are greater than 0.5. If for a given observation set the estimated distribution does not approach the true point mass distribution sufficiently enough, then the case cannot be classified. By deactivating a fragment the percentage of such cases can be reduced without any loss of performance. Since the fault localization algorithm can fail, occasionally adequate CPTs could be considered inadequate, which can reduce the classification quality. However, by considering the properties of the localization algorithm, we can show that it is more likely to encounter cases (i.e. sets of observations), for which the classification quality improves. This is especially the case if fragments rooted in the hypothesis node have identical topologies and CPTs, which corresponds to models of conditionally independent processes of the same type running in parallel. Models that support improved classification through fragment deactivation are relevant for a significant class of applications, where states of hidden variables are inferred through interpretation (i.e. fusion) of information obtained from large amounts of different sources, such as sensors. As it was shown in [13], such fusion can be based on BNs where each sensor is associated with a conditionally independent fragment given the monitored phenomenon. The improvement of the estimation through deactivation of fragments is illustrated with the help of an experiment. We used a BN with a tree topology, branching factor 5 and 4 levels of nodes, corresponding to 125 leaf nodes. The CPTs at every level were identical, such that pre = 0.75. This network was used for the data generation through sampling. The sampled data sets, consisting of 5000 cases, were fed to two classifiers. Both classifiers were based on the BN identical to the generative model. For one classifier we used the fault localization and deactivated inadequate fragments. Compared to the classifier using unaltered BN, the average posterior probability of the true hypothesis was significantly higher (0.81 instead of 0.76). Furthermore, the divergence between the estimated and the true distribution over the hypothesis variable was reduced for 67% of the data cases. Finally, of the cases that were misclassified by the unaltered BN, 11% got correctly classified after deactivation of inadequate CPTs. In contrast, a correct classification became a misclassification after deactivation in only 2% of cases. Furthermore, we assume that a sensor failure is a rare event. Consequently, if a sensor is broken, the CPT relating the monitored phenomenon and the fragment corresponding to the sensor is inadequate. If a few of the existing sensors are broken, we can localize the corresponding CPTs and mitigate the impact of broken sensors by deactivating the corresponding network fragments.

10 10.1

Discussion Non-Tree Networks

The analyses in the sections above are based on tree structured DAGs. For example, pre denotes the probability that a single CPT, corresponding to a single edge in the graph, is accurate. Obviously, real domains are often not represented by pure trees. It is possible, however, to convert an arbitrary DAG to a tree structure by compounding multiple nodes into hyper nodes

Section 10

Discussion

(a)

17

/.-, ()*+ 0123 7654 B rr¥¥A ;;; ¤ r ¤ r ; ¤ ¥ r À ¤¢ xrr0123¢¥ 7654 0123 0123 7654 D C ¥7654 H ;; ¤ ;; ¤ ² ¢¥¥¥ ¤¢ ¤ À 7654 0123 7654 0123 0123 7654 E F G99 ¤ 99 ² ² ¤¢ ¤¤ ¿ /.-, ()*+ /.-, ()*+ /.-, ()*+ I J L

(b)

()*+ /.-, 7654 0123 A ÄB ¥ ;;; Ä ¥ ;À ÄÄ Ä ¥£ ¥ 7654 0123 7654 0123 E H ¤ ??? § ¤ § ?Â ¤ § ¤¢ §¤ ()*+ /.-, 7654 0123 ?>=< 89:; I F JL

Figure 6: Example network structures: (a) original DAG, (b) tree DAG.

and marginalizing out certain nodes. The states of the hyper nodes are the cartesian products of the states of the original nodes. Note that this will increase the size of the CPTs and thus the assumption of pre > 0.5 becomes more difficult to justify. We give a simple example to illustrate this claim. Suppose the structure of our model is the DAG shown in Figure 6(a), where all leaf nodes are the evidence nodes. Given node A, nodes C and D are not independent and therefore must be part of the same fragment, together with E. Since we want fragments to consist of only one node, C and D either have to be compounded or marginalized out. To avoid unnecessarily large CPTs, we choose marginalization. Now all the fragments to the left of A consist of only one node. On the right side of the DAG, given H, F is an independent fragment. The fragment consisting of G, J and L cannot be split into multiple fragments given H, and it is not tree structured. It can be converted to a tree by compounding J and L and marginalizing out G. The resulting DAG is shown in Figure 6(b). Note that the structure of this DAG is equal to that of the running example (see Figure 1).

10.2

Related Work

Several authors have addressed the problems with reliable inference and modeling robustness. Sensitivity based approaches focus on the determination of modeling components that have a significant impact on the inference process [3, 2]. We must take special care of such components, since eventual modeling faults will have a great impact as well. Sensitivity analysis is carried out prior to the operation and deals with the accuracy of the generalizations. Another class of approaches, including ours, is focusing on determination of the model quality or performance in a given situation at runtime. The central idea of our approach is observation of the consistency of the model’s runtime reinforcements, which is different from common approaches to runtime analysis of BNs, as for example, data conflict [8] and straw models [10, 9]. The data conflict approach is based on the assumption that given an adequate model all observations should be correlated and pˆ(e1 , . . . , en ) > pˆ(e1 ) · · · pˆ(en ). If this inequality is not satisfied then this is an indication that the model does not ‘fit’ the current set of observations [7]. A generalization of this method, [9], is based on the use of straw models. Simpler (straw) models are constructed through partial marginalization, which, in a coherent situation, should be less probable than the original model. Situations in which the evidence observations are very unlikely under the original model and more probable under the straw model indicate a data conflict. While these approaches can handle more general BNs than our method, their disadvantage is that the conflict scores are difficult to interpret; at which score should an action be undertaken, or what is the probability that a positive score indicates an error? Another approach proposed in [5] is the surprise index of an evidence set. This index is defined as the joint probability of the evidence set plus the sum of the probabilities of all possible evidence sets that are less probable. If the index has a value below a certain threshold ([5] proposes a threshold of 0.1 or lower), the evidence set is deemed surprising, indicating a possibly erroneous model. Clearly, this approach requires computing the probabilities of an

18

REFERENCES

exponentially large number of possible evidence sets, making it intractable for most models. In addition, most of the common approaches to model checking focus on the net-performance of models and do not directly support detection of inaccurate parts of a model [16]. An exception is the approach [4] based on logarithmic penalty scoring rules. However, in this case the scores can be determined only for the nodes corresponding to observable events, while we reason about the nodes modeling hidden events.

10.3

Conclusion

We have presented an approach to fault detection in BNs that are used for classification. This was done through the following steps: 1. We identified a partitioning of a BN such that each fragment has an independent influence on a classification node. 2. We identified three different fault causes which can be present in a CPT, and argued that 0.5 is a plausible lower bound on the probability of encountering such a fault cause. 3. We presented a coarse view on the inference process and showed how faults can be propagated through different network fragments. 4. We introduced a measure to monitor the consistency among the influences of multiple network fragments on a node, and showed that we can find thresholds on this measure such that we can deduce the probability of a fault existing in a fragment. 5. We presented an algorithm that can combine the consistency measures at different nodes in the network in order to determine whether the fragment between the nodes contains a fault. One might question the assumption about large branching factors. However, there exist applications where this is the case, as for example the Distributed Perception Networks [12], which deals with hundreds or thousands of observations, each corresponding to a fragment in a BN. We showed that the results from fault localization can be used in several ways, such as localization of erroneous modeling parameters, faulty information sources and modeling components that do not support accurate inference in a particular situation due to rare cases. Furthermore, we established a lower bound to the algorithm’s effectiveness, which we showed to converge asymptotically to 1 for network topologies with increasing branching factors.

References [1] S. Andreassen, F. V. Jensen, S. K. Andersen, B. Falck, U. Kjærulff, M. Woldbye, A. R. Sørensen, A. Rosenfalck, and F. Jensen. MUNIN — an expert EMG assistant. In ComputerAided Electromyography and Expert Systems, chapter 21. Elsevier Science Publishers, Amsterdam, 1989. [2] E. Castillo, J. M. Guti´errez, and A. S. Hadi. Sensitivity analysis in discrete Bayesian Networks. IEEE Transactions on Systems, Man, and Cybernetics. Part A: Systems and Humans, 27:412–423, 1997. [3] V. M. H. Coup´e and L. C. van der Gaag. Practicable sensitivity analysis of bayesian belief networks. In Joint Session of the 6th Prague Symposium of Asymptotic Statistics and the 13th Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, pages 81–86, Prague, 1998.

REFERENCES

19

[4] R. G. Cowell, A. P. Dawid, and D. J. Spiegelhalter. Sequential model criticism in probabilistic expert systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(3):209–219, 1993. [5] J. D. F. Habbema. Models for diagnosis and detection of combinations of diseases. In F. de Dombal et al., editor, In Proc. IFIP Conf. on Decision Making and Medical Care, pages 399–411, 1976. [6] D. Heckerman. A tutorial on learning with Bayesian networks. In M. Jordan, editor, Learning in Graphical Models. MIT Press, Cambridge, MA, 1999. [7] F. V. Jensen. Bayesian Networks and Decision Graphs. Springer-Verlag, New York, 2001. [8] F. V. Jensen, B. Chamberlain, T. Nordahl, and F. Jensen. Analysis in hugin of data conflict. In In Proc. Sixth International Conference on Uncertainty in Artificial Intelligence, pages 519–528, 1990. [9] Y.-G. Kim and M. Valtorta. On the detection of conflicts in diagnostic bayesian networks using abstraction. In Proceedings of the Eleventh International Conference on Uncertainty in Artificial Intelligence, pages 362–367, 1995. [10] K. Laskey. Conflict and surprise: Heuristics for model revision. In In Proc. Seventh International Conference on Uncertainty in Artificial Intelligence, pages 197–204, 1991. [11] David J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. Available from http://www.inference.phy.cam.ac.uk/mackay/itila/. [12] G. Pavlin, M. Maris, and J. Nunnink. An agent-based approach to distributed data and information fusion. In Proc. IEEE/WIC/ACM Joint Conference on Intelligent Agent Technology, pages 466–470, 2004. [13] G. Pavlin and J. Nunnink. Inference meta models: A new perspective on inference with bayesian networks. Technical Report IAS-UVA-06-01, Informatics Institute, University of Amsterdam, The Netherlands, 2006. [14] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. [15] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000. [16] L. C. van der Gaag and S. Renooij. Evaluation scores for probabilistic networks. In Proceedings of the 13th Belgium-Netherlands Conference on Artificial Intelligence, pages 109–116, 2001.

20

REFERENCES

Acknowledgements

IAS reports This report is in the series of IAS technical reports. The series editor is Bas Terwijn ([email protected]). Within this series the following titles appeared: F. Oliehoek and N. Vlassis Dec-pomdps and extensive form games: equivalence of models and algorithms. Technical Report IAS-UVA-06-02, Informatics Institute, University of Amsterdam, The Netherlands, April 2006. G. Pavlin and J. Nunnink and F. Groen Inference meta models: A new perspective on belief propagation with bayesian networks. Technical Report IASUVA-06-01, Informatics Institute, University of Amsterdam, The Netherlands, March 2006. Z. Zivkovic and O. Booij How did we built our hyperbolic mirror omni-directional camera - practical issues and basic geometry. Technical Report IAS-UVA-0504, Informatics Institute, University of Amsterdam, The Netherlands, December 2005. All IAS technical reports are available for download at the IAS website, http://www. science.uva.nl/research/ias/publications/reports/.