Adaptive Bayesian Networks - Stephen Luttrell

0 downloads 0 Views 284KB Size Report
Adaptive Cluster Expansion (ACE) is offered as a com- putationally cheaper ... two different types of PDF used: Bayesian (denoted as. Q), and frequentist ... dictions about test data. Bayesian inference ..... Page 6 ... Press, Oxford). [3] D. H. ...
Adaptive Bayesian Networks S P Luttrell





Defence Research Agency, St Andrews Rd, Malvern, WORCS, WR14 3PS

The theory of adaptive Bayesian networks is summarised. A detailed discussion of the Adaptive Cluster Expansion (ACE) network is presented. ACE is a scalable Bayesian network designed specically for high-dimensional applications, such as image processing. I.

INTRODUCTION

III.

In the rst half of this paper the theory of adaptive Bayesian networks is presented.

This type of network

MAKING PREDICTIONS FROM A MODEL

The fundamental problem is this: given a training set, and a model PDF, use Bayesian methods to make pre-

performs Bayesian inference [1, 2] using a probability

dictions about test data.

density function (PDF) model that it learns during a

freedom of action once the problem is thus specied, and

training programme. Because a large number of degrees

it yields the predictive model

of freedom (or hidden variables) need to be integrated

Bayesian inference leaves no

´ Q(x+ |x− ) = ds Q(x+ |s) Q(s|x− ), exact Q(x+ |x− ) ≈ Q(x+ |s(x− )), approximate

out, Bayesian networks can be computationally expensive to simulate [3]. In the second half of this paper the

(3.1)

Adaptive Cluster Expansion (ACE) is oered as a computationally cheaper alternative.

The exact result is an application of a hidden variables model, in which the model parameters are integrated out

II.

GENERAL THEORETICAL FRAMEWORK

thus

ˆ Q(x+ , x− ) =

In this section the principles of adaptive Bayesian net-

ds Q(x+ , x− , s)

(3.2)

The emphasis is on clarifying the

The quality of the exact result is limited only by the valid-

underlying principles and exposing hidden approxima-

ity of the model. If there are several contending models,

tions. The notation used in this paper can be found in

perhaps none of which is actually correct, then a single

the appendix.

It is important to realise that there are

winner could be selected by, for instance, choosing the

two dierent types of PDF used: Bayesian (denoted as

model that yielded the greatest probability of generating

Q),

the data.

works are discussed.

P). Q whereas P is

and frequentist (denoted as

is used to denote used to denote a

On the other hand, the approximate result avoids

frequency derived from a training set (used only where a

the computationally expensive integral over parameters.

This appeared in Proceedings of the SPIE Conference on Adaptive and Learning Systems, Orlando, 140-151, 1992.

There is a variety of such methods, including

a Bayesian model PDF,

full Bayesian analysis is impractical). ∗



Electronic address: luttrell%[email protected]

1. maximum likelihood, 2. maximum posterior probability, 3. maximum discriminatory likelihood,

The choice of approximation scheme can be used to

arg max Q(x− |s) s arg max s(x− ) = Q(s|x− ) s arg max − s(x− ) = Q(x− out |xin , s) s

s(x− ) =

(3.3)

alternative outputs). This would be a job for scheme 3.

inuence the capabilities of the predictive model. If the posterior probability over parameters is highly localised,

In all schemes, whether exact or approximate, there is

then schemes 1 and 2 are good approximations. On the

the problem of missing data. The exact method is unaf-

other hand, the data might be partitioned into separate

fected by this, because missing data can be treated as hid-

input and output spaces, and the model might need to

den variables, and eliminated by integration.

be optimised for computing the conditional probability of

or not an approximate method survives the problem of

the output given the input (i.e. to discriminate between

missing data depends on its structure.

Whether

For instance, if

the output data were omitted, then scheme 3 would be

2

Adaptive Bayesian Networks

meaningless. There are many ways in which an approx-

an additive constant) approximately equal to logarithmic

imate method could fail due to violations of its underly-

likelihood.

ing assumptions. The moral is to construct the approxi-

samples, relative entropy maximisation is equivalent to

mate method to suit the data, rather than the other way

maximum likelihood maximisation (i.e.

around.

to approximate scheme 1).

IV.

In the limit of a large number of training

VI.

VISIBLE AND HIDDEN VARIABLES

it corresponds

SOME ADAPTIVE NETWORKS REVISITED

The data space may be augmented by appending an additional unobserved vector to the original data vector. In this case, the data vector comprises visible variables,

In this section a brief review of some familiar adaptive Bayesian networks is presented.

whereas the unobserved vector comprises hidden variables. The model PDF then becomes

ˆ

Q(x|s) =

ˆ

dh Q(x, h|s) =

VII.

MIXTURE DISTRIBUTION

dh Q(x|h, s) Q(h|s) (4.1)

Hidden variables can always be appended to a model,

A mixture distribution is a hidden variables model with a PDF of the form

at the cost of additional theoretical and computational

Q(x|s) =

eort. The prediction equation then becomes

Q(x|c, s) Q(c|s)

(7.1)

c=1

´ Q(x+ |x− ) = ´ ds dh+ dh− Q(x+ , h+ |s) Q(s, h− |x− ), Q(x+ |x− ) ' dh+ Q(x+ , h+ |s(x− )), approximate

exact

(4.2)

where

c is a class label

which is a discrete-valued hidden

variable. If the dependence on parameters is modied as follows

Q(x|c, s) Q(c|s) −→ Q(x|c, s) Q(c) Q(x|s) −→ Q(x|s, Q(c))

where the notation used is self-explanatory.

V.

M X

where the prior probability factor itself comprises

LOGARITHMIC LIKELIHOOD AND RELATIVE ENTROPY

of the parameters (i.e.

(7.2)

M−1

it is non-parametric), then the

iterative re-estimation prescription [4, 5] for maximising If

N

data samples are drawn independently from the

training set, then the logarithmic likelihood may be writ-

relative entropy becomes

M P arg max ´ dx P(x) Q(c|x, s) log(Q(x|c, s0 )) 0 s c=1 ´ ˆ Q(c) = dx P(x) Q(c|x, s)

ˆs =

ten as

L(s) ≡ log

N Y

ˆ

! k

Q(x |s)

≈N

dx P(x) log(Q(x|s))

(7.3)

k=1 (5.1) The integration measure

P

is a frequentist (i.e.

non-

The structure of a mixture distribution network is shown in Figure 1.

Bayesian) PDF; it is used as a convenient shorthand notation, and will not enter into any Bayesian manipulations. On the other hand, the relative entropy (between and

Q)

P

is dened as

 dx P(x) log

P(x) Q(x|s)

Relative entropy is zero if, and only if,

whose hidden variables have a memory of their previ-

 ≤0

(5.2)

and

Q.

modied to become

Relative entropy is the logarithm of the prob-

ability (per sample) that samples taken from frequency of occurrence specied by

Q

Q(xt |s, Q(ct |ct−1 )) =

M X

Q(xt |ct , s) Q(ct |ct−1 )

(8.1)

ct =1

have the

where a discrete time index

P.

t now appears,

and the prior

probability is replaced by a transition matrix. The full

These results can be combined to yield

ˆ

L(s) ≈ N G(s) +

ous state in time. Thus the mixture distribution PDF is

P and Q are identi-

cal, so it may be used as a tness criterion for comparing

P

HIDDEN MARKOV MODEL

A hidden Markov model is a mixture distribution

ˆ G(s) ≡ −

VIII.

joint PDF for a time series of data is then given by

dx P(x) log(P(x))

(5.3)

τ

Q(x|s, Q(c |c The second term on the right hand side does not depend on the model parameters, so relative entropy is (up to

τ −1

0

)) = Q(c )

T Y

M X

t=1

ct =1

! t

t

t

Q(x |c , s) Q(c |c

t−1

(8.2)

)

Adaptive Bayesian Networks

3

Figure 2: Diagram showing a network suitable for computing a hidden Markov model. As in the mixture distribution network, the input layer receives the data vector, and the output layer produces the corresponding class conditional probabilities. The network parameters are located as indicated. The output layer is inuenced by its own previous output values, which endows the network with discrete time dynamics.

Figure 1: Diagram showing a network suitable for computing a mixture distribution. The input layer receives the data vector, and the output layer produces the corresponding class conditional probabilities. The network parameters are located as indicated.

IX.

BOLTZMANN MACHINE

Gibbs distributions [6] (or, equivalently, Markov random elds) are a maximum entropy family of model

where a prior probability which species the initial distribution of the hidden variable

c

is included.

PDF's with the form

The re-

estimation prescription may be used to optimise the model parameters.

1 exp(−s.U(x)) Q(x|s) = Z(s) ´ where Z(s) = dx exp(−s.U(x))

The structure of a hidden Markov model network is shown in Figure 2.

(9.1)

which depend exponentially on a sum of potentials. Hidden variables may readily be introduced into such models.

Because Gibbs distributions do not generally have the simple structure of mixture distributions or hidden Markov models, it is not usually possible to use the reestimation prescription to optimise them. A relative entropy gradient ascent algorithm could be used instead, based on the result [7]

∂G(s) = ∂s

ˆ

ˆ dx dh Q(x, h|s) U(x, h) −

This leads to a generalised form of the Boltzmann ma-

dx dh P(x) Q(h|x, s) U(x, h)

X.

(9.2)

ADAPTIVE CLUSTER EXPANSION (ACE)

chine training algorithm [3], which optimises the model in a way that depends on the dierence between the free average and the clamped average of the potentials.

In this section an up-to-date discussion of the Adaptive Cluster Expansion (ACE) method [7] is presented, followed by a detailed theoretical analysis of a simple ACE network based on mixture distributions. A good example

4

Adaptive Bayesian Networks

of the application of this approach is the anomaly detec-

where the numerator models the PDF of the transformed

tor [8, 9]. Other uses (i.e. not PDF models) of the ACE

clusters, and the denominator is the product of the trans-

approach have also been published [10, 11].

formed PDF's of the original clusters, dened by

XI.

ACE PHILOSOPHY

One of the major problems with Bayesian networks with hidden variables is their tendency to consume large

´ − + − + + Q1 (y+ dx+ 1 Q1 (x1 |x1 ) δ(y1 − y1 (x1 )) 1 |x1 ) = + − Q2 (y2 |x2 ) = similar (11.4)

amounts of computing resources when evaluating the integration over hidden variables (including model parameters) in the exact prediction equation

ˆ −

+

Q(x |x ) =

ds dh+ dh− Q(x+ , h+ |s) Q(s, h− |x− ) (11.1)

These numerical problems can be especially acute if Monte Carlo simulations are used. The ACE approach factorises this equation into the following form

+



Q(x |x ) =

The above factorisation cannot be derived from the original prediction equation; it has to be assumed, and justied as follows.

Suppose there are four Bayesians,

each of whom has partial access to the training data. Bayesian 1 has access to cluster 1, Bayesian 2 has access to cluster 2, and Bayesian 3 has access to the transformed cluster pairs.

Bayesians 1-3 can each indepen-

dently form an exact prediction equation for his own (sub)space. Bayesian 4 can then use these three results

(and no further information) to form a uniquely consis+ − + − + + − −tent combination, which is the factorised prediction equaQ1 (x1 |x1 ) Q2 (x2 |x2 ) J(y1 (x1 ), y2 (x2 )|x1 , x2 ) tion stated earlier. (11.2)

where the rst two factors model the PDF in two independent subspaces (or clusters), and the third factor

Finally, each of the factors can be modelled by any of

models the joint PDF of transformed versions of the clus-

the methods discussed thus far: exact or approximate,

ters. In the

J-factor, the choice of cluster transformations

is arbitrary, though the choice aects the performance of the prediction equation. To ensure that the overall normalisation is correct, it is necessary to write the

and with or without additional hidden variables.

XII.

APPROXIMATE ACE: REMOVE THE INTEGRALS OVER PARAMETERS

J-factor

in the form

− J(y1 , y2 |x− 1 , x2 ) =

+ − − Q12 (y+ 1 , y2 |y1 , y2 ) − + − Q1 (y+ 1 |x1 ) Q2 (y2 |x2 )

The above ACE expression for two cluster may be ap(11.3)

− − + − − Q(x+ |x− ) ≈ Q1 (x+ 1 |s1 (x1 , x2 )) Q2 (x2 |s2 (x1 , x2 ))

proximated as

+ − − Q12 (y1 (x+ 1 ), y2 (x2 )|s12 (x1 , x2 )) − − + − − Q1 (y1 (x+ 1 )|s1 (x1 , x2 )) Q2 (y2 (x2 )|s2 (x1 , x2 ))

(12.1)

where the parameter values have been set to their maximum likelihood values (i.e. approximate scheme 1)

  − s1 x− − 1 , x2  Q12 (y1 (x− arg max − − −  1 ), y2 (x2 )|s12 )  s2 x− = Q (x |s ) Q (x |s ) , x 1 2  − s1 , s2 , s12 1 1 1 2 2 2 Q1 (y1 (x− − 1 )|s1 ) Q2 (y2 (x2 )|s2 ) s12 x− 1 , x2 

Note that the parameters need to be jointly optimised because the term arising from the

J-factor

is not inde-

(12.2)

When the amount of training data is large, this result is equivalent to relative entropy maximisation. Thus dene

pendent of the rst two factors.

´ 1) G1 (s1 ) = − dx1 P(x1 ) log( Q1P(x (x1 |s1 ) ) G2 (s2 ) = similar ´ 12 (y1 ,y2 |s12 ) I12 (s1 , s2 , s12 ) = dy1 dy2 P(y1 , y2 |s1 , s2 ) log( Q1Q (y1 |s1 ) Q2 (y2 |s2 ) ) ´ P12 (x1 ,x2 ) I0 = dx1 dx2 P(x1 , x2 ) log( P1 (x1 ) P2 (x2 ) )

(12.3)

Adaptive Bayesian Networks

5

variables. Thus model each PDF in the ACE expression

to obtain [12].

as a mixture distribution

G(s1 , s2 , s12 ) = G1 (s1 ) + G2 (s2 ) + I12 (s1 , s2 , s12 ) − I0 (12.4)

Q1 (x1 |s1 ) =

The rst two terms are the relative entropies evaluated independently for each cluster, the third term is (a model of ) the mutual information between the transformed clusters, and the fourth term is the (constant) mutual information between the clusters themselves.

M P

Q1 (x1 |c, s1 ) Q1 (c|s1 )

c=1

Q2 (x2 |s2 ) = similar M P Q12 (y1 , y2 |s12 ) = Q12 (y1 , y2 |c, s12 ) Q12 (c|s12 ) c=1 (13.1)

XIII.

Furthermore, it is possible to dene the cluster trans-

MIXTURE DISTRIBUTION ACE

formations using quantities that are computed by the mixture distribution model.

Although the integrals over parameters were removed above, it is still possible to introduce additional hidden

y(x|s) =

For instance, one possible

choice is the vector of posterior class probabilities. Omitting the cluster suces, this may be written as

(Q(x|c = 1, s) Q(c = 1|s), Q(x|c = 2, s) Q(c = 2|s), · · · ) M P Q(x|c, s) Q(c|s)

(13.2)

c=1

XIV.

This choice is non-Bayesian, because a probability is be-

GENERAL ACE MODEL

ing used as if it were a state vector, which is not permitted for Bayesian probabilities.

However, because it

By including a

J-factor for each cluster transformation,

is being used only as a cluster transformation, and not

the above ACE network can readily be generalised to a

in any further Bayesian manipulations, this procedure is

tree-structured network with any number of layers. The

allowed.

expression for the relative entropy is similarly modied to incorporate the additional

The structure of an ACE network suitable for computing the above results is shown in Figure 3.

J-factors.

The gradients of the relative entropy may be derived as follows. Introduce some simplied notation

xL yL yL G

= input to layer L = output from layer L = xL+1 = input to layer L + 1 = ´yL (xL ) = cluster transformations in layer L = dx P(x) G(x) = decomposition by input sample N P G(x) = JL (xL ) = decomposition by network layer L=1 (14.1) The relative entropy may be dierentiated with respect to the parameters in network layer

L

to yield

∂G(x) ∂JL (xL ) ∂yL = + .bL+1 ∂sL ∂sL ∂sL where the back-propagated signal from layer

Figure 3: Diagram showing an ACE network. The second stage receives as input two vectors of posterior class probabilities computed in the rst stage.

(14.2)

L+1

is ob-

tained by iterating the back-propagation equations r (xr ) br = ∂J∂s + r bN +1 = 0

∂yr ∂sr .br+1 ;

r = N, N − 1, · · · , 3, 2, 1 (14.3)

The ow of information needed to compute the above results is shown in Figure 4. In Appendix B, expressions for the derivatives of the

It is also possible to replace the mixture distributions

relative entropy are presented, together with a brief dis-

used above, by hidden Markov models. This is a natural

cussion on self-supervision.

way of introducing temporal behaviour into ACE.

6

Adaptive Bayesian Networks

XV.

CONCLUSIONS

The theory of adaptive Bayesian networks is very powerful, and contains many familiar methods as special cases.

The main problem with the Bayesian method,

when applied rigorously, is the need to perform integrations over hidden variables, which can involve lengthy

Figure 4: Diagram showing a schematic multilayer ACE network. The ow from left-to-right is the forward-propagating data, and the ow from right-to-left is the backwardpropagating derivatives generated within the network. The contributions to the relative entropy and its derivatives are output from each network layer, as shown.

Monte Carlo simulations in some cases.

The Adaptive

Cluster Expansion (ACE) method reviewed in this paper provides one possible escape route from the need for such integrations. ACE provides a scalable network architecture that can readily be applied to large datasets, such as images.

Appendix A: BASIC NOTATION

xk x− x+ y (x) y (x± ) s, h P (x) Q (x|s) Q (x+ |x− ) L (s) G (s)

= k-th vector in data  space = x1 , x2 , · · · , xN = past data (training data) = xN+1 , xN+2 , · · · = future data (test data) = transformation of data vector = y± = transformation of all future/past data = hidden variables vector = PDF of data (frequentist "real world") = conditional PDF of data given a parameter vector (Bayesian model) = conditional PDF of future data given past data (Bayesian model) = logarithmic likelihood = relative entropy

Appendix B: GRADIENT OF RELATIVE ENTROPY IN MIXTURE DISTRIBUTION ACE The gradients of the relative entropy are

∂G1 (x1 ) ∂s1 ∂G2 (x2 ) ∂s2 ∂I12 (s1 ,s2 ,s12 ) ∂s1 ∂I12 (s1 ,s2 ,s12 ) ∂s2 ∂I12 (s1 ,s2 ,s12 ) ∂s12

´ 1 |s1 ) = dx1 P(x1 ) ∂ log Q∂s1 (x 1 = similar ´ ´ 1 ∂ log Q12 (y1 ,y2 |s12 ) 1 |s1 ) = dx1 dx2 P(x1 , x2 ) ∂y | y = y (x |s ) − dy1 P(y1 |s1 ) ∂ log Q∂s1 (y ∂s1 . ∂y1 1 1 1 1 1 y2 = y2 (x2 |s2 ) = similar ´ = dy1 dy2 P(y1 , y2 |s1 , s2 ) ∂ log Q12∂s(y121 ,y2 |s12 )

(B1)

The derivative of (the model of ) the mutual informa-

This eect, which is called self-supervision, is important

tion with respect to the parameters in the rst stage of

because it co-ordinates the training of dierent layers in

the network causes the training of the left and right clus-

a multilayer unsupervised network (of which the ACE

ter transformations to become mutually coupled [13, 14].

network is but a particular example).

[1] R. T. Cox (1946). Probability, frequency and reasonable expectation. Am. J. Phys., 14(1), 1-13.

[2] H. Jereys (1939). Theory of Probability (Clarendon

Adaptive Bayesian Networks

Press, Oxford). [3] D. H. Ackley, G. E. Hinton and T. J. Sejnowski (1985). A learning algorithm for Boltzmann machines. Cognitive Sci., 9(2), 147-169. [4] L. E. Baum (1972). An inequality and associated maximisation technique in statistical estimation for probabilistic functions of Markov processes. Inequalities, 3(1), 1-8. [5] A. P. Dempster, N. M. Laird and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. B, 39(1), 1-38. [6] R. Kindermann and J. L. Snell (1980). Contemporary Mathematics vol. 1 Markov Random Fields and their Applications (American Mathematical Society, Providence). [7] S. P. Luttrell (1989). The use of Bayesian and entropic methods in neural network theory. In Maximum Entropy and Bayesian Methods, edited by G. Erickson, J. T. Rychert and C. R. Smith (Kluwer, Dordtrecht), 363-370. [8] S. P. Luttrell (1990). A trainable texture anomaly detector using the adaptive cluster expansion (ACE) method. RSRE, Malvern. Technical Report 4437. [9] S. P. Luttrell (1991 - NEED TO UPLOAD ACE_NN TO ARXIV). Adaptive Cluster Expansion (ACE): a multilayer network for estimating probability density functions. [10] S. P. Luttrell (1989). Hierarchical vector quantisation. Proc. Inst. Electr. Eng. I, 136(6), 405-413. [11] S. P. Luttrell (1989). Image compression using a multilayer neural network. Patt. Recogn. Lett., 10(1), 1-7. [12] S. P. Luttrell (1991). A hierarchical network for clutter and texture modelling. In Proc. SPIE Conf. on Adaptive Signal Processing, (SPIE, San Diego), 518-528. [13] S. P. Luttrell (1991). Self-supervised training of hierarchical vector quantisers. In Proc. 2nd IEE Conf. on Articial Neural Networks, (IEE, Bournemouth), 5-9. [14] S. P. Luttrell (1992). Self-supervised adaptive networks. Proc. IEE F, 139(6), 371-377.

7