Bayesian Networks for Dynamic Classification - Department of ...

2 downloads 0 Views 1MB Size Report
May 28, 2012 - is closely reflated to switching stat-space models [12]. This is a natural choice when each class have multiple approximately linear sub regimes ...
Bayesian Networks for Dynamic Classification

Shengtong Zhong a, Helge Langseth a, Thomas D. Nielsen b a Department

of Computer and Information Science, The Norwegian University of Science and Technology, Trondheim, Norway

b Department

of Computer Science, Aalborg University, Aalborg, Denmark

Abstract Monitoring a complex process often involves keeping an eye on hundreds or thousands of sensors to determine whether or not the process is stable. We have been working with dynamic data from an oil production facility in the North sea, where unstable situations should be identified as soon as possible. Motivated by this problem setting, we propose a generative model for dynamic classification in continuous domains. At each time point the model can be seen as combining a na¨ıve Bayes model with a mixture of factor analyzers. The latent variables of the factor analyzers are used to capture the state-specific dynamics of the process as well as modeling dependencies between attributes. Exact inference in the proposed model is intractable and computationally expensive, so in this paper we focus on an approximate inference scheme.

1

Introduction

When doing classification, the task is to assign a class label to an object. Objects are characterized by a collection Y = {Y1 , Y2 , . . . , Yn } of random variables, and a specific object is defined by a value assignment y = {y1 , y2 , . . . , yn } to these variables. The set of possible labels (or classes) for an object is represented by a class variable C, denoted sp (C). We will focus on real-valued attributes in this paper, meaning that y ∈ Rn . In a probabilistic framework, it is well-known that the optimal classifier will Email addresses: [email protected] (Shengtong Zhong), [email protected] (Helge Langseth), [email protected] (Thomas D. Nielsen).

Preprint submitted to Reliability Engineering & System Safety

28 May 2012

label an object y by the class label c∗ , where c∗ = arg min

c∈sp(C)

X

L(c, c′ )P (c′|y),

c′ ∈sp(C)

and L(c, c′ ) is the loss-function encoding the cost of mis-classification. Learning a classifier therefore amounts to estimating the probability distribution P (C = c|y). In order to reduce complexity, certain independence assumptions are often introduced. For example, in the na¨ıve Bayes (NB) classifier all attributes are assumed to be independent given the class. Although the NB model obtains competitive results in some application domains, it can also perform poorly in many other domains. This has motivated the development of a variety of improvements to overcome the often overly simplistic independenceassumption. Among these approaches, the tree-augmented naive Bayes (TAN) [7] and the aggregating one-dependence estimator (AODE) [24] relax the independence assumptions by allowing a more general correlation structure in the model. Alternatively, the latent classification model (LCM) [16] encode conditional dependencies among the attributes using latent variables. Common to all these classification techniques is that they assume that the different objects are independently and identically distributed. This assumption is violated in dynamic domains, where a temporal relation in the data can introduce correlations between the different observations. To exemplify, let us consider a system that monitors a complex process. Overseeing such processes will often mean keeping an eye on hundreds or thousands of sensors, each of them updating their readings several times per second. Real-life processes have their own natural dynamics when everything is running according to plan; “outliers” may be seen as indications that the process is leaving its stable state, thereby becoming more dangerous. We want the monitoring system to determine whether or not the process is under control at all times (that is, to detect unstable behavior) in order to be able to shut down an unsafe process, or even better, to detect that the system is about to become unstable (that is, to predict the future problems). The latter would give the system operators the chance to implement countermeasures before the system has exposed anyone to an increased risk. Classifiers that fail to take the dynamic aspect of the process into account will not be able to make accurate predictions, and will therefore not be able to recognize a problem under development. The framework of dynamic Bayesian networks [11] supports the specification of dynamic processes. A simple instance of this framework is the hidden Markov model (HMM), which has previously been considered for classification purposes [13,22]; to this end the “hidden” node in the HMM is used as the classification node, and the attributes at time t are assumed to be independent of those at time t + 1 given the class label at either of the two points in time. Further simplification can be obtained by assuming that all attributes at one 2

time step are conditionally independent given the class label at that time step; the resulting model by [20] is known as a dynamic na¨ıve Bayes (dNB) classifier. The dNB models can be effectively learned from data due to the relatively small number of parameters required to specify them. However, the dNB has limitations when it comes to expressive power that are similar to those of its static counterpart, which can lead to reduced classification accuracy when the independence assumptions are violated. To the best of our knowledge, there has been no systematic investigation into the properties of probabilistic classifiers and their applicability to real-life dynamic data. In this paper we will take a step in that direction, by examining the underlying assumption of some well-known probabilistic classifiers and their natural extensions to dynamic domains. We do so by carefully linking our analysis back to a real-life dataset, and the result of this analysis is a classification model, which can be used , e.g., to help prevent unwanted events by automatically analyzing a data stream and raise an alert if the process is entering an unstable state. This paper is organized as follows: In Section 2 we describe the target domain in some detail. This is followed by Section 3, where we give an overview of the dynamic classification scheme in general, and propose a specific classification model denoted by dynamic latent classification model (dLCM). Next, we look at inference and learning in dLCMs (Section 4), before reporting on their classification accuracy in Section 5. Finally, in Section A we conclude and give directions for future research.

2

The domain and the dataset

The dataset we consider is from an oil production installation in the North Sea. Data, consisting of some sixty variables, is captured every five seconds. The data is monitored in real time by experienced engineers, who have a number of tasks to perform ranging from understanding the situation on the platform (activity recognition) via avoiding a number of either dangerous or costly situations (event detection), to optimization of the drilling operation. The variables that are collected cover both topside measurements (like flow rates) and down-hole measurements (like, for instance, gamma rate). The goal of this research is to build a dynamic classification system that can help the drilling engineers by automatically analyzing the data stream and classify the situation accordingly. To do so, we will firstly investigate a number of different dynamic classification frameworks, and thereafter propose a new model for dynamic classification in general. However, for the discussions to be concrete, we will tie the development to the task of activity recognition. The drilling of a well is a complex process, which consists of activities that are performed iteratively as the length of the well increases, and knowing which activity is performed at any time is important in several contexts: Firstly, some un3

desired events can only happen during specific activities, and knowing the current activity is therefore of high importance to the drilling engineer from a safety perspective; Secondly, operators are consistently looking for more costefficient ways of drilling, and the sequencing of activities during an operation is important for hunting down potential time-sinks. Activity recognition is a task that also finds applications as diverse as healthcare [23] and video analysis [17]. In the Wellsite Information Transfer Specification (WITS), a total of 34 different activities with associated activity codes are defined. Each activity has its separate purpose and consists of a set of actions. Out of the 34 different drilling activities in total, only a handful are really important to recognize. The important activities, which roughly correspond to those that constitute most of the total well drilling time are described next:

2 – Drilling: The activity when the well is gaining depth by crushing rock at the bottom of the hole and removing the crushed pieces (cuttings) out of the well-bore. Thus, the drill string is rotating during this activity, and mud is circulated at low speed to transport out the cuttings. The activity is interrupted by other activities, but continues until the well reaches the reservoir and production of oil may commence. 3 – Connection: This activity involves changing the length of the drillstring, by either adding or removing pieces of drill-pipe. 4 – Reaming: Reaming means widening a section of the well hole after it has been drilled the first time. 7 – Circulation: Circulation resembles drilling to the extent that the mud is flowing and the drill string is rotating at a low speed. However, new cuttings are not produced. Like for reaming, the point of this activity is to clean a recently drilled sections of the well, but the radius of the well hole is not widened as much as during reaming, and circulation can therefore be seen as a less severe cleaning activity. 8 – Tripping in: This is the act of running the drill string into the well hole. 9 – Tripping out: Tripping out means pulling the drill string out of the well bore.

Additionally, the activity 0 – Other is used to describe events that do not fit any of the other activity definitions, making sure that exactly one activity is performed at all times.

Out of the approximately 60 attributes that are collected, domain experts have selected the following as most important for activity detection: Depth Bit Measured, Depth Hole Measured, Block Position, Hookload, Weight On Bit, Revolutions Per Minute, Torque, Mud Flow In, and Standpipe Pressure. These variables are also the ones we will use in the following. 4

3

From Static to Dynamic Bayesian Classifiers

In this section we develop a general framework for performing dynamic classification. The framework will be specified incrementally by examining its expressivity relative to the oil production data.

3.1 Static classifiers Standard (static) classifiers like NB [5] or TAN [7] assume that the class variable and attributes at different time points are independent given the model M, i.e, {C t1 , Y t1 }⊥ ⊥{C t2 , Y t2 }|M for all t1 6= t2 . This independence assumption is clearly violated in many domains and, in particular, in domains that specify a process evolving over time. To validate the independence assumptions in practice, we can for instance compare the marginal distribution of the class variables with the conditional distribution of the class variable given its value at the previous time step. The results using the oil production data are shown in Table 1, where we see a considerable correlation between the class variable of consecutive time slices. In particular, if the system was in the drilling activity at time t − 1, the probability of being in the drilling activity also at time t changes from 0.325 (static classifier) to 0.997 (dynamic classifier). The reason for this dramatic difference is that the system tends to remain in the drilling activity as soon as a drilling is introduced, an effect that the static classifier is unable to represent. P (C t = drilling)

3.25 · 10−1

P (C t = drilling|C t−1 = ¬drilling)

2.90 · 10−3

P (C t = drilling|C t−1 = drilling)

9.97 · 10−1

Table 1 Comparison between P (ct ) and P (ct |ct−1 ). The class “¬drilling” signifies the activies that is not belong to drilling.

One way to capture this dependence is to explicitly take the dynamics of the process into account, i.e., to look at the class of dynamic classifiers.

3.2 A simple dynamic classifier The temporal dynamics of the class variable can be described using, e.g., a first order Markov chain, where P (ct |c1:t−1 ) = P (ct |ct−1 ) for all t. By combining this temporal model with the class-conditional observation model for the attributes we have the well-known hidden Markov model. This model is 5

Ct

C t−1

Y1t−1

Y2t−1

Y3t−1

Y4t−1

Y1t

Y2t

Y3t

Y4t

Fig. 1. Attributes are assumed to be conditionally independent given the class variable (equivalent structure for HMM) with n = 4.

described by a prior distribution over the class variable P (c0), a conditional observation distribution P (y t |ct ), and transition probabilities for the class variable P (ct |ct−1 ); we assume that the model is stationary, i.e., P (y t |ct ) = P (y s |cs ) and P (ct |ct−1 ) = P (cs |cs−1 ), for all s, t ≥ 1. With a continuous observation vector, the conditional distribution may be specified by a class-conditional multivariate Gaussian distribution with mean µc and covariance matrix Σc , i.e., Y |{C = c} ∼ N(µc , Σc ) [9]. Unfortunately, learning a full covariance matrix involves estimating a number of parameters that is quadratic in the number of attributes, which may result in over-fitting when data is scarce compared with the number of free parameters. One approach to alleviate this problem is to introduce additional independence assumptions about the domain being modeled. Specifically, by assuming that all variables are independent given the class variable, we will at each time step have a NB model defined by a diagonal covariance matrix, thus requiring only O(n) parameters to be learned, where n = |Y | is the number of attributes in the model. This structure corresponds to the dNB model for discrete domains. A graphical representation of the resulting independence assumptions can be seen in Fig. 1 in the form of a two time-slice DBN [19]. As for the (static) NB model, the independence assumptions encoded in the dNB model are often violated in real-world settings. For example, if we consider the measured flow of drilling fluid going in to the well (Mud Flow In) and the observed pressure in the well (Stand Pipe Pressure), and plot their values conditioned on the class variable (activities tripping in and tripping out), it is evident that there is a conditional correlation between the two attributes given the class, see Fig. 2; similar results are also obtained when considering other pairs of attributes.

3.3 Modeling dependence between attributes There are several approaches to model attribute dependence. For example, Friedman et al. [8] propose an extension of the TAN model [7] to facilitate continuous domains. In the TAN framework, each attribute is allowed to have at most one parent besides the class variable. As an alternative, [16] present 6

Correlation between pair of attributes 0.035

Stand Pipe Pressure

0.03 0.025 0.02 0.015 0.01 0.005 0

0

0.5

1

1.5 2 Mud Flow In

2.5

3 7

x 10

Fig. 2. Scatter plot of the Mud Flow In (x-axis) and the Stand Pipe Pressure (y-axis) for the two classes in the oil production data (black “+” is the tripping in class and grey “o” is the tripping out class). The conditional correlation between the two attributes is evident.

the latent classification model (LCM), which can be seen as combining the NB model with a factor analysis model [6]. A LCM offers a natural extension of the NB model by introducing latent variables Z = (Z1 , . . . , Zk ) as children of the class variable C and parents of all the attributes Y = (Y1 , . . . , Yn ). The latent variables and the attributes work as a factor analyzer focusing on modeling the correlation structure among the attributes. Following the approach by [16], we introduce latent variables to encode conditional dependencies among the attributes. Specifically, for each time step t we have the vector Z t = (Z1t , . . . , Zkt ) of latent variables that appear as children of the class variable and parents of all the attributes (see Fig. 3). The latent variable Z t is assigned a multivariate Gaussian distribution conditional on the class variable and the attributes Y are also assumed to be a multivariate Gaussian distributions conditional on the latent variables:

Z t |{C t = ct } ∼ N(µct , Σct ), Y t |{Z t = z t } ∼ N(Lz t , Θ), where Σct and Θ are diagonal and L is the transition matrix; note that the stationarity assumption encoded in the model. In this model, the latent variables capture the dependencies between the attributes. They are conditionally independent given the class but marginally dependent. Furthermore, the same mapping, L, from the latent space to the 7

attribute space is used for all classes, and hence, the relation between the class and the attributes is conveyed by the latent variables only. Ct

C t−1

Z1t−1

Y1t−1

Z2t−1

Y2t−1

Y3t−1

Z1t

Y4t−1

Y1t

Z2t

Y2t

Y3t

Y4t

Fig. 3. In each time step, the conditional dependencies between the attributes are encoded by the latent variables (Z1t , . . . , Zkt ), with k = 2, n = 4.

The model in Fig. 3 assumes that the attributes in different time slices are independent given the class variable. This assumption implies that the temporal dynamics is captured at the class level only. When the state specification of the class variable is coarse, then this assumption will rarely hold 1 . For the oil production data, this assumption does not hold as we can see in Fig. 4, which show the conditional correlation of the Stand Pipe Pressure attribute in successive time slices in both tripping in and tripping out activities.

0.035 0.03 0.025 0.02 0.015 0.01 0.005 0

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Fig. 4. Scatter plot of the Stand Pipe Pressure (x-axis is value of Stand Pipe Pressure at time t and y-axis is value of Stand Pipe Pressure at time t + 1) for the two classes in the oil production data (black “+” is tripping in class and gray “o” is tripping out class). The conditional dynamic correlation of this attribute over time is evident.

We propose to address this apparent short-coming by modeling the dynamics of the system at the level of the latent variables, which semantically can be seen as a compact representation of the “true” state of the system. Specifically, we 1

Obviously, the finer the granularity of the state specification of the class variable, the more accurate this assumption will be.

8

encode the state specific dynamics by assuming that the latent variable vector Z t follows a linear multivariate Gaussian distribution conditioned on Z t−1 : Z t |{Z t−1 = z t−1 , C t = ct } ∼ N(Act z t−1 , Σct ) where Act is the transition dynamics for latent variables. A graphical representation of the model is given in Fig. 5, and will be referred to as a dynamic latent classification model (dLCM). Ct

C t−1

Z1t−1

Y1t−1

Z2t−1

Y2t−1

Y3t−1

Z1t

Y4t−1

Y1t

Z2t

Y2t

Y3t

Y4t

Fig. 5. The state specific dynamics are encoded at the level of the latent variables, with k = 2, n = 4.

3.4 Modeling non-linear systems One of the main assumptions in the model above is that there is a linear mapping from the latent variables to the attribute space as well as a linear mapping from the latent variables in a given time slice to the latent variables in the succeeding time slice (i.e., that the state specific dynamics are linear). When class variable is at a fixed class value, then the model is equivalent to stochastic linear dynamical systems (LDS) [1], also known as state-space models. Given sufficient dimension of latent continuous space, LDS can represent any complex real-world processes, but the computational cost makes it infeasible in practice. In order to reduce the computational cost while maintaining the representational power, we introduce a discrete mixture variable M for each time slice (Fig. 6), as done by [16] for static domains. Similar situations in the dynamic domains can be seen from [2,12] where a probabilistic model called switching state-space model is proposed by combining discrete and continuous dynamics. The representational power and computational efficiency are well demonstrated in their work. When the value of class variable is fixed, the proposed model in this paper is different from their work, because there is no dynamics on the latent mixture node while the discrete class variables carries 9

Ct

C t−1

Z1t−1

Y1t−1

M t−1

Y2t−1

Y3t−1

Z2t−1

Y4t−1

Z1t

Y1t

Z2t

Mt

Y2t

Y3t

Y4t

Fig. 6. A mixture variable M is introduced at each time slice to extend of the model, with k = 2, n = 4.

the dynamics over time. Switching state-space models focus on modeling nonlinear real-world process with one single system state (class), whereas our model tries to capture the non-linearity of multiple system states (classes) at the same time. We call the resulting model dynamic latent classification model (dLCM), each time slice can now be seen as combining a na¨ıve Bayes model with a mixture of factor analyzers. In this case, the mixture variable follows a multinomial distribution conditioned on the class variable. and the attributes Y t follow a multivariate Gaussian distribution conditioned on the latent variables and the discrete mixture variable, i.e.:

M t |{C t = ct } ∼ P (M t |C t = ct ), Y t |{Z t = z t , M t = mt } ∼ N(Lmt z t , Θmt ), where 1 ≤ mt ≤ |sp (M)|, P (M t |C t = ct ) ≥ 0 and ct ) = 1 for all 1 ≤ ct ≤ |sp (C)|.

4

P|sp(M )| mt =1

P (M t = mt |C t =

Learning and inference

In what follows we discuss algorithms for performing inference and learning in the proposed models. 4.1 Learning Learning the dLCM model involves estimating the parameters of the model, the number of latent variables and the number of the mixture components. 10

With the number of latent variables and the number of the mixture components specified, parameter learning becomes simple: since the class variables are always observed during learning, the resulting model is similar to a linear state-space model for which existing EM-algorithms [4] can be applied. Following the approach from [16], we determine the number of latent variables and the number of the mixture components by performing a systematic search over a selected subset of candidate structures (note that a model structure is completely specified by its number of latent variables and number of the mixture components). For each candidate structure we learn the model parameters by applying the EM-algorithm, and the quality of the resulting model is then evaluated using the wrapper approach [15]. Finally, we select the model with the highest score. In the following, we will present the parameter learning algorithm of dLCM models in the form of the EM-algorithm which is an obvious modification of EM original work.

4.1.1 The E-step The E-step of EM algorithm computes the expected log likelihood of the completed observation, given the data:

h

Q = E log =

T Y

t=1

T n X

i

f (·t )|D 1:T =

T n X

o

E[log f (·t )|D1:T ]

t=1

o

E[log P (ct |ct−1 )|D1:T ] +

t=1

T n X

T n X

o

E[log P (M t |ct )|D1:T ] +

t=1

o

E[log p(Z t |Z t−1 , ct )|D1:T ] +

t=1

T n X

o

E[log p(y t |Z t , M t )|D1:T ] .

t=1

Therefore, the expected log likelihood depends on these four expectations: • Tt=1 E[log P (ct |ct−1 )|D1:T ], P • Tt=1 E[log P (M t |ct )|D1:T ], P • Tt=1 E[log p(Z t |Z t−1 , ct )|D1:T ], and P • Tt=1 E[log p(y t |Z t , M t )|D1:T ]. P

These expectations can be directly calculated (see Appendix.A for more details). 11

4.1.2 The M-step The parameters of dLCM model are A, L, Σ, Θ, Φ, J , K. At each iteration, these parameters can be obtained by taking the corresponding partial derivative of the expected log likelihood. The following are the results (refer Appendix.A for more details): L, the linear dynamics from the latent space to the attribute spaces: Li,mt denotes the i’th row of this matrix when the mixture node is mt .

ˆ i,mt = L

X T n

t

t

P (M = m |D

t=1 T Xn

1:T

t

t ′

t

t

)E[Z (Z ) |M = m , D

1:T

]

o−1

· o

(yit − Φi,mt )P (M t = mt |D1:T )E[Z t |M t = mt , D1:T ]

t=1

Φ, the offset from the latent space to the attribute spaces: Φi,mt denotes the i’th row of this matrix when the mixture node is mt .

1 ˆ i,mt = P n o Φ 1:T T t t ) t=1 P (M = m |D T n X

X T n

o

P (M t = mt |D1:T )yit −

t=1

o

P (M t = mt |D1:T )L′i,mt E[Z t |M t = mt , D1:T ]

t=1

Θ, the covariance matrices of the attribute spaces: Θi,mt denotes the i’th element on the diagonal of the matrix when the mixture node is mt .

T n X

1 ˆ i,mt = − P n o Θ (yit − Φi,mt )2 P (M t = mt |D 1:T )− 1:T T t t ) t=1 t=1 P (M = m |D 2(yit − Φi,mt )P (M t = mt |D 1:T )L′i,mt E[Z t |M t = mt , D 1:T ]+

P (M t = mt |D 1:T )L′i,mt E[Z t (Z t )′ |M t = mt , D1:T ]Li,mt )

o

Σ, the covariance matrices of the latent spaces: Σc denotes the diagonal matrix when the class variable is c. 12

X n ˆc = 1 E[Z t (Z t )′ |D1:T ] − 2Ac E[Z t−1 (Z t )′ |D1:T ] + Act · Σ αc t:ct =c

E[Z t−1 (Z t−1 )′ |D1:T ]A′ct

o

A, the linear dynamics within the latent space from one time slice to next time slice: Ac denotes the matrix when the class variable is c.

ˆc = A

X n

t

E[Z (Z

t−1 ′

) |D

t:ct =c

1:T

o  X n

] ·

E[Z

t−1

(Z

t−1 ′

) |D

t:ct =c

1:T

o−1

]

J , the transition matrix of class variable is directly obtained by using frequency estimation on P (ct |ct−1 ) K, the transition matrix of the mixture node is computed based on P (mt |ct = c, D1:T ) :

P

t:ct =c

n

P (M t = mt |D1:T )

o

o P (mt |ct = c, D1:T ) = PT n t = mt |D 1:T ) P (M t=1

4.2 Inference Seen from the dLCM in Fig. 6, an equivalent probabilistic model is p(y 1:T , z 1:T , m1:T , c1:T ) =p(y 1 |z 1 , M 1 )p(z 1 |c1 )p(m1 |c1 )p(c1 )· T Y

p(y t |z t , mt )p(z t |z t−1 , ct )p(mt |ct )p(ct |ct−1 ).

t=2

Following the reasoning of [2], exact filtered and smoothed inference for dLCM can easily be shown to be intractable (scaling exponentially with T [18]) as neither the class variables nor the mixture variables are observed: At time step 1, p(z 1 |y 1 ) is a mixture of |sp (C)| · |sp (M)| Gaussians. At time-step 2, due to the summation over the classes and mixture variables, p(z 2 |y 1:2 ) will be a mixture of |sp (C)|2 · |sp (M)|2 Gaussians; at time-step 3 it will be a mixture of |sp (C)|3 · |sp (M)|3 Gaussians and so on until the generation of a mixture of |sp (C)|T · |sp (M)|T Gaussians at time-point T . To control this explosion in 13

computational complexity, we will resort to approximate inference techniques for both classification and learning (i.e., when performing the E-step).

4.2.1 Approximate inference As the structure of the proposed dLCM is similar to the Linear Dynamical System (LDS) [1], the standard Rauch-Tung-Striebel (RTS) smoother [21] and the expectation correction smoother [2] for LDS provide the basic ideas for the approximate inference of dLCM. As for the RTS, the filtered estimate of dLCM p(z t , mt , ct |y 1:t ) is obtained by a forward recursion, then following a backward recursion to calculate the smoothed estimate p(z t , mt , ct |y 1:T ). The inference of dLCM is then achieved by a single forward recursion and a single backward recursion. Gaussian collapse is incorporated into both the forward recursion and the backward recursion to form the approximate inference. The Gaussian collapse in the forward recursion is equivalent to assumed density filtering [3], and the Gaussian collapse in the backward recursion mirrors the smoothed posterior collapse from [2]. The detail of forward recursion and backward recursion are introduced in the next subsections.

4.2.2 Forward recursion: filtering During the forward recursion of dLCM, the filtered posterior p(z t , mt , ct |y 1:t ) is computed with a recursive form. By a simple decomposition, p(z t , mt , ct |y 1:t ) = p(z t , mt , ct , y t |y 1:t−1 )/p(y t |y 1:t−1 ) ∝ p(z t , mt , ct , y t |y 1:t−1 ). Dropping the normalization constant p(y t |y 1:t−1 ), p(z t , mt , ct |y 1:t ) is proportional to the new joint probability p(z t , mt , ct , y t |y 1:t−1 ), where p(z t , mt , ct , y t |y 1:t−1 ) = p(y t , z t |mt , ct , y 1:t−1 )p(mt |ct , y 1:t−1 )p(ct |y 1:t−1 ). (1)

Closing the forward recursion with Gaussian collapse Since all factors in Equation1 can be obtined from the previously filtered results (see Appendix.B for more details), the forward resursion is achieved. During the process of building the forwad recursion, the term p(z t−1 , mt−1 , ct−1 |y 1:t−1 ) is requred to be otained. Although p(z t−1 |y t−1 ) is the filtered probability of previous time slice which can be directly obtained, the mixture components of p(z t−1 |y t−1 ) is increasing exponentially over time. The same situation also applies to the case for p(z t−1 |mt−1 , ct−1 , y 1:t−1 ). To deal with this in this 14

paper, we collapse p(z t−1 |mt−1 , ct−1 , y 1:t−1 ) to a single Gaussian, parameterized with mean ν mt−1 ,ct−1 and covariance Γmt−1 ,ct−1 , and then propagate this collapsed Gaussian for next time slice. The approximated Gaussian is denoted as p˜(z t−1 |mt−1 , ct−1 , y 1:t−1 ). For notational convenience we drop tildes for all the approximated quantities in later of this article. After the propagation and updating at time t, the approximated p(z t |mt , ct , y 1:t ) will be a Gaussian mixture with fixed |sp (C)| · |sp (M)| components. We later collapse the new approximated Gaussian mixture |sp (C)| · |sp (M)| to a single Gaussian and propagate it for the next time slice. This will guarantee that at any time slice, the mixture components of approximated Gaussian mixture is fixed. At the last step of the forward recursion, the filtered results is p(z T , mT , cT |y 1:T ) at time T . 4.2.3 Backward recursion: smoothing Similar to the forward pass, the backward pass also relies on a recursion computation of the smoothed posterior p(z t , mt , ct |y 1:T ). In detail, we compute p(z t , mt , ct |y 1:T ) from its smoothed result of the previous step, p(z t+1 , mt+1 , ct+1 |y 1:T ), together with some other quantities obtained from forward pass. The first smoothed posterior is p(z T , mT , cT |y 1:T ) which can be directly obtained as it is also the last filtered posterior from forward pass. To compute p(z t , mt , ct |y 1:T ), we factorize it as p(z t , mt , ct |y 1:T ) =

X

p(z t , mt , ct , mt+1 , ct+1 |y 1:T )

X

p(z t |mt , ct , mt+1 , ct+1 , y 1:T )·

mt+1 ,ct+1

=

(2)

mt+1 ,ct+1 t t

p(m , c |mt+1 , ct+1 , y 1:T )p(mt+1 , ct+1 |y 1:T ).

With the factorized form we can achieve the recursion for the backward pass, since all the quantities can be computed either from previous smoothed results or from the forward recursion. There are two main approximations in the backward recursion. Following [2] we note that although {mt , ct } ⊥ 6 ⊥z t+1 |y 1:T , t t t+1 t t the influence of {m , c } on z through z is “weak” as z will be mostly in1:t fluenced by y . We shall therefore approximate p(z t+1 |mt , ct , mt+1 , ct+1 , y 1:T ) by p(z t+1 |mt+1 , ct+1 , y 1:T ). The benefit of this simple assumption lies in that p(z t+1 |mt+1 , ct+1 , y 1:T ) can be directly obtained from the previous backward recursion. Note that p(z t+1 |mt+1 , ct+1 , y 1:T ) is a Gaussian Mixture whose components increase exponentially in T − t. Then the second assumption is that we collapse p(z t+1 |mt+1 , ct+1 , y 1:T ) to a single Gaussian and then pass this collapsed Gaussian for the next step. Then the approximated p(z t+1 |mt+1 , ct+1 , y 1:T ) at next time step will be a Gaussian mixture with fixed |sp (C)| · |sp (M)| components. We later collapse this new approximated Gaussian mixture to a single 15

Gaussian and propagate it to next time step. This process will also guarantee that at any time step, the mixture components of approximated Gaussian mixture is fixed. (see Apendix.B for more details)

Closing the backward recursion with Gaussian collapse All the factors in Equation B.5 are now computed with a recursive form based on the smoothed results from the previous backward recursion, which aslo completes the approxiamate inference for dLCM.

5

Experiments Results

Dynamic Latent Classification Model should prove its computational efficiency and representational power in modeling non-linear real-world time series. To illustrate the point we examine the time series data from oil drilling platform of North sea. It is natural to choose oil drilling data, since the dLCM is initially motivated from the same data. These data are from on-rig sensors which are describing the physical attributes of the drilling system during drilling, and these sensor data are recorded every 5 seconds. There are more than 60 physical attributes of raw sensor data, however, only some of these attributes are considered important. Finally, only 9 attributes are chosen for the classification purpose of oil drilling activities which are show in Fig. 7. These attributes are highly non-linear and chosen based on the analysis from experienced field experts.

Fig. 7. The selected on-rig sensor data during oil drilling: e.g. HKL: Hookload, it measures the load on the hook in tons.

16

In total, there are 5 oil drilling activities in the dataset used for the classification task. These activities are drilling, connection, tripping in, tripping out, other respectively, and they are referring to different action during drilling process. For the classification of these activities, we decide to use a two-step hierarchical classification process. At the first hierarchical step, the drilling and connection activities are combined as one activity called drilling-connection, and we do the similar combination for tripping in and tripping out activities called tripping in/out. The dataset after this combination is called modified dataset. The reason behind is these combined actives are physically close and has quite similar dynamics. A dLCM is then trained to classify these two combined activities. At the second hierarchical step, with the classification results from the first step, a dLCM is trained for each of the two combined activities. One dLCM aims to classify the activities contains drilling and connection, and another one is trained to classify tripping in and tripping out activities. All of these dLCMs are trained separately, but their training process is identical and we describe it as follows. The learning of dLCM contains two steps: the score scheme for evaluating the quality of a model, and the search strategy for exploreing the space of dLCM. First, the wrapper approach [15] is adopted to score a model based on its classification accuracy. By the wrapper approach a model is trained with a training dataset, and the score is given as the classification accuracy found by applying this model on another validation dataset. Following [16], the search strategy is characterized as: (i) A systematic approach for selecting values for |sp (M)| , |sp (Z)| , |sp (Y )| and, given such a pair of values, (ii) learn the edge-set and the parameters in the model. Next the model with the highest score on validation dataset is chosen as the learnt dLCM. Based on this learnt dLCM, the classification task is conducted on a separate test dataset. In our experiment the training, validation, test dataset used for train-validate-test process has 90000, 80000, 50000 time slice respectively. The termination condition for learning is set as either meeting the convergence criteria by measuring the change in likelihood over the consecutive EM steps or 50 iterations. To illustrate the classification performance of dLCM, we choose Naive Bayes, HMM, ,dLCM without mixture component (or equivalent to |sp (M)| = 1), decision tree (J48), local-global learning [25] to compare the classification results. Among these models, the first three models are the intermediate models when we motivate the dLCM. The classification results of first hierarchical step for drilling-connection and tripping in/out activities are shown in Table 2: Furthermore, Fig. 8 shows the detail of classification results using dLCM. After obtaining the classification results for drilling-connection andtripping 17

Table 2 Experimental results on classification accuracy from first hierarchical step Model

classification accuray

NB

84.77

HMM

84.80

J48

96.83

LGL

-

dLCM (M=1)

97.14

dLCM

98.72

The combined activities of the test dataset 2

Activity

1.8 1.6 1.4 1.2 1

0

1

2

3 4 Time step The classified results using dLCM for combinied activities

5 4

x 10

2

Activity

1.8 1.6 1.4 1.2 1

0

1

2

3 Time step

4

5 4

x 10

Fig. 8. The detail of classification results for drilling-connection activity (1) and tripping in/out activity (2) using dLCM.

in/out activities. We separate the original unmodified training dataset into subsets based on the activity change (between drilling-connection andtripping in/out) points of the first step results. Then we will have a set of training data for drilling-connection and another set of training data for tripping in/out activity. In the set of training data for drilling-connection activity, the exact activities in the data includes drilling, connection, and other activity. Meanwhile, in the set of training data for tripping int/out activity, the exact activities includes both tripping in and tripping out activity. The same separation process also applies to the validation and test dataset. In the second hierarchical step of classification, we trained one dLCM for drilling activity, connection and other activities set and another dLCM for 18

tripping in/out activity. The training processes and the parameter settings are exactly the same as the first hierarchical step. The classification results of two subsets in second step are shown in Table 3: Table 3 Experimental results on classification accuracy from second hierarchical step Model

tripping in/out

drilling-connection

NB

60.72

67.64

HMM

59.96

76.23

J48

57.88

69.56

LGL

-

-

dLCM (|sp (M )| = 1)

75.57

76.66

dLCM

75.57

85.47

The overall result of full dataset by combing the two steps classification is illustrated in Table 4: Table 4 The overall experimental results on classification accuracy Model

full dataset

NB

58.53

HMM

61.69

J48

66.75

LGL

-

dLCM (|sp (M )| = 1)

76.29

dLCM

82.09

The experimental results shows that dLCM achieve significant better classification accuracy than static classifier such as NB, J48 and LGL. Meanwhile the experimental results also show that dLCM is more effective than all the three intermediate models one the way to motivate dLCM. In practise, HMM is shown to be more effective than NB in the classification results in Table 2 (drilling-connection classification); dLCM (|sp (M)| = 1) shows a better performance than HMM in Table 1, Table 2 (tripping in/out classification) and Table 3. In general, these results show that dLCM is very effective for the activity detection in our oil drailing data. The detail of full dataset classification from dLCM is shown in Fig. 9 19

The original activities of the test dataset 5

Activity

4 3 2 1

0

1

2

3 4 Time step The classified results using dLCM for original activities

5 4

x 10

5

Activity

4 3 2 1

0

1

2

3 Time step

4

5 4

x 10

Fig. 9. The detail of classification results of full dataset from dLCM.

6

Conclusions

In this paper we have described a new family of classification models specifically designed for the analysis of streaming data. Dynamic classification systems are of great importance for reliability engineers, with obvious applications in, e.g., monitoring systems. We exemplified the use of the classifier by looking at online activity recognition, and showed that our classification model significantly outperforms the standard (non-dynamic) classifiers. Our dynamic latent classification model (dLCM) is a generalization of the latent classification model [16] to dynamic domains, but at the same time it is closely reflated to switching stat-space models [12]. This is a natural choice when each class have multiple approximately linear sub regimes of dynamics, but forced us to look at an approximate inference technique inspired by [2]. The approximations are shown to be effective and efficient, which makes the learning of dLCMs tractable. We want to investigate the appropriateness of the approximate inference scheme further thoroughly comparing our approach to traditional sampling techniques (e.g., [10]). Finally, we plan to use the classification model to do event detection, i.e., to foresee – and therefore potentially help the operators prevent – undesired situations. This is a challenging task, because traditional probabilistic classification based on Equation (1) requires the specification of a meaningful lossfunction. In the dynamic setting, one would want to encode statements like 20

“Detecting an event before a minute has past is worth $1, but detection after 90 seconds is useless”, which requires a richer representation than a (static) loss function. Furthermore, event-detection is an extremely unbalanced classification problem [14], where the marginal probability of a given event can be in the order of 10−5 or even less. The appropriateness of maximum-likelihood based learning can thus be questioned, and we plan to devise a semi-discriminative strategy in the spirit of [25] to this end.

Acknowledgments

Ana helped us out in the beginning. Early version of paper presented at FLAIRS. Also, the Verdande people have helped with preparing and understanding the data.

References

[1] Yaakov Bar-Shalom and Xiao-Rong Li. Estimation and Tracking: Principles, Techniques and software. Artech House Publishers, 1993. [2] David Barber. Expectation correction for smoothed inference in switching linear dynamical systems. Journal of Machine Learning Research, 7:2515–2540, 2006. [3] Xavier Boyen and Daphne Koller. Approximate learning of dynamic models. In Advances in Neural Information Processing Systems 12, pages 396–402, 1999. [4] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977. [5] Richard O. Duda and Peter E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, New York, 1973. [6] Brian S. Everitt. An introduction to latent variable models. Chapmann & Hall, London, 1984. [7] Nir Friedman, Dan Geiger, and Moises Goldszmidt. Bayesian network classifiers. Machine Learning, 29(2–3):131–163, 1997. [8] Nir Friedman, Moises Goldszmidt, and Thomas J. Lee. Bayesian network classification with continuous attributes: Getting the best of both discretization and parametric fitting. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 179–187, San Francisco, CA, 1998. Morgan Kaufmann Publishers.

21

[9] Jo˜ ao Gama. A linear-Bayes classifier. In Proceedings of the 7th Ibero-American Conference on AI: Advances in Artificial Intelligence, volume 1952 of Lecture Notes in Computer Science, pages 269–279. Springer-Verlag, Berlin, Germany, 2000. [10] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741, 1984. [11] Zoubin Ghahramani. Learning dynamic Bayesian networks. In Adaptive Processing of Sequences and Data Structures, International Summer School on Neural Networks, ”E.R. Caianiello”-Tutorial Lectures, pages 168–197, London, UK, 1998. Springer-Verlag. [12] Zoubin Ghahramani and Geoffrey E. Hinton. Variational learning for switching state-space models. Neural Computation, 12:963–996, 1998. [13] Y. He and A. Kundu. 2-D shape classification using hidden Markov model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(11):1172– 1184, 1991. [14] Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5):429–449, 2002. [15] Ron Kohavi and George H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1–2):273–324, 1997. [16] Helge Langseth and Thomas D. Nielsen. Latent classification models. Machine Learning, 59(3):237–265, 2005. [17] Gal Lavee, Ehud Rivlin, and Michael Rudzsky. Understanding video events: a survey of methods for automatic interpretation of semantic occurrences in video. IEEE Transactions on Systems, Man, and Cybernetics Part C, 39(5):489–504, 2009. [18] Uri Lerner. Hybrid Bayesian networks for reasoning about complex systems. PhD thesis, Dept. of Comp. Sci. Stanford University, Stanford, 2002. [19] Uri Lerner and Ronald Parr. Inference in hybrid networks: Theoretical limits and practical algorithms. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI), pages 310–318, 2001. [20] Miriam Mart´ınez and Luis Enrique Sucar. Learning dynamic naive Bayesian classifiers. In Proceedings of the Twentyfirst International Florida Artificial Intelligence Research Symposium Conference, pages 655–659, 2008. [21] H.E. Rauch, F. Tung, and C. T. Striebel. Maximum likelihood estimates of linear dynamic systems. AIAA Journal, 3:1445–1450, 1965. [22] Padhraic Smyth. Hidden Markov models for fault detection in dynamic system. Pattern Recognition, 27(1):149–164, 1994.

22

[23] Ryoji Suzuki, Mitsuhiro Ogawa, Sakuto Otake, Takeshi Izutsu, Yoshiko Tobimatsu, Shin-Ichi Izumi, and Tsutomu Iwaya. Analysis of activities of daily living in elderly people living alone: single-subject feasibility study. Telemedicine Journal and E-Health, 10(2):260–276, 2004. [24] G. I. Webb, J. R. Boughton, and Z. Wang. Not so na¨ıve Bayes: Aggregating one-dependence estimators. Machine Learning, 58(1):5–24, 2005. [25] Shengtong Zhong and Helge Langseth. Local-global-learning of naive Bayesian classifier. In Proceedings of the 2009 Fourth International Conference on Innovative Computing, Information and Control, pages 278–281, Washington, DC, USA, 2009. IEEE Computer Society.

A

Appendix A. Learning

A.1

The E-step

The E-step of EM algorithm computes the expected log likelihood of the completed observation, given the data:

h

Q = E log =

T Y

t=1

T n X

i

f (·t )|D 1:T =

T n X

o

E[log f (·t )|D1:T ]

t=1

o

E[log P (ct |ct−1 )|D1:T ] +

t=1 T n X

T n X

o

E[log P (M t |ct )|D1:T ] +

t=1

o

E[log p(Z t |Z t−1 , ct )|D 1:T ] +

t=1

T n X

o

E[log p(y t |Z t , M t )|D 1:T ] .

t=1

Therefore, the expected log likelihood depends on these four expectations: • Tt=1 E[log P (ct |ct−1 )|D1:T ], P • Tt=1 E[log P (M t |ct )|D1:T ], P • Tt=1 E[log p(Z t |Z t−1 , ct )|D1:T ], and P • Tt=1 E[log p(y t |Z t , M t )|D1:T ]. P

The respective computation form of these probability quantities are as follows: 23

P (ct |ct−1 ) ∼ J P (mt |ct ) ∼ K p(z t |z t−1 , ct ) ∼ N(Act z t−1 , Σct ) 

t−1 t ) = (2π)−k/2|Σct |−1/2 exp − 1/2(zt − Act z t−1 )′ Σ−1 ct (z − Act z

p(y t |z t , mt ) ∼ N(Lmt z t + Φmt , Θmt )



= (2π)−n/2 |Θmt |−1/2 exp − 1/2(y t − Lmt z t − Φmt )′ Θ−1 mt · (y t − Lmt z t − Φmt )

A.2



The M-step

The parameters of dLCM model are A, L, Σ, Θ, Φ, J , K. At each iteration, these parameters can be obtained by taking the corresponding partial derivative of the expected log likelihood. The following are the results: L, the linear dynamics from the latent space to the attribute spaces: Li,mt denotes the i’th row of this matrix when the mixture node is mt .

T n X ∂Q P (M t = mt |D1:T )E[Z t (Z t )′ |M t = mt , D 1:T ]Li,mt − = Θ−1 t i,m ∂Li,mt t=1

o

(yit − Φi,mt )P (M t = mt |D1:T )E[Z t |M t = mt , D1:T ]

ˆ i,mt = L

X T n

t

t

P (M = m |D

t=1 T n X

1:T

t

t ′

t

t

)E[Z (Z ) |M = m , D

1:T

]

o−1

· o

(yit − Φi,mt )P (M t = mt |D1:T )E[Z t |M t = mt , D1:T ]

t=1

Φ, the offset from the latent space to the attribute spaces: Φi,mt denotes the i’th row of this matrix when the mixture node is mt . 24



T n X ∂Q 2P (M t = mt |D1:T )(yit − Φi,mt ) − 2P (M t = mt |D1:T )L′i,mt · = ∂Φi,mt t=1

o

E[Z t |M t = mt , D 1:T ]

1 ˆ i,mt = P n o Φ T t = mt |D 1:T ) P (M t=1 T n X

X T n

o

P (M t = mt |D1:T )yit −

t=1

o

P (M t = mt |D1:T )L′i,mt E[Z t |M t = mt , D1:T ]

t=1

Θ, the covariance matrices of the attribute spaces: Θi,mt denotes the i’th element on the diagonal of the matrix when the mixture node is mt .

T n T n o X 1 −2 X 1 ∂Q 1:T t t P (M = m |D ) + (yit − Φi,mt )2 · = − Θ−1 Θ t t ∂Θi,mt 2 i,m t=1 2 i,m t=1

P (M t = mt |D1:T ) − 2(yit − Φi,mt )P (M t = mt |D1:T )L′i,mt · E[Z t |M t = mt , D 1:T ] + P (M t = mt |D1:T )L′i,mt · E[Z t (Z t )′ |M t = mt , D1:T ]Li,mt

o

T n X

1 ˆ i,mt = − P n o Θ (yit − Φi,mt )2 P (M t = mt |D 1:T )− 1:T T t t ) t=1 t=1 P (M = m |D 2(yit − Φi,mt )P (M t = mt |D 1:T )L′i,mt E[Z t |M t = mt , D 1:T ]+

P (M t = mt |D 1:T )L′i,mt E[Z t (Z t )′ |M t = mt , D1:T ]Li,mt )

o

Σ, the covariance matrices of the latent spaces: Σc denotes the diagonal matrix when the class variable is c.

 o 1 X n 1 X n αc ∂Q +2 E[Z t (Z t )′ |D1:T ]Σ−1 Ac · I− = − Σ−1 c c ∂Σc 2 αc t:ct =c αc t:ct =c o o 1 X n t−1 t−1 ′ 1:T ′ −1 − E[Z t−1 (Z t )′ |D1:T ]Σ−1 Ac E[Z (Z ) |D ]Ac Σc c αc t:ct =c

25

X n ˆc = 1 Σ E[Z t (Z t )′ |D1:T ] − 2Ac E[Z t−1 (Z t )′ |D1:T ] + Act · αc t:ct =c

E[Z t−1 (Z t−1 )′ |D1:T ]A′ct

o

A, the linear dynamics within the latent space from one time slice to next time slice: Ac denotes the matrix when the class variable is c.

o X n −1 ∂Q = −2 Σc E[Z t (Z t−1 )′ |D 1:T ] + ∂Ac t:ct =c

2

X n

t−1 Σ−1 (Z t−1 )′ |D1:T ] c Ac E[Z

t:ct =c

ˆc = A

o

o−1

o  X n

X n

E[Z t (Z t−1 )′ |D1:T ] ·

t:ct =c

E[Z t−1 (Z t−1 )′ |D 1:T ]

t:ct =c

J , the transition matrix of class variable is directly obtained by using frequency estimation on P (ct |ct−1 ) K, the transition matrix of the mixture node is computed based on P (mt |ct = c, D1:T ) :

P

t:ct =c

n

P (M t = mt |D1:T )

o

o P (mt |ct = c, D1:T ) = PT n 1:T t t ) t=1 P (M = m |D

B

Appendix B. Inference

Seen from the dLCM in Fig. 6, an equivalent probabilistic model is p(y 1:T , z 1:T , m1:T , c1:T ) =p(y 1 |z 1 , M 1 )p(z 1 |c1 )p(m1 |c1 )p(c1 )· T Y

p(y t |z t , mt )p(z t |z t−1 , ct )p(mt |ct )p(ct |ct−1 ).

t=2

(B.1) 26

Following the reasoning of [2], exact filtered and smoothed inference of dLCM can easily be shown to be intractable (scaling exponentially with T [18]) as neither the class variables nor the mixture variables are observed: At time step 1, p(z 1 |y 1 ) is a mixture of |sp (C)||sp (M)| Gaussians. At time-step 2, due to the summation over the classes and mixture variables, p(z 2 |y 1:2 ) will be a mixture of |sp (C)|2 |sp (M)|2 Gaussians; at time-step 3 it will be a mixture of |sp (C)|3 |sp (M)|3 Gaussians and so on until the generation of a mixture of |sp (C)|T |sp (M)|T Gaussians at time-point T . To control this explosion in computational complexity, we will resort to approximate inference techniques for both classification and learning (i.e., when performing the E-step).

B.1 Approximate inference As the structure of the proposed dLCM is similar to the Linear Dynamical System (LDS) [1], the standard Rauch-Tung-Striebel (RTS) smoother [21] and the expectation correction smoother [2] for LDS provide the basic ideas for the approximate inference of dLCM. As for the RTS, the filtered estimate of dLCM p(z t , mt , ct |y 1:t ) is obtained by a forward recursion, then following a backward recursion to calculate the smoothed estimate p(z t , mt , ct |y 1:T ). The inference of dLCM is then achieved by a single forward recursion and a single backward recursion. Gaussian collapse is incorporated into both the Forward Recursion and the backward recursion to form the approximate inference. The Gaussian collapse in the forward recursion is equivalent to assumed density filtering [3], and the Gaussian collapse in the backward recursion mirrors the smoothed posterior collapse from [2]. The detail of forward recursion and backward recursion are introduced next.

B.2 Forward recursion: filtering

During the forward recursion of dLCM, the filtered posterior p(z t , mt , ct |y 1:t ) is computed with a recursive form. By a simple decomposition, p(z t , mt , ct |y 1:t ) = p(z t , mt , ct , y t |y 1:t−1 )/p(y t |y 1:t−1 ) ∝ p(z t , mt , ct , y t |y 1:t−1 ). Dropping the normalization constant p(y t |y 1:t−1 ), p(z t , mt , ct |y 1:t ) is proportional to the new joint probability p(z t , mt , ct , y t |y 1:t−1 ), where p(z t , mt , ct , y t |y 1:t−1 ) = p(y t , z t |mt , ct , y 1:t−1 )p(mt |ct , y 1:t−1 )p(ct |y 1:t−1 ). (B.2) 27

We are going to find a recursion form for each of the factors in Equation B.2 given the filtered results of the previous time-step.

Evaluating p(ct |y 1:t−1 ): P p(ct |y 1:t−1 ) can be factorized to ct−1 ∈sp(C) p(ct |ct−1 )p(ct−1 |y 1:t−1 ), where p(ct |ct−1 ) is the known transition probability of class variables and p(ct−1 |y 1:t−1 ) can be obtained by marginalizing out the previously filtered results, p(ct−1 |y 1:t−1 ) =

X

p(mt−1 , ct−1 |y 1:t−1 )

mt−1

=

X Z

mt−1

z

t−1

p(z t−1 , mt−1 , ct−1 |y 1:t−1 ) dz t−1 .

p(z t−1 , mt−1 , ct−1 |y 1:t−1 ) can be directly obtained since it is the filtered probability of previous time slice. However, we have earlier pointed out that the mixture components of p(z t−1 |y t−1 ) is increasing exponentially over time, so is the case for p(z t−1 |mt−1 , ct−1 , y 1:t−1 ). To deal with this, we collapse p(z t−1 |mt−1 , ct−1 , y 1:t−1 ) to a single Gaussian, parameterized with mean ν mt−1 ,ct−1 and covariance Γmt−1 ,ct−1 , and then propagate this collapsed Gaussian for next time slice. The approximated Gaussian is denoted as p˜(z t−1 |mt−1 , ct−1 , y 1:t−1 ). For notational convenience we drop tildes for all the approximated quantities in later of this article. After the propagation and updating at time t, the approximated p(z t |mt , ct , y 1:t ) will be a Gaussian mixture with fixed |sp (C)| · |sp (M)| components. We later collapse the new approximated Gaussian mixture |sp (C)| · |sp (M)| to a single Gaussian and propagate it for the next time slice. This will guarantee that at any time slice, the mixture components of approximated Gaussian mixture is fixed. At the last step of the forward recursion, the filtered results is p(z T , mT , cT , |y1:T ) at time T .

Evaluating p(mt |ct , y 1:t−1 ): As mt ⊥ ⊥y 1:t−1 |ct , p(mt |ct , y 1:t−1 ) is equal to p(mt |ct ) which is a known model parameter.

Evaluating p(y t , z t |mt , ct , y 1:t−1 ): p(y t , z t |mt , ct , y 1:t−1 ) = X

p(y t , z t |mt , ct , mt−1 , ct−1 , y 1:t−1 )p(mt−1 , ct−1 |mt , ct , y 1:t−1 ).

mt−1 ,ct−1

The joint (y t , z t |mt , ct , mt−1 , ct−1 , y 1:t−1 ) is a Gaussian with covariance and mean elements that can be obtained by integrating the forward dynamics 28

over z t−1 using Appendix.C, µz t = Act ν mt−1 ,ct−1 , µy t = Lmt µz t , Σz t z t = Act Γmt−1 ,ct−1 A′ct + Σct , Σy t y t = Lmt Σz t z t L′mt + Θmt , Σy t z t = Lmt Σz t z t . Upon a trivial normalization constant, the remaining joint p(mt−1 , ct−1 |mt , ct , y 1:t−1 ) can be further decomposed as p(mt−1 , ct−1 |mt , ct , y 1:t−1 ) ∝ p(mt , ct |mt−1 , ct−1 , y 1:t−1 )p(mt−1 , ct−1 |y 1:t−1 ). (B.3) In Equation B.3, the last factor p(mt−1 , ct−1 |y 1:t−1 ) is given from the previous time step, and the factor p(mt , ct |mt−1 , ct−1 , y 1:t−1 ) is found from p(mt , ct |mt−1 , ct−1 , y 1:t−1 ) = Z

z

t−1

p(mt , ct |z t−1 , mt−1 , ct−1 , y 1:t−1 ) p(z t−1 |mt−1 , ct−1 , y 1:t−1 )dz t−1 .

(B.4)

As both p(mt , ct |z t−1 , mt−1 , ct−1 , y 1:t−1 ) and p(z t−1 |mt−1 , ct−1 , y 1:t−1 ) can be directly computed based on the previously filtered results, any numerical integration method can be adopted to solve Equation B.4, e.g., mean field approximation, or Monte Carlo integration.

Closing the forward recursion The forward recursion is achieved, since all factors in Equation B.2 are obtined from the previously filtered results. At the last step of the forward recursion, the filtered results is p(z T |mT , cT , y 1:T ) at time T .

B.3 Backward recursion: smoothing

Similar to the forward pass, the backward pass also relies on a recursion computation of the smoothed posterior p(z t , mt , ct |y 1:T ). In detail, we compute p(z t , mt , ct |y 1:T ) from its smoothed result of the previous step, p(z t+1 , mt+1 , ct+1 |y 1:T ), together with some other quantities obtained from forward pass. The first smoothed posterior is p(z T , mT , cT |y 1:T ) which can be directly obtained as it is 29

also the last filtered posterior from forward pass. To compute p(z t , mt , ct |y 1:T ), we factorize it as p(z t , mt , ct |y 1:T ) =

X

p(z t , mt , ct , mt+1 , ct+1 |y 1:T )

X

p(z t |mt , ct , mt+1 , ct+1 , y 1:T )·

mt+1 ,ct+1

=

(B.5)

mt+1 ,ct+1

p(mt , ct |mt+1 , ct+1 , y 1:T )p(mt+1 , ct+1 |y 1:T ). With the factorized form, we can achieve the recursion for the backward pass. Evaluating p(z t |mt , ct , mt+1 , ct+1 , y 1:T ): Due to the fact that z t ⊥ ⊥{y t+1:T , mt+1 , ct+1 }|{zt+1 , mt , ct }, the term p(z t |mt , ct , mt+1 , ct+1 , y 1:T ) can be found from p(z t |mt , ct , mt+1 , ct+1 , y 1:T ) = =

Z

Zz

t+1

z t+1

p(z t |z t+1 , mt , ct , mt+1 , ct+1 , y 1:T ) p(z t+1 |mt , ct , mt+1 , ct+1 , y 1:T )dz t+1 p(z t |z t+1 , mt , ct , y 1:t ) p(zt+1 |mt , ct , mt+1 , ct+1 , y 1:T )dz t+1 .

The above equation enables a recursive computation of p(z t |mt , ct , mt+1 , ct+1 , y 1:T ). First we compute the the term p(z t+1 |mt , ct , mt+1 , ct+1 , y 1:T ). We follow (Baber 2006) and note that although {mt , ct } ⊥ 6 ⊥z t+1 |y 1:T , the influence of {mt , ct } on t+1 t t z through z is “weak” as z will be mostly influenced by y 1:t . We shall therefore approximate p(z t+1 |mt , ct , mt+1 , ct+1 , y 1:T ) by p(z t+1 |mt+1 , ct+1 , y 1:T ). The benefit of this simple assumption lies in that p(z t+1 |mt+1 , ct+1 , y 1:T ) can be directly obtained from the previous backward recursion. Note that p(z t+1 |mt+1 , ct+1 , y 1:T ) is a Gaussian Mixture whose components increase exponentially in T −t. Similar to what we did in the forward recursion, we now collapse p(z t+1 |mt+1 , ct+1 , y 1:T ) to a single Gaussian and then pass this collapsed Gaussian for the next step. Then the approximated p(z t+1 |mt+1 , ct+1 , y 1:T ) at next time step will be a Gaussian mixture with fixed |sp (C)| · |sp (M)| components. We later collapse this new approximated Gaussian mixture to a single Gaussian and propagate it to next time step. This process will also guarantee that at any time step, the mixture components of approximated Gaussian mixture is fixed. The factor p(z t |z t+1 , mt , ct , y 1:t ) can be evaluated by conditioning the joint p(z t , z t+1 |mt , ct , y 1:t ) on z t+1 using Appendix.D. The joint p(z t , z t+1 |mt , ct , y 1:t ) can be computed as p(z t , z t+1 |mt , ct , y 1:t ) = p(z t+1 |z t , mt , ct , ct+1 , y 1:t )p(z t |mt , ct , y 1:t ).

(B.6)

p(z t |mt , ct , y 1:t ) can be obtained from the forward recursion. The remaining 30

necessary statistics of the joint are the mean and covariance of z t+1 , and covariance between z t+1 and z t , µz t+1 = Act+1 ν mt ,ct , Σz t+1 z t+1 = Act+1 Γmt ,ct A′ct+1 + Σct+1 , Σz t+1 z t = Act+1 Γmt ,ct .

Evaluating p(mt , ct |mt+1 , ct+1 , y 1:T ): The term p(mt , ct |mt+1 , ct+1 , y 1:T ) can be found from p(mt , ct |mt+1 , ct+1 , y 1:T ) =

Z

z

t+1

p(mt , ct |z t+1 , mt+1 , ct+1 , y 1:T ) p(z t+1 |mt , ct , mt+1 , ct+1 , y 1:T )dz t+1 . (B.7)

As {mt , ct }⊥ ⊥y t+1:T |{z t+1 , mt+1 , ct+1 }, p(mt , ct |z t+1 , mt+1 , ct+1 , y 1:T ) = p(mt , ct |z t+1 , mt+1 , ct+1 , y 1:t ), and we can compute p(mt , ct |z t+1 , mt+1 , ct+1 , y 1:t ) as p(mt , ct |z t+1 , mt+1 , ct+1 , y 1:t ) ∝ p(mt , ct , z t+1 , mt+1 , ct+1 , y 1:t ) = p(z t+1 |mt , ct , mt+1 , ct+1 , y 1:t )p(mt , ct , mt+1 , ct+1 |y 1:t ). (B.8) The first factor in Equation B.8 can be obtained by marginalizing out z t from p(z t , z t+1 |mt , ct , mt+1 , ct+1 , y 1:t ), which is computed in Equation B.6. The last factor p(mt , ct , mt+1 , ct+1 |y 1:t ) is found from p(mt , ct , mt+1 , ct+1 |y 1:t ) ∝ p(mt+1 , ct+1 |mt , ct , y 1:t )p(mt , ct |y 1:t ),

where all the quantities can be computed from the forward recursion. Using p(z t+1 |mt+1 , ct+1 , y 1:T ) from the previous backward recursion, we are now able to compute Equation B.7. Similar to the situation we encountered for the forward recursion in Equation B.4, any numerical integration method can be adopted to perform the integral in Equation B.7.

Evaluating p(mt+1 , ct+1 |y 1:T ) This term can be obtained by marginalizing out z t+1 from p(z t+1 , mt+1 , ct+1 |y 1:T ), which is obtained during the previous step of the backward recursion. 31

Closing the backward recursion All the factors in Equation B.5 are now computed with a recursive form based on the smoothed results from the previous backward recursion, which aslo completes the approxiamate inference for dLCM.

C

Appendix C. Gaussian propagation

let y be linearly related to x through y = Ax + ψ, where ψ ∈ N(ν, Ψ), and R x ∈ N(µ, Σ). Then p(y) = x p(y|x)p(x) is Gaussian with mean Aµ + ν and Covariance AΣAT + Ψ [2].

D

Appendix D. Gaussian conditioning

For a joint Gaussian over x and y with means µx , µy and covariance elements Σxx , Σxy , Σyy , the conational p(x|y) is a Gaussian with mean µx +Σxy Σ−1 yy (y− −1 µy ) and covariance Σxx − Σxy Σyy Σyx [2].

32