Maximum Likelihood Estimation of State Space Models from ...

8 downloads 5666 Views 727KB Size Report
linear time invariant models from observed frequency domain data. ... email:[email protected] ... email:[email protected]. S. Gibson.
Maximum Likelihood Estimation of State Space Models from Frequency Domain Data Adrian Wills, Brett Ninness, Stuart Gibson

Abstract— This paper addresses the problem of estimating linear time invariant models from observed frequency domain data. Here an emphasis is placed on deriving numerically robust and efficient methods that can reliably deal with high order models over wide bandwidths. This involves a novel application of the Expectation-Maximisation (EM) algorithm in order to find Maximum Likelihood estimates of state space structures. An empirical study using both simulated and real measurement data is presented to illustrate the efficacy of the solutions derived here.

I. I NTRODUCTION A widely used approach to the problem of linear dynamic system identification is to first obtain measurements of a system’s frequency response, and then take as estimate the model of given order whose response most closely matches these measurements [16], [18]. This technique may be employed because the system observations naturally present themselves in the frequency domain, such as in the area of vibration analysis [11]. Alternatively, it may be used in order to obtain some of the natural advantages that flow from a frequency domain approach, such as ease of combining separate data sets, ability to compress large (time domain) data records, and ease of accommodating continuous time models [21]. However, the approach is not without difficulties. In most cases finding the estimate involves solving a non-convex nonlinearly parametrized optimisation problem. This necessitates some sort of iterative search such as a gradient based technique [5]. Furthermore, when the observed data is in the frequency domain, it can have a very wide dynamic range involving several orders of magnitude, with features at small magnitudes being as significant from a modelling viewpoint as features at large magnitudes. Finally, it may be required for the estimated model to fit the observed data over several decades of frequency. The combined effect of these issues is that solving a frequency domain estimation problem often implies solving a complex optimisation problem subject to significant numerical conditioning difficulties. A wide variety of techniques have been developed to cope with these difficulties, with a commonly used approach being one of Gauss–Newton type search This work was supported by the Australian Research Council. A. Wills is with the School of Electrical and Computer Science, University of Newcastle, Australia and can be contacted at email:[email protected] B. Ninness is with the School of Electrical and Computer Engineering, University of Newcastle, Australia and can be contacted at email:[email protected] S. Gibson is with Lehman Brothers, London, UK email:[email protected]

with respect to models which are specifically parametrized to ameliorate numerical conditioning issues [21]. This paper explores a new approach of tackling the model fitting problem via the Expectation-Maximisation (EM) algorithm. As opposed to gradient-based search which applies to the general class of smooth cost functions, the EM method was specifically developed to tackle (possibly non-smooth) Maximum-Likelihood (ML) criteria. It was originally developed in the statistics literature [4], and enjoys a reputation in that field of providing a particularly robust means for iteratively computing ML estimates. However, it has not (to the authors knowledge) previously been applied to the frequency domain estimation problems studied here, although the current authors have studied the method for both linear and non-linear estimation from time domain data in [6], [7]. In that context, it was found to provide a method that was particularly robust against capture in local minima. It was also observed to enjoy favourable numerical conditioning and reasonable computational load which scaled only modestly with increasing model complexity. These advantages in the time domain case motivated the frequency domain study here, and as will be illustrated, the same benefits apply in this new setting. Like gradient based search, the algorithm involved is still iterative in nature and consists of two key steps. In the first, a smoothed state estimate based on the current iterate model parameters is computed. In the second, this state estimate is used to update the model parameters via solving two maximisation problems. Unlike the time domain cases studied in [6], [7], this maximisation step cannot be found in closed form and a gradient based search is required. However, the problems involved are unimodal (i.e. quasi-convex – see [1]), and hence simply solved. The efficacy of the method is illustrated here empirically on a range of simulation studies of varying system dimension, and on two real world problems involving a cantilevered beam and a vibrating mechanical system. For readers wishing to perform their own evaluation, MATLAB-based software implementing the methods derived and studied here is freely available at sigpromu.org/idtoolbox. The paper is organised as follows. In Section II we define the model structure and noise assumptions used in the remainder of the paper. Section III reviews the Maximum-Likelihood method for state-space models in the frequency-domain. This leads to the description of the Expectation-Maximisation algorithm in Section IV. Section V detailing the expectation step and sections VI–VII detailing the maximisation together with computational considerations may be skipped by readers interested only in the resulting algorithm, which is summarised in section VIII. Its convergence properties are analysed in

section IX. The EM method is profiled against other competitive methods for MIMO state-space models in Section X, and Section XII concludes the paper.

where N denotes a standard Gaussian density, and Z has probability density function  1 p(Z) = n exp −(Z − µ)? P −1 (Z − µ) , (5) π |P |

II. M ODEL S TRUCTURE AND AVAILABLE DATA

with (·)? denoting the complex conjugate transpose of (·). This assumption of i.i.d. circular complex Normal distribution is widely employed in the frequency domain estimation literature [21], [16], [17]. It is motivated by several considerations. The most significant being that the discrete Fourier transform (DFT) of a white Gaussian time domain measurement corruption has an i.i.d. circular complex Normal distribution, and that the DFT of a stationary bounded variance time domain measurement converges in distribution as the time domain measurement length grows to a sequence of i.i.d. circular complex Normal random variables [2]. Coupling these stochastic assumptions with state space models has also been previously considered in [16], [17] where the corresponding ‘innovations form’ of (2) is employed. Finally, it is important to emphasise that the model (2) does not explicitly account for any non-zero initial conditions affecting the frequency domain measurements {Yk }, {Uk }. These typically decrease in size with increasing time domain data length [16], and it is assumed that this length is sufficient that they are negligible.

This paper assumes the availability of a set of measurements p×m {Y (ωk ) , Yk }N of a system’s frequency k=1 , Yk ∈ C N response at a set {ωk }k=1 of frequencies that need not be equally spaced. Measurements {U (ωk ) , Uk }N k=1 , Uk ∈ Cm×m of the input spectrum at these same frequencies are also assumed available. By allowing p > 1, m > 1 the cases of multiple output, multiple input, or both are all catered for. More specifically, the i, `’th element of the matrix Yk is the response at the i’th output on the `’th excitation experiment, and the i, `’th element of Uk is the excitation on the i’th input during the `’th experiment [21, Section 2.7]. The topic of this paper is to address the problem of finding matrices A, B, C, D in a state space model such that the associated frequency response G(γk ) = C(γk I − A)−1 BUk + D;

γk = ejωk or jωk (1)

matches that of an underlying true linear system producing the measurements {Yk } in response to {Uk } as closely as possible. Via the choice of γk in (1) both discrete time and continuous time models are accommodated. Furthermore, the common situation of Frequency-Response-Function (FRF) data is addressed by setting Uk = Im , where Im is the m × m identity matrix. To address this estimation problem, the following state space model structure is employed        Wk Xk A B γk Xk (2) + = Ek Uk C D Yk γk n×n

= ejωk n×m

or

jωk . p×n

p×m

Here, A ∈ R ,B ∈ R ,C ∈ R and D ∈ R are the system matrices to be estimated. The element Xk ∈ Cn×m is the frequency response of the system state in response to the excitation Uk = U (ωk ). The terms Wk ∈ Cn×m and Ek ∈ Cp×m model corruptions in the frequency response measurements. There are assumed to be zero mean i.i.d. random variables that are circular complex Normally distributed according to       vec {Wk } 0 Im ⊗ Q 0 ∼ Nc , . 0 Im ⊗ R vec {Ek } 0 (3) Here vec {X} ∈ Cnm is the vector obtained by stacking the left to right columns of the matrix X ∈ Cn×m on top of one another, and ⊗ is the Kronecker product. The circular complex Normal distribution is defined such that if Z ∼ Nc (µ, P ) for a vector Z ∈ Cn then its real and imaginary parts are distributed as       1 Re{P } −Im{P } Re{Z} Re{µ} ∼N , Im{Z} Im{µ} 2 Im{P } Re{P } (4)

III. M AXIMUM L IKELIHOOD E STIMATION Given the model structure (2), (3), the work here addresses N the problem of using the data {Uk }N k=1 and {Yk }k=1 to form b an estimate θ of the system parametrization h iT T T θ , β T , η T , vec {Q} , vec {R} (6) where β , vec {∆} ,

∆ , [A, B] ,

(7)

η , vec {Γ} ,

Γ , [C, D] .

(8)

It is worth pointing out that β and ∆ are isomorphic, and likewise η and Γ are isomorphic; we exploit this one-toone correspondence throughout the remainder of the paper. The parameter vector θ is constrained to the parameter set Θ defined via Θ = Θ1 ×Θ2 ×Θ3 ,

[β, η] ∈ Θ1 ,

Q ∈ Θ2 , `

R ∈ Θ3 (9)

2

where Θ1 is a closed hypercube in R , ` = n +nm+np+p2 , and Θ2 is a closed and bounded subset of Rn×n for which all Q ∈ Θ2 are symmetric and positive definite. Similarly, Θ3 is a closed and bounded subset of Rp×p for which all R ∈ Θ3 are symmetric and positive definite. For estimation purposes, this paper employs the well-known maximum likelihood criterion θb , arg max pθ (Y1 , · · · , YN ) θ∈Θ

which has well known properties of statistical optimality [21]. This approach requires computation of the likelihood pθ (Y1 , · · · , YM ). To achieve this, note that via (2) Yk = Gk Uk + Tk Wk + Ek

where Wk , W (ωk ), Ek , E(ωk ) and −1

Tk , C(γk I − A)

,

G k , Tk B + D

(10)

p(X, Y )/p(X | Y ) and hence the log likelihood L(θ) in (15) can be decomposed as L(θ) = log pθ (Y ) = log pθ (X, Y ) − log pθ (X|Y ).

so that vec {Yk } ∼ Nc (vec {Gk Uk } , Im ⊗ Pk )

(11)

θb = arg max L(θ)

(14)

(18)

Note that in general, the component X is termed the ‘missing data’ and its choice is the essential design variable in the where deployment of the EM method [4], [6], [7]. Its selection here −? T Pk , CA−1 Ak , γk I − A. (12) as the states in an underlying state-space model is part of k QAk C + R, the adaptation of the EM approach to this specific frequency Note that this requires that A have no eigenvalues on the domain estimation setting. stability boundary, which is assumed true in the remainder A second essential point is that the expectation of L(θ) of the paper. To continue, by the independence assumptions given the output measurements Y and with respect to a of Wk and W` for k 6= `, and similarly for Ek , E` probability density function implied by parameters θ0 leaves N Y L(θ) unchanged. That is pθ (Y1 , · · · , YN ) = pθ (Yk ) = Z  k=1 0 Eθ L(θ)|Y = log pθ (y)pθ0 (y|y = Y )dy = log pθ (Y ), N Y   1 ? −1 exp −Tr (Yk − Gk Uk ) Pk (Yk − Gk Uk ) . π n |Pk |m k=1 since clearly pθ0 (y|y = Y ) = δ(y − Y ) a Dirac delta. (13) Therefore, via (18) Actually, since log x is monotonic in x, then the maximum L(θ) = Q(θ, θ0 ) − V(θ, θ0 ). (19) likelihood estimate can be equivalently defined as where

θ∈Θ

where L(θ) is the log-likelihood function: L(θ) = log pθ (Y1 , · · · , YN ).

(15)

In this frequency domain case, equation (13) makes it clear that (after dropping terms independent of θ) L(θ) = −m

N X

log |Pk |

(16)

k=1

 − Tr (Yk − Gk Uk )? Pk−1 (Yk − Gk Uk ) . The essential difficulty now is that via the formulation of Gk , Tk and Pk , L(θ) is parametrized in quite a complicated way by θ. As such, the computation of θb via minimisation of L(θ) is not straightforward, and certainly it requires some sort of iterative search procedure to be used. Previous work has addressed this via the employment of a gradient-based search method such as Gauss–Newton iteration or a variant thereof [15], [21]. However, this previous work does not specifically consider MIMO state-space model structures with state noise. This is possibly due to the inherent difficulty of calculating the gradient and Hessian of the maximum-likelihood cost function (31). The work here avoids this difficulty by employing the so-called ‘ExpectationMaximisation’ (EM) algorithm.

 Q(θ, θ0 ) , Eθ0 log pθ (X, Y )|Y  V(θ, θ0 ) , Eθ0 log pθ (X|Y )|Y

L(θ) − L(θ0 ) = Q(θ, θ0 ) − Q(θ0 , θ0 ) + V(θ0 , θ0 ) − V(θ, θ0 ), | {z } ≥0

(22) where the indicated positivity arises since the term involved is easily established via the definition in (19) as the Kullback-Leibler divergence metric between pθ (X|Y ) and pθ0 (X|Y ) [3]. It follows immediately that if Q(θ, θ0 ) > Q(θ0 , θ0 ) then L(θ) > L(θ0 ) which naturally suggests what is known as the EM algorithm which takes an approximation θbi of the Maximum Likelihood estimate θb given by (14) and updates it to an expected better one θbi+1 according to: 1) E Step Calculate:

Q(θ, θbi );

(23)

2) M Step θbi+1 = arg max Q(θ, θbi ).

(24)

θ∈Θ

To explain and motivate the EM Algorithm it will be convenient to define the full output and full state information sequences as X , {X1 , . . . , XN }.

(21)

The final key component is that via (19)

Compute: IV. EM A LGORITHM

Y , {Y1 , . . . , YN },

(20)

(17)

An essential ingredient in the understanding of the EM Algorithm is the recognition that via Bayes’ rule p(Y ) =

Like gradient based search, a single iteration is unlikely to find the optimiser θb and hence the above steps are iterated multiple times to produce a sequence {θb0 , θb1 , θb2 , . . .} of increasingly b The iterations are usually termigood approximations to θ. nated using a standard criterion such as the relative decrease of L(θ) falling below a pre-defined threshold [5].

V. T HE E-S TEP To implement the EM Algorithm for frequency domain estimation using the missing data choice X given by (17) it is clearly necessary to derive a method for computing Q(θ, θbi ). This may be simply achieved via the results of the following two Lemmas. Lemma 1: The function Q(θ, θbi ) defined via (19) may be expressed as (recall the definitions in (7) and (8)) Q(θ, θbi ) = −QΓ (η, R, θbi ) − Q∆ (β, Q, θbi )

(25)

where QΓ (η,R, θbi ) , mN log |R|    + Tr R−1 Φ − ΓΨ? − ΨΓT + ΓΣΓT Q∆ (β,Q, θbi ) , mN log |Q| − 2m 

−1

+ Tr Q

N X

Re {log |Ak |}

k=1 T

(26)

(27)

  Λ − ∆Ω? − Ω∆ + ∆Σ∆T

where Ak has been defined in (12) and Φ,

N X

Yk Yk? ,

Ω,

k=1

N X

Eθbi {γk Xk Zk? |Yk } ,

k=1

(28) Ψ,

N X

Eθbi {Yk Zk? | Yk } ,

Σ,

k=1

Λ,

N X

N X

Eθbi {Zk Zk? | Yk } ,

k=1

»

Eθbi {γk? γk Xk Xk? |Yk } ,

Zk ,

k=1

Xk Uk

(29) – . (30)

Proof: See Appendix I. This reduces the problem of computing Q(θ, θbi ) to one of computing a smoothed state estimate and its covariance, which itself may be achieved by the results of the following Lemma. Lemma 2: Assume that observed data Yk obeys the model (2), which is parametrized by a given vector θ according to (6). Then bk X

, Eθ {Xk |Yk } =

A−1 k BUk

+

Sk Rk−1 (Yk



CA−1 k BUk

− DUk ),

and bk X bk? + Qk − Sk R−1 Sk? Eθ {Xk Xk? |Yk } = X k T

T

Qk , Sk , Qk C , Rk , CQk C + R. Proof: According to the model (8)       Xk A−1 Qk Sk k BUk , ∼ Nc . (31) Yk Sk? Rk CA−1 k BUk + DUk Therefore, by Lemma 11 in Appendix V (Xk |Yk ) ∼ Nc (µk , Πk )

Together, Lemmas 1 and 2 provide a simple means for executing the ‘expectation step’ (23) of computing Q(θ, θbi ). This leads to consideration of the ‘maximisation step’ (24). Unfortunately, this is not so straightforward since it can only partially be achieved by closed form expressions. Recall that via (25), Q(θ, θbi ) can be decomposed into two components Q∆ (β, Q, θbi ) and QΓ (η, R, θbi ) which depend in a non-coupled fashion on particular elements of the state space description A, B, Q and C, D, R, respectively, as defined via the parametrization in (7), (8). This allows the optimisation problem (24) to be decomposed into smaller ones. The first of these involves maximisation (minimisation of QΓ (η, R, θbi ) and Q∆ (β, Q, θbi )) with respect to the noise parametrizations R and Q. This subproblem does have a closed form solution. Lemma 3: For any η = vec {[C, D]}  1  b (33) Φ − ΓΨ? − ΨΓT + ΓΣΓT R(η) , mN is a stationary point of QΓ (η, R) with respect to R and 1 ∂ 2 QΓ (η, R) b b = R(η) ⊗ R(η). (34) T mN ∂vec {R} ∂vec {R} b R=R(η) Likewise, for any β = vec {[A, B]}  1  b Q(β) , Λ − ∆Ω? − Ω∆T + ∆Σ∆T (35) mN is a stationary point of Q∆ (β, Q) with respect to Q and 1 ∂ 2 Q∆ (β, Q) b b = Q(β) ⊗ Q(β). (36) T mN ∂vec {Q} ∂vec {Q} b Q=Q(β) Proof: By Lemma 13 ∂ QΓ (η, R) = ∂R   mN R−1 − R−1 Φ − ΓΨ? − ΨΓT + ΓΣΓT R−1 (37) b which is clearly equal to zero for R = R(η) given by (33). A b similar argument holds for Q(β). Furthermore n o ∂QΓ (η, R) −1 b = mN vec R−1−R−1 R(η)R . (38) ∂vec {R} Lemma 13 and the product rule then yields

where Ak has been defined in (12) and −? A−1 k QAk ,

VI. T HE M-S TEP

(32)

where µk

−1 −1 = A−1 k BUk + Sk Rk (Yk − CAk BUk − DUk )

Πk

= Qk − Sk Rk−1 Sk? .

∂ 2 QΓ (η, R) 1 mN ∂vec {R} ∂vec {R}T     T ∂vec R−1 ∂vec R−1 −1 b − R(η)R − = ⊗I T T ∂vec {R} ∂vec {R} h  i ∂vec R−1 b I ⊗ R−1 R(η) T ∂vec {R}   T  −1 b = R(η)R ⊗ I R−1 ⊗ R−1 + h  i  b I ⊗ R−1 R(η) R−1 ⊗ R−1 − R−1 ⊗ R−1 . b Evaluating this at R = R(η) then establishes (34), with (36) being obtained in an identical manner.

b b Therefore, provided the stationary points R(η), Q(β) given by (33), (35) are positive definite, then they provide closed form expressions for local maximisers of Q(θ, θbi ) with respect to R and Q. Substituting them for the R and Q terms of QΓ (η, R, θbi ) and Q∆ (β, Q, θbi ) then leads to ‘concentrated’ versions that depend only on the system parameters β = vec {∆} = vec {[A, B]} and η = vec {Γ} = vec {[C, D]} according to

Turning to Q∆ (β, Q, θbk ), an identical argument and approach applies to deriving a real valued minimiser βb via the Newton search update of −1 b βbi+1 = βbi − δi H∆ (βi )g∆ (βbi )

(45)

where g∆ (β) and H∆ (β) are (respectively) the gradient and ˛ i˛˛ ˛ 1 h b ? T T ˛ b ˛ Φ − ΓΨ − ΨΓ + ΓΣΓ ˛+mpN, Hessian of Q∆ (β, Q, θk ) and δi is again a damping coefficient, QΓ (η, R(η)) = mN log ˛ mN chosenPto ensure global convergence. Unfortunately, because (39) N b

of the k Re{log |Ak |} term, Q∆ (β, Q, θi ) is not, in general, quasiconvex in β, but it is nevertheless the experience of 1   b Q∆ (β, Q(β)) =mN log Λ − ∆Ω? − Ω∆T + ∆Σ∆T + the authors (to be illustrated in a section to follow) that mN (45) provides a quick and reliable means for minimising N Q∆ (β, Q, θbi ). X mnN − 2m Re {log |Ak |} . In order to employ the damped-Newton search of (43),(45) k=1 we require the computation of the gradient and Hessian, which (40) is detailed in the following Lemmas. Considering (39), since b Lemma 5: The gradients for QΓ (η, R(η)) and b ? T T Q (β, Q(β)) are given by ∆ Φ − ΓΨ − ΨΓ + ΓΣΓ = (41) −1 −1 ? −1 ? (Γ − ΨΣ )Σ(Γ − ΨΣ ) + Φ − ΨΣ Ψ b ∂QΓ (η, R(η)) then it is tempting to conclude via Minkowski’s inequality for gΓ (η) , ∂η determinants of positive definite matrices [9] that n n oo n o b −1 [ΓΣ − Ψ] , (46) = 2Re vec R(η) −1 b , Γ b = ΨΣ ηb = vec Γ (42) and

globally minimises (39). Unfortunately, this overlooks the constraint that ηb must be real valued, and it is not clear how it can be simply accommodated while retaining a closed form solution. As well, (40) hasP an additional complicating factor N due to the presence of the k=1 Re {log |Ak |} term. To address this, a damped Newton search, of the form ηbi+1 = ηbi − αi HΓ−1 (b ηi )gΓ (b ηi )

=⇒

∇f (x)T (y − x) ≤ 0

=

b ∂Q∆ (β, Q(β)) ∂β n n o b −1 [∆Σ − Ω] + 2Re vec Q(β) )  M  X vec A−T k ∅nm

(47)

k=1

(43)

is proposed where gΓ (η) and HΓ (η) are (respectively) the b gradient and Hessian of QΓ (η, R(η)) and αi is a damping coefficient used to ensure global convergence [5]. The rationale b is real valued, and hence its gradient is that since QΓ (η, R(η)) and Hessian are real valued, then the search (43) automatically respects the realness constraint on ηb. Of course, the reliability and speed of employing the iterative search (43) is an obvious concern, and to address this recall that a function f : Rn → R is termed ‘quasiconvex’ if for any α ∈ R the sublevel set {x : f (x) ≤ α} is convex [8]. Intuitively, these functions are thus ‘cone shaped’, and are often termed ‘unimodal’ in recognition of their lack of multiple local minima [1]. Furthermore, a sufficient condition for f to be quasiconvex is that f (y) ≤ f (x)

g∆ (β) ,

(44)

which implies that gradient based search, and in particular damped-Newton iterations such as (43) converge quickly and reliably [8], [5]. b Lemma 4: The function QΓ (η, R(η)) defined in (39) is quasiconvex as a function of η. Proof: See Appendix IV.

respectively. Proof: See Appendix section II. b Lemma 6: The Hessian matrices for QΓ (η, R(η)) and b Q∆ (β, Q(β)) are given by

HΓ (η) ,

∂2 b QΓ (η, R(η)) = ∂η∂η T n o b −1 − 2Re Σ ⊗ R(η) h i b −1 ⊗ R(η) b −T JΓC (η), M JΓT (η) R(η) (48) 2

H∆ (β) ,

∂ b Q∆ (β, Q(β)) = ∂β∂β T n o b −1 − 2Re Σ ⊗ Q(β) h i T C b −1 ⊗ Q(β) b −T J∆ M J∆ (β) Q(β) (β) " P #  M 2Re A−1 ⊗ A−T K ∅n2 ×nm k=1 k k + ∅nm×n2 ∅nm×nm (49)

where

as is the formulation

JΓ (η) , =

n o b ∂vec R(η)

N X

∂η T

k=1

  1  T ΓΣ − ΨC ⊗ Ip + (Ip ⊗ [ΓΣ − Ψ]) K , M n o b ∂vec Q(β)

−T [A−1 k ⊗ Ak ]K =

" (V ⊗ V −T )

N X

# (Jk−1 ⊗ Jk−1 ) (V −1 ⊗ V T )K

k=1

(56) ∂β T for computing the Hessian (49).   1  T C = ∆Σ − Ω ⊗ In + (In ⊗ [∆Σ − Ω]) K , M B. Continuous Time Modelling and K is a commutation matrix (see p46 of [13]), which The developments to this point have been independent of the satisfies choice of the frequency domain variable γk introduced in (1).   (50) However, the terms in (28), (29) and (30) involve components Kvec {·} = vec ·T , K T vec ·T = vec {·} . scaled by γk and |γk |2 . When fitting a continuous time model, Proof: See Appendix III. whereby γk = jωk , this implies scalings proportional to ωk and ωk2 , which may lead to numerical problems in cases where VII. C OMPUTATIONAL C ONSIDERATIONS ωk varies over a large dynamic range. This type of problem encountered with frequency domain A. Computing the gradient and Hessian estimation of continuous time models is widely acknowledged In the previous section we provided expressions for in the literature [20], [17]. It is routinely addressed by first b the gradients and Hessian matrices of QΓ (η, R(η)) and scaling the measurement frequencies ωk to reduce their dyb Q∆ (β, Q(β)). namic range, fitting a model, and then scaling the model to The computational load for the M-step is dominated by remove the effect of ‘pre-warping’ the ωk points. This paper 3 3 an O(N m n ) matrix factorization and solve step for the proposes the same well accepted approach. damped-Newton method in order to obtain the search direction Perhaps the simplest example of this technique involves the for estimating A and B. As the order of the system increases to scaling large values, this will quickly become inhibitory and iterative ωk (57) ωk 7→ methods such as conjugate-gradients [19] may need to be ωmed employed. where ωmed is the median value of all the points ω1 , · · · , ωN . The computational load for the E-step is dominated by The system matrices A, B estimated with respect to this −1 computing Ak = (γk In − A) multiple times, which is an warped frequency axis are then themselves scaled to deliver 3 O(N n ) operation for N measurements. However, this burden the final estimates may be significantly decreased by performing an eigenvalue 1 1 decomposition of A A 7→ A, B 7→ B. (58) ωmed ωmed AV = V J, (51) While this is an effective strategy, in the authors experience, an J∆ (β) ,

where the columns of V hold the eigenvectors of A and J is a diagonal matrix whose diagonal entries are the corresponding eigenvalues of A. The utility of this decomposition is that Ak

= V (γk In − J)V −1

(52)

so that computing each Ak is a matter of updating the diagonal matrix (γk In − J), but possibly more importantly A−1 k

= V (γk In − J)−1 V −1 .

(53)

Therefore, once (51) is solved once, each computation of A−1 k requires only diagonal (as opposed to full) matrix inversion. Furthermore, using the same ideas, after defining Jk as Jk

, (γk In − J),

(54)

then when computing the gradient (47), the following formulation is efficient ( "N # ) N X X  −T −1 −T T vec Ak = vec V Jk V (55) k=1

k=1

alternative approach developed in [17] provides slightly better performance. This involves linking the desired continuoustime estimate G(s, θ) to a discrete-time one G(q, θ) via the bilinear transform B(q) , λ(q − 1)/(q + 1), so that G(q, θ) = G(B(q), θ) for the user defined constant λ. A continuous-time estimate G(s, θ) is then obtained by first finding a discrete time estimate G(q, θ) at frequency points ωk scaled via ωk 2 (59) ωk 7→ ω ¯ k , tan−1 T λ where T is the underlying sampling period. Since B(ej ω¯ k T ) = jλ tan

ω ¯k T = jωk 2

(60)

then the discrete time model estimate G(ej ω¯ k T , θ) will have identical frequency response as an associated continuous-time estimate G(jωk , θ) if the latter is formed as G(s, θ) = G(B −1 (q), θ).

(61)

This inverse mapping is very simply achieved by substituting q = B −1 (q) = (λ + s)/(λ − s) to derive a mapping from

¯ B, ¯ C, ¯ D ¯ to continuous-time model discrete-time estimates A, estimates A, B, C, D as [17] √ ¯ A , λ(A¯ + I)−1 (A¯ − I), B , 2λ(A¯ + I)−1 B √ ¯ ¯ A¯ + I)−1 , D , D ¯ − C( ¯ A¯ + I)−1 B. C , 2λC( VIII. A LGORITHM The preceding arguments and derivations are summarised here via a formal specification of the EM-based algorithm proposed in this paper. Initialise i = 0 and θb0 and then perform the following steps. 1) E-STEP a) Compute Eθbi {Xk |Yk } and Eθbi {Xk Xk? |Yk } for k = 1, . . . , N via Lemma 2; b) Compute Φ, Ψ, Σ, Λ and Ω via Lemma 1; 2) M-STEP a) Determine a value η = ηbi+1 that reduces b QΓ (η, R(η), θbi ) defined in (39) using the gradient and Hessian information from Lemmas 5 and 6 to implement the damped-Newton search (43); b) Determine a value β = βbi+1 that reduces b Q∆ (β, Q(β), θbi ) defined in (40) using the gradient and Hessian information from Lemmas 5 and 6 to implement the damped-Newton search (45); b ηi+1 ) and Q( b βbi+1 ) from (33) and (35), c) Compute R(b respectively; d) Set n n o n oo b βbi+1 ) , vec R(b b ηi+1 ) . θbi+1 = βbi+1 , ηbi+1 , vec Q( 3) If converged, as measured by the criterion of choice, then stop. Otherwise set i = i + 1 and repeat. IX. C ONVERGENCE A NALYSIS The convergence properties of the EM algorithm derive from the fact that Q(θ, θbi ) approximates the likelihood L(θ) locally around θbi . For example, via (22), any value of θ that increases Q(θ, θbi ) must also increase L(θ). Furthermore, since ∂ Q(θ, θbi ) ∂θ

= = = =

=

=

∂ ∂θ Z

Z log pθ (X, Y )pθbi (X | Y )dX p0θ (X, Y

) p b (X | Y )dX pθ (X, Y ) θi Z pθb (X | Y ) ∂ [pθ (Y )pθ (X | Y )] i dX ∂θ pθ (X, Y ) Z pθb (X | Y ) p0θ (Y )pθ (X | Y ) i dX + pθ (X, Y ) Z pθb (X | Y ) pθ (Y )p0θ (X | Y ) i dX pθ (X, Y ) Z 0 pθ (Y ) p b (X | Y )dX + pθ (Y ) θi Z 0 pθ (X | Y ) p b (X | Y )dX pθ (X | Y ) θi Z pθb (X | Y ) ∂ ∂ log pθ (Y ) + pθ (X | Y ) i dX ∂θ ∂θ pθ (X | Y )

then ∂ ∂ b Q(θ, θi ) L(θ) = ∂θ ∂θ θ=θbi θ=θbi

(62)

and hence if the EM algorithm terminates at a point θi because it is a stationary point of Q(θ, θi ), then it is also a stationary point of the log likelihood L(θ), or more formally: Lemma 7: Let θbi+1 be generated from θbi by an iteration of EM Algorithm (23), (24). Then L(θbi+1 ) ≥ L(θbi )

∀i = 0, 1, 2, . . .

(63)

with equality if and only if both Q(θbi+1 , θbi ) = Q(θbi , θbi )

(64)

pθbi+1 (X | Y ) = pθbi (X | Y )

(65)

and

for almost all (with respect to Lebesgue measure) X. Proof: The proof is identical to that of Theorem 5.1 in [6]. Therefore, since by the assumptions on Θ the log likelihood L(θ) is bounded, by establishing in Lemma 7 that the likelihood values {L(θi )} are a monotonic sequence, the lemma also establishes that they are a convergent sequence. Furthermore, limit points of the sequence θbi are stationary points of the log likelihood as established via the following Lemma. Lemma 8: Let {θbi } ⊂ Θ be a sequence of estimates generated by the EM Algorithms (23), (24). Then a limit point θ¯ of {θbi }, is a stationary point of L(θ) and the sequence ¯ {L(θbi )} converges monotonically to L(θ). Proof: See Theorem 5.2 of [6]. A disadvantage of EM-based algorithms is that, in general, it is difficult to theoretically establish that a limit point θ¯ exists without imposing further (strong) conditions, such as the “classical” one imposed below. Lemma 9: Let {θbi } ⊂ Θ be a sequence of estimates generated by the EM Algorithms (23), (24) and assume that there exists a τ > 0 such that Q(θbi+1 , θbi ) − Q(θbi , θbi ) ≥ τ kθbi+1 − θbi k2 .

(66)

¯ 2 = 0. lim kθbi − θk

(67)

Then i→∞

for some θ¯ ∈ Θ. Proof: The proof is taken from [4], but reproduced here in order to have a self contained presentation since it is so short. Since L(θbi ) is a Cauchy sequence, then for any  > 0 there exists an N such that for i ≥ N and all ` ≥ 1 L(θbi+` ) − L(θbi ) =

` X j=1

L(θbi+j ) − L(θbi+j−1 ) ≤ 

(68)

Label s1 s2 s3 s4 s5

and hence by (22) and the assumption (66) τ

` X

kθbi+j − θbi+j−1 k2 ≤

j=1 ` X

L(θbi+j ) − L(θbi+j−1 ) ≤ 

# Inputs m=1 m=1 m=1 m=2 m=3

# Outputs p=1 p=1 p=1 p=2 p=3

Type As per (70) Random Random Random Random

TABLE I

Q(θbi+j , θbi+j−1 ) − Q(θbi+j−1 , θi+j−2 ) ≤

j=1 ` X

System Order n=6 n=2 n=3 n=8 n = 18

(69)

Combinations of system order and input-output dimensions used for experiments on fixed system (s1 ) or suit of randomly chosen systems (s2 -s5 ).

j=1

which establishes (67). X. E XPERIMENTAL S TUDY In this section we demonstrate the potential utility of the EM algorithm by profiling it against state-of-the-art and widely accepted gradient-based methods. To achieve this we consider a number of Monte-Carlo simulations with systems ranging from sixth-order, single-input single-output to 18th-order, three-input three-output. More precisely, three algorithms were considered, namely: • The Expectation-Maximisation algorithm from section IV - this is denoted EM throughout; • A standard implementation of the Levenberg-Marquardt algorithm for non-linear least squares, where the objective is to minimise the negative log-likelihood. This is denoted as LM throughout. • Matlab’s System Identification Toolbox [12] implementation of the Levenberg-Marquardt algorithm, where the objective is also to minimise the negative log-likelihood. This is denoted as SIT throughout. Note that the EM algorithm discussed in this paper is designed to produce maximum-likelihood estimates for the case of both state and measurement noise. However, the gradient-based methods mentioned above are designed to produce maximumlikelihood estimates but only for the case of measurement noise. Therefore, to obtain parity across the different algorithms, all simulations were conducted without state noise so that the assumed model structures and algorithms are, at least in theory, able to explain the measured data. The first example we consider is a single-input single-output (SISO) system, whose transfer function is given by G(s)

=

2.021 × 10−3 (s + 0.1)(s + 0.2)

(70)

1 (s + 0.01 + 0.1j)(s + 0.01 − 0.1j) 1 × . (s + 0.02 + j)(s + 0.02 − j) ×

This system was discretised using a zero-order-hold function ¯ with one-second sample time resulting in system G(z). Output measurements {Y1 , . . . , YN }, where N = 500, were obtained via ¯ jωk ) + Vk , Vk ∼ Nc (0, 0.01), Yk = G(e (71) with ωk being logarithmically spaced in the interval [10−3 , π] We performed 100 simulations, where the noise and initial guess at parameter values were regenerated at the beginning

of each run. In particular, the initial guess was randomly generated, but then ensured to correspond to a stable system. To measure the success of each algorithm, the modelling b denoted e(θ) b and defined via error of the final estimate θ, ! N 1 X ? b b b Ek (θ)Ek (θ) , e(θ) , det (72) N k=1

b , Yk − G(e b jωk ), Ek (θ) b jωk ) , C(e b jωk I − A) b −1 B b + D, b G(e was computed and compared with the determinant of the sample covariance ! N 1 X ? Vk Vk . (73) v , det N k=1

b ≈v From (71) it follows that if θb is a good estimate then e(θ) is expected. With this in mind, a failure was flagged if b > 1.3v. e(θ)

(74)

Using this criteria, the number of failures for the first simulation are recorded in Table II under the heading s1 . Note that both the EM and LM approaches proved robustly successful, despite all starting from initialisations which were ‘failures’ according to the criterion (74). In consideration of these initial encouraging results, further Monte–Carlo simulations (labelled as s2 -s5 ) of randomly chosen systems of increased system dimension were conducted as described in table I. Each of these cases are studied in Monte–Carlo fashion by generating 100 randomly chosen systems. Different noise realisations and randomly chosen initial guesses for the parameters were also selected for each of these 100 cases. The number of estimation failures (as defined by (74)) are recorded in Table II under the columns labelled with headings s2 − s5 . For the lower order systems we see that EM performs slightly better than LM and both perform substantially better than SIT. As the system order is increased, the gap (in terms of number of failures) begins to grow between EM and LM indicating a clear performance advantage offered by the EM approach developed here. While the above simulations indicate the potential utility of the EM algorithm, it is of course interesting to assess performance on a real physical system. For this purpose, a 2 × 2 MIMO case arising from an aluminium cantilevered

EM LM SIT

s1 0 0 27

s2 2 5 23

s3 4 7 42

s4 8 12 57

s5 13 28 65

regions smaller that that illustrated in [22].

TABLE II

Number of failures for different algorithms (rows) and for different simulations (columns) as defined in table I.

Fig. 1.

Cantilevered Beam Experimental apparatus.

Magnitude Plot for Cantilevered Beam Example −40

−60

Magnitude (dB)

−80

−100

−120

Data EM LM SIT

−140

−160

2

3

10

10 frequency (rad/s)

Fig. 2. Measured Frequency response together with that of estimated models.

Error Magnitude Plot for Cantilevered Beam Example −60

−80

−100 Magnitude (dB)

beam with piezoelectric actuators and sensors as shown in Figure 1 is considered. Frequency domain measurements of the response of this apparatus at N = 6337 non-uniformly spaced frequency points between 5Hz and 500Hz were obtained by DFT analysis of experiments involving a periodic chirp signal excitation. For this experiment, the most natural model structure is continuous rather than discrete. Figure 2 shows the measured frequency response for one of the transfer functions from this two-input, two-output system (the others are omitted for brevity). Figure 2 also shows the model fit for each algorithm and Figure 3 shows its associated error plot. In each case the system order was selected as n = 12, with m = 2 and p = 2. Furthermore, each algorithm was initialised with the same parameter values as determined using a frequency-domain subspace algorithm [17]. Note that the EM algorithm clearly outperforms the gradient-based algorithms in this case in that according to Figure 3, the modelling error is between 10 and 30 dB smaller when employing the EM approach. In particular, note from Figure 2 how the EM method is able to accurately model two lightly damped zeros between 40 and 80 rad/s that are completely missed by both gradient search based implementations considered here. Importantly, not capturing these modelling aspects can have significant implications for the success of subsequent feedback control design [24]. As a final test, again using data from a real physical system, the ‘vibrating mechanical system’ profiled in [22] is considered using the measurement data and associated report available at elecftp.vub.ac.be in the subdirectory /Pub/IdentificationData/Report1998nr3. This data consists of 102400 time domain measurements. These were processed by a standard Blackman-Tukey method with full width Hamming window to provide N = 641 frequency response measurements in the range from 95Hz to 6200Hz. Figure 4 shows the associated nonparametric frequency response measurements (solid line) together with (dashed line) a 23’rd order model obtained via the EM algorithm described in Section VIII. The EM method was initialised with a subspacebased estimate [17] shown as a dash-dot line. Note that this subspace estimate was deliberately chosen to be quite “poor” in order to show that the EM algorithm converges from such initial estimates. With some tuning, the subspace algorithm should perform much better than this. Again, despite the high model order, wide bandwidth over which a fit is required, and wide dynamic range of measurements, a very close fit between measurements and state space model response is achieved. Indeed, the fitting error, shown as a dotted line in Figure 4 is as small, and in some frequency

−120

−140

−160

−180

EM LM SIT

2

3

10

10 frequency (rad/s)

Fig. 3.

Modelling error associated with the estimates shown in figure 2.

will involve minor but straightforward modifications to the maximisation step profiled in section VI. In the more general MIMO case, this same extension allowing for estimation of edge effects can still be applied. However, more extensive (but still straightforward) modifications will be required. For example, the model (3) will need to be altered by removing the Kronecker product terms, and the flow on effects of this in the E-step and M-step will need to be accounted for. We thank an anonymous reviewer for bringing this useful extension to our attention.

Magnitude Plot for Vibrating Mechanical System 80

60

Data EM SID Error

40

Magnitude (dB)

20

0

−20

−40

XII. C ONCLUSIONS

−60

−80

3

4

10

10 Frequency (rad/sec)

Fig. 4. Measured Frequency response together with that of estimated models and modelling error for ”vibrating mechanical system’.

XI. E XTENSIONS As detailed in section II, the developments in this paper have concentrated on the situation where measurements of the frequency response function are available. This involves the i, `’th element of Y (ωk ) ∈ Cp×m being the magnitude and phase response at frequency ωk (in phasor form) of the i’th output to the `’th input and, in turn, the i, `’th element of U (ωk ) ∈ Cm×m being the measured magnitude and phase at frequency ωk (again in phasor form) of the i’th input on the `’th experiment. This format has been assumed because it so commonly occurs in practice, principally by the design of commercial equipment specifically targetted at obtaining frequency response function measurements. An important alternative scenario is one whereby available measurements are the discrete Fourier transforms (DFT’s) of the observed time domain output yt ∈ Rp and input ut ∈ Rm . In the SISO case of p = m = 1, then since this paper does not constrain U (ωk ) to be the identity, and since only one experiment is involved, this situation also fits within the assumed measurement frameork of this paper. Furthermore, with the definition of the N point DFT Y (ω) of the time domain signal yt as

A PPENDIX I P ROOF OF L EMMA 1 Proof: By using the fact that Yk is independent of Y` for k 6= ` and similarly for Xk and X` then according to the definition of conditional probability pθ (Y1 , · · · , YN , X1 , · · · , XN ) = =

N Y k=1 N Y

pθ (Yk , Xk )

(78)

pθ (Yk | Xk )pθ (Xk ). (79)

(75) The above densities are given by −jωN

then with the definition Ix (ω) = N (xN +1 e the first row of the model structure (2) becomes ejωk Xk

Acknowledgments: The authors would like to thank Prof. Reza Moheimani and Dr. Andrew Fleming for providing the cantilevered beam experimental data and for their insights into aspects of its modelling.

k=1

N 1 X −jωt Y (ω) = √ yt e N t=1 −1/2

This paper has profiled a new method for estimation from frequency domain data that is based on the EM algorithm. As illustrated empirically, the new technique developed here has the advantages of robustness to capture in local minima and the ability to more accurately estimate higher dimensional systems (possibly over wide bandwidths) than conventional gradient based search approaches. Planned future work by the authors will be directed at analysing and understanding the convergence properties of the method and the factors that influence convergence rate. Readers interested in evaluating the methods presented here are invited to download the MATLAB-based system identification toolbox available at sigpromu.org/idtoolbox which implements them.

= AXk + BUk + Ix (ωk ) + Wk eU ek + Wk = AXk + B

− x1 )

1 pθbi (Yk |Xk ) = mp × π det(Im ⊗ R) 

2 

−1/2 exp − (Im ⊗ R) vec{Yk − ΓZk } ,

(76) (77)

e = [B, Ix (ωk )] and U ek = [U T , 1]T . With the further where B k (commonly occuring) specialisation that the frequency points satisfy ωk = 2πk/N , then Ix is real valued and frequency e and independent. This implies that by using the augmented B ek in (77), any distorting effects of non-periodic inputs can be U avoided by estimating the term Ix accounting for them. This

(80)

and 1 pθbi (Xk ) = mn −? × π det(Im ⊗ A−1 k QAk ) 

2  

−1 −? −1/2 −1 exp − Im ⊗ Ak QAk vec{Xk − Ak B} . (81)

Using the fact that (Im ⊗ R) is block diagonal then det(Im ⊗ R) = det(R)m . Furthermore, with Vk , Yk − ΓZk , Vek , vec {Vk }, and Vki being the i’th column of Vk then

2

−1/2 e −1 Vk = Vek? (Im ⊗ R) Vek (82)

(Im ⊗ R) =

m X

{Vki }? R−1 Vki

so that ∂f2 (β) ∂β

=

(83)

  N  X vec A−T

 T   K vec A−C k ∅nm k=1 (N  )   X vec A−T k (94) 2mRe . ∅nm

= m

k ∅nm

+

k=1

i=1

= Tr

(m X 

i=1 −1

= Tr R

) R

−1

Vki {Vki }?

Vk Vk?



.

(84) (85)

Therefore −

N X

log pθ (Yk | Xk ) = mN log |R|+

(86)

k=1 N X

 Tr R−1 (Yk − ΓZk )(Yk − ΓZk )? .

A PPENDIX III P ROOF OF L EMMA 6 Proof: The expression (48) for the Hessian with respect to η is established by direct application of Lemma 14 in Appendix V. To establish (49) recall the definitions of f1 and f2 made in (90), (91) respectively which allow the Hessian with respect to β to be written as b ∂ 2 Q∆ (β, Q(β)) ∂ 2 f1 (β) ∂ 2 f2 (β) + . = ∂β∂β T ∂β∂β T ∂β∂β T | {z } | {z }

(87)

k=1

Applying the conditional expectation operator Eθbi {· | Yk } to both sides then provides the expression (26) for QΓ (η, R, θbi ) with the expression (27) for Q∆ (β, Q, θbi ) being obtained in an identical manner. A PPENDIX II P ROOF OF L EMMA 5

(88)

where c is a constant. Direct application of Lemma 14 in Appendix V results in the expression (46) for gΓ (η). In a similar manner, substituting (35) into Q∆ (β, Q) provides b Q∆ (β, Q(β))

= f1 (β) + f2 (β), b f1 (β) , mN log Q(β) + c, N X

H2 (β)

Use of Lemma 14 then provides n o b −1 − H1 (β) =2Re Σ ⊗ Q(β) h i T C b −1 ⊗ Q(β) b −T J∆ mN J∆ (β) Q(β) (β).

(96)

Furthermore

Proof: Substituting (33) into (26) provides b b QΓ (η, R(η)) = mN log R(η) + c,

f2 (β) , −m

H1 (β)

(95)

(89) (90)

(log |Ak | + log |A?k |) , (91)

k=1

so that again using Lemma 14 in Appendix V n n oo ∂f1 (β) b −1 [∆Σ − Ω] = 2Re vec Q(β) . ∂β

(92)

Furthermore, by Lemma 13 T N  X  ∂vec {Ak } ∂f2 (β) = −m vec A−T − k T ∂β ∂β k=1 T N  X  ∂vec {A?k } m vec A−C (93) k T ∂β

 T ∂ ∂f2 (β) (97) ∂β ∂β N i ∂ Xh  T =m vec A−T ∅ 1×nm + k ∂β k=1 h i  T (98) vec A−? ∅ 1×nm k " # T −1 N T ∂ vec{Ak } X ∂ vec{A? k} + ∅n2 ×nm ∂β ∂β =m (99) 2 ∅ ∅ nm×nm nm×n k=1  N  X Zk ∅n2 ×nm = −m (100) ∅nm×n2 ∅nm×nm k=1  N  X (A−1 ⊗ A−T )K + (A−C ⊗ A−∗ )K ∅n2 ×nm k k k k =m ∅nm×n2 ∅nm×nm " k=1 P #  N −1 −T 2mRe K ∅n2 ×nm k=1 Ak ⊗ Ak (101) = ∅nm×n2 ∅nm×nm

H2 (β) =

where Zk ,

(A−1 k





∂vec ATk A−T k ) ∂β T



+ (A−C ⊗ A−∗ k k )

∂vec {A∗k } ∂β T

k=1

where ∂vec {Ak } ∂β T

=

  ∂vec ejωk I − A = −I T ∂β

 ∅n2 ×nm ,

and (K is a commutation matrix defined in (134))    ∂vec e−jωk I − AT ∂vec {A?k } = = −K ∅n2 ×nm , T T ∂β ∂β

A PPENDIX IV P ROOF OF L EMMA 4 Proof: Define, for a matrix X of suitable size, x , vec {X} and b f (x) , QΓ (x, R(x))

(102)

with the latter defined in (39) so that  f (x) = mN log det

so that a sufficient condition for (107) to hold is that

iff T T ? T ? T 1 h Φ − XΨ? − ΨX T + XΣX T +mpN. Y ΣY − XΣX − Y Ψ − ΨY + XΨ + ΨX  0. (110) mN

Let y , vec {Y }, then f (y) ≤ f (x) is equivalent to (since log(·) is monotonic)    1  Φ − Y Ψ? − ΨY T + Y ΣY T det mN   (103)  1  ? T T ≤ det . Φ − XΨ − ΨX + XΣX mN However, since the arguments for det(·) are positive definite and Hermitian, then according to [9, Theorem 7.2.5] a necessary and sufficient condition for (103) is ?

Φ − Y Ψ − ΨY

T

T

?

T

A PPENDIX V AUXILIARY R ESULTS Lemma 10: Suppose      X µX PX ∼N , Y µY PY X

so that

 ,

 (X | Y ) ∼ Nc µX|Y , PX|Y .

(111)

µX|Y , µX + PXY PY−1 (Y − µY ),

(112)

where PX|Y , PX − PXY PY−1 PY X . Proof: See Theorem 2.13 of [10].

Y ΣY T − XΣX T  Y Ψ? + ΨY T − XΨ? − ΨX T . (105) Considering now the right hand side of (44), Lemma 5 provides an expression for the gradient of f as  n o b−1 (x) [XΣ − Ψ] , ∇f (x) = 2Re vec R (106)

PXY PY

then

T

 Φ − XΨ − ΨX + XΣX (104) where  denotes the usual partial ordering on positive semidefinite matrices [9]. In turn, (104) is equivalent to + Y ΣY

However, as just established via (105), this is a consequence of the assumption that f (y) ≤ f (x) which completes the proof by using the property (44).

Lemma 11: Suppose      PX µX X , ∼ Nc PY X µY Y

PXY PY

(113)

 ,

then

∇f (x)T (y − x) n oT b−1 (x) [XΣ − Ψ] = 2Re vec R (y − x),  n oT b−1 (x) [XΣ − Ψ] = vec R (y − x)  n  oT b−T (x) XΣT − ΨC + vec R (y − x) n o ? b −1 = Tr [XΣ − Ψ] R (x)(Y − X) n o b−1 (x) [XΣ − Ψ] + Tr (Y − X)T R n o b−1 (x)(Y − X) [XΣ − Ψ]? = Tr R n o b−1 (x) [XΣ − Ψ] (Y − X)T + Tr R n  b−1 (x) Y ΣX T + XΣY T − 2XΣX T = Tr R  −Y Ψ? − ΨY T + XΨ? + ΨX T . 

b−1 (x) is positive definite, then ∇f (x)T (y − x) ≤ 0 Since R if and only if Y ΣX T +XΣY T −2XΣX T −Y Ψ? −ΨY T +XΨ? +ΨX T  0. (107) However,

 (X | Y ) ∼ Nc µX|Y , PX|Y .

(114)

µX|Y , µX + PXY PY−1 (Y − µY ),

(115)

where PX|Y , PX − PXY PY−1 PY X . (116) Proof: While this result is expected given Lemma 10, it is by no means trivially obvious given the covariance restrictions (4) necessary for a complex Gaussian distribution. Furthermore, we have been unable to find a statement, (let alone proof) of this result in the literature, and so provide it here since it is central to the development. We thank an anonymous reviewer for providing the proof below, which is much more streamlined than our first drafted version. According to the complex normal specification (47)      JXK JµX K ∼N ,Σ JY K JµY K where

Σ,

which can be rewritten as T

Y ΣX + XΣY

T

− 2XΣX  Y ΣY

T

− XΣX

T

(109)

JPX K JPXY K JPY X K JPY K

 .

(117)

(JXK | JY K) ∼ N (µ, P ) ,

(118)

µ , JµX K + JPXY KJPY K−1 (JY K − JµY K)

(119)

where

and T



Applying Lemma 10 then yields

0  (Y − X)Σ(Y − X)T = Y ΣY T + XΣX T − Y ΣX T − XΣY T (108)

1 2

P ,

1 1 JPX K − JPXY KJPY K−1 JPY X K. 2 2

(120)

Applying Lemma 12 repeatedly to (119) then delivers µ= = =

JµX K + JPXY KJPY−1 K (JY K − JµY K) JµX K + JPXY PY−1 K (JY K − JµY K) JµX + PXY PY−1 (Y − µY )K.

(121) (122) (123)

A similar repeated application of Lemma 12 to (120) provides 1 (124) P = JPX − PXY PY−1 PY X K. 2 Therefore, with the formulations (115) and (116) for mean and covariance, the real and imaginary parts of X | Y satisfy the distributional conditions (4) necessary for the complex valued X | Y to obey a complex normal density. Lemma 12: Let n > 0, X ∈ Cn×1 , A, B ∈ Cn×n be arbitrary save that A is invertible. Define the operator J·K acting on vectors and matrices such as these via     Re {X} Re {A} −Im {A} JXK , , JAK , . Im {X} Im {A} Re {A} (125) Then JA + BK = JAK + JBK,

JAKJBK = JABK

Proof: By Lemma 13  T  ∂vec {D(x)} ∂f (x) vec D(x)−T =α ∂x ∂xT

where using the identity vec {W V Z} = (Z T ⊗ W )vec {V } ∂vec {D(x)} (136) ∂xT    1 ∂  vec {A}−vec {XB ? }−vec BX T +vec XCX T , = T α ∂x  ! ∂vec X T 1 ∂vec {X} C = , −(B ⊗ I) − (I ⊗ B) α ∂xT ∂xT (137)  T ! ∂vec X 1 ∂vec {X} + , (XC T ⊗ I) + (I ⊗ XC) T α ∂x ∂xT (138)  1 (B C ⊗ I) − (I ⊗ B)K + (XC T ⊗ I) + (I ⊗ XC)K , =− α (139) 1 = [XC T − B C ] ⊗ I + (I ⊗ [XC − B])K . (140) α

(126)

where (·)C denotes complex conjugation and K is the JAKJXK = JAXK, JA K = JAK . (127) commutation matrix defined by Proof: See [21, Lemma 13.3].  Lemma 13: Suppose M, N ∈ Rn×n and M is invertible. Kvec {X} = vec X T . Then Substitution of (140) into (135) then delivers ∂ ∂ log |M | = M −T , Tr{M −1 N } = −M −T N T M −T ∂M ∂M ∂f (x) = and  ∂x T  −1 ([X C − B ? ] ⊗ I)+ dvec M ∂ −T −1   = −M ⊗ M . Tr{M N } = N T , T K T (I ⊗ [C T X T − B T ]) vec D(x)−T , ∂M dvec {M }  Proof: See [14]. = vec D(x)−T [XC T − B C ] +  K T vec [C T X T − B T ]D(x)−T , Lemma 14: Suppose X ∈ Rp×q , x , vec {X},  = vec D(x)−T [XC T − B C ] + D(x)−1 [XC − B] A = A∗ ∈ Cp×p , B ∈ Cp×q , C = C ∗ ∈ Cq×q , α > 0 and a   function f is defined as = 2 Re vec D(x)−1 [XC − B] , −1

−1

f (x) , α log |D(x)| , (128)  1 ∗ T T D(x) , A − XB − BX + XCX . (129) α Then

 ∂f (x) = 2Re D−1 (x)(XC − B) , ∂x

(130)

and  ∂ 2 f (x) =2Re C ⊗ D−1 (x) − T ∂x∂x   αE T (x) D−1 (x) ⊗ D−T (x) E C (x),

(131) (132)

where ∂vec {D(x)} , (133) ∂xT   1  = XC T − B C ⊗ Ip + (Ip ⊗ [XC − B]) K α and K is a commutation matrix, which satisfies   Kvec {X} = vec X T , K T vec X T = vec {X} . (134) E(x) ,

(135)

(141)

(142)

(143) (144)

where the facts that both D(x) and C are Hermitian was used in progressing to the last line. Using the expression (143) ∂ 2 f (x) = ∂x∂xT n o ∂ −T T C −1 D(x) [XC − B ] + D(x) [XC − B] vec ∂xT ff ∂[XC T − B C ] −1 ∂[XC − B] =vec D(x)−T + D(x) ∂xT ∂xT  ff −T −1 ∂D(x) ∂D(x) T C [XC − B ] + [XC − B] + vec ∂xT ∂xT “ ” ∂vec {X} “ ” ∂vec {X} −T = C T ⊗ D(x)−1 + C ⊗ D(x) ∂xT ˘ ∂xT ¯ » –T ∂vec D(x)−T ∂vec {D(x)} +α ∂xT ∂xT “ ” “ ” T −1 = C ⊗ D(x) + C ⊗ D(x)−T » –T ” » ∂vec {D(x)} –C ∂vec {D(x)} “ −1 −T −α D(x) ⊗ D(x) . ∂xT ∂xT

Substitution of the definition (133) for E(x) into this expression then completes the proof.

R EFERENCES [1] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [2] David R. Brillinger. Time Series: Data Analysis and Theory. Holden-Day, 1981. [3] Thomas Cover and Joy Thomas. Elements of Information Theory. Wiley, 1991. [4] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38, 1977. [5] J.E. Dennis and Robert B. Schnabel. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice Hall, 1983. [6] Stuart Gibson and Brett Ninness. Robust maximum-likelihood estimation of multivariable dynamic systems. Automatica, 41(10):1667–1682, 2005. [7] Stuart Gibson, Adrian Wills, and Brett Ninness. Maximum-likelihood parameter estimation of bilinear systems. IEEE Trans. Automat. Control, 50(10):1581–1596, 2005. [8] Haitham Hindi. A tutorial on convex optimization. In Proceedings of the 2004 American Control Conference, pages 3252–3262, 2004. [9] Horn and Johnson. Matrix Analysis. Cambridge University Press, Cambridge, 1985. [10] Andrew H. Jazwinski. Stochastic Processes and Filtering Theory. Academic Press, 1970. [11] J.N. Juang. Applied System Identification. Prentice Hall, 1994. [12] Lennart Ljung. MATLAB System Identification Toolbox Users Guide, Version 6. The Mathworks, 2004. [13] J. R. Magnus and H. Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley, 1988. [14] Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley, Chichester UK, 1988. [15] Tomas McKelvey. Frequency domain identification. In Proceedings of SYSID 2000, 2000. Santa Barbara, California. [16] Tomas McKelvey. Frequency domain identification methods. Circuits Systems Signal Process., 21(1):39–55, 2002. Special tutorial issue on system identification. [17] Tomas McKelvey, H¨useyin Akc¸ay, and Lennart Ljung. Subspace-based multivariable system identification from frequency response data. IEEE Transactions on Automatic Control, 41:960–979, 1996. [18] R. Pintelon, P. Guillaume, Y. Rolain, J. Schoukens, and H. Van hamme. Parametric identification of transfer functions in the frequency domain—a survey. IEEE Trans. Automat. Control, 39(11):2245–2260, 1994. [19] J. Nocedal and S. Wright. Numerical Optimization. Springer-Verlag, New York, 1999. [20] R. Pintelon and I. Kollar On the Frequency Scaling in Continuous-Time Modeling. IEEE Trans. Instrum. Meas., 54(1):318–321, 2005. [21] R. Pintelon and J. Schoukens. System Identification: A Frequency Domain Approach. IEEE Press, 2001. [22] J. Schoukens, G. Vandersteen, R. Pintelon, and P. Guillaume. Frequency domain identification of linear systems using arbitrary excitations and a nonparametric noise model. IEEE Transactions on Automatic Control, 44(2):343–347, February 1999. [23] Alexander Weinmann. Uncertain Models and Robust Control. Springer-Verlag, New York, 1991. [24] A. G. Wills, D. Bates, A. Fleming, B. Ninness, and R. Moheimani. Application of MPC to an active structure using sampling rates up to 25kHz. In Proceedings of CDC/ECC05, December 2005.