Inferring the eigenvalues of covariance matrices from ... - CiteSeerX

24 downloads 0 Views 302KB Size Report
Oct 5, 1998 - Results of Silverstein which characterise the eigenvalue spectrum of the .... Silverstein 15] has proved a remarkable result characterising the ...
Inferring the eigenvalues of covariance matrices from limited, noisy data Richard Everson and Stephen Roberts Department of Electrical and Electronic Engineering, Imperial College of Science, Technology & Medicine, London. UK. fr.everson,[email protected]

October 5, 1998

Abstract The eigenvalue spectrum of covariance matrices is of central importance to a number of data analysis techniques. Usually the sample covariance matrix is constructed from a limited number of noisy samples. We describe a method of inferring the true eigenvalue spectrum from the sample spectrum. Results of Silverstein which characterise the eigenvalue spectrum of the noise covariance matrix and inequalities between the eigenvalues of Hermitian matrices are used to infer probability densities for the eigenvalues of the noise-free covariance matrix, using Bayesian inference. Posterior densities for each eigenvalue are obtained, which yield error estimates. The evidence framework gives estimates of the noise variance and permits model order selection by estimating the rank of the covariance matrix. The method is illustrated with numerical examples.

Keywords: sample covariance, eigenvalue spectrum, Bayesian evidence, model order selection. EDICS: SP 4.1.4 Corresponding author: Richard Everson Department of Electrical and Electronic Engineering, Imperial College of Science, Techology and Medicine, Exhibition Road London, SW7 2BT United Kingdom Phone: +44 171 594 6220 Fax: +44 171 823 8125 Email: [email protected] 0

$ Id:

spectrum.tex,v 1.10 1998/10/05 12:53:51 rme Exp $

1

1 Introduction The covariance matrix and its spectrum of eigenvalues are of great interest in the analysis and modelling of experimental data. Principal Components Analysis [5, 6], the Karhunen-Loeve decomposition [7, 10] and related techniques such as Independent Components Analysis [2] model data vectors x(t) 2 RN as an admixture of decorrelated (or in the case of ICA, statistically independent) sources, s(t) 2 RM , (M  N ) which are linearly mixed by A 2 RN M ; thus (t) = As(t):

(1)

x

Before the data are examined the number of sources, M , is usually unknown and determination of

M is a model order selection problem [13, 12]. The number may, in principle at least, be deduced from the rank of the covariance matrix

Cx =

D

(t)xT(t)

X  C^x = T1 x(t)xT(t); t

E

T

x

(2)

=1

where hi denotes the expectation and C^x is the sample covariance matrix for an ensemble of T data vectors. In the absence of noise

Cx = A

D ss

T

E

AT = AAT;

(3)

which clearly has rank equal to the number of sources. In this case the eigenvalue spectrum of C^x is comprised of M positive eigenvalues and N ? M zeroes, thus !1  !2  : : :!M > !M +1 =

   = !N = 0. Methods such as ICA, which assume that the sources are statistically independent, use higher order (i.e., greater than second order) statistics to estimate A? [2, 9, 3]. Note however that the number of sources is still determined by the rank of AAT because statistical independence 1

implies linear decorrelation. Inevitably the data are contaminated by noise and equation (1) might be replaced by (t) = As(t) +  (t)

x

(4)

where (t) 2 RN denotes noise vectors whose elements are independently and identically distributed 2

(i.i.d.) noise, with mean zero and unit variance. Consequently

Cx = AAT + 2A

D

T

s

E

+  2C ;

(5)

where C is the covariance matrix of the noise. Since the noise and the signal are assumed to be uncorrelated the measured covariance is the sum of AAT and the noise covariance:

Cx = AAT + 2C:

(6)

When the data are plentiful, so that covariance matrices are well approximated by averages over the observed ensemble, we have

Cx = AAT + 2I:

(7)

The e ect of the noise therefore is merely to raise all the eigenvalues of Cx by  2, so that Cx now has full rank and N ? M eigenvalues equal to  2. The noise variance is readily found from the smallest !n and so the rank of AAT is easily determined. The subject of this paper is the determination of the eigenvalue spectrum fn g of AAT when the number of observations T is limited, in which case the noise covariance matrix is not diagonal. Brie y, the results of Silverstein [15] characterise the eigenvalue spectrum of the noise covariance matrix and inequalities between the eigenvalues of Hermitian matrices are used to infer probability densities for the eigenvalues of AAT , using Bayesian inference. Section 2 summarises Silverstein's work on the eigenvalues of sample covariance matrices; these are incorporated in a Bayesian model in section 3. The use of the evidence to infer the number of non-zero eigenvalues and the noise variance is discussed in section 4.

3

2 Eigenvalues of sample covariance matrices Silverstein [15] has proved a remarkable result characterising the eigenvalue spectrum of a sample covariance matrix: T X 1 ^ (t)T (t): C =

T

(8)

t=1

Denote by F ( ) the distribution function of the eigenvalues of C^ , so that for every  , F ( ) = N1 

(number of eigenvalues of C^   ). If there exists a  > 0 such that jj2+ < 1; then Silverstein's

result says that F ( ) converges to a nonrandom limit as N ! 1. Let y = N=T and b = (1  py )2,

then when 0 < y  1, the distribution function is given by 8 >
:

1 2y

p

0

( ? b? )(b+ ?  ) if b? <  < b+ otherwise

(9)

and for 1 < y  1, Z  1 1 F () = (1 ? )I[0;1)() + f (t)dt;

y

y

b?

(10)

where IS ( ) = 1 if  2 S and zero otherwise. The rst term on the right hand side of (10) represents the T ? N zero eigenvalues which must occur when there are fewer samples than the dimension of the sample vectors. This is commonly the case in the analysis of ensembles of images, each of which has a great many pixels [17]. When the number of samples is very large so that y  1, equation (9) reproduces the usual approximation that the eigenvalues of C^ are all unity. However, as y approaches 1 the smallest eigenvalue decreases towards zero (being equal to b? ) and the largest eigenvalue (equal to b+ ) increases. It is worthy of note that in this regime both the mean and the mode of f are greater than 1, so that in addition to spreading the range of the eigenvalues, limited sampling in ates the e ect of noise. For y = 1 there is a single zero eigenvalue, and as y becomes large all of the T non-zero eigenvalues approach y . Here again, a limited number of samples magni es the apparent noise. Although Silverstein's theorem is true in the limit N ! 1, numerical experiments show that 4

it is a good approximation even for N as small as 10. As an illustration, we display in gure 1a the eigenvalues of a single sample covariance matrix constructed from T = 100 random vectors of dimension N = 10, so that y = 0:1. The elements of the random vectors are Gaussian distributed with zero mean and unit variance. Also shown is the line  versus n = N (1 ? F ( )), which is the cumulative probability that there are n eigenvalues greater than  . As the graph shows there is fairly good agreement between Silverstein's asymptotic result and the eigenvalues of this single, lowdimensional realisation. The gure also illustrates that the largest eigenvalue is substantially larger than the noise variance ( 2 = 1). Figure 1b shows the 10 non-zero eigenvalues from a covariance matrix with y = 10, constructed from T = 10 random vectors ( 2 = 1) each of dimension 100. Again there is reasonable agreement with the asymptotic result, and the eigenvalues are located around y , in this case 10 times the noise variance.

3 Eigenvalues of AAT Although the covariance matrices add,

C^x = AAT + 2C^;

(11)

the eigenvalues, !n , of C^x are nonlinear functions of n and n , the eigenvalues of AAT and C^ respectively. Since all the matrices involved are symmetric, bounds on the !n are given by inequalities attributed to Weyl (see, for example, [4, 11]):

n+N ?j + 2j  !n  n?k+1 + 2k

(12)

where 1  k  n  j  N . When N > T , so that C^ is of full rank, Silverstein's result shows that the i are bounded by

2b?  i  2b+:

(13)

Combining (12) and (13) gives bounds on !n :

n + 2b?  !n  n + 2b+: 5

(14)

If there are fewer than N samples C^ has N ? T zero eigenvalues, and the upper and lower bounds in (14) each become a pair of inequalities, either of which may provide the tightest bound. Although equation 14 provides rigorous bounds on !n , (and therefore on n given !n ) better, probabilistic, estimates may be obtained by considering the probability densities of the i .

3.1 Bayesian Inference In order to estimate probability density functions for the n we shall adopt a Bayesian point of view, using Bayes' rule in the form

p(j !; ) = p(!j p(;!j))(j )

(15)

where ; ! 2 RN denote the vectors formed from the n and !n . Prior belief about the density of eigenvalues of AAT , given a vector of parameters , is embodied in the prior density,  (j ). The parameters in this problem are  = ( 2; M )T; the (usually unknown) noise variance and rank of AAT. Having observed the data, namely the eigenvalues ! of C^x, the posterior density p(j !; ) for  may be calculated using the likelihood, p(!j ; ). The form of the likelihood is determined by the model (4) and is calculated below. The denominator p(!j ), which may be determined by the requirement that the posterior density integrates to 1, is known as the evidence and is useful in determining the noise variance and the rank of AAT ; this is the subject of section 4 and for now we omit the dependence of these densities on ; for notational simplicity.

3.2 Likelihood We assume that the !n are conditionally independent, so that the likelihood may be expressed as the product:

p(!j ) =

N Y n=1

6

p(!n j ):

(16)

The likelihood p(!n j ) is determined by model equation (4), and may be estimated from (12) as

p(!nj ) = P (!n  n+N ?j + 2j and !n  n?k+1 + 2k ) N n Y Y ! ? = P ( n  2n+N ?j  j ) P ( !n ? 2n?k+1  k ): j =n k =1

1knjN

(17) (18)

The probabilities appearing in (18) are no more than the cumulative densities for the eigenvalues

n, which may be calculated, using elementary methods from order statistics, in the following way. The non-zero eigenvalues of C^ may be regarded as R = min(T; N ) realisations of the random variable  , whose density function is f ( ), given by equations (9) and (10). The cumulative density function, Fn ( ); of the nth largest eigenvalue n is the probability that at least R ? n of the eigenvalues are less than or equal to  . Thus Fn () = P (n  ) =

 

R X

?

i=R n+1

R [F ()]i [1 ? F ()]R?i ; i

(19)

which is readily calculated from f ( ). Combining (18) and (19) gives Y   n  ! ! n ? n+N ?j n ? n?k +1 p(!n j ) = Fj 1 ? Fk 2  2 j =n k =1 N Y

=

N Y j =n



Fj (~!n ? ~n+N ?j )

n Y k =1

Fk (~!n ? ~n?k+1 );

(20) (21)

where for notational brevity we have written !~ n = !n = 2; ~ n = n = 2 and F ( ) = 1 ? F ( ). The likelihood is thus seen to be the product of N ? n + 1 factors estimating lower bounds on

!n and n factors estimating upper bounds. The lower bound factors, Fj , are cumulative density functions and so increase monotonically from zero at small (possibly negative) !n to unity when !n is suciently large. Conversely the upper bound factors, Fk ; decrease from unity to zero. Since there is always at least one Fj and one Fk , the likelihood p(!n j ) is zero for suciently small !n , rises monotonically with increasing !n to a maximum and then decreases monotonically to zero at large !n .

7

The full likelihood (16) is therefore

p(!j ) =

N Y N Y n=1 j =n

Fn+N ?j (~!n ? ~j )

n Y k =1

Fn?k+1 (~!n ? ~k )

(22)

and identifying terms that depend upon n gives the likelihood of ! conditioned on n :

p(!j n) =

n Y j =1

FN ?n+j (~!j ? ~n)

Provided that the prior for  factorises as p() =

N Y k =n

QN

n=1

Fk?n+1 (~!k ? ~n):

(23)

p(n), the posterior densities for each n

may be calculated separately using (23):

p(nj !) = p(!jp(n!)p) (n)

(24)

N Y ~ = FN ?n+j (~!j ? n ) Fk?n+1 (~!k ? ~ n ) p(n) : p(!) j =1 k =n n Y

This factorisation of the posterior density p(j !) =

QN

n=1

(25)

p(nj !) is a consequence of the form of

the estimates for p(!n j ) and the factorisation of the  prior. Note that, unlike the likelihood (21), it is the factors Fn?k+1 that shape the lower bounds of the posterior density, while the cumulative densities FN ?n+j determine the posterior probabilities for large n . Although there are N + 1 of these factors, in practice only a few of them a relevant, because many of them are rendered impotent by the fact that another factor is zero everywhere that they are non-zero. Utilisation of this fact greatly speeds up computation of the likelihood. Since

b?  n  b+ for all n; it is clear that Fn () = 0 for  < b? and Fn () = 1 for  > b+. Consider the factors associated with the lower bounds (i.e., the Fk?n+1 in (25)) and in particular the factor F1(~!n ? ~n). This term is zero for ~n <  = !~n ? b+. Any other term which is 1 for ~n =  (and therefore for any ~ n >  ) cannot play a role in shaping the likelihood, because F1 (~!n ? ~ n ) and therefore p(!j n ) is zero for ~ n < , and when ~ n >  the contribution of unity to the product (25) is irrelevant. The only potent contributions are therefore those Fk?n+1 for which !~k ? b?  . Similar considerations show that the only potent upper bound factors are those for p which !~j ? b+  !~n ? b? : If n is separated from its neighbours by at least (b+ ? b? ) 2 = 4 N=T 2 then only n plays a role in determining !n . The most time-consuming part of the likelihood calculation is the numerical integration of f ( ) 8

to obtain F ( ). Each likelihood estimate requires F ( ) for di erent values of  , but great economies may be made by tabulating F (once) on a relatively ne mesh and then interpolating to the required

 .1

3.3 Prior In order to complete the Bayesian scheme a prior density,  (j ); must be chosen to re ect belief and prior knowledge about the eigenvalues of AAT : Since AAT is positive semi-de nite, n  0. Non-negativity of the eigenvalues is not enforced by the inequalities (12), which are applicable to the wider class of Hermitian matrices. The prior should therefore encode this knowledge about n ; thus

(nj ) = 0 8n < 0:

(26)

In some instances, when N is large, n may be regarded as scale variables, and may be expected to decay like n = a?kn for some constants k and a. In this case a gamma distribution centred around a?kn is a reasonable model for  (nj ): Note however that it may be important to allow for the possibility of zero eigenvalues by adopting a prior which is non-zero for n = 0. When N is relatively small and the data are thought to have been generated by a small number of roughly equally powered sources the n are all expected to be about the same size. In the absence of further information a uniform prior between 0 and some outer scale is most uninformative. Note that we have found a Je rey's prior (which gives equal weight to n on a logarithmic scale; see e.g., [8]) to be unsuitable for this situation because (1) it places too much weight at small scales, but (2) does not permit the possibility of an exactly zero eigenvalue. Since !n > n a suitable outer scale for n is !n , and we therefore choose 8 >
:

!n

if 0  n < !n

0

otherwise

1

(27)

In this context the !n s are hyper-parameters and formally their values may be found through a 1

Matlab scripts implementing these calculations are available from

http://www.ee.ic.ac.uk/research/neural/everson

9

hierarchical Bayesian approach. In any case, since the likelihood is compact (p(!n j ; ) = 0 for !n outside the interval [n +

2b?; n + 2b+]) the precise form of the prior is not crucial provided that it is suciently broad that it does not unwarrentedly prejudice the posterior. Finally, we shall wish to examine the hypothesis that the number of sources, M , is less than

N . To do this we choose (nj ; M ) = (n) for n > M . In summary we have (j ; M ) =

M Y n=1

1 I ( ) ! [0;!n] n n

N Y

(n):

(28)

n=M +1

3.4 Example As an illustration we apply the method to covariance matrices corresponding to M = 6 sources and T = 100 observations of an N = 10-dimensional vector. The noise covariance matrix was constructed from unit variance Gaussian noise vectors. The noise power is given by TrC^ = 10:5, and the signal power by TrAAT = 19:0. The eigenvalues of C^ (which would usually be unknown) are those shown in gure 1b. Figure 2 shows the posterior densities for some of the eigenvalues. This calculation used the known noise variance and the at prior (27). The modes of the posteriors for the non-zero n are close to the real values, while the posterior densities for 7 and 8 are correctly indicate the eigenvalues to be zero or very close to zero. Figure 3 shows the modes and standard deviations of posteriors for all the eigenvalues. In all cases the mode of the posterior is close to the true eigenvalue and an advantage of the Bayesian method is that error estimates are placed on the eigenvalues.

4 Evidence and model order selection The Bayesian formalism permits an estimate of the unknown parameters:



= (; M )T. This

follows from the fact that posterior probability of  given ! may be expressed as

p(; M j !) = p(!j ;pM(!)) (; M ): 10

(29)

Since p(!) is constant, the most probable  and M are those which maximise the numerator of (29). If the prior  (; M ) is uninformative this is equivalent to maximising p(!j ; M ), which is therefore known as the evidence. While  is a continuous variable, M may only assume integer values and choosing the M for which there is most evidence thus constitutes selecting a model order, i.e., the rank of the mixing matrix A. For predictive purposes the Bayesian approach is to integrate (marginalise) over all  and one might also integrate over the model order M . In many cases p(; M j !) is sharply peaked in both  and M so that choosing the modal values forms a good approximation to the full integration. The minimum message length (MML) criterion [19], and the minimum description length (MDL) criterion [13] each seek to select the model order by determining the shortest string that describes the data in terms of the model and the data given the model. This balances model complexity (measured as the length ? log  (M ) of a string describing the model, with the length ? log p(!j M ) of an additional string required to describe the data once the model is known. Since the length of a message describing the model is proportional to the model order, the MML criterion may be viewed as maximising p(M j !) with the prior  (M ) = e?M . Since  may be regarded as a scale variable a Je rey's prior for  (; M ) may be appropriate in some circumstances. For now we choose  (; M ) = e?M : Using (28), we have

p(; M j !) =

M Y n=1

1

!n

Z 0

!n

p(!j n; )dn

N Y

p(!j n; )jn=0:

(30)

n=M +1

As gure 4 shows for the example discussed in x3.4, there is most evidence for M = 6 and

 = 0:94. The rank of AAT has been correctly identi ed and the noise variance is close to the true value of unity. Figure 4 shows the eigenvalue spectrum at  = 0:94: Particularly when the rank of A is small there is, however, a tendency for the evidence calculation to underestimate  and overestimate the rank M . The reason for this can be seen by examining equation (30). The terms in the second product are the likelihoods evaluated at n = 0, each of which may be interpreted as the evidence that n is zero. As argued in section 3.2 the support for

p(!j n) is no larger than [!n ?2b+; !n ?2b?] and the likelihood attains a maximum somewhere in this interval. When  2 < !n =b+ there is zero evidence for n being zero because p(!j n;  )j n=0 = 0. As  2 becomes larger than !n =b+ the likelihood (and therefore the evidence for n = 0) increases, 11

achieves a maximum for   p!n ; and then decreases. When  2 > !n =b? the support for p(!j n) lies entirely in the negative half axis and there is zero evidence for n = 0. Note that this gives an immediate estimate of the maximum noise variance, namely  2 < !N =b?; if the noise were any larger !N would be negative which contradicts the positive de niteness of C^x . This is, however, an overestimate of  because the non-zero eigenvalues of AAT increase !N away from N . The use of the Hermitian properties of the covariance matrices (without exploiting their positive de niteness) leads to the prediction of negative n , which was eliminated above by a prior which truncated the likelihood at zero. Here it leads to decreasing evidence for n = 0 as  2 exceeds  !n . The rate of this decrease may be too rapid because the model fails to account for the increasing probability that n is exactly zero, but treats n = 0 in exactly the same way as, for example,

n = 0:1. This point is illustrated in a second example, which was chosen to be dicult to the point of pathological. Two fragments (T = 200 samples, less than 1/50th second) of music were linearly mixed and Gaussian observational noise ( = 1) added to form an observation sequence x(t) in N = 20 dimensions. The noise power, TrC^ = 19:75 and the power of the noiseless signal was Tr(T ?1

PT

AssT AT) = 7:76. Also the vast majority of the signal power is represented by 1 = 7:22, while the other non-zero eigenvalue is well below the noise power, 2 = 0:54. t

The eigenvalues n and !n are shown in Figure 5a. Apart from the rst eigenvalue, the spectrum is dominated by the noise. The abrupt drop in the true eigenvalues between 2 and zero is obscured in the !n and it is dicult to tell by eye that the underlying rank is 2. Naive application of the MDL criterion, based on a linear model with Gaussian distributed errors, suggests that the model order is 1. Rajan and Rayner's scheme [12] suggests M = 13. The evidence (Figure 5b) is maximum when

 = 0:87 and M = 7. The reason for this overestimate of M and concomitant underestimate of R  is apparent from gures 5c and 5d, which show !n?1 0!n p(!j n; )dn and p(!j n; )j n=0 as functions of  and n. The evidence for any  and M is obtained (29) by multiplying together the values down the column corresponding to  in gure 5c to M and then continuing down the corresponding column in gure 5d. As gure 5d illustrates, the evidence for 20 = 0 is large when 12

  0:7; but is very small ( 10?15) when  = 1. In fact when  = 0:9; for example, the evidence that 20 = 0 is less than the evidence that 19 = 0, even though we know that 19  20: Denoting the evidence that n = 0 by n ( ) (n > M ), we might expect that n ( )  n?1 ( ) when   !n =b? and n ( ) = 0 if  > !n =b? : That is, at any noise variance we expect there to be at least as much evidence that n = 0 as evidence that n?1 = 0; unless  2 > !n =b? which is an infeasible value for  . We might therefore model the evidence that n = 0 as 8 >
:

p

maxn0 n p(n0 j !;  )jn0 =0 if   !n =b? p

if  > !n =b? :

0

(31)

Figure 5e shows n ( ), and gure 5f shows the overall evidence, which is maximum at  = 1:07 and M = 1 { closer to the correct model order and variance. Here the fact that n ( ) does not decrease with increasing  , until  =

p

!n =b?; when n drops abruptly to zero, means that M p is selected at the largest feasible value of  for !N : namely  = !N =b?: Since this is an upper bound for  , it is generally an overestimate of the actual variance and consequently often leads to an underestimate of the rank M . A better estimate of  is obtained from a least squares t of the tail of the observed spectrum

to the Silverstein noise spectrum, n , assuming that the e ect of n on the tail of the spectrum is negligible. In this example a least-squares t yields  = 1:0061. The evidence for di erent model orders at this  is plotted in gure 6 and correctly suggests that the rank is 2.

5 Discussion We have presented a method for estimating the eigenvalues and hence the rank of a covariance matrix when the observed covariance matrix is heavily contaminated with noise and the number of data samples is limited. The Bayesian approach yields error bounds for the eigenvalues and the model order and noise variance may be estimated using the evidence. Silverstein's expression for the eigenvalue density of sample covariance matrices is valid for all i.i.d. noise. Consequently this method is applicable to all sorts of zero mean noise, not just Gaussian noise. 13

We have concentrated on the regime y = N=T < 1. Apart from the N ? T zero eigenvalues in the covariance matrix the theory is unchanged when N > T , and we point out that the !n may be eciently calculated from the T  T matrix Kpq = T1 xT(tp )x(tq ), which has the same eigenvalues as C^x [16]. This method is also applicable to singular value decomposition (SVD) because the singular values of the data matrix are just the square roots of the !n . Many data analysis methods (PCA, SVD, ICA) do not explicitly model the noise as in (4) but implicitly use the noiseless model (1). Exciting exceptions are Tipping and Bishop's \probabilistic PCA" [18] and EM formulations of ICA [14, 1] Model order estimation and the blind estimation of noise variance are notoriously dicult problems and many methods have been developed to attack them. A novel feature of our approach is the explicit incorporation in the model of the number of data samples and the expected statistics of the noise. Methods which fail to model the number of data samples (such as naive MDL based on a linear generative model with Gaussian latent variables) perform poorly because the eigenvalues of the sample covariance matrix are not merely raised by  2 except in the limit of in nite data. Rajan and Rayner [12] also give a Bayesian scheme for SVD model order determination. They do not explicitly model the noise but assume that the projections onto the singular vectors with small singular values are dominated by noise. Zarowski's approach [20] is similar to ours in that he has modelled the singular values of noisy data by assuming that the singular values of the noise free data are perturbed noise drawn from ad hoc distributions. He then uses the MDL criterion to decide the rank of the noiseless data. We have discussed a particular example in which this scheme has diculty in correctly assessing the model order and noise variance. In this context model order estimation is equivalent to determining the number of eigenvalues that are exactly zero, and small changes in the estimated variance can lead to large changes in the model order. The principal obstacle here is the inadequacy of modelling the likelihood of a zero eigenvalue. Since the covariance matrices are positive semi-de nite zero is a distinguished value, but it is not treated specially by the inequality relations between eigenvalues of general Hermitian matrices. Note that brute-force sampling approaches are computationally completely infeasible even for small problems. We have presented two methods (a modi cation of the evidence for n = 0 and estimation of  by least squares tting) that improve the estimates. More robust results are obtained when the signal to noise ratio is larger or prior 14

information exists about the variance or eigenvalue spectrum. Finally, it is important to recognise that we have assumed that the noise and signal are uncorrelated. This assumption was made so that the term 2A

T s

could be dropped in passing from

equation (5) to (7). While this is certainly true with many data samples, spurious correlations with few samples may a ect !n . Indeed, since 2A



T s T T

is inde nite the !n may be decreased. Also if

the sources are not perfectly decorrelated Ass A may have larger rank than the actual number of sources.

Acknowledgements We are grateful for discussions with Dirk Husmeier, Will Penny, Iead Rezek and Amos Storkey. Part of this work was supported by funding from British Aerospace, to whom the authors are most grateful.

References [1] H. Attias. EM algorithms for independent components analysis. In A. Costantinides, S.-Y. Kung, M. Niranjan, and E. Wilson, editors, Neural Networks for Signal Processing VIII, pages 132{141, New York, NY, USA., 1998. IEEE, IEEE. [2] A.J. Bell and T.J. Sejnowski. An information maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation, 7(6):1129{1159, 1995. [3] R.M. Everson and S.J. Roberts. ing manifold approach.

Ica:

A exible non-linearity and decorrelat-

Neural Computation, 1998.

(Submitted.) Available from

http://www.ee.ic.ac.uk/research/neural/everson.

[4] R.A. Horn and C.R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [5] H. Hotelling. Analysis of complex statistical variables in principal components. J. Educ Psy., 24:417{441, 498{520, 1933. [6] I.T. Joli e. Principal Component Analysis. Springer-Verlag, 1986. [7] K. Karhunen. Zur Spektraltheorie Stochasticher. Prozesse Ann. Acad. Sci. Fennicae, 37, 1946. 15

[8] P.M. Lee. Bayesian Statistics: An introduction. Oxford University Press, 1989. [9] T-W. Lee, M. Girolami, A.J. Bell, and T.J. Sejnowski.

A Unifying Information-

theoretic Framework for Independent Component Analysis. nal on Mathematical and Computer Modeling, 1998.

International Jour-

(In press). Available from

http://www.cnl.salk.edu/~tewon/Public/mcm.ps.gz.

[10] M.M. Loeve. Probability Theory. Van Nostrand, Princeton, N.J., 1955. [11] B.N. Parlett. The Symmetric Eigenvalue Problem. Prentice-Hall, Englewood Cli s, NJ, 1980. [12] J.J. Rajan and P.J.W. Rayner. Model order selection for the singular value decomposition and the discrete Karhunen-Loeve transform using a Bayesian approach. IEE Proc. Vis. Image Signal Process., 144(2):116{123, 1997.

[13] J. Rissanen. Modeling by shortest data description. Automatica, 14:465{471, 1978. [14] S. Roweis and Z. Ghahramani. sian models.

A unifying review of linear Gaus-

Nerual Computation,

1998.

In press. Available ftp://ftp.cs.toronto.edu/pub/zoubin/linear-gaussian.ps.gz.

from

[15] J.W. Silverstein. Eigenvalues and eigenvectors of large dimensional sample covariance matrices. Contemporary Mathematics, 50:153{159, 1986.

[16] L. Sirovich. Turbulence and the Dynamics of Coherent Structures, Pt I: Coherent Structures, Pt II: Symmetries and Transformations, Pt. III: Dynamics and Scaling. Quarterly of Applied Mathematics, XLV(3):561{590, 1987.

[17] L. Sirovich and R.M. Everson. Analysis and management of large scienti c databases. International Journal of Supercomputing Applications, 6(1):50{68, 1992.

[18] M.E. Tipping and C.M. Bishop. Probabilistic principal component analysis. Technical Report NCRG/97/010, Neural Computing Group, Aston University, 1997. Available from http://neural-server.aston.ac.uk.

[19] C.S. Wallace and D.M. Boulton. An information measure for classi cation. Computer Journal, 11(2):195{209, 1968.

16

[20] C.J. Zarowski. The MDL criterion for rank determination via e ective singular values. IEEE Trans. Sig. Proc., 46(6):1741{1744, 1998.

17

List of Figures 1

Eigenvalues of sample covariance matrices. a: Eigenvalues, n ; of a sample covariance matrix constructed from T = 100 random vectors of dimension N = 10. The dashed line is  plotted versus n = N (1 ? F ( )), which is the cumulative probability that there are n eigenvalues greater than  .

b:

The 10 non-zero eigenvalues of a sample

covariance matrix constructed from T = 10 Gaussian-distributed random vectors each of dimension N = 100. Here the dashed line is  versus n = T (1 ? F ( )): . . . . 19 2

Posterior densities for eigenvalues 3, 4, 6, 7 and 8. . . . . . . . . . . . . . . . . . . . 20

3

Modes of the posterior densities. The solid line joins the modes of p(nj !n ) and error bars indicate the standard deviation of p(n j !). Circles show the true eigenvalues,

n; and squares mark the eigenvalues !n of the noise-corrupted covariance matrix. . 21 4

Evidence for noise variance and rank.

a:

Gray scale shows log10p(; M j !); white

indicates that p(; M j !) = 0. The maximum at  = 0:94, M = 6 is indicated by the white circle.

b:

Modes of posterior densities for n calculated with the  for which

there is maximum evidence. Circles show the true eigenvalues, n ; and squares mark the eigenvalues !n of the noise-corrupted covariance matrix. . . . . . . . . . . . . . . 22 5

Evidence for mixture of two fragments of music mixed into 20 dimensions.

a:

True

eigenvalues (circles), noise-corrupted eigenvalues (squares) and modes of posterior densities for  = 1. The rst eigenvalues (1 = 13:73; !1 = 14:67) are not plotted. b:

Evidence log10p(; M j !). The maximum at  = 0:87, M = 8 is indicated by a

white circle.

c:

Evidence conditioned on n .

d:

Evidence, p(!j n;  )j n=0 , that

n = 0. Crosses mark the locus !n ? 2=b? = 0: e: Modi ed evidence, n(); that n = 0. f: Overall evidence, log10p(; M j !). The white circle indicates the maximum at  = 1:07; M = 1: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 6

Evidence for model order p(M;  j !) evaluated at  = 1:006. . . . . . . . . . . . . . 24

18

2

20

1.8

18

1.6

16

1.4

14

1.2

12

1

10

0.8

8

0.6

6

0.4

4

0.2

2

0

0

1

2

3

4

5

a

6

7

8

9

10

0

0

1

2

3

4

5

b

6

7

8

9

10

Figure 1: Eigenvalues of sample covariance matrices. a: Eigenvalues, n; of a sample covariance matrix constructed from T = 100 random vectors of dimension N = 10. The dashed line is  plotted versus n = N (1 ? F ()), which is the cumulative probability that there are n eigenvalues greater than . b: The 10 non-zero eigenvalues of a sample covariance matrix constructed from T = 10 Gaussian-distributed random vectors each of dimension N = 100. Here the dashed line is  versus n = T (1 ? F ()):

19

4

3.5

3

p(λ8 | ω)

2.5

2 p(λ7 | ω)

1.5

p(λ6 | ω)

p(λ4 | ω)

1

p(λ3 | ω)

0.5

0 −0.5

0

0.5

1

1.5

2

Figure 2: Posterior densities for eigenvalues 3, 4, 6, 7 and 8.

20

2.5

3

3.5

4

10

8

6

4

2

0

0

1

2

3

4

5

6

7

8

9

10

11

Figure 3: Modes of the posterior densities. The solid line joins the modes of p(nj !n) and error bars indicate the standard deviation of p(nj !). Circles show the true eigenvalues, n ; and squares mark the eigenvalues !n of the noise-corrupted covariance matrix.

21

1

−5

2

−6

3

−7

4

−8

5

−9

10

8

6

−10

6

4 −11

7

−12 8

2 −13

9 −14 10 0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

0

−15

0

Figure 4: Evidence for noise variance and rank.

1

2

3

4

5

6

7

8

9

10

11

a: Gray scale shows log10 p(; M j !); white indicates that p(; M j !) = 0. The maximum at  = 0:94, M = 6 is indicated by the white circle. b: Modes of posterior densities for n calculated with the  for which there is maximum evidence. Circles show the true eigenvalues, n ; and squares mark the eigenvalues !n of the noise-corrupted covariance matrix.

22

0.6 2

8

4

7

0.5

6

6

0.4 8

5

10 4

0.3 12

3

0.2

14 2

16 1

0.1 18

0 0

2

4

6

8

10

12

14

16

18

20

20

a

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

0

c

2

−10

2

4

−15

4

6

−20

0.9

0.8

6

0.7

8

8

0.6

−25

10

10

0.5

−30

12

12 −35

14

16

−40

16

18

−45

18

20 0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

−50

0.4

14

0.3

0.2

0.1

20 0.6

0.7

0.8

0.9

1

b

1.1

1.2

1.3

1.4

d

2

0.9

4

0.8

6

0.7

8

0.6

10

0.5

12

0.4

14

0.3

2

−5

4

−10

6

−15

8

−20

10

−25

12

−30

14

−35

16

16

−40

0.2 18

18

−45

0.1 20

20 0.6

0.7

0.8

0.9

1

0

1.1

1.2

1.3

1.4

0

0.6

0.7

0.8

0.9

1

1.1

f

e

1.2

Figure 5: Evidence for mixture of two fragments of music mixed into 20 dimensions.

1.3

1.4

−50

a: True eigenvalues (circles), noise-corrupted eigenvalues (squares) and modes of posterior densities for  = 1. The rst eigenvalues (1 = 13:73; !1 = 14:67) are not plotted. b: Evidence log10p(; M j !). The maximum at  = 0:87, M = 8 is indicated by a white circle. c: Evidence conditioned on n . d: Evidence, p(!j n ; )j n=0 , that n = 0. Crosses mark the locus !n ? 2 =b? = 0: e: Modi ed evidence, n (); that n = 0. f: Overall evidence, log10p(; M j !). The white circle indicates the maximum at  = 1:07; M = 1:

23

−35

2.5

x 10

2

1.5

1

0.5

0

0

2

4

6

8

10

12

14

16

Figure 6: Evidence for model order p(M; j !) evaluated at  = 1:006.

24

18

20