Penalized Covariance Matrix Estimation using a ...

2 downloads 0 Views 246KB Size Report
Nov 15, 2010 - tain an accurate covariance matrix estimate with a well-structured eigen-system ... Covariance matrix estimation is of fundamental importance in ...
Penalized Covariance Matrix Estimation using a Matrix-Logarithm Transformation Xinwei Deng∗ and Kam-Wah Tsui Department of Statistics, University of Wisconsin-Madison (November 15, 2010) Abstract For statistical inferences that involve covariance matrices, it is desirable to obtain an accurate covariance matrix estimate with a well-structured eigen-system. We propose to estimate the covariance matrix through its matrix logarithm based on an approximate log-likelihood function. We develop a generalization of the Leonard and Hsu (1992) log-likelihood approximation that no longer requires a nonsingular sample covariance matrix. The matrix log-transformation provides the ability to impose a convex penalty on the transformed likelihood such that the largest and smallest eigenvalues of the covariance matrix estimate can be regularized simultaneously. The proposed method transforms the problem of estimating the covariance matrix into the problem of estimating a symmetric matrix, which can be solved efficiently by an iterative quadratic programming algorithm. The merits of the proposed method are illustrated by a simulation study and two real applications in classification and portfolio optimization.

1

Introduction

Covariance matrix estimation is of fundamental importance in multivariate analysis and many statistical applications. For example, principal component analysis (PCA) applies the eigen-decomposition of the covariance matrix for dimension reduction. For the classification problem, linear discriminant analysis (LDA) and other procedures need the inverse of a covariance matrix to compute the classification rule. In finance, portfolio optimization often ∗

Address for correspondence: Xinwei Deng, Department of Statistics, University of Wisconsin-Madison,

1300 University Ave., Madison, WI 53706 (E-mail: [email protected]).

1

utilizes the inverse of the covariance matrix for minimizing the portfolio risk. Although one can estimate the covariance matrix and then obtain its inverse, the inversion can be computationally intensive and unstable as the dimension increases. It is desirable to obtain an accurate covariance matrix estimate with a well-structured eigen-system. Suppose x1 , . . . , xn are independently and identically distributed p-dimensional random vectors which follow a multivariate normal distribution N (µ, Σ) with mean µ and covariance P 0 matrix Σ. Without loss of generality, we assume that µ = 0. Let S = ni=1 xi xi /n be the sample covariance matrix. The negative log-likelihood function of Σ given the sample, x1 , . . . , xn , is proportional to Ln (Σ) = − log |Σ−1 | + tr[Σ−1 S],

(1.1)

up to some constant. When p < n, S is the maximum likelihood estimate of Σ. It is well known that S is not a stable estimate of Σ when p is large or p is close to the sample size n. As the dimension p increases, the largest eigenvalues of S tend to be systematically distorted, which can result in an ill-conditioned estimate of Σ (Stein, 1975; Johnstone, 2001). When p > n, S is singular and the smallest eigenvalue is zero. It is not appropriate to use S to obtain the estimate of Σ−1 . We reported here a brief simulation study to contrast the performance of the sample covariance matrix and our newly developed covariance matrix estimate. We call our proposed estimation method, Log-ME, to stand for “Logarithm-transformed Matrix Estimate”, which will be detailed in Section 3. Specifically, simulate xi ’s from a p-dimensional multivariate normal distribution N (0, I), where I is an identity matrix. For each p varying from 5 to 100, a sample of size n = 50 was chosen and the largest and smallest eigenvalues of the sample covariance matrix were obtained. For each p, we repeated the simulation 100 times and computed the average of the largest eigenvalues and the average of the smallest eigenvalues. We also obtained the corresponding averages of the eigenvalues of the Log-ME covariance matrix estimate. The true eigenvalues in this case are all equal to 1. However, from Figure 1, we see that the averages of eigenvalues of the sample covariance matrix deviate significantly from the true eigenvalues as the dimension p increases. In contrast, the 2

averages of eigenvalues of the Log-ME covariance matrix estimate are very close to the true eigenvalues. 7 S Log−ME

S Log−ME 1

smallest eigenvalue estimate

largest e eigenvalue estimate

6

5

4

3

0.8

0.6

0.4

0.2 2 0 1 0

20

40

60 dimension p

80

100

0

20

40

60 dimension p

80

100

Figure 1: The averages of the largest and smallest eigenvalues of covariance matrix estimates over the dimension p. The sample size is 50 and the true eigenvalues are all equal to 1. Many alternative estimates of Σ or Σ−1 have been proposed from different directions. One direction is to seek the sparsity in Σ−1 to improve the estimation accuracy and to explore the structure of the Gaussian graphical model (Cox and Wermuth, 1996). Research in this direction includes Meinshausen and Buhlmann (2006), Banerjee et al. (2006), Huang et al. (2006), Yuan and Lin (2007), Friedman et al. (2008), Rajaratnam et al. (2008), Levina et al. (2008), and Peng et al. (2009). Bickel and Levina (2008a) and Rothman et al. (2009) proposed thresholding methods to estimate Σ with attractive theoretical properties. Fan et al. (2008) developed a factor model method in estimating both Σ and its inverse. Another direction is to construct shrinkage estimates of the covariance matrix. Early work was developed to shrink the eigenvalues of the sample covariance matrix (Dey and Srinivasan, 1985; Haff, 1991). Ledoit and Wolf (2006) considered an estimate of Σ to be a linear combination of the sample covariance matrix and a pre-chosen matrix. Recently, Won et al. (2009) proposed what they called a well-conditioned estimate of Σ by encouraging the 3

condition number to be bounded. Recall that the condition number of a covariance matrix is its largest eigenvalue divided by its smallest eigenvalue. Note that these methods have retained the use of the eigenvectors of S in estimating Σ or Σ−1 . It is known that the eigenvectors of S are not consistent as p increases (Johnstone and Lu, 2009). Thus, only shrinking the eigenvalues of S may not be sufficient to obtain a reasonable and accurate estimate of Σ. To obtain a useful and accurate estimate of a covariance matrix, it requires a covariance matrix estimate to be positive definite. This mathematical restriction makes the covariance matrix estimation problem challenging. Note that the covariance matrix Σ can be expressed as a matrix exponential of a real symmetric matrix A. That is, A is the matrix logarithm of the covariance matrix. More properties of the matrix logarithm can be found in Chiu et al. (1996). Expressing the likelihood function in terms of A releases the mathematical restriction. An estimate of the matrix logarithm thus automatically guarantees the resulting covariance matrix to be positive definite. Leonard and Hsu (1992) used this transformation method to estimate the covariance matrix via a Bayesian approach. Their method, however, relies on the availability of a nonsingular sample covariance matrix. Hence, their method is not applicable when the dimension of the sample vectors is larger than the sample size, where the sample covariance matrix becomes singular. We propose to significantly generalize the Leonard and Hsu result such that the sample covariance matrix no longer needs be non-singular. Using an appropriate penalty function together with the generalization of the Leonard and Hsu result, we show that one can simultaneously regularize the largest and smallest eigenvalues of the covariance matrix estimate, resulting in a well-conditioned covariance matrix estimate, a highly desirable result. After transforming the problem of estimating a covariance matrix into the problem of estimating a real symmetric matrix A, we develop an efficient iterative quadratic programming algorithm to solve the transformed problem. The remaining of this article is organized as follows. We describe how the matrix logarithm transformation is used for the log-likelihood function in Section 2. Section 3 we details the development of our proposed Log-ME method. We conduct a simulation study to bring

4

out the merits of the Log-ME method in Section 4. In Section 5, we present two real applications of the Log-ME covariance matrix estimate: one in classification and another one in portfolio optimization. We conclude this work with some discussion in Section 6.

2

Log-likelihood using Matrix Log-Transformation

Consider the spectral decomposition of the covariance matrix Σ = T DT 0 , where D = diag(d1 , . . . , dp ) is a diagonal matrix of the eigenvalues of Σ, and T is an orthonormal matrix consisting of eigenvectors of Σ. Assume that d1 ≥ d2 ≥ · · · ≥ dp > 0. Let A = P k (aij )p×p = log(Σ) be the matrix logarithm of Σ. That is, Σ = ∞ k=0 A /k! ≡ exp(A), where exp(A) is called the matrix exponential of A. Then A = T M T 0,

(2.1)

where M is a diagonal matrix, diag(log(d1 ), . . . , log(dp )). In terms of A, the negative loglikelihood function in (1.1) becomes Ln (A) = tr(A) + tr[exp(−A)S].

(2.2)

A major advantage of using the matrix logarithm transformation is that it converts the problem of estimating a positive definite matrix Σ into a problem of estimating a real symmetric matrix A. The estimates of Σ and Σ−1 can then be obtained through the matrix exponential of A. Because of the matrix exponential term in (2.2), estimating A by directly minimizing Ln (A) in (2.2) is nontrivial. To circumvent this challenge, we consider an approximation to the second term in (2.2) by using the Volterra integral equation (Bellman, 1970, page 175), Z

t

exp(A0 (t − s))(A − A0 ) exp(As)ds,

exp(At) = exp(A0 t) + 0

5

0 < t < ∞.

(2.3)

As in Leonard and Hsu (1992), we repeatedly apply (2.3) to obtain Z t exp(At) = exp(A0 t) + exp(A0 (t − s))(A − A0 ) exp(A0 s)ds 0 Z tZ s + exp(A0 (t − s))(A − A0 ) exp(A0 (s − u))(A − A0 ) exp(A0 u)duds 0

0

+ cubic and higher order terms,

(2.4)

where A0 = log(Σ0 ) and Σ0 is an initial estimate of Σ. In the situation when p is close to or larger than n, the sample covariance matrix S is ill-conditioned and its matrix logarithm is not well-defined. It cannot be used as a proper initial estimate Σ0 . Our equation (2.4) thus differs from that in Leonard and Hsu (1992) who used Σ0 = S as the initial estimate of Σ. Taking t = 1 in (2.4) and substituting A, A0 in (2.4) with −A, −A0 , we have Z 1 −1 −s exp(−A) =Σ0 − Σs−1 0 (A − A0 )Σ0 ds Z 1 Z 0s u−s −u + Σs−1 0 (A − A0 )Σ0 (A − A0 )Σ0 duds 0

0

+ cubic and higher order terms.

(2.5)

Therefore, tr[exp(−A)S]

=tr(SΣ−1 0 )

Z −

1 s−1 tr[(A − A0 )Σ−s 0 SΣ0 ]ds

0 1

Z

s

Z

−u s−1 tr[(A − A0 )Σu−s 0 (A − A0 )Σ0 SΣ0 ]duds

+ 0

0

+ cubic and higher order terms.

(2.6)

By leaving out the cubic and higher order terms in (2.6), we approximate Ln (A) in (2.2) by using ln (A) given below: ln (A)

=tr(SΣ−1 0 )

Z −

1

tr[(A −

s−1 A0 )Σ−s 0 SΣ0 ]ds

 − tr(A)

0

Z

1

Z

+ 0

s −u s−1 tr[(A − A0 )Σu−s 0 (A − A0 )Σ0 SΣ0 ]duds.

(2.7)

0

We will show later in Section 3 that our Log-ME method using (2.7) provides an estimate of A, that gives the same value for ln (A) in (2.7) and Ln (A) in (2.2). 6

The integrations in (2.7) can be analytically solved through the spectral decomposition (0)

(0)

(0)

of Σ0 = T 0 D 0 T 00 . Here D 0 = diag(d1 , . . . , dp ) with di ’s as the eigenvalues of Σ0 , (0)

(0)

and T 0 = (t1 , . . . , t(0) p ) with ti

(0)

as the corresponding eigenvector for di . Define B =

˜ = T 0 ST 0 = (˜ T 00 (A − A0 )T 0 = (bij )p×p , and S sij )p×p . First, we can obtain, 0 Z 1 Z 1 −s s−1 ˜ s−1 tr[BD −s tr[(A − A0 )Σ0 SΣ0 ]ds − tr(A) = 0 SD 0 ]ds − tr(A) 0

0 p

X

= Const +

βii bii + 2

X

i=1

βij bij ,

(2.8)

i 0, and Y = −1 otherwise. Then the conditional misclassification error is E(I(Y 6= g(x))) = P (Y 6= g(x)) = P (g(x) = 1|Y = −1)P (Y = −1) + P (g(x) = −1|Y = 1)P (Y = 1) = P (aT x − b > 0|Y = −1)π2 + P (aT x − b ≤ 0|Y = 1)π1 . 23

Since x|Y = 1 ∼ N (µ1 , Σ), and x|Y = −1 ∼ N (µ2 , Σ), obviously aT x|Y = 1 ∼ N (aT µ1 , aT Σa), and aT x|Y = −1 ∼ N (aT µ2 , aT Σa). Therefore,   T a µ2 − b T , P (a x − b > 0|Y = −1) = Φ √ aT Σa   aT µ1 − b T P (a x − b ≤ 0|Y = 1) = Φ − √ , aT Σa where Φ(·) is the cumulative density function of the standard normal distribution. Then,  T    a µ2 − b aT µ1 − b E(I(Y 6= g(x))) = π2 Φ √ + π1 Φ − √ . aT Σa aT Σa Without loss of generality, assume π1 = π2 = 1/2. With the estimates of a and b through ˆ the conditional misclassification error γ(Σ, ˆ µ ˆ 1, µ ˆ 2 , Σ, ˆ 1, µ ˆ 2 ) is µ ˆ µ ˆ 1, µ ˆ 2 ) = E(I(Y 6= g(x))) γ(Σ,   1 T ˆ −1 T ˆ −1 ˆ −µ ˆ ) Σ µ2 − 2 (µ ˆ1 + µ ˆ 2 ) Σ (µ ˆ1 − µ ˆ 2) (µ 1  = Φ  1 q2 −1 −1 2 T ˆ ˆ ˆ1 − µ ˆ 2 ) Σ ΣΣ (µ ˆ1 − µ ˆ 2) (µ   1 T ˆ −1 T ˆ −1 ˆ ˆ ˆ ˆ ˆ ˆ ( µ − µ ) Σ µ − ( µ + µ ) Σ ( µ − µ ) 1 1 1 2 1 2  2 + Φ − 1 q2 . −1 −1 2 ˆ ΣΣ ˆ (µ ˆ1 − µ ˆ 2 )T Σ ˆ1 − µ ˆ 2) (µ

(6.1)

References Banerjee, O., d’Aspremont, A., and Natsoulis, G. (2006). Convex optimization techniques for fitting sparse Gaussian graphical models. Proceedings of the 23rd international conference on Machine learning, 89–96. Bellman, R. (1970). Introduction to matrix analysis, New York: McGraw-Hill. Bickel, P. J. and Levina E. (2008a). Covariance regularization by thresholding. Annals of Statistics, 36, 2577–2604. Bickel, P. J. and Levina., E. (2008b). Regularized estimation of large covariance matrices. Annals of Statistics, 36, 199–227. Blake, C. L., Newman, D. J., Hettich, S., and Merz, C. J. (1998). UCI respository of machine learning databases. 24

Chiu, T. Y. M., Leonard, T., and Tsui, K. W. (1996). The matrix-logarithmic covariance model. Journal of the American Statistical Association, 91, 198–210. Dey, D. K. and Srinivasan, C. (1985). Estimation of a covariance matrix under Stein’s loss. Annals of Statistics, 13, 1581–1591. Fan, J., Fan, Y., and Lv, J. (2008). High dimensional covariance matrix estimation using a factor model. Journal of Econometrics, 147, 186–197. Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9, 432–441. Fu, W. (1998). Penalized regressions: the bridge vs the lasso. Journal of Computational and Graphical Statistics, 7, 397–416. Haff, L. R. (1991). The variational form of certain Bayes estimators. Annals of Statistics, 19, 1163–1190. Huang, J. Z., Liu, N., Pourahmadi, M., and Liu, L. (2006). Covariance matrix selection and estimation via penalised normal likelihood. Biometrika, 93, 85–98. Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics, 29, 295–327. Johnstone, I. M., and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104, 682–693. Leonard, T. and Hsu, J. S. J. (1992). Bayesian inference for a covariance matrix. Annals of Statistics, 20, 1669–1696. Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88, 365–411. Levina, E., Rothman. A. J., and Zhu, J. (2008). Sparse estimation of large covariance matrices via a nested lasso penalty. Annals of Applied Statistics, 2, 245–263. 25

Markowitz, H. (1952). Portfolio selection. Journal of Finance, 7, 77–91. Meinshausen, N. and Buhlmann, P. (2006). High dimensional graphs and variable selection with the lasso. Annals of Statistics, 34, 1436–1462. Peng, J., Wang, P., Zhou, N., and Zhu, J. (2009). Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association, 104, 735– 746. Rajaratnam, B., Massam, H., and Carvalho, C. (2008). Flexible covariance estimation in graphical Gaussian models. Annals of Statistics, 36, 2818–2849. Rothman, A. J., Levina, E., and Zhu, J. (2009). Generalized thresholding of large covariance matrices. Journal of the American Statistical Association, 104, 177–186. Sigillito, V. G., Wing, S. P., Hutton, L. V., and Baker, K. B. (1989). Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest, 262–266. Stein, C. (1975). Estimation of a covariance matrix. Reitz Lecture, IMS-ASA Annual Meeting. Won, J. H., Lim, J., Kim, S. J., and Rajaratnam, B. (2009). Maximum likelihood covariance estimation with a condition number constraint. Technical Report, Department of Statistics, Stanford University. Yuan, M., and Lin Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika, 94, 19–35.

26