Generalized PLS regression - Wiley Online Library

63 downloads 0 Views 188KB Size Report
SUMMARY. The present paper develops a class of generalized partial least squares (GPLS) regression methods. GPLS can be regarded as a kind of weighted ...
JOURNAL OF CHEMOMETRICS J. Chemometrics 2001; 15: 135–148

Generalized PLS regression Qing-Song Xu1, Yi-Zeng Liang2* and Hai-Lin Shen2 1

Institute of Chemometrics and Chemical Sensing Technology, Department of Applied Mathematics, Hunan University, Changsha 410082, People’s Republic of China 2 Institute of Chemometrics and Chemical Sensing Technology, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, People’s Republic of China

SUMMARY The present paper develops a class of generalized partial least squares (GPLS) regression methods. GPLS can be regarded as a kind of weighted partial least squares regression method. Two special cases of them, ridge partial least squares (RPLS) and generalized ridge partial least squares (GRPLS) regression methods, are discussed in detail. RPLS and GRPLS combine partial least squares (PLS) with ridge regression (RR) and generalized ridge regression (GRR) respectively. It is shown that the estimated coefficient vectors by RPLS and GRPLS are shrunken PLS estimators and their prediction power is not so sensitive to the components included in the model compared to PLS. The four methods RR, PLS, RPLS and GRPLS are compared on the basis of three data sets under the criteria of prediction residual error sum of squares (PRESS) and mean squared error of prediction (MSEP). Copyright  2001 John Wiley & Sons, Ltd. KEY WORDS:

generalized partial least squares; ridge partial least squares regression; generalized ridge partial least squares regression; prediction residual error sum of squares; mean squared error sum of prediction

1.

INTRODUCTION

Many types of chemical data analysis are fulfilled with the help of regression methods. These analyses usually consist of two steps. The first step is to establish a model or a relationship between two groups of variables, dependent y and independent x, i.e. y = f(x). The data sets used for this step are training sets. The second step is to predict values for the dependent variables. The data sets used for this step are test sets. The most popular model is the multilinear regression model, which may be stated as y ˆ Xb ‡ e;

E…e† ˆ 0;

Cov…e† ˆ 2 I

…1†

where b is the regression coefficient vector, X is the observation matrix, e is the random error vector, y is the response vector and I is the identity matrix. Suppose that X is an n  m matrix, y and e are n  1 vectors and b is an m  1 vector. The rank of X is q. E(⋅) and Cov(⋅) denote the expectation and covariance respectively. The estimator of the regression coefficient vector is obtained by least squares (LS):

* Correspondence to: Y.-Z. Liang, Institute of Chemometrics and Chemical Sensing Technology, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, People’s Republic of China. E-mail: [email protected] Contract/grant sponsor: National Natural Science Foundation of the People’s Republic of China.

Copyright  2001 John Wiley & Sons, Ltd.

Received 15 December 1998 Accepted 9 March 2000

136

Q.-S. XU, Y.-Z. LIANG AND H.-L. SHEN

^ L ˆ X‡ y b

…2†

where X‡ denotes the Moore–Penrose inverse of matrix X. Typical chemical data tend to be characterized by many independent variables on relatively fewer observations (m > n), or more generally, rank(X) < min(n,m). There is high collinearity among the independent variables. It is well known that in this situation the estimator of the regression coefficient vector (Equation (2)) by LS is unstable, leading to poor prediction accuracy [1]. Statisticians have developed many methods to deal with collinearity in data sets. Ridge regression (RR) [2], generalized ridge regression (GRR) [3], principal component regression (PCR) [4] and latent root regression [5] are some of them. Partial least squares (PLS) is another alternative method. The PLS regression method was pioneered by Wold [6] in the field of econometrics. The use of the PLS method for chemical applications was first proposed by Kowalski et al. [7]. Since then, a large amount of literature on the PLS regression method and its applications in chemistry has emerged. The PLS regression method is now one of the most popular methods used in chemometrics. The principal goal of PLS, PCR and RR is to shrink the solution coefficient vector from the LS solution [8] in order to control the variance of the estimators of b. PCR is similar to PLS in constructing new variables (components); each new component is a linear combination of x1, x2,…, xm, where xi is the ith column of data matrix X. Both of them shrink the estimated coefficient vector by leaving some of the components out of the model. The major difference between PCR and PLS is that PCR’s components are determined only by X, whereas PLS’s components are determined by both X and y. RR and GRR shrink the estimated coefficient vector away from the LS solution by introducing one ridge parameter and more ridge parameters into the matrix XTX respectively. There has been some recent progress with respect to PLS. Continuum regression (CR) [9] involves a single algorithm that can yield LS, PLS, PCR and other intermediate estimators. References [10,11] found a close relationship between CR and RR. The newly developed cyclic subspace regression (CSR) [12–14] and continuum cyclic subspace regression (CCSR) [15], a slight modification of CSR, also involve a single algorithm that can yield LS, PLS, PCR and other intermediate estimators. The differences and similarities between PCR, PLS, RR, GRR, CSR and CCSR estimated coefficient vectors have been explored according to the weights of the eigenvectors of matrix XTX. Other recent developments of PLS regression methods can be found in References [16–18]. The purpose of the present paper is to develop a generalized PLS (GPLS) regression method. The estimator of the coefficient vector in this method can be regarded as a weighted estimator of all PLS components. The contribution of every PLS component to GPLS is clearly exposed in concise form. Thus the statistical properties of this method are easily shown. Two special cases of this method are discussed in detail. These two methods combine PLS with RR and GRR methods respectively. A similar idea has been used for PCR [19,20]. It is shown that the estimated coefficient vectors by RPLS and GRPLS are shrunken PLS estimators and are more stable compared to PLS. The four methods RR, PLS, RPLS and GRPLS are compared on the basis of three data sets. The prediction residual error sum of squares (PRESS) [21,22] is used as the criterion to determine the optimal parameters. The mean squared error of prediction (MSEP) is used as the criterion for evaluation. 2. 2.1.

THEORY AND METHODS

PLS algebra

The linear model (1) is considered. Taking the notation of Reference [18], Xˆ

Copyright  2001 John Wiley & Sons, Ltd.

X

ti pTi ˆ TPT

…3†

J. Chemometrics 2001; 15: 135–148

137

GENERALIZED PLS REGRESSION

where T = [t1, t2, …, tq] and P = [p1, p2, …, pq]. The vectors wi and ti and matrices Xi (i = 1, 2, …, q) satisfy t i ˆ Xi wi ;

Xi‡1 ˆ Xi ÿ ti pTi

Consider the relation between ti and Xi: t1 ˆ X1 w1 ˆ Xh1 ti ˆ Xi wi ˆ Xiÿ1 …I ÿ tiÿ1 pTiÿ1 †wi .. . ˆ X1 …I ÿ w1 pT1 †…I ÿ w2 pT2 † . . . …I ÿ wiÿ1 pTiÿ1 †wi ˆ Xhi where h1 = w1 and hi ˆ …I ÿ w1 pT1 †…I ÿ w2 pT2 † . . . …I ÿ wiÿ1 pTiÿ1 †wi for i 2. Letting H = (h1, h2, …, hq), then T ˆ XH

…4†

y ˆ TPT b ‡ e ˆ Ta ‡ e

…5†

Inserting Equation (3) into Equation (1),

wherea = PTb. The least squares solution of (4) is ^ ˆ …TT T†ÿ1 TT y a The fitted value of y is ^ ^ ˆ XH…TT T†ÿ1 HT XT y y ˆ Ta

…6†

Therefore, if all the PLS components are introduced into the model, the PLS estimator of b is ^ p ˆ H…TT T†ÿ1 HT XT y b

…7†

^ L (see Equation (20)). The mean squared error (MSE) ^ p is equivalent to the LS estimator b Then b ^ p is of b ^ p ÿ b†T …b ^ p ÿ b† ^ p † ˆ E…b MSE…b ^ p ††T …b ^ p ÿ E…b ^ p †††† ‡ …E…b ^ p † ÿ b†T …E…b ^ p † ÿ b†† ^ p ÿ E…b ˆ E…Tr……b q X ^ p † ÿ bk2 hTi hi =tTi ti ‡ kE…b ˆ 2

…8†

iˆ1

where Tr(⋅) denotes the trace of a matrix and k⋅k denotes the norm of a vector. For a matrix A, kAk denotes the spectral norm, i.e. kAk = (A), where (A) is the largest singular value of A. The first ^ p. ^ p and the second is the squared bias of b term of Equation (8) is the variance of b Copyright  2001 John Wiley & Sons, Ltd.

J. Chemometrics 2001; 15: 135–148

138

Q.-S. XU, Y.-Z. LIANG AND H.-L. SHEN

Although it is possible to calculate as many PLS components as q (rank of matrix X), not all of them are needed. The reason for this may be that the observation data are often polluted by noise, and some of the PLS components with small variance …tTi ti  0† describe noise. On the other hand, if there is high collinearity in the data set, some of the PLS components also have very small variance. From Equation (8) it is easy to understand that only one PLS component with small variance would ^ p . To tackle this problem, it is usual to leave ^ p † very large and destroy the estimator b make MSE…b out the components with very small variance from the model so that the noise can be removed and the ^ p † with little increase in collinearity of the data set can be reduced, i.e. greatly decrease MSE…b bias. When the number of PLS components introduced into the model is k, the matrices T, P and H can be partitioned as follows: . . T ˆ ‰t1 ; t2 ; . . . ; tk ..tk‡1 ; tk‡2 ; . . . ; tq Š ˆ ‰T1 ..T2 Š . . P ˆ ‰p1 ; p2 ; . . . ; pk ..pk‡1 ; pk‡2 ; . . . ; pq Š ˆ ‰P1 ..P2 Š . . H ˆ ‰h1 ; h2 ; . . . ; hk ..hk‡1 ; hk‡2 ; . . . ; hq Š ˆ ‰H1 ..H2 Š

…9†

Inserting Equation (9) into Equation (1), y ˆ T1 pT1 b ‡ T2 pT2 b ‡ e Letting a T ˆ ‰a T1 ; a T2 Š ˆ ‰…pT1 b†T ; …pT2 b†T Š, then y ˆ T1 a 1 ‡ T2 a 2 ‡ e

…10†

The latter q7k PLS components have very small variance and are considered as representative of noise or the cause of collinearity in the data set. Therefore only the reduced dimension model is taken account: y ˆ T1 a 1 ‡ e

…11†

The least squares solution of equation (11) is ^ 1 ˆ …TT1 T1 †ÿ1 TT1 y a

…12†

^ p of b with the latter q7k components left out of the model can be written as The PLS estimator b ^ p ˆ H1 …TT T1 †ÿ1 TT y b 1 1 ^ p k ˆ kH1 …TT T1 †ÿ1 HT XT X…XT X†‡ XT yk kb 1 ÿ1 T T T kH1 …T1 T1 † H1 X Xk

…13†

1

 ^ Lk ˆ kb

^ Lk kb …14†

^ L. ^ p is a shrunken estimator of b Equation (14) shows that b 2.2.

Generalized PLS

Although the PLS estimator shrinks the LS estimator in general (Equation (14)), some elements of the ^ L are inflated [8]. The inflated elements can be highly detrimental because they LS estimator b increase both the squared bias and the variance of the model. The performance will be better if some Copyright  2001 John Wiley & Sons, Ltd.

J. Chemometrics 2001; 15: 135–148

139

GENERALIZED PLS REGRESSION

suitable adjustments are made to the PLS estimator. For model (1) the generalized PLS (GPLS) estimator is defined by ^ g …R† ˆ HRHT XT y ˆ b

q X

ri hi hTi XT y ˆ

iˆ1

q X

ei ri b

…15†

iˆ1

where R ˆ diag‰r1 ; r2 ; . . . ; rq Š;

e i ˆ hi hT XT y b i

…16†

diag[⋅] denotes a diagonal matrix and ri (i = 1, 2, …, q) are non-negative constants. The covariance e j is e i and b matrix of b e i; b e j† ˆ Cov…b



0; ÿ2 tTi ti hi hTi ;

i 6ˆ j iˆj

…17†

^ g …R† is a linear combination of e i …i ˆ 1; 2; . . . ; q† are mutually uncorrelated vectors. b Therefore b i i e e uncorrelated vectors b …i ˆ 1; 2; . . . ; q†. Every b …i ˆ 1; 2; . . . ; q† depends only on the ith PLS ^ g …R†. The mean squared error of b ^ g …R† is component, and ri is its weight coefficient in b ^ g …R†† ˆ 2 MSE…b

q X

! ri2 hTi hi tTi ti

^ g …R†† ÿ bk2 ‡ kE…b

…18†

iˆ1

^ g …R†. The larger the value The weight ri controls the effect of the ith PLS component on the MSE of b ^ of ri, the stronger is the effect of the ith component on b g …R†. Different matrices R correspond to ^ g …R†. Thus new estimators can be constructed according to some optimal different linear estimators b criteria in practice. If ri ˆ 1=tTi ti for i = 1, 2, …, k and ri = 0 for ik‡1, then ^ g …R† ˆ H1 …TT T1 †ÿ1 TT y b 1 1

…19†

^ p is a particular case of the generalized PLS estimator. Equation (19) shows that the PLS estimator b ^ This is the reason why the estimator b g …R† is called the generalized PLS estimator. It should be pointed out that some other linear estimators can be included in the generalized PLS estimator. The LS solution of b in model (1) is the same as the PLS regression estimator with no PLS components left out of the model, i.e. ^ L ˆ X‡ y ˆ H…TT T†ÿ1 XT y b

…20†

^ g …R† with ri ˆ 1=tT ti ; i ˆ 1; 2; . . . ; q. The Stein shrinkage estimator [23] of ^ L is a special case of b b i b in model (1) is ^L e s ˆ b b

…21†

corresponding to ri ˆ =tTi ti ; i ˆ 1; 2; . . . ; q, where is a constant. ^ g …R† is a shrunken estimator for all 0  ri  1=tT ti ; i ˆ 1; 2; . . . ; q. The Appendix shows that b i Copyright  2001 John Wiley & Sons, Ltd.

J. Chemometrics 2001; 15: 135–148

140 2.3.

Q.-S. XU, Y.-Z. LIANG AND H.-L. SHEN

Ridge PLS regression

The ordinary ridge regression estimator of b for model (1) is [2] ^ R ˆ …XT X ‡ dI†ÿ1 XT y b

…22†

In References [19,20] it was suggested that PCR could be combined with RR by introducing a ridge parameter into the PCR estimator: ^ ˆ V1 …L1 ‡ dI†ÿ1 VT XT y b 1

…23†

where V1 = [v1, v2, …, vp], vi is the ith eigenvector of matrix XTX, L1 = diag[1, 2, …, p], i is the eigenvalue corresponding to vi, and p is the number of components included in the model. This approach could produce better results than PCR. By virtue of this idea, the ridge parameter could be introduced into the PLS estimator. Let R in Equation (16) be defined by R ˆ diag‰1=…tT1 t1 ‡ d†; . . . ; 1=…tTq tq ‡ d†Š

…24†

^ g (R) by b ^ RPLS , then Replacing b ^ RPLS ˆ b

q X

hi hTi =…tTi ti ‡ d†XT y ˆ H…TT T ‡ dI†ÿ1 HT XT y

…25†

iˆ1

If the rank of matrix X is m, then H is an invertiable matrix: ^ RPLS …m† ˆ H…HT XT XH ‡ dI†ÿ1 HT XT y ˆ …XT X ‡ d…HT †ÿ1 Hÿ1 †ÿ1 XT y b

…26†

The form of Equation (26) is similar to that of Equation (22). It is a kind of generalized ridge ^ RPLS is called the ridge PLS (RPLS) estimator. The mean squared estimator [21,25,26]. Therefore b ^ error of b RPLS is ^ RPLS † ˆ  MSE…b

2

q X

! hTi hi =…tTi ti

‡ d†

^ RPLS † ÿ bk2 ‡ kE…b

…27†

iˆ1

^ p of b is obtained by retaining the first k components, the corresponding RPLS If the PLS estimator b estimator is ^ RPLS …k† ˆ H1 …TT T1 ‡ dI†ÿ1 HT XT y b 1 1

…28†

R ˆ diag‰1=…tT1 t1 ‡ d†; . . . ; 1=…tTk tk ‡ d†; 0; . . . ; 0Š

…29†

and the matrix R is

^ RPLS also shrinks the PLS estimator …b ^ p †. The rank of H1 is k. There is an m  m elementary square b matrix F such that Copyright  2001 John Wiley & Sons, Ltd.

J. Chemometrics 2001; 15: 135–148

141

GENERALIZED PLS REGRESSION

2

3

1 0 ... 0 .7 6 .. ..  6 . . . . . .. 7  I 6 7 FH1 ˆ 6 0 0 . . . 1 7 ˆ kk 0 6 7 40 0 ... 05 .. .. .. . . ... .

…30†

where the upper part of FH1 is the k  k identity matrix, and ^ RPLS …k†k ˆ kH1 …TT T1 ‡ dI†ÿ1 TT T1 …FH1 †T FH1 …TT T1 †ÿ1 HT XT yk kb 1 1 1 1 ^ p …k†k  kH1 …TT T1 ‡ dI†ÿ1 TT T1 …FH1 †T Fk kb 1

ˆ

maxf‰tTi ti =…tTi ti i

1 1=2

‡ d†Š

^ p …k†k gkb

^ p …k†k …for d  0†  kb

…31†

It appears that PLS and RPLS methods shrink the regression coefficients in different ways. In PLS the latter q7k components are removed from the model in order to reduce the influence of noise or the high degree of collinearity in the data set. The limitations of PLS regression are that whether PLS components are retained or removed depends on some optimal criteria applied to the training set, the discarded PLS components may contain more or less useful information of the model and there may be effects of noise remaining in the first k components of the model. On the other hand, according to Equation (8), if the components with very small variance are retained in the model, the variance of the estimator will be inflated. In RPLS the removed components are undoubtedly regarded as noise. The motive for introducing a ridge parameter is to tackle the problem caused by retaining PLS components with small variance and the effects of noise. As shown in Equation (27), the introduction of a small positive constant d has little effect on the PLS components with large variance …tTi ti  d†, but a significant effect on those ^ RPLS …k†, MSE…b ^ RPLS …k†† is expected to be ^ p …k† is a special case of b with small variance. Since b reduced by a suitably chosen ridge parameter d. 2.4.

Generalized RPLS regression

Let R in Equation (16) be defined by R ˆ diag‰1=…tT1 t1 ‡ d1 †; . . . ; 1=…tTq tq ‡ dq †Š

…32†

^ GRPLS gives ^ g …R† by b Replacing b ^ GRPLS ˆ H…TT T ‡ D†ÿ1 HT XT y b

…33†

^ GRPLS (Equation (33)) with b ^ RPLS …k† (Equation (25)), where D = diag[d1,d2,…,dq]. Comparing b more ridge parameters are introduced into the model and different PLS components are treated with ^ GRPLS is different parameters. In analogy with ordinary ridge regression, the regression estimator b ^ GRPLS is called the generalized ridge PLS regression (GRPLS) estimator. The mean squared error of b ! q X 2 T T ^ ^ GRPLS …k†† ÿ bk2 MSE…b GRPLS † ˆ  h hi =…t ti ‡ di † ‡ kE…b …34† i

i

iˆ1

Letting the latter q7k parameters tend to infinity gives Copyright  2001 John Wiley & Sons, Ltd.

J. Chemometrics 2001; 15: 135–148

142

Q.-S. XU, Y.-Z. LIANG AND H.-L. SHEN

^ GRPLS …k† ˆ H1 …TT T1 ‡ D†ÿ1 HT XT y b 1 1

…35†

^ GRPLS …k† contains the first k PLS components: b ^ GRPLS …k†k ˆ kH1 …TT T1 ‡ D†ÿ1 …TT T1 ‡ dI†…FH1 †T FH1 …TT T1 ‡ dI†ÿ1 HT XT yk kb 1 1 1 1 ÿ1 T T T ^  kH1 …T T1 ‡ D† …T T1 ‡ dI†…FH1 † Fk kb RPLS …k†k 1

ˆ

maxf‰…tTi ti i

1

‡

d†=…tTi ti

^ RPLS …k†k ‡ di †Š1=2 gkb

…36†

For a given d, choose suitable di (i = 1,2,…) such that make maxi f‰…tTi ti ‡ d†=…tTi ti ‡ di †Š1=2 g  1. Thus the following inequality holds: ^ RPLS …k†k ^ GRPLS …k†k  kb kb

…37†

^ RPLS with suitably chosen ridge parameters. ^ GRPLS shrinks b b The goal of introducing more ridge parameters into the model is to reduce the mean squared error of the regression estimator. In Equations (18) and (25) there is only one ridge parameter to deal with all the PLS components included in the model and it is unfitting that the components with large variance get the same treatment as those with very small variance. Because RPLS is a special case of ^ GRPLS …k† will GRPLS, it can be expected that the value of the MSE of the optimal GRPLS estimator b ^ RPLS …k†. be smaller than that of the optimal RPLS estimator b 3. 3.1.

CALCULATION AND EXPERIMENTAL

Determination of parameters

There are two kinds of parameters to be determined. One is the number of PLS components included in the model. The other is the values of the ridge parameters. The number of PLS components is selected by cross-validation [21]. The cross-validation is done by constructing the prediction residual error sum of squares (PRESS) [22], then minimizing it to determine k, the number of components included in the model: ! n X 2 …yi ÿ ^yk=i † …38† PRESS…k  † ˆ min 1kq

iˆ1

where yˆk/i is the predicted value of the model computed on the training data set with the ith observation removed, and yi is the y value of the ith sample, when the number of PLS components included in the model is k. In order to get a realistic comparison of results among RR, PLS, RPLS and GRPLS, the determination of parameters ought to be performed from the same starting point. Thus similar crossvalidation processes should also be used to determine the optimal values of d for all the other methods. For RPLS (Equation (28)), firstly the optimal d is obtained by minimizing PRESS with respect to k(k = 1,2,…,q). Then choose the best of these d values. For GRPLS the computation is more complicated. The forward selection algorithm used in this paper is the following. 1. For a given k(k = 1,2,…,q) do steps 3–5 in order to get the optimal parameters. Copyright  2001 John Wiley & Sons, Ltd.

J. Chemometrics 2001; 15: 135–148

GENERALIZED PLS REGRESSION

143

2. Minimize PRESS with respect to d (as done for RPLS; see Equations (25) and (28)) to get the optimal d*. Let D0 = diag[d*,d*,…,d*]. 3. Let D1 = diag[d,d*,…,d*]. Minimize PRESS with respect to d to determine the optimal d (denoted d1  ). Let D2 ˆ diag‰d1  ; d; d  ; . . . ; d  Š. Minimize PRESS with respect to d to determine the second d2  . The final d can be determined in the same way by degrees. 4. Let D0 ˆ diag‰d1  ; d2  ; . . . ; dk Š. Repeat steps 3 and 4 until convergence. 5. Choose the best k with the minimized PRESS among q loops and the corresponding optimal di …i ˆ 1; 2; . . . ; k†. The best k is devoted k*. 3.2.

Calculation of mean squared error of prediction (MSEP)

After the optimal parameters and optimal PRESS are acquired, all training samples are used to build the final model. MSEP for the selected model is calculated for each method over test samples: MSEP ˆ

n 1X …yi ÿ ^yi †2 n iˆ1

…39†

where yˆi is the prediction for the ith observation of the test data and n is the size of the test set. All the computations mentioned above are performed on an IBM PC using Matlab Version 4.2. 3.3.

Explanation of data sets

3.3.1. Data set A. Data set A is taken from the work of Martens and Naes [1]. For this data set the dependent variable y is the percentage of n-propanol in a mixture of ethanol, methanol and npropanol. Measurements of the mixture are performed at 101 wavelengths in the near-infrared region between 1100 and 1600 nm at intervals of 5 nm. Forty samples are collected. These samples are split into two parts of the same size. One part serves as the training set. The spectra of these samples are shown in Figure 1. It is evident that there is a serious spectral overlap in the data, i.e. there is a high degree of collinearity among the independent variables. The model is constructed using these samples. The other samples are used as the test set. 3.3.2. Data set B. The second data set is from our laboratory. These data consist of ultraviolet measurements of mixtures of naphthalene, anthracene, fluorene and phenanthrene. The UV spectra are collected at intervals of 1 nm between wavelengths 200 and 280 nm. The dependent variable y is the concentration (0⋅1mg ml71) of fluorene. Thirty-five samples are collected. Their spectra are shown in Figure 2. These samples are split into two parts. One part consists of the last 18 samples. These serve as the training set and are used to established the model. The others compose the test set and are used to compute the mean squared error of prediction of the selected model. 3.3.3. Data set C. The third data set is from Reference [24]. The data set is used to build the relationship been the logarithm of rate constants (log k) (or reaction barrier BA) and six descriptors for radical addition reactions. The training set consists of 114 samples. The test set is a monitoring set of 55 samples. The training set is centred and normalized before computation. 4.

RESULTS AND DISCUSSION

4.1. Data set A. There is a high degree of collinearity among the spectral variables in the training data set, because the condition number of matrix XTX is 3⋅0178  1021. Therefore the LS Copyright  2001 John Wiley & Sons, Ltd.

J. Chemometrics 2001; 15: 135–148

144

Q.-S. XU, Y.-Z. LIANG AND H.-L. SHEN

Figure 1. Near-infrared spectra of training set of data set A.

method is not suitable to analyse the data set. All the results are listed in Table I. PLS, RPLS and GRPLS all use six available PLS components. GRPLS gives the lowest PRESS. The result of GRPLS is better than those of PLS, RR and RPLS. The performances of PLS and RR are almost the same. These results are not beyond expectation as discussed in Section 2. The values of the

Figure 2. Ultraviolet spectra of data set B.

Copyright  2001 John Wiley & Sons, Ltd.

J. Chemometrics 2001; 15: 135–148

145

GENERALIZED PLS REGRESSION

Table I. Optimal parameters, PRESS and MSEP on basis of data set A for PLS, RPLS, GRPLS regression and RR methods PLS *

k d* PRESS MSEP R2 ^ Norm of b

6 295⋅81 20⋅50 0⋅9974 195⋅66

RPLS

GRPLS

RR

6 6 0⋅0055 0, 1⋅52  1072, 0⋅32  1072, 0, 1⋅40  1072, 0 289⋅41 275⋅98 35⋅03 6⋅56 0⋅9810 0⋅9626 186⋅19 186⋅20

0⋅38  1072 294⋅40 35⋅07 0⋅9982 191⋅54

square of the multiple correlation coefficient, R 2, for the selected models of PLS, RPLS, GRPLS and RR are 0⋅9974, 0⋅9810, 0⋅9626 and 0⋅9982 respectively. In order to confirm the predictive ability of the models constructed on the basis of the training set, the fresh data of the test set are introduced. MSEP is calculated according to the selected model on the test set. One can see from Table I that GRPLS gives the best performance overall, followed by PLS, RPLS and RR. RR gives almost the same result as RPLS, and both of them perform distinctly worse than the other two biased procedures. For the case studied here, the values of MSEP of each method are not in the same order as those of PRESS. Unexpectedly, PLS performs better than RPLS, but GRPLS still gives a significant improvement over PLS. It can be seen from Table I that the GRPLS and GPLS estimators give almost the same overall shrinkage of the PLS estimator. Because of the high degree of collinearity of the data, further shrinkage of the PLS estimator could improve the capacity for-prediction. On the other hand, the fact that the performance of the GRPLS estimator is obviously better than that of the GPLS estimator indicates that PLS components with different variances need different ridge parameters. 4.2.

Data set B

The second data set is obtained from UV spectroscopy. The training set has 18 samples. There is also collinearity in these samples. The ordinary least squares regression method is also not suitable for this example. Table II lists the optimal PRESS and MSEP values for every method, together with the optimal parameters, R 2 and the norm of the regression vector on the basis of the training set. GRPLS and RPLS give the best performances, followed by PLS and RR. The R 2 values for the four methods are similar. A noteworthy point is that the numbers of PLS components introduced into the model by these methods are not the same. The models constructed by the PLS and GRPLS regression methods contain five PLS components, while the one constructed by the RPLS regression method contains six PLS components. The GRPLS estimator slightly shrinks the PLS estimator. It is worth noting that GPLS estimators evidently inflate the PLS estimator. The reason for this might be that the RPLS estimator uses six PLS components, one component more than the PLS estimator does. Therefore this degrades the performance of the RPLS estimator. Figure 3 shows the overall behaviour of every method with respect to the number of PLS components contained in the model. An obvious difference between GRPLS (or RPLS) and PLS is that PRESS for GRPLS (or RPLS) is stable after it reaches the optimal point. With more PLS components involved in the model, collinearity is then introduced and subsequently the performance of the PLS method is degraded. In contrast, with the addition of ridge parameters, increasing the Copyright  2001 John Wiley & Sons, Ltd.

J. Chemometrics 2001; 15: 135–148

146

Q.-S. XU, Y.-Z. LIANG AND H.-L. SHEN

Table II. Optimal parameters, PRESS and MSEP on basis of data set B for PLS, RPLS, GRPLS regression and RR methods PLS *

k d* PRESS MSEP R2 ^ Norm of b

5 2⋅50  1073 1⋅43  1072 0⋅9945 2⋅37

RPLS 6 1074 2⋅40  1073 1⋅53  1072 0⋅9928 2⋅97

GRPLS

RR

5 0, 2⋅20  1073, 0, 1⋅60  1073, 0 2⋅35  1073 1⋅41  1072 0⋅9899 2⋅35

1074 2⋅62  1073 1⋅53  1072 0⋅9899 10⋅84

number of PLS components does not increase the collinearity (or effect of noise) of the data, and the performance of GRPLS (or RPLS) is not degraded. The predictive abilities of the models determined on the basis of the training set for each method are validated on the fresh test set. MSEP is calculated as the criterion for the selected model. It can be seen from Table II that GRPLS gives the best performance. PLS performs worse than GRPLS, followed by RPLS and RR. All the methods perform in the same order as discussed for data set A. It is also noticed that RPLS and RR give almost the same performance and their optimal ridge parameters are the same. Figure 4 shows the plot of MSEP relative to the number of PLS components included in the model. All the regression methods shows the same features as they do in the PRESS plot (Figure 3). GRPLS and RPLS are non-sensitive to the number of PLS components. This property indicates that GRPLS and RPLS are stable regression methods against noise or collinearity in samples. 4.3.

Data set C

As stated in Reference [24], there is collinearity among some variables. The linear model is an approximate model. It can be seen from Table III that the values of PRESS relating to log k for the four methods are almost the same. However, the values of MSEP are different. GRPLS and RPLS perform better than PLS and RR. For BA the values of PRESS have the same order as for log k and also show little difference, while the values of MSEP show larger differences. GRPLS (the best) and RPLS obviously perform better than PLS and RR.

Figure 3. PRESS based on training set of data set B for PLS, RPLS, GRPLS regression and RR methods in relation to number of PLS components included in model.

Copyright  2001 John Wiley & Sons, Ltd.

J. Chemometrics 2001; 15: 135–148

147

GENERALIZED PLS REGRESSION

Figure 4. MSEP of test set of data set B for PLS, RPLS, GRPLS regression and RR methods in relation to number of PLS components included in model.

5.

CONCLUSION

A class of generalized PLS regression methods is developed. The GPLS regression model gives a very concise mathematical formula for its estimators. Based on this formula, two special cases in the class of generalized PLS regression methods, GRPLS and RPLS, are discussed in detail. The GRPLS regression method can be regarded as a combination of PLS and GRR methods. RPLS can also been regarded as a special case of GRPLS. The GPLS estimator is a shrunken PLS estimator, as are the RPLS and GRPLS estimators. The two methods, as well as the PLS regression and RR methods, are compared from the point of view of prediction. According to the three data sets studied here, GRPLS and RPLS give lower PRESS values than PLS and RR. PRESS for PLS is more sensitive than those for GRPLS and RPLS with respect to the number of PLS components included in model. The reason for this is that the high degree of collinearity and the noise in the model are reduced by introducing ridge parameters. A more important property for the constructed model is its prediction ability. The model constructed by GRPLS gives the best performance compared to RPLS, PLS and RR. It is beyond our expectation that the model constructed by PLS performs better than that constructed by RPLS for the first two data sets. This result seems to show that the introduction of only one ridge parameter makes it difficult to tackle both collinearity and noise in the model. PLS components with different variances may need different ridge parameters. It should be pointed out here that the prediction ability for the Table III. Optimal parameters, PRESS and MSEP on basis of data set C for PLS, RPLS, GRPLS regression and RR methods

*

Log k BA

k d* PRESS MSEP ^ Norm of b k* d* PRESS MSEP ^ Norm of b

PLS

RPLS

GRPLS

4

4 0⋅06 129⋅98 1⋅23 0⋅66 6 0⋅08

4 3⋅80  1072, 0⋅13, 0⋅10, 0 129⋅88 1⋅21 0⋅66 6 7⋅5  1072, 0⋅16, 0⋅22, 0, 4⋅1  1072, 1⋅30  1072 4970⋅15 47⋅65

130⋅16 1⋅24 0⋅66 5 5015⋅03 50⋅24

4998⋅94 48⋅67 19⋅73

22⋅23

Copyright  2001 John Wiley & Sons, Ltd.

18⋅43

RR 3⋅00  1073 130⋅06 1⋅24 2⋅38 5⋅00  1074 4990⋅23 52⋅50 22⋅06

J. Chemometrics 2001; 15: 135–148

148

Q.-S. XU, Y.-Z. LIANG AND H.-L. SHEN

model constructed by PLS is better than that for the model constructed by RR based on three examples. ACKNOWLEDGEMENTS

This work was financially supported by the National Natural Science Foundation of the People’s Republic of China. The authors are grateful to H. Martens and T. Fearn for data set A and to K. Heberger and A. P. Borosy for data set C. ^ g (R) is a shrunken estimator of b APPENDIX. Estimator b Letting k⋅k denote the spectral norm of a matrix, then, when h is a vector and khk is its length, ^L ^ g …R† ˆ HRHT XT y ˆ HRHT XT X…XT X†‡ XT y ˆ HRHT XT Xb b ^ g …R†k ˆ kHRHT XT Xb ^ L k  kHRHT XT Xk kb ^ Lk kb

…40†

^ L k  maxf‰ri …tT ti †Š1=2 gkb ^ Lk ˆ kRHT XT XHk kb i

…41†

i

If 0  ri  1=tTi ti for i = 1,2, …,q, then ^ Lk ^ g …R†k  kb kb

…42†

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.

Martens H, Naes T. Multivariate Calibration. Wiley: New York 1989. Hoer AE, Kennard RW. Technometrics 1970; 12: 55–67. Hoer AE, Kennard RW. Technometrics 1970; 12: 69–82. Massy WF. J. Am. Statist Assoc. 1965; 60: 234–246. Webster JT, Gunst RT, Mason RL. Technometrics 1974; 16: 513–522. Wold H. PLS regression. In Encyclopaedia of Statistical Sciences, vol. 6. Johnson NL, Kotz S (eds). Wiley: New York, 1984; 581–591. Kowalski B, Gerlach R, Wold H. In Systems under Indirect Observation. Jo¨reskog K, Wold H (eds). NorthHolland: Amsterdam 1982; 191–209. Frank E, Friedman JH. Technometrics 1993; 35: 109–135. Stone M, Brooks RJ. J. R. Statist. Soc. B. 1990; 52: 237–269. Fearn T. J. R. Statist. Soc. B 1990; 52: 260. Sundberg R. J. R. Statist. Soc. B 1993; 55: 653. Lang PM, Brenchley JM, Nieves RJ, Kalivas JH. J. Multivariate Anal. 1998; 65: 58. Brenchley JM, Lang PM, Nieves RJ, Kalivas JH. Chemometrics Intell. Lab. Syst. 1998; 41: 127. Kalivas JH. Chemometrics Intell. Lab. Syst. 1998; 45: 209. Kalivas JH. J. Chemometrics 1999; 13: 111–132. Helland IS. Commun. Statist. Simul. 1988; 17: 581–607. Helland IS. Scand. J. Statist. 1990; 17: 97–114. Ho¨skuldsson A. J. Chemometrics 1988; 2: 211–220. Marquardt DW. Technometrics 1970; 12: 591–612. Weisberg S. Applied Linear Regression, Wiley: New York 1980. Allen DM. Technometrics 1974; 16: 125–127. Stone M. J. R. Statist. Soc. B 1974; 36: 111–147. James W, Stein C. Proc. Fourth Berkeley Sympon. Mathematics, Statistics and Probability, vol. 1, 1961; 361–379. He`rberger K, Borosy AP. J. Chemometrics 1999; 13: 473–489. Lowerre JM. Technometrics 1974; 16: 461–464. Hocking RR, Speed FM, Lynn MJ. Technometrics 1976; 18: 425.

Copyright  2001 John Wiley & Sons, Ltd.

J. Chemometrics 2001; 15: 135–148