Linear and nonlinear methods for regression and classification and ...

6 downloads 23 Views 794KB Size Report
http://www.statistik.tuwien.ac.at. Linear and nonlinear methods for regression and classification and applications in R. P. Filzmoser. Forschungsbericht CS-2008- ...

Institut f. Statistik u. Wahrscheinlichkeitstheorie 1040 Wien, Wiedner Hauptstr. 8-10/107 AUSTRIA http://www.statistik.tuwien.ac.at

Linear and nonlinear methods for regression and classification and applications in R P. Filzmoser

Forschungsbericht CS-2008-3 Juli 2008

Kontakt: [email protected]

Linear and Nonlinear Methods for Regression and Classification and applications in R Peter Filzmoser Department of Statistics and Probability Theory, Vienna University of Technology Wiedner Hauptstr. 8-10, Vienna, 1040, Austria, [email protected] Summary The field of statistics has drastically changed since the introduction of the computer. Computational statistics is nowadays a very popular field with many new developments of statistical methods and algorithms, and many interesting applications. One challenging problem is the increasing size and complexity of data sets. Not only for saving and filtering such data, but also for analyzing huge data sets new technologies and methods had to be developed. This manuscript is concerned with linear and nonlinear methods for regression and classification. In the first sections “classical” methods like least squares regression and discriminant analysis are treated. More advanced methods like spline interpolation, generalized additive models, and tree-based methods follow. Each section introducing a new method is followed by a section with examples from practice and solutions with R [R Development Core Team, 2008]. The results of different methods are compared in order to get an idea of the performance of the methods. The manuscript is essentially based on the book “The Elements of Statistical Learning”, Hastie et al. [2001].

1

Contents 1 Fundamentals 1.1 The linear regression model . . . . . . . . . 1.2 Comparison of models and model selection . 1.2.1 Test for several coefficients to be zero 1.2.2 Explained variance . . . . . . . . . . 1.2.3 Information criteria . . . . . . . . . . 1.2.4 Resampling methods . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

4 4 5 5 6 7 8

2 Linear regression methods 2.1 Least Squares (LS) regression . . . . . . . . . . 2.1.1 Parameter estimation . . . . . . . . . . . 2.2 Variable selection . . . . . . . . . . . . . . . . . 2.2.1 Bias versus variance and interpretability 2.2.2 Best subset regression . . . . . . . . . . 2.2.3 Stepwise algorithms . . . . . . . . . . . . 2.3 Methods using derived inputs as regressors . . . 2.3.1 Principal Component Regression (PCR) 2.3.2 Partial Least Squares (PLS) regression . 2.4 Shrinkage methods . . . . . . . . . . . . . . . . 2.4.1 Ridge regression . . . . . . . . . . . . . . 2.4.2 Lasso Regression . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

9 10 10 10 10 11 11 12 13 14 15 15 16

3 Linear regression methods in R 3.1 Least squares regression in R . . . . . . . . . . . 3.1.1 Parameter estimation in R . . . . . . . . 3.2 Variable selection in R . . . . . . . . . . . . . . 3.2.1 Best subset regression in R . . . . . . . . 3.2.2 Stepwise algorithms in R . . . . . . . . . 3.3 Methods using derived inputs as regressors in R 3.3.1 Principal component regression in R . . 3.3.2 Partial least squares regression in R . . . 3.4 Shrinkage methods in R . . . . . . . . . . . . . 3.4.1 Ridge regression in R . . . . . . . . . . . 3.4.2 Lasso regression in R . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

17 17 18 18 18 19 20 20 21 22 22 23

. . . . . .

25 25 26 26 27 28 28

. . . .

31 31 32 32 32

. . . . . .

4 Linear methods for classification 4.1 Linear regression of an indicator matrix . . . . . . 4.2 Discriminant analysis . . . . . . . . . . . . . . . . 4.2.1 Linear Discriminant Analysis (LDA) . . . 4.2.2 Quadratic Discriminant Analysis (QDA) . 4.2.3 Regularized Discriminant Analysis (RDA) 4.3 Logistic regression . . . . . . . . . . . . . . . . . 5 Linear methods for classification in R 5.1 Linear regression of an indicator matrix in R 5.2 Discriminant analysis in R . . . . . . . . . . 5.2.1 Linear discriminant analysis in R . . 5.2.2 Quadratic discriminant analysis in R 2

. . . .

. . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

5.3

5.2.3 Regularized discriminant analysis in R . . . . . . . . . . . . . . . . . Logistic regression in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Basis expansions 6.1 Interpolation with splines . 6.1.1 Piecewise polynomial 6.1.2 Splines . . . . . . . . 6.1.3 Natural cubic splines 6.2 Smoothing splines . . . . . .

33 33

. . . . .

35 36 36 37 38 38

7 Basis expansions in R 7.1 Interpolation with splines in R . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Smoothing splines in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39 39 41

8 Generalized Additive Models (GAM) 8.1 General aspects on GAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Parameter estimation with GAM . . . . . . . . . . . . . . . . . . . . . . . .

42 43 43

9 Generalized additive models in R

44

10 Tree-based methods 10.1 Regression trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Classification trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46 46 47

11 Tree based methods in R 11.1 Regression trees in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Classification trees in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48 48 50

Bibliography

52

. . . . . . functions . . . . . . . . . . . . . . . . . .

3

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

1

Fundamentals

In this section the linear model is introduced, and basic information about model comparison and model selection is provided. It is assumed that the reader has some knowledge about the theory of least squares regression including statistical testing.

1.1

The linear regression model

The aim of the linear regression model is to describe the output variable y through a linear combination of one or more input variables x1 , x2 , . . . , xp . “Linear combination” means that the input variables get first weighed with constants and are then summarized. The result should explain the output variable as good as possible. The classical linear model has the form y = f (x1 , x2 , . . . , xp ) + ε with the linear function f (x1 , x2 , . . . , xp ) = β0 +

p X

xj β j .

(1)

j=1

The goal is to find a functional relation f which justifies that y ≈ f (x1 , x2 , . . . , xp ), and thus we have y = f (x1 , x2 , . . . , xp ) + ε with the error term ε, which should be as small as possible. A request often made is normal distribution of the error term with expectation 0 and variance σ 2 . The variables xj can come from different sources: • quantitative inputs, for example the height of different people • transformations of quantitative inputs, such as log, square-root, or square • basis-expansions, such as x2 = x21 , x3 = x31 , leading to a polynomial representation • numeric or “dummy” coding of the levels of qualitative inputs • interactions between variables, for example x3 = x1 · x2 No matter the source of the xj , the model is linear in the parameters. Typically we estimate the parameters βj from a set of training data. These consist of measurements y1 , . . . , yn of the response variable and x1 , . . . , xn of the regressor variables. Each vector xi contains the measurements of the p variables xi1 , xi2 , . . . , xip . The estimated parameters are denoted by βˆ0 , βˆ1 , . . . , βˆp and by inserting these values in the linear model (1) for each observation one gets: yˆi = f (xi1 , xi2 , . . . , xip ) = βˆ0 +

p X

xij βˆj

j=1

The predicted values yˆi should be as close as possible to the real measured values yi . The definition of “as close as possible” as well as an estimation of how good the model actually is, will be given in the next sections.

4

1.2

Comparison of models and model selection

Let us consider a statistical model which combines several input variables x1 , x2 , . . . , xp linearly in order to explain an output variable y as good as possible. One question often asked is what “as good as possible” really means and if there would be another model giving an explanation of y as good as the one just found. A lot of times the differences of the quadratic error of the observed values yi and the estimated values yˆi are used to measure the performance of a model. These differences yi − yˆi are called residuals. There are different strategies to compare several models: One could be interested in reducing the number of input variables used to explain the output variable since this could simplify the model, making it easier to understand. In addition, the measurements of variables are often expensive, a smaller model would therefore be cheaper. Another strategy is to use all of the input variables and derive a small number of new input variables from them (see Section 2). Both strategies are aimed at describing the output variable as good as possible, not just for already measured and available values but also for values acquired in the future. 1.2.1

Test for several coefficients to be zero

Let us assume that we have two models of different size: M0 : y = β0 x0 + β1 x1 + · · · + βp0 xp0 + ε M1 : y = β0 x0 + β1 x1 + · · · · · · + βp1 xp1 + ε with p0 < p1 . By simultaneously testing several coefficients to be zero we want to find out if the additional variables xp0 +1 , . . . , xp1 in the larger model M1 provide a significant explanation gain to the smaller model M0 . Rephrasing we get: H0 : “the small model is true”, meaning that the model M0 is acceptable, versus H1 : “the larger model is true”. Basis for this test is the residual sum of squares à !2 p0 n n X X X RSS0 = yi − β0 − βj xij = (yi − yˆi )2 i=1

j=1

i=1

for model M0 and respectively RSS1 =

n X

à yi − β0 −

i=1

p1 X

!2 βj xij

j=1

for model M1 . This leads to the following test statistic RSS0 −RSS1 F =

p1 −p0

RSS1 n−p1 −1

The F -statistic measures the change in residual sum of squares per additional parameter in the bigger model, and it is normalized by an estimate of σ 2 . Under Gaussian assumptions εi ∼ N (0, σ 2 ) and the null hypothesis that the smaller model is correct, the F statistic will be distributed according to F ∼ Fp1 −p0 ,n−p1 −1 . For a better understanding of this test statistic we use the fact that the Total Sum of Squares (TSS) can be expressed by the sum of two variations: the sum of squares explained by the regression (Regression Sum of Squares - RegSS) and the Residual Sum of Squares - RSS: 5

• Total Sum of Squares TSS =

n P

(yi − y)2

i=1

• Residual Sum of Squares RSS =

n P

(yi − yˆi )2

i=1

• Regression Sum of Squares RegSS =

n P

(ˆ yi − y)2

i=1

It is easy to show that TSS = RegSS + RSS. For the models M0 and M1 we have TSS = RegSS0 + RSS0 = RegSS1 + RSS1 which leads to RegSS1 − RegSS0 = RSS0 − RSS1 ≥ 0 because p0 < p1 If RSS0 − RSS1 is large, M1 explains the data significantly better. Attention: A test for H0 : β0 = β1 = · · · = βp = 0 might not give the same result as single tests for H0 : βj = 0 with j = 0, 1, . . . , p - especially if the x-variables are highly correlated. 1.2.2

Explained variance

The multiple R-Square (coefficient of determination) describes the amount of variance that is explained by the model RSS RegSS = TSS TSS P (yi − yˆi )2 = 1− P (yi − y)2 2 = Cor (y, yˆ) ∈ [0, 1]

R2 = 1 −

The closer R-Square is to 1, the better the fit of the regression. If the model has no intercept, y = 0 is chosen. Since the denominator remains constant, R-Square grows with the size of the model. In general we are not interested in the model with the maximum R-Square. Rather, we would like to select that model which leads only to a marginal increase in R-Square with the addition of new variables. ˜ 2 is defined as The adjusted R-Square R ˜ 2 = 1 − RSS/(n − p − 1) . R TSS/(n − 1) It gives a reduced R-Square value and prevents the effect of getting a bigger R-Square even though the fit gets worse, by including the degrees of freedom in the calculation [see, for example, Fox, 1997].

6

1.2.3

Information criteria

We now want to approach the analysis of nested models M1 , . . . , Mq . The models are ranked with M1 ≺ M2 ≺ · · · ≺ Mq , therefore if Mi ≺ Mj , the model Mi can be seen as a special case of Mj which can be obtained by setting some parameters equal to zero, i.e. M1 : y = β0 + β1 x1 + ε M2 : y = β0 + β1 x1 + β2 x2 + ε Of course we have M1 < M2 , since M1 is a special case of M2 where β2 = 0. A possibility to evaluate nested models are information criteria. The most popular criteria are AIC and BIC, which try in different ways to combine the model complexity with a penalty term for the number of parameters used in the model. Akaike’s information criterion (AIC): The AIC is defined as ˆ AIC = −2 max log(L(θ|data)) + 2p ˆ where max log(L(θ|data)) is the maximized log-likelihood function for the parameters θ conditioned at the data at hand, and p is the number of parameters in the model. Considering a linear regression model with normally distributed errors, the maximum likelihood estimators for the unknown parameters are βˆM L = βˆLS = (X > X)−1 X > y 2 σ ˆM L = RSS/n It follows that n n max log L(βˆM L , σ ˆM L ) = − (1 + log(2π)) − log 2 2

µ

RSS n

¶ .

After dropping the constant, the AIC becomes AIC = n log(RSS/n) + 2p RSS = + 2p if σ 2 is known σ2 Within nested models the (log-)likelihood is monotonically increasing and the RSS monotonically decreasing. Bayes information criterion (BIC): Similar to AIC, BIC can be used if the model selection is based on the maximization of the log-likelihood function. The BIC is defined as BIC = −2 max log L + log(n)p RSS = + log(n)p if σ 2 is known. σ2 Mallow’s Cp : For known σ 2 the Mallow’s Cp is defined as Cp =

RSS + 2p − n . σ2 7

If the residual variance σ 2 is not known, it can be estimated by regression with the full model (using all variables). If a full model cannot be computed (too many variables, collinearity, etc.), a regression can be performed on the relevant principal components (see Section 2.3.1), and the variance is estimated from the resulting residuals. For the “true” model, Cp is approximately p, the number of parameters used in this model, and otherwise greater than p. Thus a model where Cp is approximately p should be selected, and preferably that model with smallest p. 1.2.4

Resampling methods

After choosing a model which provides a good fit to the training data, we want to model the test data as well, with the requirement yTest ≈ fˆ(xTest ) (fˆ was calculated based on the training data only). One could define a loss function L which measures the error between y and fˆ(x), i.e. ³ ´ ½ (y − fˆ(x))2 ... quadratic error ˆ L y, f (x) = |y − fˆ(x)| ... absolute error. The test error is the expected predicted value of an independent test set h ³ ´i Err = IE L y, fˆ(x) . With a concrete sample, Err can be estimated using n ´ 1X ³ ˆ c L yi , f (xi ) . Err = n i=1

In case of using the loss function with quadratic error, the estimation of Err is well-known under the name Mean Squared Error (MSE). So, the MSE is defined by n ´2 1 X³ MSE = yi − fˆ(xi ) . n i=1

Usually there is only one data set available. The evaluation of the model is then done by resampling methods (i.e. cross-validation, bootstrap). Cross validation: A given sample is randomly divided into q parts. One part is chosen to be the test set, the other parts are defined as training data. The idea is as follows: for the kth part, k = 1, 2, . . . , q, we fit the model to the other q − 1 parts of the data and calculate the prediction error of the fitted model when predicting the kth part of the data. In order to avoid complicated notation, fˆ denotes the fitted function, computed with the kth part of the data removed. The functions all differ depending on the part left out. The evaluation of the model (calculation of the expected prediction error) is done on the kth part of the data set, i.e. for k = 3: 1 2 3 Training Training Test

4 5 ··· Training Training · · ·

q Training

yˆi = fˆ(xi ) represents the prediction for xi , calculated by leaving out the kth part of the data set, with xi allocated in part k. Since each observation exists only once in each test set, we obtain n predicted values yˆi , i = 1, . . . , n. 8

The estimated cross-validation error is then n

X c CV = 1 Err L (yi , yˆi ) . n i=1 Choice of q: The case q = n is known as “leave-one-out cross-validation”. In this case the fit is computed using all the data except the ith. Disadvantages of this method are • high computational effort • high variance due to similarity of the n “training sets”. A 5-fold or 10-fold cross-validation, thus q = 5 or q = 10, should therefore be preferred. Bootstrap: The basic idea is to randomly draw data sets with replacement from the training data, each sample the same size as the original training set. This is done q times, producing q data sets. Then we refit the model to each of the bootstrap data sets, and examine the behavior of the fits over the q replications. The mean prediction error is then q

n

´ 1 1 XX ³ ˆ c ErrBoot = L yi , fk (xi ) , q n k=1 i=1 with fˆk indicating the function assessed on sample k and fˆk (xi ) indicating the prediction of observation xi of the kth data set. Problem and possible improvements: Due to the large overlap in test and training sets, c Boot is frequently too optimistic. The probability of an observation being included in a Err bootstrap data set is 1 − (1 − n1 )n ≈ 1 − e−1 = 0.632. A possible improvement would be to c Boot only for those observations not included in the bootstrap data set, which calculate Err is true for about 13 of the observations. “Leave-one-out bootstrap” offers another potential improvement: The prediction fˆk (xi ) is based only on the data sets which do not include xi : n ´ 1X 1 X ³ ˆ c ErrBoot−1 = L yi , fk (xi ) n i=1 |C −i | −i k∈C

C −i is the set of indices of the bootstrap samples k that do not contain observation i.

2

Linear regression methods

A linear regression model assumes that the regression function IE(y|x1 , x2 , . . . , xp ) is linear in the inputs x1 , x2 , . . . , xp . Linear models were largely developed in the pre-computer age of statistics, but even in today’s computer era there are still good reasons to study and use them since they are the foundation of more advanced methods. Some important characteristics of linear models are: • they are simple and • often provide an adequate and • interpretable description of how the inputs affect the outputs. • For prediction purposes they can often outperform fancier nonlinear models, especially in situations with 9

– small numbers of training data or – a low signal-to-noise ratio. • Finally, linear models can be applied to transformations of the inputs and therefore be used to model nonlinear relations.

2.1 2.1.1

Least Squares (LS) regression Parameter estimation

The multiple linear regression model has the form y = Xβ + ε with the n observed values y = (y1 , y2 , . . . , yn )> , the design matrix      X=   

1 x11 x12 · · · 1 x21 x22 · · · .. . 1 xi1 .. .

xi2 · · ·

1 xn1 xn2 · · ·

   x1p 1, x1   x2p    1, .x2    .    .  =  xip   1, xi    .    ..  xnp 1, xn

and the error term ε = (ε1 , ε2 , . . . , εn )> . The parameters we are looking for are the regression coefficients β = (β0 , β1 , . . . , βp )> . The most popular estimation method is least squares, in which we choose the coefficients β = (β0 , β1 , . . . , βp )> which minimize the residual sum of squares (RSS) n X RSS(β) = (yi − f (xi ))2 i=1

=

n X

à yi − β0 −

i=1

p X

!2 xij βj

.

j=1

This approach makes no assumptions about the validity of the model, it simply finds the best linear fit to the data. The solution of the minimization problem is given by βˆ = (X > X)−1 X > y. The estimated y-values are yˆ = X βˆ = X(X > X)−1 X > y.

2.2 2.2.1

Variable selection Bias versus variance and interpretability

There are two reasons why a least squares estimate of the parameters in the full model is unsatisfying:

10

1. Prediction quality: Least squares estimates often have a small bias but a high variance. The prediction quality can sometimes be improved by shrinkage of the regression coefficients or by setting some coefficients equal to zero. This way the bias increases, but the variance of the prediction reduces significantly which leads to an overall improved prediction. This tradeoff between bias and variance can be easily seen by decomposing the mean squared error (MSE) of an estimate θˆ of θ: ˆ := IE(θˆ − θ)2 = Var(θ) ˆ + [IE(θ) ˆ − θ ]2 MSE(θ) | {z } bias

For our BLUE (Best Linear Unbiased Estimator) βˆ we have ˆ = Var(β) ˆ + [IE(β) ˆ − β]2 . MSE(β) | {z } =0

However, an estimator β˜ with ˜ < MSE(β) ˆ MSE(β) ˜ 6= β (biased) and Var(β) ˜ < Var(β). ˆ A smaller MSE could thus could exist, with IE(β) lead to a better prediction of new values. Such estimators β˜ can be obtained from variable selection, by building linear combinations of the regressor variables (Section 2.3), or by shrinkage methods (Section 2.4). 2. Interpretability: If many predictor variables are available, it makes sense to identify the ones who have the largest influence, and to set the ones to zero that are not relevant for the prediction. Thus we eliminate variables that will only explain some details, but we keep those which allow for the major explanation of the response variable. With variable selection only a subset of all input variables is used, the rest is eliminated from the model. For this subset, the least squares estimate is used for parameter estimation. There are many different strategies to choose a subset. 2.2.2

Best subset regression

Best subset regression finds the subset of size k for each k ∈ {0, 1, . . . , p} that gives the smallest RSS. An efficient algorithm is the so-called Leaps and Bounds algorithm which can handle up to 30 or 40 regressor variables. Leaps and Bounds algorithm: This algorithm creates a tree model and calculates the RSS (or other criteria) for the particular subsets. In Figure 1 the AIC for the first subsets is shown. Subsequently, large branches are eliminated by trying to reduce the RSS. The AIC for the model x2 + x3 = 20. Through the elimination of x2 or x3 we get an AIC of at least 18, since the value can reduce by 2p = 2 at most (this follows from the formula for the AIC := n log(RSS/n) + 2p). By eliminating a regressor in x1 + x2 a smaller AIC (>8) can be obtained. Therefore, the branch x2 + x3 is left out for the future analysis. 2.2.3

Stepwise algorithms

With data sets larger than 40 input variables a search through all possible subsets becomes infeasible. The following algorithms can be used instead: 11

Figure 1: Tree of models where complete branches can be eliminated. • Forward stepwise selection: starts with the intercept and then sequentially adds into the model the predictor that most improves the fit. The improvement of fit is often based on the F -statistic ˆ − RSS(β) ˜ RSS(β) . F = ˜ RSS(β)/(n − k − 2) βˆ is the parameter vector of the current model with k parameters, β˜ the parameter vector of the model with k + 1 parameters. Typically, predictors are added sequentially until no predictor produces an F ratio greater than the 90th or 95th percentile of the F1,n−k−2 distribution. • Backward stepwise selection: starts with the full model and sequentially deletes the predictors. It typically uses the same F ratio like forward stepwise selection. Backward selection can only be used when n > p in order to have a well defined model. • There are also hybrid stepwise strategies that consider both forward and backward moves at each stage, and make the “best” move.

2.3

Methods using derived inputs as regressors

In many situations we have a large number of inputs, often highly correlated. For the model y = Xβ + ε with X fixed and ε ∼ N (0, σ 2 I) we have: βˆLS ∼ N (β, σ 2 (X > X)−1 ). In case of highly correlated regressors, X > X is almost singular and (X > X)−1 therefore ˆ As a consequence, tests for the regression very large which leads to a numerically instable β. parameters become unreliable: H0 : βj = 0 H1 : βj 6= 0

12

with

βˆj zj = p , σ ˆ dj

where dj represents the jth diagonal element of (X > X)−1 . To avoid this problem, we use methods that use derived input directions. The methods in this section produce a small number of linear combinations zk , k = 1, 2, . . . , q of the original inputs xj which are then used as inputs in the regression. The methods differ in how the linear combinations are constructed. 2.3.1

Principal Component Regression (PCR)

This method looks for transformations of the original data into a new set of uncorrelated variables called principal components, see, for example, Johnson and Wichern [2002]. This transformation ranks the new variables according to their importance, which means according to the size of their variance, and eliminates those of least importance. Then a least squares regression on the reduced set of principal components is performed. Since PCR is not scale invariant, one should scale and center the data first. Given a pdimensional random vector x = (x1 , . . . , xp )> with covariance matrix Σ. We assume that Σ is positive definite. Let V = (v1 , . . . , vp ) be a (p × p)-matrix with orthonormal column vectors, i.e. vi> vi = 1, i = 1, . . . , p, and V > = V −1 . We are looking for the linear transformation z = V >x zi = vi> x The variance of the random variable zi is given by Var(zi ) = IE[vi> xx> vi ] = vi> Σvi Maximizing the variance Var(zi ) under the conditions vi> vi = 1 with Lagrange gives φi = vi> Σvi − ai (vi> vi − 1) . Setting the partial derivation to zero we get ∂φi = 2Σvi − 2ai vi = 0 ∂vi which is (Σ − ai I)vi = 0 In matrix form we have ΣV = V A or Σ = V AV > where A = Diag(a1 , . . . , ap ). This is known as the eigenvalue problem, vi are the eigenvectors of Σ and ai the corresponding eigenvalues: a1 ≥ a2 . . . ≥ ap . Since Σ is positive definite, all eigenvalues are real, non negative numbers. zi is named the ith principal component of x, and we have: Cov(z) = V > Cov(x)V = V > ΣV = A 13

The variance of the ith principal component matches the eigenvalue ai , the variances are ranked in descending order. The last principal component has the smallest variance. The principal components are orthogonal to all the other principal components (they are even uncorrelated) since A is a diagonal matrix. In the following we will use q (1 ≤ q ≤ p) principal components for regression. The regression model for observed data X and y can then be expressed as y = Xβ + ε = XV V > β + ε = Zθ + ε with the n×q matrix of the empirical principal components Z = XV and the new regression coefficients θ = V > β. The solution of the least squares estimation is θˆk = (zk> zk )−1 zk> y and θˆ = (θˆ1 , . . . , θˆq )> . Since the zk are orthogonal, the regression is just a sum of univariate regressions yˆPCR =

q X

θˆk zk

k=1

Since the zk are linear combinations of the original xj , we can express the solution in terms of coefficients of the xj q X ˆ βPCR (q) = θˆk vk = V θˆ k=1

Note that if q = p, we would just get back the usual least squares estimates for the full model. For q < p we get a “reduced” regression. 2.3.2

Partial Least Squares (PLS) regression

This technique also constructs a set of linear combinations of the inputs for regression, but unlike principal components regression it uses y in addition to X for this construction [e.g., Tenenhaus, 1998]. We assume that both y and X are centered. Instead of calculating the parameters β in the linear model yi = x> i β + εi we estimate the parameters γ in the so-called latent variable model y i = t> i γ + εi . We assume: • The new coefficients γ are of dimension q ≤ p. • The values ti of the variables are put together in a (n × q) score matrix T . • Due to the reduced dimension, the regression of y on T should be more stable.

14

• T can not be observed directly; we get each T sequentially for k = 1, 2, . . . , q through the PLS criteria ak = argmax Cov(y, Xa) a

under the constraints kak k = 1 and Cov(Xak , Xaj ) = 0 for 1 ≤ j < k. The vectors ak with k = 1, 2, . . . , q are called loadings, and they are collected in the columns of the matrix A. The score matrix is then T = XA, and y can be written as: y = Tγ + ε = (XA)γ + ε = X (Aγ) +ε = Xβ + ε | {z } β

In other words, PLS does a regression on a weighted version of X which contains incomplete or partial information (thus the name of the method). The additional usage of the least squares method for the fit leads to the name Partial Least Squares. Since PLS uses also y to determine the PLS-directions, this method is supposed to have better prediction performance than for instance PCR. In contrast to PCR, PLS is looking for directions with high variance and large correlation with y.

2.4

Shrinkage methods

Shrinkage methods keep all variables in the model and assign different (continuous) weights. In this way we obtain a smoother procedure with a smaller variability. 2.4.1

Ridge regression

Ridge regression shrinks the coefficients by imposing a penalty on their size. The ridge coefficients minimize a penalized residual sum of squares,   Ã !2 p p n X  X X βˆRidge = argmin y i − β0 − xij βj + λ βj2 (2)   β i=1

j=1

j=1

Here λ ≥ 0 is a complexity parameter that controls the amount of shrinkage: the larger the value of λ, the greater the amount of shrinkage. The coefficients are shrunk towards zero (and towards each other). An equivalent way to write the ridge problem is  Ã !2  p n X  X ˆ βRidge = argmin yi − β0 − xij βj   β i=1

under the constraint

p X

j=1

βj2 ≤ s,

j=1

15

which makes explicit the size constraint on the parameters. By penalizing the RSS we try to avoid that highly correlated regressors (e.g. xj and xk ) cancel each other. An especially large positive coefficient βj can be canceled by a similarly large negative coefficient βk . By imposing a size constraint on the coefficients this phenomenon can be prevented. The ridge solutions βˆRidge are not equivariant for different scaling of the inputs, and so one normally standardizes the inputs. In addition, notice that the intercept β0 has been left out of the penalty term. Penalization of the intercept would make the procedure depend on the origin chosen for y; that is, adding a constant c to each of the targets yi would not simply result in a shift of the predictions by the same amount c. P We center the xij (each xij is replaced by xij − xj ) and estimate β0 by y = ni=1 yi /n. The remaining coefficients get estimated by a ridge regression without intercept, hence the matrix X has p rather than p + 1 columns. Rewriting (2) in matrix form gives RSS(λ) = (y − Xβ)> (y − Xβ) + λβ > β βˆRidge = (X > X + λI)−1 X > y .

and the solutions become

I is the (p × p) identity matrix. Advantages of the just described method are: • With the choice of quadratic penalty β > β, the resulting ridge regression coefficients are again a linear function of y. • The solution adds a positive constant to the diagonal of X > X before inversion. This makes the problem nonsingular, even if X > X is not of full rank. This was the main motivation of its introduction around 1970. • The effective degrees of freedom are df(λ) = tr(X(X > X + λI)−1 X > ), thus for λ = 0 ⇒ df(λ) = tr(X > X(X > X)−1 ) = tr(Ip ) = p λ → ∞ ⇒ df(λ) → 0 It can be shown that PCR is very similar to ridge regression: both methods use the principal components of the input matrix X. Ridge regression shrinks the coefficients of the principal components, the shrinkage depends on the corresponding eigenvalues; PCR completely discards the components to the smallest p − q eigenvalues. 2.4.2

Lasso Regression

The lasso is a shrinkage method like ridge, but the L1 norm rather than the L2 norm is used in the constraints. The lasso is defined by à !2 p n X X βˆLasso = argmin yi − β0 − xij βj β

with the constraint

i=1

p X

j=1

|βj | ≤ s.

j=1

16

Just as in ridge regression we standardize the data. The solution for βˆ0 is y, and thereafter we fit a model without an intercept. Lasso and ridge differ in their penalty term. The lasso solutions are nonlinear and a quadratic programming algorithm is used to compute them. Because of the nature of the constraint, making s sufficiently small will cause some of the coefficients to be exactly 0. PThus the lasso does a kind of continuous subset selection. If s is chosen larger than s0 = pj=1 |βˆj | (where βˆj is the least squares estimate), then the lasso estimates are the least squares estimates. On the other hand, for s = s0 /2, the least squares coefficients are shrunk by about 50% on average. However, the nature of the shrinkage is not obvious. Like the subset size in subset selection, or the penalty in ridge regression, s should be adaptly chosen to minimize an estimate of expected prediction error.

3

Linear regression methods in R

3.1

Least squares regression in R

The various regression methods introduced throughout the manuscript we will illustrated with the so-called Body fat data. The data set can be found in the R package library(UsingR) under the name “fat”. It consists of 15 physical measurements of 251 men. • • • • • • • • • • • • • • •

body.fat: percentage of body-fat calculated by Brozek’s equation age: age in years weight: weight (in pounds) height: height (in inches) BMI: adiposity index neck: neck circumference (cm) chest: chest circumference (cm) abdomen: abdomen circumference (cm) hip: hip circumference (cm) thigh: thigh circumference (cm) knee: knee circumference (cm) ankle: ankle circumference (cm) bicep: extended biceps circumference (cm) forearm: forearm circumference (cm) wrist: wrist circumference (cm)

To measure the percentage of body-fat in the body, an extensive (and expensive) underwater technique has to be performed. The goal here is to establish a model which allows the prediction of the percentage of body-fat with easily measurable and collectible variables in order to avoid the underwater procedure. Nowadays, a new, very effortless method called bio-impedance analysis provides a reliable method to determine the body-fat percentage. • Scanning of the data and explanation of the variables

17

> > > > # > # > # >

library(UsingR) data(fat) attach(fat) fat$body.fat[fat$body.fat == 0] lm.regsubset lm.regsubset2 plot(lm.regsubset2)

Figure 2: Model selection with leaps() This plot shows the two best, by regsubsets() computed models with 1-8 regressors each. The BIC, coded in grey scale, does not improve after the fifth stage (starting from the bottom, see Figure 2). The optimal model can then be chosen from the models with “saturated” grey, and preferable that model is taken with the smallest number of variables. 3.2.2

Stepwise algorithms in R

• Stepwise selection with step() >model.lmstep anova(model.lm, model.lmstep) Analysis of Variance Table Model 1: body.fat ~ age + weight + height + BMI + neck + chest + abdomen + hip + thigh + knee + ankle + bicep + forearm + wrist Model 2: body.fat ~ age + weight + neck + abdomen + hip + thigh + forearm + wrist Res.Df RSS Df Sum of Sq F Pr(>F) 1 235 3738.3 2 241 3786.2 -6 -47.9 0.5019 0.8066

By using the smaller model model.lmstep no essential information is lost, therefore it can be used for the prediction instead of model.lm.

3.3 3.3.1 > > > > + >

Methods using derived inputs as regressors in R Principal component regression in R

x model.ridge model.ridge$Inter [1] 1 > model.ridge$coef age weight 0.79271367 -2.74668538 abdomen hip 8.87964121 -1.43870319 forearm wrist 0.79583229 -1.41917220

height BMI neck chest 0.30184658 1.25208862 -1.06552555 -0.34544269 thigh knee ankle bicep 1.06181696 -0.04751459 0.19309025 0.44022071

• optimal selection of λ using GCV > model.ridge = lm.ridge(body.fat ~ ., data = fat, lambda = seq(0, + 15, by = 0.2)) > select(model.ridge) modified HKB estimator is 1.513289 modified L-W estimator is 4.41101 smallest value of GCV at 1.2

22

> plot(model.ridge$lambda, model.ridge$GCV, type = "l")

Generalized cross-validation (GCV) is a good approximation of the leave-one-out (LOO) cross-validation by linear fits with quadratic errors. For a linear fit we have yˆ = Sy with the corresponding transformation matrix S and this results in #2 " n n i2 X ˆ(xi ) 1 Xh y − f 1 i −i yi − fˆ (xi ) = n i n i 1 − Sii

0.069 0.068 0.067

GCV error

0.070

with fˆ−i (xi ) as fit without observation i and Sii as ith diagonal element of S. The GCV approximation is " #2 N yi − fˆ(xi ) 1 X GCV = N i 1 − trace(S)/n

0

5

10

15

lambda

Figure 6: Optimal selection of λ with GCV

3.4.2

Lasso regression in R

• Lasso regression >library(lars) >model.larsplot(model.lars)

• optimal selection of s with CV > set.seed(123) > cv.lasso x. Then the decision boundary between class k and l is the set of points for which fˆk (x) = fˆl (x), that is, {x : (βˆk0 − βˆl0 ) + (βˆk − βˆl )> x = 0}. New data points x can then be classified with this predictor.

4.1

Linear regression of an indicator matrix

Here each of the response categories is coded by an indicator variable. Thus if G has K classes, there will be K such indicators yk

with k = 1, 2, . . . , K

with

½ yk =

1 for G = k 0 otherwise.

These response variables are collected in a vector y = (y1 , y2 , . . . , yK ), and the n training instances form an (n × K) indicator response matrix Y , with 0/1 values as indicators for the class membership. We fit a linear regression model to each of the columns yk of Y which is given by βˆk = (X > X)−1 X > yk with k = 1, 2, . . . , K . ˆ since The βˆk can be combined in a ((p + 1) × K) matrix B ˆ = (βˆ1 , βˆ2 , . . . , βˆK ) B = (X > X)−1 X > (y1 , y2 , . . . , yK ) = (X > X)−1 X > Y . The estimated values are then Yˆ

ˆ = XB = X(X > X)−1 X > Y .

A new observation x can be classified as follows: • Compute the vector of length K of the fitted values: i> h i> h > ˆ ˆ ˆ ˆ f (x) = (1, x )B = f1 (x), . . . , fK (x) • Identify the largest component fˆ(x) and classify x accordingly: ˆ G(x) = argmax fˆk (x) k∈G

25

4.2 4.2.1

Discriminant analysis Linear Discriminant Analysis (LDA)

G consists of K classes. Let the probability of observation belonging to class k be (the Pan K prior probability) πk , k = 1, 2, . . . , K with k=1 πk = 1. Suppose hk (x) is the density function of x in class G = k. Then P (G = k|x) =

hk (x)πk K P

hl (x)πl

l=1

is the conditional probability that with a given observation x the random variable G is k. hk (x) is often assumed to be the density of a multivariate normal distribution ϕk ½ ¾ (x − µk )> Σ−1 1 k (x − µk ) exp − . ϕk (x) = p 2 (2π)p |Σk | LDA arises in the special case when we assume that the classes have a common covariance Σk = Σ, k = 1, 2, . . . , K [see, e.g., Huberty, 1994]. Comparing these two classes, it is sufficient to look at the log-ratio log

ϕk (x)πk ϕk (x) πk P (G = k|x) = log = log + log P (G = l|x) ϕl (x)πl ϕl (x) πl πk 1 − (µk + µl )> Σ−1 (µk − µl ) + x> Σ−1 (µk − µl ) . = log πl 2

The decision boundary between the classes k and l is P (G = k|x) = P (G = l|x) which is linear in x (in p dimensions this is a hyperplane). Using this equality, the log-ratio becomes log

P (G = k|x) ϕk (x)πk = log(1) = 0 = log P (G = l|x) ϕl (x)πl = log ϕk (x) + log πk − log ϕl (x) − log πl 1 1 − (x − µk )> Σ−1 (x − µk ) + log πk = log p/2 1/2 (2π) |Σ| 2 1 1 − log + (x − µl )> Σ−1 (x − µl ) + log πl p/2 1/2 (2π) |Σ| 2 1 1 > −1 −1 = − x Σ x +x> Σ−1 µk − µ> k Σ µk + log πk 2 2 | {z } δk (x)

1 + x> Σ−1 x −x> Σ−1 µl + 2 |

1 > −1 µ Σ µl − log πl . 2 {zl }

−δl (x)

The linear discriminant scores for class k 1 −1 δk (x) = x> Σ−1 µk − µ> k Σ µk + log πk 2 26

(for k = 1, . . . , K) provide a description of the decision rule. Observation x is classified to the group for which δk (x) is the largest, i.e., G(x) = argmax δk (x) . k

Considering the two classes k and l, then x is classified to class k if δk (x) > δl (x)

⇐⇒

1 πl x> Σ−1 (µk − µl ) − (µk + µl )> Σ−1 (µk − µl ) > log 2 πk

The parameters of the distribution (µk , Σk ) as well as the prior probabilities πk are usually unknown and need to be estimated from the training data. Common estimates are: nk with nk . . . number of observations in group k n X xi = nk g =k

π ˆk = ˆk µ

i

ˆ = Σ

K 1 XX ˆ k )(xi − µ ˆ k )> (xi − µ n − K k=1 g =k i

gi indicates the true group number of observation xi from the training data. Using the linear discriminant function we now can estimate the group membership of xi . This gives a value ˆ i ), which can now be compared to gi . A reliable classification rule should correctly for G(x classify as many observations as possible. The misclassification rate provides the relative frequency of incorrectly classified observations. 4.2.2

Quadratic Discriminant Analysis (QDA)

Just as in LDA we assume a prior probability πk for class k, k = 1, 2, . . . , K with 1. The conditional probability that G takes on the value k is P (G = k|x) =

PK

k=1

πk =

ϕk (x)πk K P ϕl (x)πl l=1

(ϕ is the density of the multivariate normal distribution). QDA does not assume the covariance matrices to be equal, which complicates the formulas. Considering the log-ratio of the conditional probabilities results in the quadratic discriminant scores (quadratic functions in x) 1 1 δk (x) = − log |Σk | − (x − µk )> Σ−1 k (x − µk ) + log πk 2 2 that can be used for the group assignment. Accordingly, observation x is classified to that group which yields the maximal value of the quadratic discriminant scores. πk and µk can be estimated from the training data as in LDA. Σk can be estimated by the sample covariance matrix of the observations from group k.

27

4.2.3

Regularized Discriminant Analysis (RDA)

This method is a compromise between linear and quadratic discriminant analysis, that allows to shrink the separate covariances of QDA towards a common covariance as in LDA. These common (pooled) covariance matrices have the form ÃK ! X 1 ˆ =P ˆk Σ nk Σ K n k=1 k k=1 where nk is the number of observations per group. With the pooled covariance matrices the regularized covariance matrices have the form ˆ k (α) = αΣ ˆ k + (1 − α)Σ ˆ Σ with α ∈ [0, 1]. α provides a compromise between LDA (α = 0) and QDA (α = 1). The idea is to keep the degrees of freedom flexible. α can be estimated using cross-validation.

4.3

Logistic regression

Logistic regression also deals with the problem of classifying observations that originate from two or more groups [see, e.g., Kleinbaum and Klein, 2002]. The difference to the previous classification methods is that the output of logistic regression includes an inference statistic which provides information about which variables are well suitable for separating the groups, and which provide no contribution to this goal. In logistic regression the posterior probabilities of the K classes are modeled by linear functions in x, with the constraint that the probabilities remain in the interval [0, 1] and that they sum up to 1. Let us consider the following models: P (G = 1|x) = β10 + β1> x P (G = K|x) P (G = 2|x) log = β20 + β2> x P (G = K|x) .. . P (G = K − 1|x) > log = β(K−1)0 + βK−1 x P (G = K|x) log

Here we chose class K to be the denominator, but the choice of the denominator is arbitrary in the sense that the estimates are equivariant under that choice. After a few calculations we get ª © P (G = k|x) = P (G = K|x) exp βk0 + βk> x for k = 1, 2, . . . , K − 1 K X

" P (G = k|x) = 1 = P (G = K|x) 1 +

k=1

K−1 X

# exp βk0 + βk> x ©

ª

k=1

P (G = K|x) =

K−1 P

1 ©

ª exp βl0 + βl> x l=1 © ª exp βk0 + βk> x P (G = k|x) = K−1 © ª P 1+ exp βl0 + βl> x 1+

l=1

28

for k = 1, 2, . . . , K − 1

In the special case of K = 2 classes the model is log

P (G = 1|x) = β0 + β1> x P (G = 2|x)

or exp{β0 + β1> x} P (G = 1|x) = 1 + exp{β0 + β1> x} 1 P (G = 2|x) = 1 + exp{β0 + β1> x} P (G = 1|x) + P (G = 2|x) = 1 | {z } | {z } p1 (x)

p2 (x)

The parameters can be estimated using the ML method by using the n training data points. The log-likelihood function is l(β) =

n X

log pgi (xi ; β) ,

i=1

where

µ β=

β0 β1



and xi includes the intercept. pgi is the probability that observation xi belongs to class gi = 1, 2. With p(x; β) = p1 (x; β) = 1 − p2 (x; β) and an indicator yi = 1 for gi = 1 and yi = 0 for gi = 2, l(β) can be written as l(β) = =

n X

{yi log p(xi ; β) + (1 − yi ) log [1 − p(xi ; β)]}

i=1 n ½ X i=1

¸ · ¸¾ exp(β > xi ) exp(β > xi ) yi log + (1 − yi ) log 1 − 1 + exp(β > xi ) 1 + exp(β > xi ) ·

n X ¤ £ ¤ª © > £ = yi β xi − yi log 1 + exp(β > xi ) − (1 − yi ) log 1 + exp(β > xi i=1 n X © > £ ¤ª = yi β xi − log 1 + exp(β > xi ) . i=1

To maximize the log-likelihood we set its derivative equal to zero. These score equations are p + 1 nonlinear functions in β of the form ¾ n ½ X ∂l(β) 1 > = yi xi − exp(β xi )xi >x ) ∂β 1 + exp(β i i=1 =

n X

[yi − p(xi ; β)] xi = 0 .

i=1

To solve these equations, we use the Newton-Raphson algorithm, which requires the second derivative or Hessian matrix: n X ∂ 2 l(β) = ... = − p(xi ; β)[1 − p(xi ; β)]xi x> i ∂β∂β > i=1 29

Starting with βold we get µ βnew = βold −

∂ 2 l(β) ∂β∂β >

¶−1

∂l(β) , ∂β

where the derivatives are evaluated at βold . Matrix notation provides a better insight in that method: y . . . (n × 1) vector of the yi X . . . (n × (p + 1)) matrix of observations xi p . . . (n × 1) vector of estimated probabilities p(xi , βold ) W . . . (n × n) diagonal matrix with weights p(xi , βold )(1 − p(xi , βold )) in the diagonal The above derivatives can now be written as: ∂l(β) = X > (y − p) ∂β ∂ 2 l(β) = −X > W X ∂β∂β > The Newton-Raphson algorithm has the form βnew = βold + (X > W X)−1 X > (y − p) = (X > W X)−1 X > W (Xβold + W −1 (y − p)) = (X > W X)−1 X > W z {z } | weighted LS

with the adjusted response z = Xβold + W −1 (y − p)) . | {z } adjustment

This algorithm is also referred to as IRLS (iteratively reweighted LS), since each iteration solves the weighted least squares problem βnew ←− argmin(z − Xβ)> W (z − Xβ) . β

Some remarks: • β = 0 is a good starting value for the iterative procedure, although convergence is never guaranteed. • For K ≥ 3 the Newton-Raphson algorithm can also be expressed as an IRLS algorithm, but in this case W is no longer a diagonal matrix. • The logistic regression is mostly used for modeling and inference. The goal is to understand the role of the input variables in explaining the outcome. Comparison of logistic regression and LDA: Logistic regression uses log

P (G = k|x) = βk0 + βk> x, P (G = q|x)

whereas LDA uses P (G = k|x) πk 1 log = log − (µk + µK )> Σ−1 (µk − µK ) + x> Σ−1 (µk + µK ) P (G = K|x) πK 2 > = αk0 − αk x The linearity in x in LDA is achieved by: 30

• the assumption of equal covariance matrices • the assumption of multivariate normally distributed groups Even though the models have the same form, the coefficients are estimated differently. The joint distribution of x and G can be expressed as P (x, G = k) = P (x)P (G = k|x) . Logistic regression does not specify P (x), LDA assumes a mixture model (with ϕ as a normal distribution): K X P (x) = πk ϕ(x; µk , Σ) k=1

5

Linear methods for classification in R

5.1

Linear regression of an indicator matrix in R

The different classification methods introduced in this manuscript will be illustrated with the so-called Pima Indian data. The data set can be found in the R package library(mlbench) under the name “PimaIndiansDiabetes2”. > library(mlbench) > data(PimaIndiansDiabetes2) > pid pidind pidind$diabetes set.seed(101) > train mod.reg > > >

predictind > > > > > >

set.seed(101) train > > >

set.seed(101) train > > >

set.seed(101) train predictGLM plot(predictGLM, pch = as.numeric(pid$diabetes) + 1)

0

100

200

300

400

Index

Figure 9: Prediction of the group memberships; the line is the separation from logistic regression, the symbol corresponds to the true group membership.

34

• Misclassification rate Simple evaluation of the misclassification rate: > > > > > >

set.seed(101) train N + λΩN )−1 N > y which is a generalized form of Ridge regression. The fit of the smoothing spline is given by fˆ(x) =

n X

Nj (x)θˆj .

j=1

7

Basis expansions in R

7.1

Interpolation with splines in R

• Model fit with B-splines Generation of a sine function overlayed with noise. > x = seq(1, 10, length = 100) > y = sin(x) + 0.1 * rnorm(x) > x1 = seq(-1, 12, length = 100)

Since we are looking for a linear relationship between the y variable and the spline bases, the function lm() can be used (Figure 12): > > > >

library(splines) lm1 = lm(y ~ bs(x, df = 6)) plot(x, y, xlim = range(x1)) lines(x1, predict.lm(lm1, list(x = x1)))

• Usage of natural cubic splines (Figure 13) > lm3 = lm(y ~ ns(x, df = 6)) > lines(x1, predict.lm(lm3, list(x = x1)))

39

1.0

● ●● ● ●●●●●● ●●●●● ● ●● ●● ● ● ●● ●● ●● ● ●●●● ● ● ●● ●● ●● ● ● ● ●● ● ●●● ●● ● ● ●● ● ●● ● ●● ● ●● ●● ●●●● ● ●● ●●

0.0 −1.0

y

0.5

●●● ● ●●●●● ● ●●● ●● ●● ● ● ● ● ● ●●

0

2

4

6

8

10

12

x

Figure 12: Prediction with 6 spline basis functions

1.0

● ●●● ● ●●●●● ● ●●● ●● ●● ● ● ● ● ● ●●

0.0 −1.0

y

0.5

●● ● ●●●●●● ●●●●● ● ●● ●● ● ● ●● ●● ●● ● ●●●● ● ● ●● ●● ●● ● ● ● ●● ● ●●● ●● ● ● ●● ● ●● ● ●● ● ●● ●● ●●●● ● ●● ●●

0

2

4

6

8

10

12

x

Figure 13: Prediction with natural cubic splines

40

7.2

Smoothing splines in R

• Fit with smooth.spline() from library(splines) library(splines) m1 = smooth.spline(x, y, df = 6) plot(x, y, xlim = range(x1), ylim = c(-1.5, 1.5)) lines(predict(m1, x1))

1.5

> > > >

−1.5

−0.5

y

0.5

● ●● ● ●●● ● ●●●●● ● ●●●●●● ●●●●● ● ●●● ● ●● ●● ●●●● ● ● ●● ●● ●● ● ● ● ● ●●●● ● ● ●● ● ●● ●● ●● ●● ● ●● ● ●●● ●● ● ● ●● ● ●● ●●● ● ●● ●● ●●●● ●● ●●●

0

2

4

6

8

10

12

x

Figure 14: Prediction at the boundaries with smoothing splines • Choice of degrees of freedom with cross-validation

1.5

> m2 = smooth.spline(x, y, cv = TRUE) > plot(x, y, xlim = range(x1), ylim = c(-1.5, 1.5)) > lines(predict(m2, x1), lty = 2)

−0.5

y

0.5

● ●● ● ●●● ● ●●●●● ● ●●●●●● ●●●●● ● ●●● ● ●● ●● ●●●● ● ● ●● ●● ●● ● ● ● ● ●●●● ● ● ●● ● ●● ●● ●● ●● ● ●● ● ●●● ●● ● ● ●● ● ●● ●●● ● ●● ●● ●●●● ●● ●●●

−1.5

df=6 df=11.7 0

2

4

6

8

10

12

x

Figure 15: Choice of degrees of freedom • Example: smoothing splines with the bone density data The bone density data consist of 261 teens from North America. The average age of the teens (age) and the relative change in the bone density (spnbmd) were measured at two consecutive visits.

41

> library(ElemStatLearn) > data(bone)

The following R code plots the data, separately for male and female persons, and fits smoothing splines to the data subsets.

0.20

plot(spnbmd ~ age, data = bone, pch = ifelse(gender == "male", 1, 3), xlab = "Age", ylab = "Relative Change in Spinal BMD") bone.spline.male + > + > > > + +







Male data Female data Model for male Model for female

● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●●● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●

10

15

20

25

Age

Figure 16: Bone density data The regressor variable measures the relative change of the bone density. For each, the male and female measurements, a smoothing spline with 12 degrees of freedom was fitted. This corresponds to a value of λ ≈ 0.0002.

8

Generalized Additive Models (GAM)

For GAMs the weighted sum of the regressor variables is replaced by a weighted sum of transformed regressor variables [see Hastie and Tibshirani, 1990]. In order to achieve more flexibility, the relations between y and x are modeled in a non-parametric way, for instance by cubic splines. This allows to identify and characterize nonlinear effects in a better way.

42

8.1

General aspects on GAM

GAMs are generalizations of Generalized Linear Models (GLM) to nonlinear functions. Let us consider the special case of multiple linear regression. There, we model the conditional expectation of y by a linear function: IE(y|x1 , x2 , . . . , xp ) = α + β1 x1 + . . . + βp xp A generalization is to use unspecified nonlinear (but smooth) functions fj instead of the original regressor variables xj : IE(y|x1 , x2 , . . . , xp ) = α + f1 (x1 ) + f2 (x2 ) + . . . + fp (xp ) Another regression model is the logistic regression model, which is in case of 2 groups µ ¶ µ(x) log = α + β1 x1 + . . . + βp xp , 1 − µ(x) where µ(x) = P (y = 1|x). The generalization with nonlinear functions is called additive logistic regression model, and it replaces the linear terms by µ ¶ µ(x) log = α + f1 (x1 ) + . . . + fp (xp ) . 1 − µ(x) For GLMs, and analogously for GAMs, the “left hand side” is replaced by different other functions. The right hand side remains a linear combination of the input variables for GLM, or of nonlinear functions of the input variables for GAM. Generally we have for GAM: The conditional expectation µ(x) of y is related to an additive function of the predictors via a link function g: g [µ(x)] = α + f1 (x1 ) + . . . + fp (xp ) Examples for classical link functions are: • g(µ) = µ: is the identity link, used for linear and additive models of Gaussian response data • g(µ) = logit(µ) or g(µ) = probit(µ): the probit link function (probit(µ) = Φ−1 (µ)) is used for modeling binomial probabilities • g(µ) = log(µ): log-linear or log-additive models for Poisson count data. All these link functions arise from the exponential family, which forms the class of generalized linear models. Those are all extended in the same way to generalized additive models.

8.2

Parameter estimation with GAM

The model has the form yi = α +

p X

fj (xij ) + εi ,

j=1

43

i = 1, . . . , n

with IE(εi ) = 0. Given observations (xi , yi ), a criterion like the penalized residual sum of squares (PRSS) can be specified for this problem: PRSS(α, f1 , f2 , . . . , fp ) =

n X

( yi − α −

i=1

R

p X

)2 fj (xij )

+

j=1

p X

Z λj

00

fj (tj )2 dtj

j=1

00

fj (tj )2 is an indicator for how much the function is ≥ 0. With linear fj , the integral is 0, nonlinear fj have values larger than 0. λj are tuning parameters. They regularize the tradeoff between the fit of the model and the roughness of the function. The larger the λj , the smoother the function. It can be shown, that (independent of the choice of λj ) an additive model with cubic splines minimizes the PRSS. Each of the functions fj is a cubic spline in the component xj with knot xij , i = 1, . . . , n. Uniqueness of the solutions is obtained by the restrictions n X

n

fj (xij ) = 0 ∀j

=⇒

i=1

1X yi =: ave(yi ) α ˆ= n i=1

and the non-singularity of X. Iterative algorithm for finding the solution: 1. Initialization of α ˆ = ave(yi ), fˆj ≡ 0 ∀i, j 2. For the cycle j = 1, 2, . . . , p, . . . , 1, 2, . . . , p, . . . oi hn P • fˆj ← Sj yi − α ˆ − k6=j fˆk (xik ) , P • fˆj ← fˆj − n1 ni=1 fˆj (xij )

i = 1, . . . , n

until all fˆj are stabilized. P The functions Sj [x] = nj=1 Nj (x)θj denote cubic smoothing splines, see Section 6.2. This algorithm is also known as backfitting algorithm.

9

Generalized additive models in R • Fit of a model to the PimaIndianDiabetes data with gam() from library(mgcv) > > > > + + >

library(mgcv) set.seed(101) train = sample(1:nrow(pid), 300) mod.gam |t|) (Intercept) 0.33000 0.02067 15.96 gam.res 0.5 > gam.TAB gam.TAB 0 1 0 51 10 1 16 15

• Misclassification rate for test set 45

> mcrgam mcrgam [1] 0.2826087

MCR

INDR LDA QDA RDA GLM GAM 0.239 0.239 0.25 0.217 0.217 0.283

According to the table above, GAMs show a rather poor performance in comparison to the other methods. These results, however, are obtained from the selected training and test set, and the picture could be different for another selection. Therefore, it is advisable to repeat the random selection of training and test data several times, to apply the classification methods, and to average the resulting misclassification rates.

10

Tree-based methods

Tree-based methods partition the feature space of the x-variables into a set of rectangular regions which should be as homogeneous as possible, and then fit a simple model in each region [see, e.g., Breiman et al., 1984]. In each step, a decision rule is determined by a split variable and a split point which afterwards is used to assign an observation to the corresponding partition. Then a simple model (i.e. a constant) is fit to every region. To simplify matters, we restrict attention to binary partitions, therefore we always have only two branches in each split. Mostly, the result is presented in form of a tree leading to an easy to understand and interpretable representation. Tree models are nonparametric estimation methods, since no assumptions about the distribution of the regressors is made. They are very flexible in application which also makes them computationally intensive, and the results are highly dependent on the observed data. Even a small change in the observations can result in a severe change of the tree structure.

10.1

Regression trees

We have data of the form (xi , yi ), i = 1, . . . , n with xi = (xi1 , . . . , xip ). The algorithm has to make decision about the: • Split variable • Split point • Form of the tree Suppose that we have a partition into MPregions R1 , . . . , RM and we model the response M as a constant cm in each region: f (x) P = m=1 cm2I(x ∈ Rm ). If we adopt as our criterion minimization of the sum of squares (yi − f (xi )) , it is easy to see that the best cˆm is just the average of yi in each region: cˆm = ave(yi |xi ∈ Rm ) Now finding the best binary partition in terms of minimization of the above criterion of the sum of squares is generally computationally infeasible. Hence we look for an approximation of the solution. 46

Approximative solution: • Consider a split variable xj and a split point s and define a pair of half planes by: R1 (j, s) = {x|xj ≤ s} ;

R2 (j, s) = {x|xj > s}

• Search for the splitting variable xj and for the split point s that solves   X X (yi − c1 )2 + min (yi − c2 )2  . min min c1

j,s

c2

xi ∈R1 (j,s)

xi ∈R2 (j,s)

For any choice of j and s, the inner minimization is solved by cˆ1 = ave(yi |xi ∈ R1 (j, s)) and cˆ2 = ave(yi |xi ∈ R2 (j, s)) . For each splitting variable, the determination of the split point s can be done very quickly and hence by scanning through all of the inputs, the determination of the best pair (j, s) is feasible. Afterwards, this algorithm is applied to all other regions. In order to avoid overfitting, we use a tuning parameter, which tries to regulate the model’s complexity. The strategy is to grow a large tree T0 , stopping the splitting process when some minimum node size is reached. Then this large tree is pruned using cost complexity pruning. By pruning (thus reduction of inner nodes) of T0 , we get a “sub” tree T . We index terminal nodes by m, representing region Rm . Let |T | denote the number of terminal nodes in T and 1 X yi nm . . . number of observations in the space Rm nm x ∈R m i 1 X Qm (T ) = (yi − cˆm )2 nm x ∈R cˆm =

i

m

and thus we define the cost complexity criterion cα (T ) =

|T | X

nm Qm (T ) + α|T |

m=1

which has to be minimized. The tuning parameter α ≥ 0 regulates the compromise between tree size (large α results in a small tree) and goodness of fit (α = 0 results in a full tree T0 ). The optional value α ˆ for the final tree Tαˆ can be chosen by cross-validation. It can be shown that for every α there exists a unique smallest sub-tree Tα ⊆ T0 which minimizes cα . For finding Tα , we P eliminate successively that internal node which yields the smallest increase (per node) of m nm Qm (T ). This is done as long as no node is left. It can be shown that this sequence of sub-trees must include Tα .

10.2

Classification trees

The goal is to partition the x-variables into 1, . . . , K classes. They are classified by the known output variable y with values between 1, . . . , K. Afterwards, new data should be assigned to the corresponding class.

47

In a node m representing a region Rm with nm observations, let 1 X pˆmk = I(yi = k) nm x ∈R i

m

be the proportion of class k in node m. We classify the observations in node m to that class k(m) for which k(m) = argmaxk pˆmk . So, the observation if assigned to the majority class in node m. Different measures Qm (T ) of node impurity include the following: P 1. Misclassification error: n1m i∈Rm I(yi 6= k(m)) = 1 − pˆmk(m) P P 2. Gini index: k6=k0 pˆmk pˆmk0 = K ˆmk (1 − pˆmk ) k=1 p PK 3. Cross-entropy or deviance: k=1 pˆmk log pˆmk The three criteria are very similar, but cross-entropy and Gini index are differentiable, and hence more amenable to numerical optimization. In addition, cross-entropy and the Gini index are more sensitive to changes in the node probabilities than the misclassification rate. For this reason, either Gini index or cross-entropy should be used when growing a tree.

11 11.1

Tree based methods in R Regression trees in R

• Growing a regression tree with rpart() from library(rpart) > library(rpart) > mod.tree mod.tree n=250 (1 observation deleted due to missingness) node), split, n, deviance, yval * denotes terminal node 1) root 250 14557.340000 18.963200 2) abdomen< 91.9 131 3834.820000 13.912210 4) abdomen< 85.45 65 1027.789000 10.695380 8) hip< 92.9 28 362.304300 9.264286 16) ankle>=20.8 23 189.056500 8.143478 * 17) ankle< 20.8 5 11.448000 14.420000 * 9) hip>=92.9 37 564.742700 11.778380 18) wrist>=17.5 28 349.528600 10.657140 36) height>=173.355 25 258.945600 10.076000 * 37) height< 173.355 3 11.780000 15.500000 * .. . 15) abdomen>=112.3 11 220.900000 34.300000 30) weight>=219.075 9 65.040000 32.666670 * 31) weight< 219.075 2 23.805000 41.650000 *

48

For each branch of the tree we obtain results in the following order: – branch number – split – number of data following the split – deviations associated with the split – predicted value – ”*”, if node is a terminal node • Plot of the tree > plot(mod.tree) > text(mod.tree)

abdomen< 91.9

abdomen< 85.45

hip< 92.9

ankle>=20.8

abdomen>=85.45

hip>=92.9 height>=182.6

height>=173.4

abdomen< 104.9

height< 182.6

wrist>=17.5 ankle< 20.8 thigh>=60.5 wrist< 17.5knee>=39.05 thigh< 60.5

8.143514.42

abdomen>=91.9

ankle>=24.85

knee< 39.05

abdomen< 112.3

abdomen< 99.45

abdomen>=112.3

abdomen>=99.45 height>=183.2 weight>=219.1 height< 183.2 weight< 219.1

19.11

9.4515.277

23.128.906 32.66741.65

height=90 54.5 thigh>=54.5 neck>=40.65

10.07615.515.267

ankle< 24.85

abdomen>=104.9

neck< 40.65 height>=186.4

height< 186.4

11.97519.55

forearm>=29.85 abdomen< forearm< 96.9 29.85 ankle>=21.45 abdomen>=96.9 weight< ankle< 214 21.45 hip>=106.5 weight>=214hip< 106.5 13.4 15.320.106

17.93326.1

chest>=108.4ankle=23.6 ankle< 22.95 14.3 17.625.633 21.655 24.94427.7

ankle>=22.95

20.424.625 28.775

Figure 18: Regression tree of the body fat data The procedure to prune the tree and the choice of the optimal complexity are shown in the next section (Section 11.2).

49

11.2

Classification trees in R

The data set Spam consists of 4601 observations and 58 variables. The goal is to divide the variable Email in “good” and “spam” emails using classification trees. The data set is also available in the library(ElemStatLearn) with data(spam). > load("Spam.RData") > names(Spam) [1] [5] [9] [13] [17] [21] [25] [29] [33] [37] [41] [45] [49] [53] [57]

"make" "our" "order" "people" "business" "your" "hp" "lab" "data" "X1999" "cs" "re" "semicolon" "dollarSign" "capitalTotal"

"address" "over" "mail" "report" "email" "font" "hpl" "labs" "X415" "parts" "meeting" "edu" "parenthesis" "hashSign" "class"

"all" "remove" "receive" "addresses" "you" "X000" "george" "telnet" "X85" "pm" "original" "table" "bracket" "capitalAverage"

"X3d" "internet" "will" "free" "credit" "money" "X650" "X857" "technology" "direct" "project" "conference" "exclamationMark" "capitalLongest"

> set.seed(100) > train = sample(1:nrow(Spam), 3065) > library(mvpart)

• Prediction with the classification tree > tree1.pred tree1.tab = table(Spam[-train, "class"], tree1.pred) > tree1.tab tree1.pred good spam good 866 54 spam 69 547

• Misclassification rate for the test set > 1 - sum(diag(tree1.tab))/sum(tree1.tab) [1] 0.08007812

The misclassification rate is not necessarily the best measure for the evaluation, because one would rather accept that spam emails are not identified as spam, than “true” emails are incorrectly assigned as spam. • Results of the cv > plotcp(tree1, upper = "size")

Figure 19 shows the complexity of the tree and the cv estimate. The upper points are the MSEs within the cv, the lower points are the MSEs for the training data. Taking the minimum of the cv-errors, and increasing by one standard error gives the horizontal line. The smallest tree that has a MSE which is still below this line is achieved with α = 0.0035, which is our optimal tree complexity. The optimal size of the tree is approximately 20 nodes. 50

Size of tree 2

3

4



● ●

5

6

7

8

9

● ●

● ●

● ●

● ●

● ●

12 13 14 15 17 19 21 23 24 28 33 62 64 69 73

0.6

0.8



0.2

0.4



Min + 1 SE

● ●

● ●

● ●

● ●

● ●

● ●

● ● ●

● ●

● ●

● ● ●

● ●

















0.0

X−val Relative Error

1.0

1

Inf

0.072

0.036

0.02

0.012

0.0071 0.0048

0.004

0.0029

0.002

0.0014 0.0011

cp

Figure 19: Cross-validation results of tree1: cv-error with standard error (points on top) and error in test set (points below) • Pruning of the tree > tree2 plot(tree2) > text(tree2)

Pruning of the tree with the optimal α ˆ. exclamationMark< 0.0805

remove< 0.045

money< 0.01

free< 0.165

dollarSign< 0.174

remove>=0.045

money>=0.01 george>=0.08

free>=0.165 hp>=0.11

capitalAverage>=3.418 your< 0.31 your>=0.31

good good spam good good spam

capitalAverage< 2.306

george< 0.08

hp< 0.11 good spam

dollarSign>=0.174 our< 1.045 our>=1.045 good spam

capitalAverage< 3.418

exclamationMark>=0.0805

free< 0.165

capitalAverage>=2.306

free>=0.165 hp>=0.39

hp< 0.39

remove< 0.045 george>=0.19 remove>=0.045 meeting>=0.155 george< 0.19

internet< 0.535

business< 0.18

hp>=0.195

internet>=0.535 edu>=0.2 good spam good

business>=0.18 spam

hp< 0.195 spam

good good spam

Figure 20: Result of the pruning of tree1

51

meeting< 0.155

edu< 0.2

good good spam

• Prediction with the new tree > tree2.pred = predict(tree2, Spam[-train, ], type = "class") > tree2.tab = table(Spam[-train, "class"], tree2.pred) > tree2.tab tree2.pred good spam good 868 52 spam 80 536

• Misclassification rate for the test set > 1 - sum(diag(tree2.tab))/sum(tree2.tab) [1] 0.0859375

References L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Belmont, Wadsworth, CA, 1984. J. Fox. Applied Regression Analysis, Linear Models, and Related Methods. Sage Publications, Thousand Oaks, CA, USA, 1997. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, New York, 2001. T.J. Hastie and R.J. Tibshirani. Generalized Additive Models. Chapman and Hall, London, 1990. C.J. Huberty. Applied Discriminant Analysis. John Wiley & Sons, Inc., New York, 1994. R.A. Johnson and D.W. Wichern. Applied Multivariate Statistical Analysis. Prentice Hall, Upper Saddle River, New Jersey, 5th edition, 2002. D.G. Kleinbaum and M. Klein. Logistic Regression. Springer, New York, 2nd edition, 2002. G. McLachlan and D. Peel. Finite Mixture Models. Wiley, New York, 2000. R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2008. http://www.R-project.org. M. Tenenhaus. La Regression PLS. Theorie et Practique. Editions Technip, Paris, 1998. (In French).

52