COMPUTING MINIMUM DESCRIPTION LENGTH FOR ROBUST ...

1 downloads 0 Views 212KB Size Report
A minimum description length MDL and stochastic complexity approach for model selection in robust linear regression is studied in this paper. Computational.
Pacific Symposium on Biocomputing 4:214-325 (1999)

COMPUTING MINIMUM DESCRIPTION LENGTH FOR ROBUST LINEAR REGRESSION MODEL SELECTION GUOQI QIAN

Department of Statistical Science, La Trobe University, Melbourne, VIC 3083, Australia A minimum description length (MDL) and stochastic complexity approach for model selection in robust linear regression is studied in this paper. Computational aspects and implementation of this approach to practical problems are the focuses of the study. Particularly, we provide both algorithms and a package of S language programs for computing the stochastic complexity and proceeding with the associated model selection. A simulation study is then presented for illustration and comparing the MDL approach with the commonly used AIC and BIC methods. Finally, an application is given to a physiological study of triathlon athletes.

1 Introduction A powerful statistical tool for quantitative investigations in health and biological sciences is linear regression, where the simultaneous e ects of a set of variables on a response variable can be analyzed. An important task in linear regression analysis is to screen a large number of potential explanatory variables to select that subset of them which t the information contained in the response variable both eciently and concisely. This is important because it may cause not only serious computation round errors, but also key statistical evidence undetectable if a regression model contains many irrelevant or super uous explanatory variables. An equally important task is to see how the selected model is a ected by outliers in the data. Namely, the model should be robust to radical change of a small portion of the data or a small change in all of the data. A natural solution for robust model selection can be obtained using the information-theoretic approaches such as Algorithmic Probability (ALP)(Solomono 1964), Minimum Message Length (MML)(Wallace and Freeman 1987), and Minimum Description Length (MDL) and Stochastic Complexity (Rissanen 1986, 1987 and 1996). Using the MDL and stochastic complexity approach, Qian and Kunsch (1998 a&b) derived a new variable selection criterion for robust linear regression. This criterion chooses such a subset of the explanatory variables relative to which the stochastic complexity of the data attains the minimum. The stochastic complexity of the data relative to the underlying regression model was shown to be approximated by the robust tting error of the model plus the model complexity | a term depending on the robustness and the signal-

Pacific Symposium on Biocomputing 4:214-325 (1999)

to-noise ratio of the model, and the weighted magnitude of the explanatory variables. Thus the new criterion substantially generalizes those classic model selection criteria such as AIC and BIC where the model complexity depends only on the number of parameters. Asymptotic study reveals that the new criterion selects with probability one the true model if it exists and can be nitely parameterized; and it has the ability of avoiding the two pitfalls of either over- tting or under- tting that plague many model selection criteria like AIC and BIC. The current paper focuses on the computational aspects and the real applications of the stochastic complexity criterion for multiple robust regression model selection. Speci cally, we will address the methods and their properties of computing the robust parameter estimates, the weight function and the criterion function that are involved in the model selection procedure. We will also introduce a package of S language programs called msrob we have written for the computations. We will then present a simulation study to compare the new criterion with the commonly used AIC and BIC. Finally, we will give an application for determining an athlete's total time in a triathlon from those candidate variables measuring the athlete's gross physical characteristics, the training load and the physiological makeup. Some other closely related works are Baxter and Dowe(1996) and Dom (1996). In Baxter and Dowe(1996) the problem of order selection for the polynomial regression models is studied in a non-robust context and using the MML principle which is developed by Wallace and co-workers since 1968. Dom (1996) also studied mostly the non-robust polynomial regression order selection but using the MDL principle. A polynomial regression model concerns the relationship between a response variable and a polynomial function of certain explanatory variable. So the statistical problems studied in these two papers are very di erent from ours which concerns the signi cant relationship between a response and certain subset of many explanatory variables in a robust framework. MML and MDL both use the code length as a criterion function for model selection. But there are also many signi cant di erences between the two principles. This relationship will not be expounded further here.

2 The Stochastic Complexity Criterion When studying the dependence of a response variable y on a p-dimensional explanatory variable x, a linear model is usually assumed between y and x. Namely, for a sample of independent observations (xt1 ; y1);   , (xtn; yn) from

Pacific Symposium on Biocomputing 4:214-325 (1999)

(xt; y), we assume

yi = xti + ri (1) where is a p-dimensional unknown parameter and ri is the error with mean 0 conditional on xi. Provided that the model (1) is valid, information about the indicated dependence can be obtained from a statistical inference of based on the data. For validity of the model (1), we usually include in (1) all the explanatory variables available in the rst consideration in practice, which results in a so-called full model. The validation of the full model usually can be carried out based on the proper subject knowledge. However, if the full model retains many explanatory variables, its statistical inference is typically inecient and non-informative. Therefore, a variable selection procedure is indispensable for proceeding with a good regression analysis. With such a procedure, any important explanatory variables should not be missed out, while at the same time no super uous variables should be included in the model. Of many attractive information-theoretic approaches, we choose to use the MDL and the associated stochastic complexity. It is formalized by identifying a model with the length of an instantaneously decipherable code which is obtained from an optimal two-step coding scheme determined by this model. For a parametric model, the two-step scheme rst encodes the parameter space, then encodes the data for each xed parameter value. The shortest code length obtained in such a way is called the stochastic complexity of the data relative to the employed model. According to the MDL principle, the smaller the stochastic complexity the better is the corresponding model. From Rissanen (1996) and Qian and Kunsch (1998a) it follows that the stochastic complexity relative to a class of parametric probability densities can be expressed as the minus maximum log-likelihood for the data plus a model complexity term determined by the Fisher information and the maximum likelihood estimator (MLE) of the parameter. This result can be directly applied to a regression model (1) if the ordinary least squares method is used, i.e, the error ri is given a normal distribution. But the parameter estimation and model selection based on least squares can be seriously a ected by one or few outliers in the data. Thus, in robust regression, one only assumes ri to follow some distribution in an in nite dimensional neighbourhood of the normal. An optimal representation of this neighbourhood is known to be the so-called least favorable distribution (cf. Hampel et al.(1986,section 7.4d) and Huber (1964)). When using the least favorable distribution to describe the data, the length of the code constructed will be robust against a radical change of a small portion of the data or a small change in all of the data. Thus, the model selection procedure will also be robust based on the robust code for the data. With

Pacific Symposium on Biocomputing 4:214-325 (1999)

this argument and other ideas underlying the two-step coding scheme, it has been shown that the stochastic complexity of Yn = (y1 ;    ; yn )t relative to the regression model (1) can be well approximated by

SC (Yn jXn ) =

n X i=1

c f wi (yi , xti ^)g + 2p ln Ec

00

+ 12 ln jXnt Wn2 Xn j + ln

p ^ Y j j j + n, =

1 4

j =1



(2)

plus terms irrelevant to model selection. The technical detail for the derivation of equation (2) can be found in Qian and Kunsch (1998b). In equation (2), c (t) = 12 t2 for jtj < c and cjtj , 21 c2 for jtj  c is the Huber function used to prevent the model selection from being heavily a ected by outliers in the data, and c (t) = 1 for jtj < c and 0 for jtj  c. The constant c in the Huber function is used to adjust the degree of eciency of the associated robust estimation procedure. The expectation Ec = (2(c) , 1)=(2(c) , 1 + 2c,1(c)), where  and  are respectively the cumulative distribution and the density function of standard normal, is obtained by taking the expectation with respect to the least favorable distribution for the error term in equation (1). In addition, Xn = (x1;    ; xn)t is an n  p design matrix, Wn = diag(w1 ;    ; wn) with wi = w(xi ) 2 (0; 1] a weight function measuring the outlyingness of xi , and  measures the scale of w(xi )ri . The M-estimator ^ = ( ^1 ;    ; ^p ) is de ned by 00

00

n X ^ = arg min c f wi (yi , xti )g:

i=1

(3)

It can be shown that ^ is also the MLE relative to the least favorable distribution. Since the objective is to select an optimal model, those irrelevant terms in the stochastic complexity can be removed. Note that each term in (2) has a clear interpretation. The rst term in (2) is the sum of the robusti ed tting errors which shows the goodness of robust t to the observations. It will decrease if additional explanatory variables are included in the model. This implies that the more explanatory variables are included in (1) the shorter is the code length for encoding the data. But the stochastic complexity also depends on other terms in (2) representing the model complexity. The second term gives the cost of using a robust method, which is 0 if c = +1 and negative otherwise. Note that c = +1 corresponds to the least squares method which is non-robust. Thus a robust method is preferred. The third term gives the weighted magnitude of the explanatory

Pacific Symposium on Biocomputing 4:214-325 (1999)

variables and the last one the generalized signal-to-noise ratio. Therefore, the model complexity in (2) is much more comprehensive than that in many other criteria, e.g. AIC, BIC and Mallows' Cp , where it depends only on the dimension of the parameter. One can also see that the model complexity in (2) depends on the Fisher information In ( ) = ,2 (Ec )Xnt Wn2 Xn . The expression (2) has to be modi ed to be invariant. Qian and Kunsch (1998b) proposed the following modi cation 00

SC (Yn jXn) = 0

n X i=1

c f wi (yi , xti ^)g + 2p ln Ec

00

!

p Y j ^j j + s,1 n,1=4 ; + 12 ln jXnt Wn2 Xn j + ln x(j ) j =2 

P

P

P

(4)

P

where s2x(j ) =( ni=1wi2 ),1 ni=1wi2(xij , xj )2 and xj =( ni=1wi2 ),1 ni=1wi2 xij : The quantity s2x(j ) can be regarded as an estimate for the variance of the j th component of x. Assuming that xi1  1, i.e. the regression contains an intercept, and that the p components of x are linearly independent and w(x) is invariant, it can be shown that SC () is invariant under both scale and shift transformations of y and x. Suppose that the regression model (1) is the full model under consideration, the set of all candidate models can be identi ed with A = f : any non-empty subset off1;    ; pgg or a subset of A. Each in A corresponds to a sub-model of (1) which contains those components of x indexed by , and vice versa. Based on (4), we propose the following model selection procedure: 1. For each candidate model 2 A, compute SC (YnjX n), where X n consists of those columns of Xn indexed by . 2. Select the model  which minimizes SC (Yn jX n) among all candidate models in A. By an asymptotic expansion for SC (Yn jX n), it can be found, under some very general regularity conditions, that the stochastic complexity (4) for a model that incorrectly describes the dependence between y and x exceeds that for a correct model by a term of order O(n) with probability 1; and the stochastic complexity for a correct model exceeds that for the simplest correct model by a term of order O(log n) with probability 1. Therefore, the proposed procedure above selects with probability 1 the simplest model of those in A which correctly describes the dependence between y and x. We refer to Qian and Kunsch (1998b) for a rigorous proof of this result. In addition, it can be 0

0

0

0

Pacific Symposium on Biocomputing 4:214-325 (1999)

shown using section 6.3 of Hampel et al. (1986) that the above procedure is robust with bounded in uence against outliers of both y and x provided that the weight function w(x) is properly chosen.

3 Computing the Stochastic Complexity To compute the stochastic complexity (4), we must be able to compute ^ and . In addition, we should have a procedure for choosing the weight function w(x) and the tuning parameter c.

Computing the M-estimator ^. From (3) it follows that ^ is the solution of n X wi f wi (y , xt )gx = 0; (5)  c  i i i i=1

where c (t) = c (t) = t for jtj < c and c  sign(t) for jtj  c. De ne ui = wi2vi with vi = c f w (yi , xti )g=f w (yi , xti )g. The equation (5) is equivalent to 0

i

i

n 1X t 2 i=1 ui(yi , xi )xi = 0:

(6)

X X ^ = ( ui xixti),1 ( ui yi xi):

(7)

It follows from (6) that n

n

i=1

i=1

So ^ can be computed with a recursive procedure provided that , wi's and c are given. Namely, starting from an initial value of , we compute the weights ui 's, then compute a new value of from (7). Continue this process until the di erence between two successive computations is negligible. The above procedure is referred to be the iteratively reweighted least squares(IRLS) method. By Huber (1981, section 7.8) it can be shown that the IRLS method used here is convergent provided that the design matrix Xn has full rank.

Computing an estimator of . The scale parameter  is treated as a nui-

sance parameter in our selection procedure. It could be estimated di erently for each candidate model considered. But this way entails encoding the parameter  and including its code length in formulating the stochastic complexity (4), which is not the case for our approach. Thus we are apt to a simpler way

Pacific Symposium on Biocomputing 4:214-325 (1999)

to estimate  from the full model and to use the same estimate for all the candidate models. This will also ensure a desirable property that the accumulated robust tting error, i.e. the rst term of (4), decreases as additional explanatory variables are included in the model. Usually, a robust estimate of  can be obtained by using essentially Huber's proposal 2 (Huber 1981, p.137) or Hampel's median absolute deviation (Hampel 1974, p.388). Using the former method, ^ is the solution of the equation n X 2 wi ^ (8) c f  (yi , xi )g = (n , p) (c): i=1

where (c) = 2(c) , 1 , 2c(c) + 2c2 (1 , (c)) is chosen in such a way that it is the expectation of the left hand side of (8) if wi(yi , xi ) = wi ri has a N (0; ) distribution. Using the vi 's de ned above, the equation (8) can again be solved by a convergent recursive method. When using Hampel's method,  is estimated by 1:4826  mediani fwi(yi , xi ^)g:

Choosing the weight function w(). Ideally w(x) should be determined

by a model which correctly describes the dependence between y and x. But whether a model is correct or not is unknown before proceeding with the model selection. In addition, the penalty of using a wrong model for determining w(x) is not given in the criterion (4). Due to these facts, we suggest that w(x) be determined based on the full model. Based on the full model, we proposed that (9) w(x) = wb (xtBx) where wb (t) = min(1; pbt ) with b chosen a priori (e.g. b = p) and B a positive de nite matrix determined by n 1X 2(c) , 1 t 2 t ,1 (10) 2(c) , 1 + 2c,1 (c) n wb (xi Bxi ) xi xi = B : i=1

By using (9) and (10), the M-estimator ^ possesses a robustness property called the bounded self-standardized sensitivity. The expression (9), such a form is often used for the weight function in robust statistics, implies that the in uence of x will be weighted down if xtBx is larger than a given value b. Clearly, the matrix B can be computed with a recursive procedure once b and c are xed. But this procedure may not converge since the solution B of (10) may not exist or may be multiple. Empirical study shows that the procedure is convergent if b is large enough, but all wi's equal 1 if b is too large. A further investigation is needed for this problem.

Pacific Symposium on Biocomputing 4:214-325 (1999)

Choosing the tuning parameter c. The smaller the parameter c is, the

more robust is the model selection procedure, but at the same time the procedure is also less ecient. We will choose the well-known value 1.345 for c so that ^ has eciency 0.95 when ri follows a normal distribution. See Huber (1981,p.91) and Hampel et al. (1986, p.399) for detail.

4 Software for implementing the stochastic complexity criterion The S language (Becker, Chambers and Wilks, 1988) provides a very exible environment for analyzing data. We have written a package of S functions, called msrob, for the robust regression model selection using the stochastic complexity criterion and some other related criteria. There are two key functions in this package: xrlm.select and xrlm. The function xrlm.select is used to select the optimal regression model by one of the following four criteria: stochastic complexity, Ronchetti's robust AIC (Ronchetti, 1985), Hampel's robust AIC (Hampel, 1983) and robust BIC (Machado, 1993). The function xrlm is used to t a robust regression model according to (3). The package msrob can be obtained free of charge via the World Wide Web address http://lib.stat.cmu.edu/S/msrob or by sending an e-mail message containing the text \send msrob from S" to [email protected].

5 Simulation and Example Simulation results. We carried out a simulation study to evaluate the ro-

bustness performance of our stochastic complexity criterion. For purpose of comparison, results for three other criteria were also obtained. The three criteria are the two versions of the robust AIC given by Ronchetti(1985) and Hampel(1983) and the robust BIC by Machado (1993). In the study we considered Y = 0 + 1 X1 + 2 X2 + 3 X3 + 4 X4 + 5 X5 + 6 X6 + r as the full model. So there were in total 26 = 64 possible sub-models with an intercept term. The sample size n was chosen to be 30. The six explanatory variables X1 to X6 were generated independently and uniformly on [0; 1] except that the rst observation of each Xi was 3 and the second was 5. Thus, the rst two sample points were leverage points and they had large in uence on the regression procedure. Six distributions for the error r were chosen to represent various deviation from normality. They are standard normal N (0; 1), student's t with 3 degrees of freedom, Cauchy (t(1) ), log-normal with mean 0 and scale 1 which is asymmetric, slash which is a standard normal divided by a uniform on [0; 1], and contaminated "-normal 0:9N (0; 1) + 0:1N (0; 3). The observations

Pacific Symposium on Biocomputing 4:214-325 (1999)

Table 1: Frequencies of Di erent Models Being Selected in 200 Simulations

Error Distribution Model Category N (0; 1) t(3) Cauchy Log-N(0,1) Slash "-N Stochastic Complexity Criterion True 143 117 30 129 7 135 Other correct 55 49 9 44 5 50 Incorrect 2 34 161 27 188 15 Ronchettis's Robust AIC True 118 111 42 122 20 122 Other correct 81 72 17 64 8 70 Incorrect 1 17 141 14 172 8 Hampel's Robust AIC True 126 116 40 127 17 129 Other correct 72 64 16 55 7 63 Incorrect 2 20 144 18 176 8 Machado's Robust BIC True 168 134 28 151 6 156 Other correct 27 19 7 15 2 27 Incorrect 5 47 165 34 192 17 of Y were obtained from Y = 1 + 2:5X1 + 3X2 , 3X3 + r (11) with r generated from one of the six error distributions. The coecient values were so selected that they would give t-values of about 4 if r were normally distributed. It is clear that the model (11) is the true model. But other models containing X1 , X2 and X3 are also correct models. We carried out 200 simulation runs. Table 1 gives the frequencies of selecting the three types| true, other correct and incorrect|of models by each of the four criteria. From Table 1 we see all the four criteria perform quite well even when the error distribution is considerably deviated from normal (i.e. t(3) , log-normal and "-normal). The relative frequencies of selecting the true model is between 55.5% and 78% for these three error distributions. (Compare with 59% and 84% for the normal error.) They are between 4% and 23.5% in selecting the incorrect models. But when the error distribution is Cauchy or slash, neither of the criteria works well in selecting the correct models. This is probably because Cauchy and slash deviate so much from normal that their population

Pacific Symposium on Biocomputing 4:214-325 (1999)

expectations do not exist. Thus a more robust and ecient procedure would be required for this situation. When comparing these four criteria with each other, we see the AIC methods usually have lower frequencies of selecting the true model but higher frequencies of selecting other super uous correct models than the other two criteria. The stochastic complexity criterion may have little lower frequencies of selecting the true model than the BIC method, but it also has lower frequencies of selecting the incorrect models so has a more stable performance. Since in practice one generally does not know which candidate model is exactly the true model, to reduce the chance of selecting an incorrect model is as important as to enhance the chance of selecting the true one. From this point of view we would prefer the stochastic complexity criterion to the robust BIC. Actually a further simulation study by us reveals that the stochastic complexity method performs more stable than the robust BIC especially when the values in the true model have more moderate t-values mentioned above, namely, when the signal-to-noise ratio becomes weaker.

An actual example. To illustrate the application to practical problems for

our proposed method, we present a real data example arising in a physiology study of triathlon athletes. The data used in this example were taken from Kohrt et al. (1987) who studied the performance of a group of 65 male athletes in half-triathlon event over a 6-week period. The data can also be found in Glantz and Slinker (1990, pp.647-648). There are 10 variables in the data: half-triathlon performance time (t min.), age (A years), weight (W kg.), years triathlon experience (E years), amount of training running (TR km/week), biking (TB km/week), and swimming (TS ,km/week), and maximum oxygen consumption while running (VR mL/min/kg), biking (VB mL/min/kg), and swimming (VS mL/min/kg). These 10 variables represent the athletes' halftriathlon performance, gross physical characteristics, training, and exercise capacity. The objective of the study is to see which variables determine best the athletes' nal time when they compete in the triathlon. This was addressed by conducting a variable selection on the full regression model

t = 0 + 1 A + 2 W + 3 E + 4 TR + 5 TB + 6 TS + 7 VR + 8 VB + 9 VS + r: (12) We applied to the variable selection the stochastic complexity criterion as well as Ronchetti's and Hampel's robust AIC and Machado's robust BIC. There were in total 29 = 512 sub-models for selection if only considering those including an intercept term. Table 2 lists the 8 best sub-models selected from these 512 models by each of the four criteria. In the table, each set of the 8

Pacific Symposium on Biocomputing 4:214-325 (1999)

Table 2: Eight Best Models Selected by Each Criterion in the Example

Stochastic Complexity A + E + TR + TB + VR A + E + TR + TB + VR + VB A + E + TB + VR A + E + TR + TB + TS + VR A + E + TR + TB + VR + VS A + E + TS + VR + VB A + W + E + TR + TB + VR A + E + TB + VR + VB Hampel's Robust AIC A + E + TR + TB + VR A + E + TR + TB + VR + VB A + E + TR + TB + TS + VR A + E + TR + TB + VR + VS A + W + E + TR + TB + VR A + E + TR + TB + TS + VR + VB A + E + TR + TB + VR + VB + VS A + W + E + TR + TB + VR + VB

Ronchetti's Robust AIC A + E + TR + TB + VR A + E + TR + TB + VR + VB A + E + TR + TB + TS + VR A + E + TR + TB + VR + VS A + E + TR + TB + TS + VR + VB A + W + E + TR + TB + VR A + E + TR + TB + VR + VB + VS A + W + E + TR + TB + VR + VB Machado's Robust BIC A + E + TR + TB + VR A + E + TB + VR A + E + TR + TB + VR + VB A + E + TR + TB + TS + VR A + E + TR + TB + VR + VS A + W + E + TR + TB + VR E + TS + VR + VB A + E + TS + VR + VB

models is displayed according to the associated criterion values in an ascending order. From Table 2 we see that all the criteria selected the same best model which includes the ve explanatory variables A, E , TR , TB and VR . These ve variables are also included in most of the other 28 models. However, each of the other four explanatory variables appears only small number of times in these models. This conclusion is the same as that by Glantz and Slinker (1990, pp. 256-261) who used the Mallows' Cp criterion. From Table 2 we can also see that the robust AIC methods tend to select more complicated models while the robust BIC tends the opposite way. The stochastic complexity method gives an improvement over the robust BIC.

References 1. Baxter, R.A. and Dowe, D.L. (1996). Model selection in linear regression using the MML criterion. Technical Report 96/276, Dept. of Computer Science, Monash Univ., Melbourne, Australia.

Pacific Symposium on Biocomputing 4:214-325 (1999)

2. Becker, R., Chambers, J.M. and Wilks, A. (1988). The New S language. Wadsworth, Belmont CA. 3. Dom, B.E. (1996). MDL estimation for small sample sizes and its application to linear regression. IBM Research Report RJ-10030. June 1996. 4. Glantz, S.A. and Slinker, B.K. (1990).Primer of Applied Regression and Analysis of Variance. McGraw-Hill, Inc., New York. 5. Hampel, F.R. (1974). The in uence curve and its role in robust estimation. J. Am. Statist. Assoc. 69 383-393. 6. Hampel, F.R. (1983). Some aspects of model choice in robust statistics. Proceedings of the 44th Session of ISI, Book 2, Madrid, 767-771. 7. Hampel, F.R., Ronchetti, E. M.,Rousseeuw, P. J. and Stahel, W. A. (1986).Robust Statistics: The Approach Based on In uence Functions. Wiley, New York. 8. Huber, P.J. (1964). Robust estimation of a location parameter. Ann. Math. Stat. 35, 73-101. 9. Huber, P.J. (1981). Robust Statistics. Wiley, New York. 10. Kohrt, W.M., Morgan, D.W., Bates, B. and Skinner, J.S. (1987). Physiological responses of triathletes to maximal swimming, cycling, and running. Med. Sci. Sports Exerc. 19, 51-55. 11. Machado, J.A.F. (1993). Robust Model Selection and M -estimation. EconTher. 9, 478-493. 12. Qian, G., and Kunsch, H. (1998a). Some notes on Rissanen's stochastic complexity. IEEE Trans. Inform. Theory. 44, 782-786. 13. Qian, G., and Kunsch, H. (1998b). On model selection in robust linear regression. J. Stat. Plan. & Infer., in press. 14. Rissanen, J. (1986). Stochastic complexity and modeling. Annals of Statistics, 14, 3, 1080{1100. 15. Rissanen, J. (1987). Stochastic complexity. J. Roy. Statist. Soc. B 9 223-239 and 252-265 (discussions). 16. Rissanen, J. (1996). Fisher information and stochastic complexity, IEEE Trans. Inform. Theory. 42, 40-47. 17. Ronchetti, E. (1985). Robust model selection in regression. Stat. Prob. Lett. 3 21-23. 18. Solomono , R.J. (1964). A formal theory of inductive inference I, II. Information and Control 7, 1-22 and 224-254. 19. Wallace, C.S. and Freeman, P.R. (1987). Estimation and inference by compact coding. J. Roy. Statist. Soc. B 9 240-251 and 252-265 (discussions).