Pruning with Minimum Description Length

4 downloads 0 Views 303KB Size Report
Jun 29, 1995 - Another less obvious approach is to use Minimum Description Length (MDL) to increase generalization. MDL is the only model selection ...
Pruning with Minimum Description Length Jon Sporring E-mail: [email protected] Department of Computer Science / University of Copenhagen Universitetsparken 1 / Copenhagen East / Denmark Phone: +45 35321454 June 29, 1995

Abstract

The number of parameters in a model and its ability to generalize on the underlying datagenerating machinery are tightly coupled entities. Neural networks consist usually of a large number of parameters, and pruning (the process of setting single parameters to zero) has been used to reduce the nets complexity in order to increase its generalization ability. Another less obvious approach is to use Minimum Description Length (MDL) to increase generalization. MDL is the only model selection criterion giving a uniform treatment of a) the complexity of the model and b) how well the model ts a speci c data set. This article investigates pruning based on MDL, and it is shown that the derived algorithm results in a scheme identical to the well known Optimal Brain Damage pruning. Furthermore, an example is given on a well known benchmark data set yielding ne results.

1 Introduction This paper is on modelling data. Models are vital in interpreting data: Consider a typical statistical experiment: Examination of repeated ips of a coin. A rst step in an analysis could be to calculate the frequency of heads in the data set. To describe and understand this experiment, the data is modelled typically by probability theory. This is in fact the real reason for doing the experiment: To understand the behavior of the coin by means of the model. In the example above, most would agree that the concept of probability is a good framework to represent the stochastic process. But in more complex experiments, such as measuring the position of a ball in a gravitational eld, it is more convenient to utilize more complex models with more than one parameter, in order to describe and understand the process. The fundamental problems are therefore:  How does one choose the framework of the model, i.e. the model class, to describe an experiment?  How does one decide which model, within the chosen framework, describes the experimental data best? The choice of model class and thereby the structure of the description is commonly dominated by opinions of suitability | or even guess-work. In no way can it be avoided that seeing is in the eye of the beholder. This paper will investigate the class of feed forward neural networks. They are parallel in nature and are therefore fast to apply. But they have shown to be 

Appeared in: A. Aamodt and J. Komorowski, editors, SCAI-95 Fifth Scandinavian Conference on Arti cial

Intelligence, pages 157{168, IOS Press, Amsterdam, 1995

1

10

8

6

4

2

0

-2

-4 -1

-0.5

0

0.5

1

Figure 1: An example of a well- tted problem with poor generalization theoretically deceptive, and as a consequence the eld is dominated by `ad hoc' techniques. The choice of model from a class needs on the other hand not be a subjective matter. The chosen model should be able to extract the important information in a given data set in order to minimize the mistakes on future data points. This is the concept of generalization. The gure 1 illustrates that choosing the right model for the data is not a trivial task. On the graph are shown three functions: A data generator, (the slowly varying curve), noisy samples of this (the circles), and a neural network tted to the samples (the faster varying curve). What can be seen is that the model ts the samples perfectly, but it is far from the data generator, i.e. the model generalizes poorly. The remedy for this is either to choose a better model, e.g. one with less parameters, or increase the number of samples. The concept of generalization and that of statistics are closely interconnected, and much work has already been expended on generalization in a statistical perspective. Especially the MDL principle deserves attention. MDL de nes the choice of the `best' model from a class as the one minimizing the coding cost of the model parameters and the deviation of the model from the data set. This is the only model selection criterion giving a uniform treatment of the complexity of the model and its deviation from a speci c data set. This paper will use MDL as a pruning criterion and an algorithm will be derived showing identical behavior as the Optimal Brain Damage pruning algorithm derived by Le Cun et al. [1].

2 Minimum Description Length When modelling a data set it is almost never a very good idea to choose one of the two extremes: a) neither a model so complex that each data point can be tted exactly, nor b) a model so simple that even the slightest regularity is considered pure chance i.e. noise. But the problem remains: How should the complexity of the model be compared to the amount of noise given the model? MDL is the only principle (known to this author) to answer this: Choose the model minimizing the code length of both the model and the data. The main point is that the complexity of the model and the amount of noise is measured as the number of bits to encode or compress it. I.e. This is an objective information-based criterion implying that the model class is xed! Therefore the real problem becomes the subjective choice of a proper model class. In contrast to validation-like methods MDL obtains the most reasonable model choice based on the whole data set and thereby gives the most reasonable generalization of the underlying data generator available from the data set. The intuitively simplest implementation of the MDL principle is Two-Part Coding, by which the model parameters is transmitted as a preamble to the deviations, i.e. the code 2

length is given by

L(y; ) = L(yj) + L()

(1) where L() is the code length of the parameters  specifying the model, and where L(y j) is the code length of deviation from the data y . Unfortunately this implementation overestimates the coding cost, because several data sets will t to a speci c set of parameters, giving an inherent redundancy [8]. Alternative implementations are Predictive and Mixture Coding [8]. In this paper Two-Part coding will be used. Shannon has shown that the mean information per symbol in a code [6] is given as the entropy function, and the optimal pre x code length for a symbol is thereby given as,

L(a) = ? log2 p(a)

(2)

where a 2 A and (A; p) is a discrete probability eld. This implies that explicit code lengths induce distributions and vice versa. It is therefore easy to see, that the Bayesian Maximum A Posteriori (MAP) method is equivalent to Two-Part Minimum Description Length method. Rewriting equation 1 in terms of distributions,

L(y; ) = ? log2 P (yj) ? log2 P ()

(3)

implies that the distributions are de ned on discrete domains, and while the deviations in practice always are given to some xed precision, the parameters are usually continuously de ned. In order to estimate the coding length of the parameters a discretization must be performed, Z ^ + 12  Y P (^) = P () d ' P (^) i (4)

?  ^

1 2

i

This is equivalent to truncating the parameters to some precision. An optimal precision vector  in terms of minimum of equation 3 can be found numerically. Equation 3 can, for large data sets and using this optimal precision, be written as,

L(y; ) = ? log2 P (yj^) ? log2 P (^) + k2 log2 n + O(k)

(5)

where ^ is  truncated to optimal precision, k is the number of parameters, and n is the cardinality of the data set. See Rissanen [8] for details. The introduction of this precision parameter on the model parameters is of great advantage. It opens for an analysis of the `importance' of each parameter. In view of generalization, the parameters should never be given with higher precision than necessary in order not to introduce non-justi ed behavior of the model.

2.1 A parameter distribution

When nothing is known Rissanen's universal distribution of integers is often used. This is based on Elias' coding scheme [8] and the code length is given as,

L(n ? 1) = log2 c + log2 n + log2 log2 n + . . .

(6)

where c ' 2:865 is a normalizing constant and all positive terms of successive log2 's are included. The lower bound L(n) = log2 c + log2 n is similar to Je rey's prior of positive reals [4]: p() = 1 . Even though this is not a true distribution it incorporates scale invariance, i.e. a parameter in the interval [1; 10] is just as probable as one in the interval [10; 100]. In terms of codes, if the code length is kept constant then fraction of precision and number, y , is also constant. y

3

2.2 A distribution for the deviations

The most novel assumption on the distribution of the deviations is the normal distributed [5]. With zero mean and diagonal covariant matrix this is given by, T  2 Y py exp ?[yt ?2f2(; t)] Pr(yj;  ) = t=1  2  y T X T 2 p = (7) exp ?[yt ?2f2(; t)]  2 t=1 where  is the standard deviation of deviations and y is the implicit precision in each point yt . Again, this is a discretization of a continuous probability function, but let it be absolutely clear that the coding length is not to be minimized for this precision parameter. MDL is lossless coding. Otherwise the minimum point would be that where all data is lost and nothing is sent. The y 's are therefore constant during the minimization and can be ignored. Hence the resulting code length is relative, p 2 X T 2 T ln L(yj; ) = ln 22 + [yt 2?f2 (ln;2t)] (8) t=1

3 The Feed Forward Neural Network model This paper will investigate models of the feed forward neural network class. The feed forward neural network (network) class are highly parallel in nature and easy to implement in hardware. They show surprisingly versatility in practice, but a closer theoretical understanding is just emerging. They have been shown to be universal, in the sense that they can approximate any given function to any given precision [3]. A thorough introduction to these and other types of networks can be found in the literature [2]. Speci cally, this paper will concentrate on the following function as a model for data: f (x) = v0 + V  g(w0 + W  x) (9) where v 0, w0 , and x are vectors, V and W are matrices, and g is a non-linear vector function such that yi = gi (x). In this paper gi = g = tan?1 , but other sigmoid functions can be used. The structure of a neural network is often described with a graph with a special associated nomenclature: the nodes are called neurons or units. They are organized in layers starting with the input layer x, from which the input is propagated through the hidden layer g to the output layer f . The number of nodes in the hidden layer, corresponding to the number of rows in the W matrix, is also called the degree of the network. Fitting the parameters a neural network to a set of data points D = f(x; y)g, the training set, is often called `learning', even though standard optimization techniques are used. Specifically, tting will here be the usual minimization sum of the l2 between the net and the data set, X XX c(; D) = 21 jjyt ? f (xt)jj22 = 21 (yt;j ? fj (xt))2 (10) t

t

j

where  is the collection of parameters for the net. This procedure can be divided into a linear and a non-linear part. The linear part, i.e. the parameters outside the non-linear function V , can be calculated precisely and directly using e.g. Singular Value Decomposition method [7]. The non-linear part is tted using a gradient descent method. Depending on the data set, this procedure does almost never nd the global minimum, but it seems that the local minima is not a grave problem since their behavior seems similar [9]; but in order not to favor speci c `basin of attractions' the algorithm should be started in a random point in weightspace. This is therefore a stochastic tting procedure! Other minimization procedures have been investigated in the literature, e.g. Simulated Annealing [7]. 4

Data generator, noise, and the net. Mean MDL = 2.489 (11 points)

Data generator, noise, and the net. Mean MDL = -0.6851 (101 points)

Data generator, noise, and the net. Mean MDL = -1.168 (1001 points)

2.4

2.2

2.4

2.2

2

2.2

2

2

1.8

1.8

1.8

1.6

1.6

1.6 f(x)

f(x)

f(x)

1.4 1.4

1.4

1.2 1.2

1.2 1

1

1

0.8

0.8 0.6

0.6

0.4 -1

0.4 -1

-0.5

0 x

0.5

1

(a)

0.8 0.6 -0.5

0 x

(b)

0.5

1

0.4 -1

-0.5

0 x

0.5

1

(c)

Figure 2: Example of MDL using neural nets with 1 input, 8 hidden, and 1 output nodes as model. The data-generating polynomial is shown as a dashed curve, the noisy samples as dots, and the tted network as a solid line. In (a) are 11 points, in (b) 101 points, and in (c) 1001 points. Points q Mean MDL 11 1 2.489 101 4 -0.6851 2 1001 12

-1.168

neural network

?71 ? 50 arctan(?8:5 ? x) ?1:375 ? 0:5625x3) 3T?02:9375 ? 2:6875 arctan( 1 :0 " # " #! 7 64 ?146 25:9 75  64 arctan ?0:33 0:47  1 5 1:48 ?0:23 x ?156:4

Table 1: The numerical results of applying MDL with neural nets to the example. q is the precision parameter. Note, that negative Mean MDL signi es a relative reduction in code length.

3.1 Examples

To demonstrate the use of MDL a simple experiment has been conducted. The analysis will be on how well networks perform with various sizes of data sets from some xed data generator, the polynomial: p(t) = 0:5262t3 ? 0:1845t2 + 0:1988t + 1:59 (11) 2 2 This is then sampled at di erent sampling rates 102 , 100 , and 1000 , in the interval [?1; 1]. Normally distributed noise N (0; 0:1) is added to each sample. Only one precision parameters has been used, i.e. all parameters are coded to equal precision from the interval [2?1 ; 2?15]. The class of neural network searched was that with 1 input, 1 output, and hidden units ranging from 1 to 8. The results are shown as graphs in gure 2, and as equations in table 1. There are two conclusions to be drawn from these experiments:  When the training set grows, the relative cost of coding the parameters is reduced, i.e. the Mean MDL converges complexity of the noise.  Networks can be used to model polynomials, but obvious from the above experiments that they are not a good model of noisy polynomials. In general, the data generator is almost never known in detail, so the choice of model class will always be of a subjective nature. 5

4 Pruning Neural Networks The act of pruning a neural network is the process of explicitly setting speci c weights to zero, and thereby ne-tuning the networks degree. For networks with input or output dimension of 1, the act of pruning the corresponding layer coincides with the act of choosing the number of internal nodes, but for data of higher order pruning works as a ne-tuning for a xed degree, or even as a choice of degree. I.e. pruning can be viewed either as a richer scheme or as a ne-tuning of a previously chosen degree.

4.1 Optimal Brain Damage

An estimate of the increase in cost function, when removing connections (parameters) from the model, can be obtained by examining the second order derivatives of the cost function (equation 10) in the training scheme. This is the Optimal Brain Damage method suggested by Le Cun et al. [1], and will be derived below. Approximating the cost function as a Taylor series up to second order gives, c( + ) ' c() + rc()   + 12 T  Hc ()   (12) rc and Hc are the gradient and the Hessian matrix of the cost function. Assume a single step pruning scheme, i.e. Assumption 4.1 Assume that only a single parameter is removed in each pruning step. This implies that change vector can be written as  = [0; 0; . . . ; i ; . . . ; 0; 0]T , leaving only a single element in the gradient vector and the Hessian matrix terms,

@c () + 1 2 @ 2c () c(; )  c( + ) ? c() ' i @ 2 i 2 @i

i

(13)

Assumption 4.2 Assume that the function is well tted to the date, i.e. [yj ? fj (x )] ' 0 This implies that the gradient vector term is negligible,

@c = ? X X[y ? f (x )] @ f (x ) ' 0 j j @i @i j j

(14)

and that the diagonal elements of the Hessian matrix can be approximated as,

! @ 2c = X X  @ f (x )2 ? [y ? f (x )] @ 2 f (x ) ' X X  @ f (x )2 j j j @i j @i2 j @i2 j j @i

leaving only,

c(; ) ' i2

 XX @ f ( x ) @ j

2

j

i

(15) (16)

This is then the pruning step: Remove those connections causing the smallest absolute change as indicated by the second order derivatives of the cost function (equation 16). But when should the iterations be stopped? The original article [1] suggests MDL as stoppingcriterion, but others have used a simple threshold, i.e. to stop pruning when the absolute value of the smallest second order derivative is above some prede ned value. The implication is that the point of MDL corresponds to the chosen threshold value, which is very unlikely.

6

4.2 Pruning with MDL

The minimum description length method of modelling data with neural networks, can be re ned. Previously, the MDL have been calculated as a function of the number of hidden nodes. I.e. the set of parameters have been examined in steps of the number of output plus input nodes. Extending the idea of pruning to MDL, is therefore tempting, but not obvious. The brute force method would be to examine all combinations of pruned weights, but this is an exponential growing function: 2(n+1)m +2(m+1)o , where n, m, and o is the number of input, hidden, and output nodes. Even if the search is restricted to ne-tuning, i.e. not counting the cases where a hidden node is inactivated (because all connections are set to zero), the search is still very large: 2nm + 2mo . No simple solution to this problem exists. Instead a greedy search similar to Optimal Brain Damage has been derived: Instead of pruning based on the concept of least increase in the cost function least increase in the Two-Part MDL is estimated. Approximating the Two-Part MDL as a Taylor series to second order yields, L(y;  + ) ' L(y; ) + rL(y; )   + 21 T  HL (y; )   (17) where rL(y ; ) is the gradient vector and HL(y; ) is the Hessian matrix of the coding length L(y; ). Assumption 4.3 Assume that only a single parameter is removed in each pruning step. Again this simpli es the equations, @L (y; ) + 1 2 @ 2L (y; ) (18) L(y; ; ) ' i @ 2 i @i2 i Assumption 4.4 Assume that the standard deviation is independent of the parameters. The derivatives of equation 5 (using normal distribution and Je rey's prior) is found to, T @ f (; t) + log2 e @L (y; ) = ? 1 X [ y ? f (  ; t )]  (19) t 2 @i  ln 2 t=1 @i i ! 2 T  @ 2 @ 2 L (y; ) = 1 X log2 e (20) @ f (  ; t ) ? f (  ; t ) ? [ y ? f (  ; t )] t 2 2 2 @i  ln 2 t=1 @i @i i2 (21) From these equations, it is clear that the e ect from coding the parameters are a constant, 2 @L 2@ L i.e. i @ () = ? log2 e and i @2 () = ? log2 e since i = ?i , and they do therefore not a ect the ordering of the weights. What should also be noted, is that the equation 18 is very much similar to the OBD method, except for a linear shift, L(y; ; ) =  2 1ln 2 c(; ) ? 2 log2 e (22) Since only the ordering of the weights matters in both OBD and this MDL-Hessian method, this linear shift has no e ect and these are in principle identical algorithms. Of course one could further assume (as with OBD), Assumption 4.5 Assume that the function is well tted to the date, i.e. that [yj ? fj (x )] ' 0 and this would reduce equation 18 to i

i

2 T  @ 2 X  i L(y; ; ) ' ? log2 e +  2 ln 2 f (; t) t=1 @i

which will be studied further experimentally in the next section. 7

(23)

1 0.9 0.8

Sunspot activity

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1700

1750

1800

1850 Year

1900

1950

2000

Figure 3: The Sunspot series: Usually divided into a training set, year 1700-1920, and two test sets, years 1921-1955 and 1956-1979.

4.3 Experiments on the sunspot series

All the experiments in this section was done on the sunspot series shown in gure 3. The functions considered were prediction of the x(t) based on the previous n years fx(t ? n); x(t ? n + 1); . . . ; x(t ? 1)g also called the lag. This has also been studied intensively by Svarer et al. and others [10]. Note that the three sets are chosen so the years 1700-1920 are for training, and the years 1921-1955 and 1956-1979 are used to judge generalization. For the last 24 years a new tendency is present, and all methods should show de nite worse prediction abilities here. The test sets could and should be included in the training when using MDL, but they are used as `future' data for comparison to earlier work. In all experiments a normalized error was used de ned as,

PT [x ? f (; t)] Enormalized = t Tt '  data data 2

=1

2

2

(24)

2

where data ' 0:2 is the variance of the data set. All the results are presented in table 2.

4.3.1 Experiments with 12-8-1 networks First a simple experiment was conducted to compare the tting algorithm with that used in Svarer et al.: A 12-8-1 network was trained on the sequence. As can be seen in table 2, the t is worse than Svarer. The precise reason for this is unclear. A MDL-Hessian pruning experiment was then conducted on a 12-8-1 network. In contradiction to Svarer et al., these networks usually chose only to include lag of up to the three previous years. An example of a t with precision 4 given by,

2 1 3T 6 2 0 2 0:0625 1 6 " # 66 xt? 0 x^t = 64 ?0:3125 75  666 arctan B BB 0 2:9375 0 4 ?0:6250 @ 0 0 0 ?2:9375  64 xt?

3 2

xt?1

31 77CC 75CA

3 77 77 75

(25)

The matrices have been reduced (zeros removed) to simplify the results. Note, for large networks with few remaining weights a shorter coding scheme can be obtained by sending the position of a weight along with its value, opposed to sending a large but almost zero matrix. This method was adopted in these experiment.

8

Sunspot activity

Sunspot activity (1700-1920)

0.2

0.9

0

0.8

-0.2

1.8

2

1.6

1.8

-0.8 -1

0.6

1.4

1.6

0.5 0.4

Normalized test error

-0.6

Normalized test error

Normalized training error

MDL

Sunspot activity (1956-1979) 2.2

0.7

-0.4

1.2 1 0.8 0.6

1.4 1.2 1 0.8

0.3

-1.2

0.4 0.2

-1.4 -1.6 0

Sunspot activity (1921-1955) 2

2

4

6

8 Lag

(a)

10

12

14

0.1 0

16

0.6

0.2 2

4

6

8 Lag

(b)

10

12

14

0 0

16

0.4 2

4

6

8 Lag

(c)

10

12

14

16

0.2 0

2

4

6

8 Lag

(d)

10

12

14

16

Figure 4: Fitting data for networks with various knowledge of the past. The results are shown as statistics over 20 ts for each lag. The dashed curves are minimum and maximum, and the solid lines are mean. In (a) is the MDL shown, in (b) the normalized training error (1700-1920), and in (c) and (d) the two normalized test errors (1921-1955) and (1956-1979). Sunspots time sequence fittet with a fixed 2-2-1 net. Mean MDL = -1.381 1.2

1

Sunspot activity

0.8

0.6

0.4

0.2

0

-0.2 1700

1750

1800

1850 Year

1900

1950

2000

Figure 5: A t on the Sunspot series with a 2-2-1 network. The dashed curve is the original sequence, and the solid curve is the tted function. Only the years 1700-1921 has been used for training!

4.3.2 Experiments determining optimal degree and lag To determine the optimal degree and lag the MDL measure was minimized without pruning over 1 to 8 hidden nodes, 1 to 15 years in past knowledge (lag), and in precision from 2?1 to 2?15 . This was done 20 times for each lag, and the results are presented in gure 4: The MDL, the normalized training error and 2 normalized test errors all as functions of lag. From these graphs, it seems reasonable especially considering the MDL statistics that a lag of 2 (or maybe 3) is close to optimum in all 4 measures. An example of a t (on the years 1700-1921), of degree 2, with precision 1, and with MDL = ?1:38 is,

3 3T 2 2 1 0" ?1:5000 7 66 # 2 1 31 77 6 1:5000 ?0:5000 6 x^t = 4 ?3:5000 5  64 arctan B 7C 7 @ ?21:0000 :5000 0 ?1:0000  4 xxt? 5A 5 ?5:5000 2

(26)

t?1

and shown in gure 5. Note that this gure shows all three sets: the t on the training set, and the agreement with both test sets.

9

4.3.3 Alternative models for the sunspot series In the previous experiments various network optimization techniques has been tested on the sunspot series, but consider as an aside: Is neural networks really a good model of the sunspot series? This will not be answered here, but consider two very simple models:  Model the data as a constant (the mean value) and a list of residuals, i.e. x(t) = k + r(t). This is a one parameter model class, which in itself is a considerable reduction in the number of parameters. For the training set (1700-1920) the mean value is 0.2496. This results in a MDL value of -0.38 at precision 5 (b250:2496c=25 = 0:22).  Model the data by repetition of the previous point, i.e. x(t) = x(t ? 1) + r(t). This model has also only one parameter: the rst point must be sent. In the calculations, the sending of this rst parameter has not been included, i.e. the MDL point of -1.15 is a bit overestimated. Non the less, this gives an indication of the convergences. Note how well the repetition model performs in MDL sense. It has the advantage of having very few parameters and still giving a fairly good indication on the future. This illustrates the main problem with the MDL method. It is explicitly concentrating on reduction of the coding length. If what is sought, is an understanding of the underlying machinery, then the MDL limit of much data compared to the number of parameters should be examined. In that perspective the above results can be seen as a transient model having optimal properties for short sequences, but for long sequences models closer to the data generating machinery will conquer. In other terms, when humans examine a problem, vast amount of past experience is used in guidance for selecting a model. The unanswered problem when using MDL for studying data generators, is how this information can be used in a compact way to design the model class, without having to present it as pure information (under the assumption that the model class is universal) { how is the limit found for moderate sizes of data?

4.3.4 Experimental results Table 2 shows the MDL results compared to several other methods (see Svarer et al. [10]). The tting of the 12-8-1 network (without pruning) is worse than reported by Svarer et al.. This seems to indicate, that their tting procedure is better. The Hessian MDL experiments shows a high degree of agreement on the test errors: The mean training error is 0.15, the mean test1 error is 0.23, and the mean test2 error is 0.33. Although higher than some of the previous experiments by other authors, these are obtained with much fewer parameters and in view of generalization this is almost certain to be more general.

5 Conclusion Generalization is the ability of a model to extract the relevant data from a noisy set of samples. This paper has emphasized Feed Forward Neural Networks as models, since they are fast in applications, but elusive in theory. The perils of generalization is that the model ts well on the training data, but works poorly on future data. There are two ways to overcome this problem:  Increase the size of the training set, or  reduce the number of parameters 10

Training Test1 Test2 Number of Model (1700-1920) (1921-1955) (1956-1979) parameters Tong & Lim (AR) 0.097 0.097 0.28 16 Weigend (Network) 0.082 0.086 0.35 43 Svarer (Linear) 0.13 0.13 0.37 13 Svarer (Network) 0.078 0.10 0.46 113 Svarer (Pruned NN) 0.090 0.082 0.35 12-16 Mean value 0.79 1.16 2.55 1 Repetition 0.29 0.44 0.69 1 Net0 (12-8-1) 0.27 0.33 0.65 113 MDL1 (Hessian 12-8-1) 0.15 0.27 0.33 5 MDL2 (2-2-1) 0.14 0.18 0.32 9

Mean MDL -0.38 -1.15 3.46 -1.26 -1.38

Table 2: A comparison of the normalized errors obtained on the Sunspot series. The experiments done in this report are shown in the bottom 5 rows of the table. The Mean value and the Repetition experiment are very simple alternative models. Index (0) is a 12-8-1 network without pruning to compare with Svarer et al., (1) is a Hessian pruned 12-8-1 network, (2) is a 2-2-1 network without pruning. Several methods for analysing the relationship between the generalization error and the number of parameters have been suggested. This paper argues that Rissanen's Minimum Description Length (MDL) is the best measure of generalization: Validation-like methods for measuring generalization are endless-screws in the sense that enough information can never be included in the training set. A t has therefore to be chosen based on incomplete data in order to estimate this same incompleteness, i.e. the generalization. MDL chooses the simplest in an information coding sense. But this is also the only method (known to the author), where both data and parameters are treated in an equal manner, and MDL must therefore be the only objective generalization measure (to date). Pruning schemes like Le Cun's et al. Optimal Brain Damage (OBD) methods are therefore relevant to investigate in a MDL perspective. While OBD uses the cost function to implement the pruning scheme, the present work develops a similar scheme from a Two-Part MDL derived measure. It is shown that this MDL measure yields the same scheme as OBD under the assumptions of: 1. Normal distribution of the residuals. 2. Independence of the standard deviation of the residuals for parameters. 3. Je rey's prior of positive reals on parameters. If not, a di erent scheme is obtained. MDL as a pruning criterion was tested experimentally on the sunspot series considered as a time sequence. As indicated by the experiments, a much smaller number of parameters should be used than found by Svarer et al. [10] and others. Functions were seen, having considerably smaller normalized error on the most dicult test set, than any of the previously mentioned results published by other researchers, but the smallest mean MDL seen in the experiments does not coincide with the minimal prediction error of the two test set. From this two conclusions can be drawn:  Either does the approximations in the MDL-measure need ne-tuning, (it is already known to over-estimate the MDL), or 11

 the 'best' generalization ability obtainable from the 1700-1920 training set only gives

rise to a approximately 0.32 normalized error on the 1956-1979 test set. If the last conclusion is the case, then the Hessian MDL-measure has really proven its worth, and the complete set could be used, ensuring best generalization on future data. This is of course the ultimately goal, but the true test will be on real future data!

References [1] Y. Le Cun, J. Denker, and S. Solla. Optimal brain damage. In D. Touretzky, editor, Advances in Neural Information Processing Systems, pages 598{605. San Mateo, 1990. [2] John A. Hertz, Anders Krogh, and Richard G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley Publishing Company, 1991. [3] K. Hornik. Multilayer feedforward networks are universal approximators. Neural Networks, 2:359{366, 1989. [4] E. T. Jaynes. Prior probabilities. IEEE Transactions on systems science and cybernetics, 4(3):227{241, 1968. [5] M. Lehtokangas, J. Saarien, P. Huuhtanen, and K. Kaski. Neural network modeling and prediction of multivariate time series using predictive mdl principle. In Proceeding of the International Conference on Arti cial Neural Networks (ICANN-93), pages 826{829, 1993. [6] Ming Li and Paul Vitanyi. An introduction to Kolmogorov complexity and its applications. Springer-Verlag, 1993. [7] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical recipes in C. Cambridge University press, 1992. [8] J. Rissanen. Stochastic Complexity in Statistical Inquiry. World Scienti c, 1989. [9] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructures of Cognition, volume 1, pages 318{362. MIT Press, 1986. [10] C. Svarer, L. K. Hansen, and J. Larsen. On design and evaluation of tapped-delay neural network architectures. In H.R. Berenji et al., editor, Proceedings of the 1993 IEEE Int. Conference on Neural Networks (ICNN93), pages 45{51, 1993.

12