Minimum Description Length Principle in the Field of ... - Springer Link

2 downloads 0 Views 230KB Size Report
In this work, the minimum description length prin ciple is described, usage of which helps to ... mum description length (MDL) principle are given. Decrease of the ...
PLENARY PAPERS

Minimum Description Length Principle in the Field of Image Analysis and Pattern Recognition1 A. S. Potapov Vavilov State Optical Institute, Birzhevaya line 12, St. Petersburg, 199034 Russia email: [email protected] Abstract—Problems of decision criterion in the tasks of image analysis and pattern recognition are consid ered. Overlearning as a practical consequence of fundamental paradoxes in inductive inference is illustrated with examples. Theoretical (on the base of algorithmic complexity) and practical formulations of the mini mum description length (MDL) principle are given. Decrease of the overlearning effect is shown in the exam ples of modern recognition, grouping, and segmentation methods modified with the MDL principle. Novel possibilities of construction of learnable image analysis algorithms by representation optimization on the base of the MDL principle are described. DOI: 10.1134/S1054661811020908

1. INTRODUCTION

2. BAYESIAN CRITERION

Tasks of image analysis and pattern recognition are rather various. At the same time they possess certain similarity, because they contain induction as an essen tial part. Inductive inference consists in the search for regularities in observation data and in construction of models of the data source. Methods of inductive infer ence contain general components such as model space, decision criterion, optimization, or a search algorithm. Methods of image analysis and pattern rec ognition always include these components in explicit or implicit form.

One of the most widely used mathematical criteria in inductive inference is based on the Bayesian rule: P ( D H )P ( H ) (1) P ( H D ) =  , P(D) where P(H|D) is the posterior probability of the model H with the given data D, P(H), P(D) is prior probabil ities, and P(D|H) is likelihood of data D with the given model H. For example, the Bayesian rule can be applied directly to the classification problem. Let D be one pattern, and let H be one of the classes. If one has probability density distributions P(D|H) of patterns within each class and unconditional probabilities P(H), he can easily select the most probable class for the given pattern maximizing P(H|D). Learning in statistical pattern recognition consists in inducing probability distributions on the basis of a training set {di, hi}, where di is the ith pattern, and hi is its class. Prior probabilities P(H) can be estimated here from frequencies of each class in the training set. Distribution P(D|H) should be represented as an ele ment of some family P(D|H, w), where w is an indica tor (e.g., parameter vector) of specific distribution. Using the Bayesian rule and supposing independence of patterns, one can obtain

Research in the fields mentioned frequently invents particular ad hoc optimality criteria and data descrip tion models on the basis of heuristic considerations. Application area of such the methods appears strictly restricted, because of narrowing of the model space (detectable regularities) and inaccuracy in optimality criteria. As a result, effects of overlearning arise in the field of pattern recognition and problems of construc tion of nontrivially learnable algorithms arise in the field of image analysis. These problems appear not only in purely heuristic methods of recognition and analysis, but also in attempts at application of seem ingly correct mathematical approaches such as the Bayesian rule for the most probable model selection. In this work, the minimum description length prin ciple is described, usage of which helps to introduce correct decision criterion for solving overlearning problem in the field of pattern recognition and con structing learnable image analysis via representation optimization. 1This article was translated by the author.

Received December 29, 2010

P(w) P(w D) =

∏ P(d

i

h i, w )

i   .

(2) P(D) Values of P(di |hi, w) can be explicitly calculated for the specific distribution defined with w. However, there is a problem with evaluation of prior probabilities P(w). In order to correctly specify these probabilities, one needs a lot of training sets, for each of which the true probability is known. This is impossible, because

ISSN 10546618, Pattern Recognition and Image Analysis, 2011, Vol. 21, No. 2, pp. 156–159. © Pleiades Publishing, Ltd., 2011.

MINIMUM DESCRIPTION LENGTH PRINCIPLE

such true probabilities are unknown even to the human expert who constructed the training set. Many researches prefer to ignore prior probabilities using the maximum likelihood (ML) approach. The same result will be obtained if one supposes that prior probabilities are equal. This supposition is evidently incorrect, because prior distributions in the case of infinite model spaces become nonnormalized. In practice, it leads to the overlearning problem. For example, consider mixture Gaussian models. Likeli hood of data will be maximized for the maximum number of components (equal to the number of train ing patterns) in a mixture leading to degenerate distri bution. This is the socalled overlearning (or overfit ting) effect. This effect appears in the task of regression. For example, if one tries to find a polynomial that fits the given points with minimum error (maximum likeli hood), he will obtain a polynomial with the maximum degree that follows all errors in the data and possesses no generalization and extrapolation capabilities. The oversegmentation effect of the same origin appears in various image segmentation tasks [1]: models with more segments will be more precise. As is pointed out by researchers [2, 7], the problem of a priori probabilities is a fundamental one. It is con nected with some paradoxes in inductive inference such as Goodman’s “Greu emerald paradox” (greu emeralds are green before some future date and blue after it). The paradox consists in the fact that experi ments show the same evidence for emeralds to be green or greu. A lot of criteria with a heuristically introduced pen alty for model complexity exist. And still, new criteria are being invented for particular tasks of information processing. This is surprising, because a correct gen eral criterion was proposed half a century ago, and it is wellknown; however, its importance is still underesti mated. 3. ALGORITHMIC PROBABILITY Consider the following notion of (prefix) algorith mic complexity of a binary string β introduced by A.N. Kolmogorov: K U ( β ) = min [ l ( α ) U ( α ) = β ], α

(3)

where U is a universal Turing machine (UTM), α is an arbitrary algorithm (program for UTM), and l(α) is its length. In accordance with this notion, the informa tion quantity contained in the data (string) equals the length of the shortest program that can produce this data. In contrast to the Shannon theory, this notion of information relies not on probability, but on pure combinatorial assumptions. R. Solomonoff proposed to derive probability from algorithmic complexity and to use it in induction. Indeed, if one can derive optimal PATTERN RECOGNITION AND IMAGE ANALYSIS

157

codes from probabilities, he can invert this task and find probabilities from compression. The probability of a program α is connected with its length l(α) as follows: –l ( α )

P(α) = 2 . (4) Arbitrary string β can be generated by a number of programs αi, so its algorithmic probability can be cal culated using the equation P(β) =

∑2

–l ( αi )

.

(5)

i

This is the socalled universal distribution. The algorithmic probability can be called the theo reticinformation formalization of Occam’s razor, and it solves the problem of a priori probabilities [7]. Apparently, if some string has any regularity, it can be generated by a short program, so its probability is higher. If one supposes that there are no nonalgorith mic regularities, the set of algorithms will give univer sal model space. With the correct decision criterion, this will give a universal inductive inference method. Unfortunately, the algorithmic probability is incom putable. 4. MINIMUM DESCRIPTION LENGTH PRINCIPLE More practical schemes, such as principles of Wal lace’s Minimum Message Length (MML) and Ris sanen’s Minimum Description Length (MDL), avoid the incomputability problem considering restricted subsets of models. Li and Vitanyi’s ideal MDL princi ple also utilizes computable models. In general, these principles can be obtained from algorithmic complex ity if one divides the program for UTM α = μδ into an algorithm itself (regular component of the model) μ and input data (random component) δ: K ( β μ ) = min [ l ( δ ) U ( μδ ) = β ], (6) K ( β ) = min [ l ( μ ) + K ( β μ ) ]. Consequently, equation (7) μ* = arg min [ l ( μ ) + K ( β μ ) ] μ

gives the best model via minimization of model com plexity l(μ) and model “precision” K(β|μ) = l(δ), where δ describes deviations of data β from model μ. Equation (7) is similar to the Bayesian rule, if one assumes l(μ) = log2P(μ) and K(β|μ) = log2P(β|μ). But here, prior probabilities can be computed. If one considers the computable complexity, the best model can be calculated for any string. However, computability is not enough for practical use, because the inductive inference problem in these settings appears to be NPhard. The universal inductive inference method turned out to be practically impossible even with the correct decision criterion, because of the essential search Vol. 21

No. 2

2011

158

POTAPOV et al. y

M=3

100

M=8

y 100

50

50 M=2

0

50

100

150

200

250

M=5 x

0

50

100

150

200

250

Fig. 1. Example of polynomial approximation.

problem. Nevertheless, the MDL principle (or its equivalents) in the following verbal form [2] appears very fruitful in the field of image analysis and pattern recognition: the best model of the given data source is the one that minimizes the sum of the length, in bits, of the model description and the length, in bits, of data encoded with the use of the model. Heuristic coding schemes are introduced for each particular task in order to apply this principle in prac tice. For example, in considering some parametric subfamily of algorithms, one needs to describe only values of parameters specifying the concrete model. Search within such a subfamily can be easy, but only the a priori restricted set of regularities can be cap tured. 5. PRACTICAL USE OF THE MDL PRINCIPLE Consider the task of polynomial approximation. Some polynomial can be considered as algorithm that produces a string y1y2…yn from the given input string x1x2…xn, where (xi, yi) are points to be fit (probably, with some residuals) by the polynomial with M dimensional parameter vector w M–1

f (x w) =

∑w x . k

(8)

k

k=0

Results of polynomial approximation M

e

e [–100, 350]

L

1 2 3 4 5 6 7 8 9

20.8 18.0 8.4 8.1 8.0 7.6 7.5 6.6 6.0

64.5 62.8 6.0 27.0 70.9 326.8 590.9 8332 34912

45.5 45.1 35.6 36.9 38.2 39.3 40.8 40.6 40.9

The model description contains information about parameters w. If the polynomial precisely fits the given data, then the data description length will be zero. Otherwise, one also needs to describe deviations of data from the model. Data encoded with the use of the model include deviations ei = yi – f(xi |w), the descrip tion of the length of which can be estimated on the base of entropy Le = nH({ei}) under the assumption of their independence. If deviations have a Gaussian dis tribution, entropy will be proportional to the loga rithm of dispersion; that is, optimization of Le will lead to the least squares method (LSM), which is a sort of ML method. In order to avoid overfitting, one needs to take into account the description length of model parameters Lw, which depends on the precision (number of bits per parameter) with which they are described. Each parameter can be rounded with different precision (this will results in changes both in Le and Lw) and the optimum found. There exists rough estimation [5] Lw = 0.5Mlog2n for M parameters and n data ele ments. The resulting equation for the MDL criterion is the following: (9) L ( w ) = nH ( { y i – f ( x w ) }) + M  log 2n. 2 Figure 1 shows an example of fitting different poly nomials to some set of the noisy point. One point was not included in the training set; it is used for checking extrapolation quality. The table shows the average error at the point of the training set, the average error in some wider interval, and the description length for polynomials of a different degree (M – 1). It can be seen that the average error on the points from the training set decreases with the degree of the polynomial. It does not correspond to true extrapola tion precision. The MDL criterion helps to choose optimal model complexity. Of course, similar results can be achieved with help of the crossvalidation tech nique. The latter, however, has several drawbacks: it does not give understanding of the overlearning origin; the model is constructed only on the portion of data that results in precision loss; crossvalidation is diffi cult to apply in unsupervised learning or image analy sis tasks.

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 21

No. 2

2011

MINIMUM DESCRIPTION LENGTH PRINCIPLE

159

Fig. 2. Example of contour segmentation.

Fig. 3. Example of image segmentation.

The same results can be achieved in the task of pat tern recognition, when it is reduced to the approxima tion problem, and nonlinear (e.g., polynomial) dis crimination functions may have different complexity [6]. Similarly, the MDL criterion helps to choose the optimal complexity of support vector models [3] and the number of components in mixture models [8]. There are also a lot of applications of the MDL principle in image analysis tasks such as texture seg mentation, feature extraction, structural description, object recognition, spatial transformation estimation, optical flow estimation, change detection, and many others. Figures 2 and 3 show some initial images and results of segmentation based on MDL criterion (Fig. 3 also shows results with least squares criterion for comparison).

into this principle yielding criterion for automatic rep resentation optimization instead of using static heuris tic coding schemes. In particular, application of this approach gives essentially learnable image analysis algorithms.

CONCLUSIONS One of the general problems in inductive inference consists in specifying decision criterion, without which correct data interpretation is impossible. Even the widely used Bayesian criterion frequently causes an a priori probability problem, which (when ignored) results in the overlearning effect. Algorithmic complexity gives a reliable theoretical basis for solving these problems. However, the univer sal a priori distribution over an algorithmically com plete model space does not allow practically realizable search. Instead, restricted families of models and heu ristic coding schemes are applied resulting in a practi cal MDL principle, usage of which has shown many positive results in solving tasks of pattern recognition and image analysis. One of the current research directions consists in filling the gap between theoretical and practical ver sions of the MDL principle. It is shown [4] that the notion of representation can be correctly incorporated

PATTERN RECOGNITION AND IMAGE ANALYSIS

ACKNOWLEDGMENTS This work was supported by a grant of the Council of the President of the Russian Federation (MD 2040.2010.9). REFERENCES 1. T. Lee, “A Minimum Description Length Based Image Segmentation Procedure, and Its Comparison with a Cross–Validation Based Segmentation Procedure,” J. Am. Stat. Assoc. 95, 259–270 (2000). 2. M. Li and P. Vitanyi, “Philosophical Issue in Kolmog orov Complexity,” in Proc. ICALP’92 (Vienna, 1992), pp. 1–15. 3. U. von Luxburg, O. Bousquet, and B. Schölkopf, “A Compression Approach to Support Vector Model Selection,” Mach. Learn. Res. 5, 293–323 (2004). 4. A. S. Potapov, I. A. Malyshev, A. E. Puysha, and A. N. Averkin, “New Paradigm of Learnable Computer Vision Algorithms Based on the Representational MDL Principle,” Proc. SPIE 7696, 769606 (2010). 5. J. J. Rissanen, “Modeling by the Shortest Data Description,” Automat. J. IFAC 14, 465–471 (1978). 6. M. Sato, M. Kudo, J. Toyama, and M. Shimbo, “Con struction of a Nonlinear Discrimination Function Based on the MDL Criterion,” in Proc. 1st Workshop on Stat. Techniques in Pattern Recognition (Prague, 1997), pp. 141–146. 7. R. Solomonoff, Does Algorithmic Probability Solve the Problem of Induction? (Cambridge MA, 1997). 8. H. Tenmoto, M. Kudo, and M. Shimbo, “MBL–Based Selection of the Number of Components in Mixture Models for Pattern Classification,” Adv. Pattern Recogn., No. 1451, 831–836 (1998).

Vol. 21

No. 2

2011