Geometric Complexity and Minimum Description Length ... - CiteSeerX

2 downloads 0 Views 82KB Size Report
selected more often than Fechner's model for both data sets, even when the data ... This result suggests that Stevens' model is more complex than Fechner's.
Geometric Complexity - 1

Geometric Complexity and Minimum Description Length Principle In Jae Myung, Shaobo Zhang and Mark A. Pitt Department of Psychology, Ohio State University, Columbus, Ohio 43210 {myung.1, zhang.194, pitt.2}@osu.edu June 3, 1999 Submission for the Symposium on Model Complexity at the Annual Mathematical Psychology Meeting (Santa Cruz, CA: July 31- Aug 1, 1999) [DRAFT MANUSCRIPT: Please do not cite without permission]

Abstract The question of how one should decide among competing explanations of data is at the heart of the scientific enterprise. Quantitative methods of selecting among models have been advanced over the years without an underlying theoretical framework to guide the enterprise and evaluate new developments. In this paper, we show that differential geometry provides a unified understanding of the model selection problem. Foremost among its contributions is a reconceptualization of the problem as one of counting probability distributions. This reconceptualization naturally leads to development of a "geometric” complexity measure, which turns out to be equal to the Minimum Description Length (MDL) complexity measure Rissanen (1996) recently proposed. We demonstrate an application of the geometric complexity measure to model selection in cognitive psychology, with models of cognitive modeling in three different areas (psychophysics, information integration, categorization).

INTRODUCTION How does one decide among competing explanations of data given limited observations? The problem of model selection is at the core of progress in science. Over the decades, scientists have used an assortment of statistical tools to select among alternative models of data. Missing from this endeavor is a theoretical framework within which to understand the problem of model selection. In this paper, we show that differential geometry, a subfield of mathematics, provides a useful framework. Not only does it recast model selection in a more intuitive and meaningful light, but provides valuable new insight into the interrelations among model selection methods. Statistical Model Selection: Issues and Problem From a statistical standpoint, data D = {d1, d2, ..., dn} is a sample generated from a true but unknown probability distribution, which is the regularity underlying the data. The goal of model selection in this case is straightforward: For a given set of observations (data), corrupted by noise, select from a set of competing explanations the model that best captures the regularities underlying the data. How to achieve the goal is not

Geometric Complexity - 2 straight forward because of the difficulty in reconciling two desirable yet conflicting properties of model behavior, goodness of fit and generalizability. Goodness of fit refers to how well a model fits a particular pattern of observed data. It is measured by comparing the predicted outcomes of a model, optimized with respect to its parameter values, with the observed data. A discrepancy measure such as mean squared error (MSE) is frequently used. Generalizability refers to a model’s ability to fit well all data samples that arose from the same regularity. The model that generalizes best in these circumstances should be preferred. These two behavioral properties are at odds with each other because the very features of a model that improve goodness of fit can decrease generalizability (e.g., Myung, in press). The following example illustrates this relationship. The two properties of a model that influence goodness of fit and generalizability are the number of parameters in a model and its functional form, which refers to the way that parameters are combined in the model equation. Together they contribute to a model’s complexity, which we define as the flexibility inherent in a model that enables it to fit diverse patterns of data. In Table 1, three models were compared on their abilities to fit two samples of data generated by one of the models (M1). Each model’s parameter values were obtained by fitting the model to the first sample to assess goodness of fit, and with the parameter values fixed, generalizability was then assessed by their fit to the second data sample. In the first row of Table 1 are the models’ fits to the data. Models M2 and M3, with two more parameters than model M1, always fit the data better than the model that generated the data (M1). The drops in MSE relative to M1 (0.44, 0.61) represent the extent of overfitting caused by these additional parameters, 2 and 3. Essentially, the extra parameters allowed M2 and M3 to absorb random error in the data, thereby improving fit well beyond the true model (i.e., M1), and therefore what is necessary to capture the underlying regularity. Also note that M3 provided an even better fit than M2. This improvement in fit must be due to functional form because these two models differ only in how the parameters and data are combined in the model equation. The results in the second row of Table 1 demonstrate that poor generalizability is the cost of overfitting a specific sample of data. Not only are the MSEs now greater for M2 and M3 than M1, but these two models also provided the best fit to the data a much smaller percentage of time. M1 provides the best fit to the second sample 59% of the time. Table 1. Goodness of fit and generalizability of models differing in complexity. Model

M1 (true model)

M2

M3

Goodness of fit

4.28 (0%)

3.84 (25%)

3.67 (75%)

Generalizability

5.37 (59%)

5.62 (23%)

5.78 (18%)

Note: Mean square error (MSE) of each model’s fit to the data and the percentage of samples in which the particular model fitted the data best (in parentheses). The three models were as follows: M1: Y = 0 + 1X + error; M2: Y = 0 + 1X2 + 3 exp(X) + error; M3: Y = 0 + 1X + 2X2 + 3X3 + error where the error was normally distributed with a mean of zero and a standard deviation of 5. A thousand pairs of samples were generated from model M1 using (0 = 1, 1 = 2) on the same 10 points for X, which ranged from 0.1 to 4.6 in halfstep increments.

This example illustrates that the best-fitting model does not necessarily generalize the best. How best to satisfy these two opposing goals, which conceptually are the edges of Occam’s razor, is at the root of model selection. The problem of model selection can be couched as one of developing a quantitative measure of model complexity, since both the number of parameters and functional form affect model behavior. Solutions to date

Geometric Complexity - 3 have met with varying degrees of success. Next we briefly summarize this work. Prior Approaches to Measuring Model Complexity The overarching goal of many prior model selection approaches was the estimation of a model’s generalizability (for a review, see Linhart & Zucchini, 1986). Four representative methods are the Akaike Information Criterion (AIC; Akaike, 1973), the Bayesian Information Criterion (BIC; Schwarz, 1978), Stochastic Complexity (SC; Rissanen, 1987), and the Information- theoretic Measure of Complexity (ICOMP; Bozdogan, 1990):

A IC = − 2 ln f ( x |θ ) + 2 k B IC = − 2 ln f ( x |θ ) + k ln n 1 S C = − ln f ( x |θ) + ln | H (θ)| 2 k tra c e ( Ω (θ)) 1 − ln | Ω (θ )| IC O M P = − ln f ( x |θ ) + ln k 2 2 In the above equations,

f ( x |θ) is the maximized likelihood (ML) of the data given the model, θ is the

parameter estimate, k is the number of parameters, n is the sample size, H

(θ ) is the Hessian matrix of the

minus loglikelihood, and Ω( θ ) is the covariance matrix of the parameter estimates. Each criterion consists of two parts: The first term in the equation, which is similar across criteria, represents a lack of fit to the data. The remaining term(s) represent model complexity. Together they provide an estimate of the lack of generalizability. Therefore, the model that minimizes a given criterion should be chosen. AIC and BIC are the most commonly used selection methods. As the above equations reveal, they consider only the number of parameters (k) and thus are incomplete solutions because they fail to take into account a model’s functional form. SC and ICOMP overcome this shortcoming by incorporating a term that is sensitive to not only the number of parameters but also functional form through H or 6.1 Although SC and ICOMP represent an improvement over AIC and BIC, they fail to meet a crucial requirement: Neither is invariant under reparameterization, the condition that any meaningfully interpretable measure of complexity must satisfy. From a statistical standpoint, the parameters of a model simply index the collection of probability distributions defined by the model and thus, it should not matter what parameterization is employed to describe the model’s probability distributions as long as the same collection of distributions is indexed (Batchelder, 1997). The Minimum Description Length Approach The Minimum Description Length (MDL; Rissanen, 1996), which was proposed as an improvement over the SC, represents a unique departure from the prior approaches and is the most satisfactory selection method proposed to date. MDL is defined as

M D L = − ln f ( x |θ ) +

k n ln + ln ∫ | I 1 (θ )|1 / 2 d θ 2 2π

where I1() is the Fisher information matrix of a single observation. This criterion is not only sensitive to functional form and the number of parameters but also is invariant under reparameterization. The MDL of a

Geometric Complexity - 4 model has its origin in algorithmic coding theory in computer science and is interpreted as the shortest code length, in bits, that unambiguously describes observed data with the help of the model (Kolmogorov, 1968; Chaitin, 1966; Solomonoff, 1964; for a review, see Li & Vitanyi, 1997). The idea behind MDL is the notion that knowledge (regularity in the data) and data redundancy are interchangeable. There must be redundancy in data. Without it, every point in the data will be unique and not related to any other point. In such a case, there would be no regularity to be learned or extracted from the data. Put another way, the more we compress the data by extracting redundancy from it, the more we learn about the regularity underlying the data and thus the more knowledge gained (Grunwald, in press). Although considerable progress has been made in defining complexity and tackling the selection problem, MDL and the other approaches are all stand-alone heuristic solutions. Sorely lacking is a clear, detailed conceptualization of complexity. It has been thought of primarily in terms of some combination of the parameters in a model and its functional form. As will be shown below, conceptualizing model complexity in this way can be more of a vice than a virtue. What is needed is a theoretical framework for understanding complexity. Differential geometry provides such a framework, one that unifies our understanding of complexity beyond a set of disparate model properties (e.g., functional form) and does so in an elegant and intuitive way. COMPLEXITY AS DEFINED IN DIFFERENTIAL GEOMETRY In differential geometry, the parametric family of probability distributions defined by a model forms a Riemannian manifold embedded in the space of probability distributions (Rao, 1945; Efron, 1975; Amari, 1983, 1985; Amari et al, 1987; Atkinson & Mitchell, 1981; Murray & Rice, 1993). Each probability distribution is mapped onto a point in this hyper-dimensional space, and the collection of all such points that a model generates by varying its parameter values gives rise to a compact differentiable manifold in such a way that similar distributions are mapped onto nearby points on the manifold, as illustrated in Figure 1.

Figure 1. The space of probability distributions. On such a manifold, a theoretically well-justified measure of model complexity, one that fits with our intuitions, can be defined. As the preceding data-fitting example reveals (Table 1), model complexity is an inherent characteristic of a model that enables it to fit a wide range of probability distributions, statistically speaking. The wider the range of distributions that a model can describe, the more complex it is. This relationship suggests that model complexity is related to the number of probability distributions that a model can generate. This naive

Geometric Complexity - 5 intuition immediately runs afoul: The number of all such distributions is uncountably infinite so the value cannot be determined. Or can it? Given that not all distributions are equally similar to one another, the solution is to count only “distinguishable” distributions. That is, if two or more probability distributions on a model’s manifold are sufficiently similar to one another to be indistinguishable for statistical purposes, they are counted as one distribution with an ensemble of such indistinguishable distributions occupying a local neighborhood of the manifold. This procedure yields a countably infinite set of “distinguishable” distributions, the relative size of which is a natural measure of complexity. How To Count Distinguishable Probability Distributions The definition of distinguishability between probability distributions is based on a measure of distance between points on a model’s manifold. The standard theory of Riemannian geometry states that under suitable technical conditions, the following positive definite quadratic differential form furnishes the basis for deriving the distance between any two points on the manifold: k

ds = 2

∑I

i , j =1

ij

d θi d θ j

 ∂ ln f ( x |θ ) ∂ ln f ( x |θ )  w h ere I ij (θ ) = E   ∂θi ∂θ j  

where  = (1, ..., k) is the parameter vector of the model, k is the number of parameters, and E denotes the expectation. In the equation, Iij is a Riemannian metric tensor, which is the Fisher information matrix (Atkinson & Mitchell, 1981; Amari, 1983). The differential distance ds is a natural measure of the degree of dissimilarity between two adjacent distributions. Two probability distributions are said to be indistinguishable if the probability that one is mistaken for the other is close to 1 even in the presence of an infinite amount of data. From this definition of indistinguishability and the above distance measure, it is shown (Balasubramanian, 1997) that the volume of the parameter space which contains indistinguishable distributions is given by

ε (2π ) k /2 V (θ ) = | I (θ )|1 / 2 In the equation,  is an infinitesimally small positive constant (i.e.,   0+), which does not depend upon the model or its parameters. Note that 1/V() is interpreted as the number of distinguishable distributions in a unit volume of the parameter space. The number of all distinguishable distributions defined by a model is then obtained by integrating 1/V() over the entire parameter space as

∫ | I (θ )| =

1/ 2

N



ε ( 2π ) k /2

We define the logarithm of N as a measure of model complexity, which will be referred as “geometric” complexity throughout this paper,

G eo m e tric C o m p le x ity : = ln ε N =

k n + ln ∫ | I 1 ( θ )|1 / 2 d θ ln 2 2π

where the relationship |I()| = nk |I1()| is applied. The exponential of the geometric complexity turns out to be a well-known quantity in differential geometry known as the Riemannian volume (e.g., Murray & Rice, 1993) and consequently, invariance of the geometric complexity under reparameterization follows. Also note in the above equation that the second, functional-form term of the geometric complexity does not depend upon sample size n whereas the first term increases logarithmically as n increases. This means that effects of complexity due

Geometric Complexity - 6 to functional form relative to those of complexity due to the number of parameters will become negligible as sample size increases, In essence, the geometric complexity represents the logarithm of a quantity proportional to the number of all distinguishable distributions (N) that describe a model. Thus a complex model is one that generates many distinguishable probability distributions, enabling it to improve goodness of fit (e.g., by absorbing random noise) without necessarily capturing the underlying regularity in the data. In contrast, simple models generate fewer distributions. Model Selection: A Differential Geometric View Differential geometry provides two valuable insights into model complexity and model selection. One is a new explication of MDL. It turns out that geometric complexity is equal to the MDL complexity measure. Knowing this, we can rewrite the MDL criterion as follows:

M D L = − ln f ( x | θ) + ln ε N f ( x |θ ) = − ln + ln ε N = − ln{ " n orm alize d f ( x |θ )" } + c on stan t This redefinition of MDL provides a clearer picture of what MDL does in model selection. It selects the model that gives the highest value of the maximized likelihood per distinguishable distribution, which may be called the “normalized maximized likelihood.” The best model is the one with fewest distinguishable distributions that are closest to the true distribution that generated the data. Perhaps the most important insight provided by differential geometry is that the size of the manifold in the space of probability distributions is what matters when measuring a model’s complexity, not the functional form of a model or its number of parameters. The latter two properties of a model can be red herrings when it comes to measuring complexity, as they are simply the apparatus by which a collection of distributions defined by the model is indexed. When examined individually, they can lead to an insufficient, even misleading, understanding of complexity. Differential geometry shows that it does not matter what parameterization or what functional form is used in the indexing as long as the same collection of distributions is cataloged on the manifold. For example, the following two models, although they assume different functional forms, are equivalent and are equally complex:

ab + erro r ( 0 < a , b < 1) ab + (1 − a )(1 − b ) 1 M ode l B : Y = + erro r ( −∞ < α , β < ∞ ) 1 + e ( α +β ) M ode l A : Y =

where the error has a zero mean and follows the same distribution for both models . In the above equation the parameter  = (a,b) of model A is related to the parameter  = (, ) of model B through  = ln ((1-a)/a) and  = ln ((1-b)/b) (Batchelder, 1997). APPLICATION EXAMPLES The geometric complexity and the MDL constitute a powerful pair of model evaluation tools. When used together in model testing, a deeper understanding of the relationships between models can be gained. In this section we demonstrate the application of geometric complexity and MDL to model selection in three areas of

Geometric Complexity - 7 cognitive modeling (psychophysics, information integration, categorization). In each example, two competing models with the same number of parameters but different functional forms were fit to data sets generated by each model. The model recovery rates of three selection methods (MDL with the geometric complexity term, AIC and BIC) were compared Psychophysics Consider two models of psychophysics, which were developed to describe the relationship between physical dimensions (e.g., light intensity) and their psychological counterparts (e.g., brightness).

S te ven s' m o de l: Y = a X b + erro r F ech ne r' s m o de l: Y = a ln ( X + b ) + erro r When the two models were fitted to data sets generated by each model, under AIC (or BIC) Stevens’ model was selected more often than Fechner’s model for both data sets, even when the data were generated by the latter (63% vs. 37%; see Table 2). That is, AIC (or BIC) overestimated the generalizability of Steven’s model relative to that of Fechner’s model. This result suggests that Stevens’ model is more complex than Fechner’s. Interestingly, Townsend (1975) made the same conjecture 24 years ago, pointing out that Stevens’ model can fit data patterns with a negative, positive or zero curvature whereas Fechner’s model can fit data patterns with a negative curvature only. Calculation of the geometric complexity of each model confirms Townsend’s suspicion. Stevens’ model is 44.9 times more complex than Fechner’s.2 This means that for every distinguishable distribution Fechner’s model can account for, there exists about 45 distinguishable distributions Stevens’ model can describe. Obviously, this difference in complexity between the two models must be due to functional form because they have the same number of parameters. When MDL was used to choose between the models, model recovery rate was nearly perfect for both models, because the effect of complexity due to functional form was appropriately taken into account. Table 2. Model Recovery Rates of Two Psychophysics Models. Data from:

Stevens

Fechner

Selection Method

Model Fitted:

AIC, BIC

Stevens

100%

63%

Fechner

0%

37%

Stevens

99%

2%

Fechner

1%

98%

MDL

Note: The percentage of samples in which the particular model fitted the data best. A thousand of samples were generated from each model using the same four points for X, which ranged from 1 to 4 in one-step increments. The random error was normally distributed with a mean of zero and a standard deviation of 1. The parameter values used to generate the simulated data were a = 2 and b = 2 for Steven’s model and a = 2 and b = 5 for Fechner’s model.

Information Integration In a typical information integration experiment, a range of stimuli are generated from a factorial manipulation of two or more stimulus dimensions (e.g.,, visual and auditory) and then presented to participants for categorization as one of two possible response alternatives. Data are scored as the proportion of responses in one category across the various combinations of stimulus dimensions . Consider two models of information

Geometric Complexity - 8 integration, the Fuzzy Logical Model of Perception (FLMP) by Oden and Massaro (1978) and the Linear Integration Model (LIM) by Anderson (1981). Each assumes that the response probability (pij) of one category, say A, upon the presentation of a stimulus of the specific i and j feature dimensions in a two-factor information integration experiment takes the following form:

F L M P : p ij = L IM :

θi λ j θi λ j + (1 − θi )(1 − λ j )

θi + λ j 2

where i and j (i=1,..,q1; j=1,..,q2: 0 < i, j < 1) are parameters representing the corresponding feature dimensions. When the two models were fitted to data sets generated by each model, under AIC (or BIC) there is again a large asymmetry in model recovery, suggesting the two models are not equally complex. As can be seen in Table 3, FLMP was always selected when it generated the data, and it competed with LIM when LIM generated the data. LIM, on the other hand, never once provided the best fit to data generated by FLMP. When the geometric complexity of each model was calculated, the ratio of FLMP’s complexity to LIM’s, which arises from the difference in functional form between the two models, was 4744.8.3 When MDL was used to choose between the models, the bias in model recovery rate was corrected, and even completely eliminated in one case. Table 3. Model Recovery Rates of Two Information Integration Models. Data from: Selection Method

FLMP

LIM

FLMP

100%

51%

LIM

0%

49%

FLMP

89%

0%

LIM

11%

100%

FLMP

100%

33%

LIM

0%

67%

FLMP

100%

0%

LIM

0%

100%

Model Fitted:

Sample size n = 20 AIC, BIC

MDL

Sample size n = 60 AIC, BIC

MDL

Note: The percentage of samples in which the particular model fitted the data best. Simulated data in a 2 x 8 factorial design (q1=2; q2 =8) were created from predetermined values of the ten parameters,  = (0.15, 0.85),  = (0.05, 0.10, 0.26, 0.42, 0.58, 0.74, 0.90, 0.95). From these, sixteen binomial response probabilities (pij) were computed using each model equation. For each probability, a series of twenty (i.e., n = 20 or 60) independent binary outcomes (0 or 1) was generated according to the binomial probability distribution. The number of ones in the series was summed and divided by n to obtain an observed proportion. This way, each sample consisted of sixteen observed proportions. For each sample size, a thousand samples were generated from each of the two models. In parameter estimation as well as complexity calculation, 1 was fixed to 0.15 so the models became identifiable.

Geometric Complexity - 9 Categorization Two models of categorization were considered in the present demonstration. They were the generalized context model (GCM: Nosofsky, 1986) and the prototype model (PRT: Reed, 1972). Each model assumes that categorization responses follow a multinomial probability distribution with piJ (probability of a category CJ response given stimulus Xi), which is given by :

G C M : p iJ =

∑s

j ∈C J

ij

∑ ∑s K

k ∈C K

s P R T : p iJ = iJ ∑ siK

ik

1/ r   M r w here s ij = exp  − c ⋅  ∑ w m | x im − x jm |    m =1   

1/ r  M   r w here siJ = exp  − c ⋅  ∑ w m | x im − x Jm |    m =1   

K

In the equation, sij is a similarity measure between multidimensional stimuli Xi and Xj, siJ is a similarity measure between stimulus Xi and the prototypic stimulus XJ of category CJ, M is the number of stimulus dimensions, c is a sensitivity parameter, wm is an attention weight parameter, and r is the Minkowski metric parameter. Table 4. Model Recovery Rates of Two Categorization Models. Data from: Selection Method

GCM

PRT

GCM

98%

15%

PRT

2%

85%

GCM

96%

7

PRT

4%

93%

GCM

98%

4%

PRT

2%

96%

GCM

98%

4%

PRT

2%

96%

Model Fitted:

Sample size n = 20 AIC, BIC

MDL

Sample size n = 60 AIC, BIC

MDL

Note: The percentage of samples in which the particular model fitted the data best. Simulated data were created from predetermined values of the seven parameters, c= 2.5, w = (0.2, 0.2, 0.2, 0.2, 0.1, 0.1). From these, twenty seven trinomial response probabilities (piJ, i = 1,..., 9, J = 1, 2, 3) were computed using each model equation. For each probability, a series of twenty (i.e., n = 20 or 60) independent ternary outcomes was generated according to the trinomial probability distribution. The number of outcomes of each type in the series was summed and divided by n to obtain an observed proportion. This way, each sample consisted of twenty-seven observed proportions. For each sample size, a hundred of samples were generated from each of the two models.

The two models were fitted to data sets generated by each model using the six-dimensional scaling solution for Experiment 1 of Shin and Nosofsky (1992) under the Euclidean distance metric of r = 2. As shown

Geometric Complexity - 10 in Table 4, under AIC (or BIC), virtually no bias in model recovery rate was observed, except for a modest tendency toward choosing GCM for the PRT data for n = 20. This result suggests that GCM and PRT are more or less equally complex. Calculation of the geometric complexity of each model reveals that GCM is slightly more complex than PRT, 1.58 times, to be exact.4 When MDL was used to choose between the models, the bias in model recovery rate observed in the small sample size condition for the PRT data was corrected. To summarize, the above results demonstrated that MDL is superior to the other selection methods in recovering the model that generated the data, especially with small sample sizes, which tend to be most common in experimental psychology. These results together clearly indicate that functional form must be taken into account in model selection. Furthermore, this new selection method provides a means of quantifying the relative complexity of the models being compared, something no other selection method has given us. SUMMARY AND CONCLUSION Model selection can proceed far more confidently and knowledgeably when a theoretically well-justified framework for understanding its central concept, complexity, is available. Differential geometry provides such a framework, one that is conceptually intuitive and readily reveals why one model is more complex than another. Geometric complexity is a higher-order measure of complexity that is naturally all-inclusive because it is based on a model’s probability distributions, not seemingly independent and possibly incomplete model properties such as the number of parameters and functional form. Together with the MDL selection method, these two tools provide an effective means of combating the contribution of error in model selection and developing a clearer understanding of the relationships between models. AUTHORS NOTES The authors wish to thank Michael Browne, Richard Shiffrin and James Townsend for valuable comments on an earlier version of the paper. This research was supported by NIMH Grant MH57472 to I.J.M. and M.A.P. The present paper is based largely on a paper submitted for publication (Myung, Balasubramanian, & Pitt, 1999). ENDNOTES 1 -k

The sample size n goes to infinity, |H| becomes proportional to nk whereas |6| becomes proportional

to n . 2

In computing geometric complexity measures, the following parameter ranges were assumed: 0 < a < , 0 < b < 3 for Stevens’ model, and 0 < a, b <  for Fechner’s model. 3

In computing geometric complexity measures, the following parameter ranges were assumed: 0.001