On The Consistency Of Minimum Complexity Nonparametric Estimation

0 downloads 0 Views 299KB Size Report
Oct 20, 1997 -
1968

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 5, SEPTEMBER 1998

REFERENCES [1] M. Basseville and I. Nikiforov, Detection of Abrupt Changes, Theory and Application. Englewood Cliffs, NJ: Prentice-Hall, 1993. [2] E. Carlstein, H. Muller, and D. Seigmund, Eds., Change-Point Problems. Hayward, CA: Inst. Math. Stat., 1994. [3] G. Lorden, “Procedures for reacting to a change in distribution,” Ann. Math. Stat., vol. 42, pp. 1897–1908, 1976. [4] G. V. Moustakides, “Optimal stopping times for detecting changes in distributions,” Ann. Statist., vol. 14, pp. 1379–1387, 1986. [5] E. S. Page, “Continuous inspection schemes,” Biometrika, vol. 41, pp. 100–115, 1954. [6] M. Pollak, “Optimal detection of a change in distribution,” Ann. Statist., vol. 13, pp. 206–227, 1985. [7] M. Pollak and D. Siegmund, “Approximations to the expected sample size of certain sequential tests,” Ann. Statist., vol 6, pp. 1267–1282, 1975. [8] Y. Ritov, “Decision theoretic optimality of the CUSUM procedure,” Ann. Satist., vol. 18, pp. 1464–1469, 1990. [9] A. N. Shiryayev, “On optimal methods in earliest detection problems,” Theor. Probl. Appl., vol. 8, pp. 26–51, 1963. , Optimal Stopping Rules. New York: Springer-Verlag, 1978. [10] [11] A. Wald, Sequential Analysis. New York: Wiley, 1947. [12] A. Wald and J. Wolfowitz, “Optimum character of the sequential probability ratio test,” Ann. Math. Statist., vol. 19, pp. 326–339, 1948. [13] B. Yakir, “Optimal detection of a change in distribution when the observations form a Markov chain with a finite state space,” in Change-Point Problems, E. Carlstein, H. Muller, and D. Seigmund, Eds. Hayward, CA: Inst. Math. Stat., 1994. , “A note on optimal detection of a change in distribution,” Ann. [14] Statist., vol. 25, no. 5, pp. 2117–2126, Oct. 1997.

On the Consistency of Minimum Complexity Nonparametric Estimation Zhiyi Chi and Stuart Geman

Abstract— Nonparametric estimation is usually inconsistent without some form of regularization. One way to impose regularity is through a prior measure. Barron and Cover [1], [2] have shown that complexitybased prior measures can insure consistency, at least when restricted to countable dense subsets of the infinite-dimensional parameter (i.e., function) space. Strangely, however, these results are independent of the actual complexity assignment: the same results hold under an arbitrary permutation of the match-up of complexities to functions. We will show that this phenomenon is related to the weakness of the convergence measures used. Stronger convergence can only be achieved through complexity measures that relate to the actual behavior of the functions. Index Terms—Consistency, minimum complexity estimation, minimum description length, nonparametric estimation.

dimensional) problems. Some variety of regularization is needed. An appealing and principled approach is to base regularization on complexity: Define an encoding of the (infinite-dimensional) parameter, and adopt codelength as a penalty. Barron and Cover [1], [2] have shown how to make this work. They get consistent estimation for densities and regressions, as well as some convergencerate bounds, by constructing complexity-based penalty terms for maximum-likelihood and least squares estimators. Can we cite the results of Barron and Cover as an argument for complexity-based regularization (or, equivalently, for complexitybased priors)? Apparently not: The results are independent of the particular assignment of complexities. Specifically, the results are unchanged by an arbitrary permutation of the matching of complexities to parameters. Of course there are many ways to define convergence of functions. We will show here that the surprising indifference of convergence results to complexity assignments is in fact related to the convergence measures used. Stronger convergence requires a stronger tie between the parameters (functions) and their complexity measures. Section II is a review of some Barron and Cover results. Then some new results about consistency for nonparametric regression are presented in Section III. (Proofs are in the Appendix.) Taken together, the results of Section III establish the principle that stronger types of convergence are sensitive to the particulars of the complexity assignment. We work here with regression, but the situation is analogous in density estimation. Our results are about consistency only. The important practical issue of relating complexity measures to rates of convergence remains open. II. COMPLEXITY-BASED PRIORS Barron and Cover [1] have shown that the problem of estimating a density nonparametrically can be solved using a complexity-based prior by limiting the prior to a countably-dense subset of the space of densities. More specifically, given a sequence of countable sets of densities 0n , and numbers Ln (q ) for densities q in 0n , let 0 = [n 0n . Set Ln (q) = 1 for q not in 0n . For independent random variables X1 ; X2 ; 1 1 1 ; Xn drawn from an unknown probability density function p, a minimum complexity density estimator p^n is defined as a density achieving the following minimization:

min q20

n (q) 0

L

sup

Manuscript received December 9, 1996; revised October 20, 1997. This work was supported by the Army Research Office under Contract DAAL0392-G-0115, the National Science Foundation, under Grant DMS-9217655, and the Office of Naval Research, under Grant N00014-96-1-0647. The authors are with the Division of Applied Mathematics, Brown University, Providence, RI 02912 USA. Publisher Item Identifier S 0018-9448(98)04795-6.

i=1

log q(Xi )

:

If we think of Ln (q ) as the description length of the density q , then the minimization is over total description length—accounting for both the density and the data. Barron and Cover showed that if Ln satisfies the summability condition

n q20

I. INTRODUCTION Maximum-likelihood, least squares, and other estimation techniques are generally inconsistent for nonparametric (infinite-

n

20L (q) < +1

and the growth restriction

lim sup Ln (q) = 0; n

n

then for each measurable set

for every q

20

S

lim P^ (S ) = P (S ) with probability one n!1 n

0018–9448/98$10.00  1998 IEEE

(1)

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 5, SEPTEMBER 1998

provided that p is in the information closure 0 of 0. Here, P^n and P are the probability measures associated with the densities p^n and p, respectively, and “p is in the information closure 0 of 0” means that inf q20 D(pkq) = 0, where D(pkq) is the relative entropy of p to q. Barron and Cover also showed that if Ln satisfies a “light tail condition,” i.e., if for some 0 < < 1 and b

q20

20 L (q)  b;

for all n

(2)

and if Ln also satisfies the growth restriction (1), then for with probability one

lim

n!1

p

2 0,

jp 0 p^n j = 0:

A second paper by Barron [2] offers a minimum-complexity solution to the regression problem. Let (Xi ; Yi )in=1 be independent observations drawn from the unknown joint distribution of random variables X; Y , where the support of X is in Rd . Here X is the vector of explanatory variables and Y is the response variable. Functions f(X) are used to predict the response. The error incurred by a prediction is measured by a distortion function d(Y; f(X)), the most common form being (Y 0 f(X))2 . Let h be a function which minimizes E(d(Y; f(X))), which is to say that h(x) = E(Y jX = x) in the squared error case. When a function f is used in place of the optimum function h the “regret” is measured by the difference between the expected distortions

r(f; h) = E(d(Y; f(X))) 0 E(d(Y; h(X))): ^ n to be Barron defines statistical risk for a given estimator h E(r(h^n ; h)). Given a sequence of countable collections of functions 0n , and numbers Ln (f), f 2 0n , satisfying the summability condition sup 20L (f ) < 1 n f 20

the index of resolvability is defined as

1 Rn (h) = fmin 20 r(f; h) +  n Ln (f) ^n and a minimum complexity estimator is a function h achieves min f 20

2 0n which

1 n d(Y ; f(X )) +  1 L (f) : i n i=1 i n n

Again there is a coding interpretation: if d(Y; f(X)) is log proba^ n minimizes total description length for bility of Y given X , then h the model f , plus the data Y1 ; 1 1 1 ; Yn given X1 ; 1 1 1 ; Xn . Barron showed that if the support of Y and the range of each function f(X) is in a known interval of length b, then with   5b2 =3 log e, the mean-squared error converges to zero at rate bounded by Rn (h), i.e.,

E(r(h^n ; h))  O(Rn (h)):

(3)

Taken together, these results offer a general prescription for nonparametric estimation of densities and regressions. Furthermore, the connection to complexity is appealing: It is not hard to invent suitable functions Ln (1) by counting the bits involved in a natural encoding of 0n (cf. [1]). There is, however, a disturbing indifference of the results to the details of the complexity measure. For any set of permutations n on 0n , define L0n () = Ln (()) and observe that Ln0 satisfies

1969

whatever conditions Ln does, and hence the same results are obtained 0 in place of Ln ! In (with the same bound on rate in (3)) using Ln 0 general Ln will have no meaningful interpretation as a complexity measure.

III. WHAT TIES CONSISTENCY

TO

COMPLEXITY?

Suppose that X is a random variable from a probability space to ([0; 1]; B). X introduces a measure PX on [0; 1] through the relation PX (B) = P (X 01 (B)), for B 2 B. Choose a countable dense subset 0 in L2 ([0; 1]; PX ), and define a “complexity function” L: 0 ! N . For any random variable Y from ( ; F ; P ) to (R; B) with

( ; F ; P )

h(x) = E(Y jX = x) 2 L2 ([0; 1]; PX ) ^ n to be a function in 0 which achieves define the estimator h L(f) + 1 n (Y 0 f(X ))2 : min i f 20 n n i=1 i We will always assume that L satisfies a much stronger tail condition than (2)

f 20

e0L(f ) < 1;

for any  > 0:

(4)

The first proposition demonstrates that for a weak form of convergence, consistency is essentially independent of the complexity measure: Proposition 1: If EY 4 < 1, then

P h^ n 0! h;

a.s.

Obviously, the proposition remains true for any permutation  of 0 and resulting complexity function L0 (f) = L((f)). But, suppose we were to ask for consistency in L2 (a.s.) in place of consistency in probability (a.s.)? Then, despite the strength of the tail condition (4), we would evidently need to pay closer attention to the complexity measure: Proposition 2: There exists a random variable X , a countable dense subset 0 in L2 ([0; 1]; PX ), and a function L: 0 ! N satisfying (4) such that for any Y with h(x) 62 0, the L2 norm ^ n (in L2 ([0; 1]; PX )) goes to +1 with probability one. of h (We are focusing on the regression problem, but analogous arguments apply to probability density estimation. For example, by a construction similar to the one used for Proposition 2, the minimum complexity density estimator discussed in Barron and Cover [1] may not converge to the actual density p in the sense of Kullback–Liebler

p log p p^n

6! 0

even though the coding L satisfies the strong condition (4).) One way to rescue consistency is to tie the complexity measure L(f) more closely to f : Proposition 3: Suppose that for every f 2 0, Ef 4 (X) < 1. Assume EY 4 is finite (and hence so is Eh4 (X)). Construct a complexity function as follows: First, define

C1 (f) = (Ef 4 (X) + e)e2Ef (X ) and

C(f) = C1 (f) log C1 (f):

1970

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 5, SEPTEMBER 1998

Then, given any L1 : Then

0

C (f )L1 (f ).

!

N

which satisfies (4), let L(f )

=

Now given Y , with h(x)

L ^ n 0! h h

a.s.

and h(x)

Proofs for the propositions are in the Appendix.

= E (Y jX = x) 2 L2 ([0; 1]; PX )

62 0, the set of ! which satisfies

1 n (Y 0 f (X ))2 ! E (h(X ) 0 f (X ))2 + E (Y i i

0 h(X ))2; 8f 2 0

n i=1

APPENDIX Recall that X is a random variable defined on a probability space

( ; F ; P ), taking values in ([0; 1]; B). PX is defined on [0; 1] by 01 PX (B ) = P (X (B )), for B 2 B . 0 is then a countable dense 2 subset of L ([0; 1]; PX ). (Take, for example, 0 to be a countable dense set in L2 ([0; 1]; dx); this will work for any PX which is absolutely continuous with respect to Lebesgue measure and has bounded derivative dPX =dx.) The complexity function L: 0 ! N is always assumed to satisfy the “strong tail condition” (4).1 Finally, we assume that the response variable Y (a random variable on ( ; F ; P )) has an L2 -valued regression h(x) h(x)

= E (Y jX = x) 2 L2 ([0; 1]; PX ): ^n h(x) is estimated by a function h

The regression achieves the minimum in

L(f )

min f 20

n

+1

n

n i=1

(Yi 0 f (Xi ))2

20

that

:

i.o. for n) = 0:

This condition can be achieved, for instance, if the B ’s satisfy

1

[1 0 (1 0 PX (Bk ))k ] < 1:

In (! )

= arg min k

= 1; 2; 1 1 1, gi (x)

=

gI

if x 62 Bi if x 2 Bi :

We first select A1 such that E (g1 0 fn )2 > 0 for all n 2 N . This can be done since there are only countably many f ’s while there are uncountably many choices of A1 . We then inductively select Ai such that E (gi 0 fn )2 > 0, for all n 2 N , and E (gi 0 gk )2 > 0, for k = 1; 1 1 1 ; i 0 1. We also require of Ai that Egi2 ! +1. Then g1 ; g2 ; 1 1 1 are distinct and none of them are in 0. Modify 0 to include g1 ; g2 ; 1 1 1. Define L: 0 ! N such that L(fn ) > L(gn )

and f

20

0L(f ) < 1; e

n

n i=1

(Yi 0 fk (Xi ))2

:

Therefore, for large n n

)+1

n

i=1

(Yi 0 gI (Xi ))2
0. Let Z1 ; Z2 ; 1 1 1 ; Zn be a sequence of independent and identically distributed (i.i.d.) random variables satisfying a) Z1  0; b) EZ12 < 1. Then if K

 (Var (Z1 ) + 2 )eEZ



and

K


0:

1 For example, choose a(1) strictly positive such that f a(f ) < 1. If F (x) is any strictly positive function satisfying F (x)=x ! 1 as x ! 1, then L(f ) = F (0 log a(f )) satisfies (4).

n

+1

(!) (Xi (! )) = fI (!) (Xi (! )) 8 1  i  In (! ):

define gi (x) as fi (x); Ai ;

L(fk )

Then since h 62 0, In (! ) ! 1 as n ! 1. For large n, Xi (! ) 62 BI (!) for all 1  i  In (! ), and hence

k=1

Now for i

62 Bn ; 8 1  i  n; 8 large n

is of probability one. For any ! in this set, let

n

Proposition 2: There exists a random variable X , a countable dense subset 0 in L2 ([0; 1]; PX ), and a function L: 0 ! N satisfying (4) such that for any Y with h(x) 62 0, the L2 norm ^ n (in L2 ([0; 1]; PX )) goes to +1 with probability one. of h Proof: Choose X so that PX is Lebesgue measure. Fix 0 = ff1 ; 1 1 1 ; fn ; 1 1 1g dense in L2([0; 1]; PX ). Let B1 ; 1 1 1 ; Bn ; 1 1 1 be a sequence of measurable subsets in [0; 1], each of which has positive probability, such that

91  i  n; Xi 2 Bn ;

Xi (! )

L(gI

We begin with Proposition 2.

P(

and

1

n

n

i=1

(Zi 0 EZ1 )  0

Proof: For any t P

1

n

n

i=1

2

 1 0 2K

n

:

2 (0; 1]

(Zi 0 EZ1 )  0



0

t( Z

Ee

+EZ

0)

n

:

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 5, SEPTEMBER 1998

Let (t) (0)

and

=

t(

Ee

0Z

+EZ

0) ,



00 (t) = E ((Z1 0 EZ1 + )2 et(0Z



By Lemma 1, for each fixed n and f 2 inequality does not hold is bounded by

then

0 (0) = 0

=1

1 0 ( + L(f )=n)

0) )

+EZ

1

for t 2 (0;

2

1

1]:

Hence

0  (t)  and (t)

0  + Kt;

 1 0 t +

2 1 Kt ; 2

1]

for t 2 (0;

1]:

n

1

P

n

2

Zi < EZ1 i=1

Lemma 2: Suppose EY 4 < h(x)

0

1 0 t + Kt2 =2. Then

0   1 0 2K

B (hM )

2 0, define

For f

1. Let

Tf; n (hM )

= 1

n

is a countable dense subset of

Vn (hM )

=

n

Rn (hM )

2

i=1

(hM (Xi ) 0 Yi ) + 
0 0 L(f )

Hence

n

is true with probability one, for sufficiently large n and all f

2 0.

1 n

n i=1

(hM (Xi ) 0 Yi )2 +  0 E (f (X ) 0 Y )2  0:

2

1972

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 5, SEPTEMBER 1998

Suppose f 2 B and Rn the above inequality

1

n

n

i=1

\ Tf; n 6= ;. For any ! 2 Rn \ Tf; n , by

Lemma 3: Let  be a finite measure, and let f and fn ; n < 1; -a.s., and if

1; 2; 1 1 1 ; be measurable functions. If f

lim inf lim sup E (fn; M 0 fM )2 = 0 M !1 n!1

(f (Xi ) 0 Yi )2 0 E (f (X ) 0 Y )2  0 0 L(f ) = 0f; n : n

 then fn !f . Proof: Suppose Mn

Furthermore, L(f ) n

 n1

n i=1

(hM (Xi ) 0 Yi )2 + 

lim lim sup E (fn; M

Fix  >

f; n

n

 2 + n1

i=1

(12)

 (E (M + jY j)4 + H 2 )eE(M+jY j)

Then by Lemma 1, for any f

 10  10

K

and f; n < K:

2 B with Rn \ Tf; n 6= ;

n

^n h

i=1

jY j>M

(L(f )=n + )2 n 2 2K

2K

n

10

L(f )=n K

n

10

0

2 2K

< x
jYi 0 h^ n (Xi )j implies jYi j > M .

^n) L(h n

:

Therefore,

\ Vn ) 

jY j>M

Consider

1

0 LK(f )

exp

a.s.

With probability one, when n is sufficiently large

L(f )=n f; n < < K K

and 1 0 x < e0x , for all bounded by

P 0! h;

Proof: The idea is to choose Mk ! 1 and then truncate the functions in 0 as in (6). Then by Lemma 2, we will get ^ n; M 0 hM )2 ! 0, where h^ n; M is the truncated h^ n , and E (h ^ n P! h. hM is the truncated h. We then use Lemma 3 to get h Filling in the details, given  > 0, there is M = M () > 0 such that E (h 0 hM )2 <  and

(f (Xi ) 0 Yi )2 0 E (f (X ) 0 Y )2  0f; n

Since

P (Rn

! 1 to complete the proof. If EY 4 < 1, then

Proposition 1:

:

\ Tf; n )

 P n1

fj 0 f j > g)  (fjf j  Mk 0 g) + (fjf j < Mk 0 ; jfn; M 0 fM j > g)  (fjf j  Mk 0 g) + 12 E (fn; M 0 fM )2 :

2 B with Rn \ Tf; n 6= ;, it is easy to check

2 )eE(f(X)0Y ) (Var ((f (X ) 0 Y )2 ) + f; n

P (Rn

and M > 0. Then

Let n ! 1 and then k

Fix K such that

Now for any f

0

0 fM )2 = 0:

( fn

(hM (Xi ) 0 Yi )2

 3 + E (hM (X ) 0 Y )2 = H:

K

! 1 is a sequence such that

k!1 n!1

and hence

=

0 LK(f )

and by the strong tail condition (4), exp(0L(f )=K ) < 1. Since K is independent of n, P (Rn \ Vn ) is exponentially small and P (Rn \ Vn ) converges.

+1

n

n i=1

n i=1

(Yi 0 h^ n (Xi ))2

(jYi j + M )2 1 IjY j>M :

With probability one, for sufficiently large n

1

n

n

i=1

(jYi j + M )2 1 IjY j>M 

jY j>M

(jY j + M )2 +  < 2

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 5, SEPTEMBER 1998

n

and, therefore, for large

n (Yi 0 h^ n; M (Xi ))2  1 +1

^n) L(h n

n

Let

n

i=1

0M = ffM : f

!N

0M

i=1

(Yi 0 hM (Xi ))2 + 3:

Since there are only finitely many f with L1 (f ) < D, by the strong law of large numbers, P (lim sup Vn00 ) = 0. Thus in order to get 0 P (lim sup Vn ) = 0, we need only show that P (lim sup Vn ) = 0. Similar to Lemma 2, it is enough to check

n

1]; PX ) \ fkf k1  M g:

Derive again the constant H , as in (12). Then for each f define

as

K (f )

0 L ( ) = minfL(f ): fM = ; f 2 0g: 0

n

n

+1

n

i=1

(Yi 0 f^n; M (Xi ))2

 n1

i=1

2

1

(Yi 0 hM (Xi ))2 + 3:

n=1

P (Rn

\ Vn0 ) 

0 satisfies the strong tail condition (4). According to Lemma 2, with



probability one, for sufficiently large n

S ()

= lim inf n!1



^ n; M ! : E (h

such that the above relation

0 hM )2  9

:

Proposition 3: Assume EY 4 is finite (and hence so is Eh4 (X )). Construct a complexity function as follows: First, define

4 2Ef (X ) C1 (f ) = (Ef (X ) + e)e

= C1 (f ) log C1 (f ): 0

! ^n h

N

=

=

Tf; n

2 = V 0 [ V 00 : f B n

n

=

2

f B; L

(f )D

Tf; n

L

\ Tf; n )

2K (f ) exp 2 

(f )D

0 C (Kf )(Lf 1)(f )

exp (L1 (f )J (f; ))

log K (f ) L1 (f )

 2CK((ff)) :

Since K (f ) > e J (f; )

So

1

(f )  0 2CK((ff))  0 2K (fC  0 c2 : ) log K (f ) P (Rn

\ Vn0 )  22

f

20

0cL

e

(f )=2 < 1:

S ()

=

^n ! : E (h

0 <  < 1,

0 h)2 < 3; for sufficiently large n

2 \k1=1 S (k01 ) ^ n 0 h)2 ! 0 as n ! 1: E (h

has probability one. Finally, then, for !

a.s.

Proof: We will follow closely the proof and the notation of Lemma 2. As in Lemma 2, we need to show that P (lim sup Vn ) = 0. Fixing a number D = D(Y; h; ), which will be determined later, we first decompose Vn as Vn

(f )D

P (Rn

Similar to Lemma 3, we can now conclude that for any the set

which satisfies (4), let L(f )

L 0! h

L

(f )D n=1

= 0 C (f ) + log K (f ) : K (f ) L1 (f ) It is easy to see that there is a constant c = c(Y; h) > 0, such that C (f )  cK (f ) log K (f ) > 0. Now choose D = D (Y; h; ) such that cD  2. Then for L1 (f )  D

n=1

and

Then, given any L1 : Then

:

where

0 hM )2 = 0:

P 2 S , h^ n 0! h, which completes the proof. Suppose that for every f 2 0, Ef 4 (X ) < 1.

C (f )L1 (f ).

0 C (Kf )(Lf 1)(f )

exp

J (f; )

By Lemma 3, for any !

C (f )

L



Choose a sequence n ! 0, and let Mn = M (n ) and Sn = S (n ): Then on S = \Sn , which has probability one

^ lim sup nlim !1 E (hn; M k!1

n

1

= 22

0 hM )2  9:

Let S () be the subset of points in holds, i.e.,

> e:

Hence

L

^ n; M E (h

2 0,

2 B with Rn \ Tf; n 6= ;, as in the proof of Lemma 2

\ Tf; n )  1 0 2K (f )

P (Rn

n

= (Var ((f (X ) 0 Y )2 ) + H 2 )eE(f (X )0Y )

Then for any f

Then with probability one, for large n

^ n; M ) L (h

0 \ Rn ) < 1:

P (Vn

2 0g [ fhM g, which is dense in

2 L ([0; Define L0 :

n

1973

[

2

f B; L

(f )