Communications in Statistics - Theory and Methods

0 downloads 0 Views 127KB Size Report
Nov 1, 2006 - Communications in Statistics—Theory and Methods, 35: 1971–1983, ... Let p x be the joint probability density function (pdf) of n iid observations from .... So with n equal to the integer part 1 − 2/ K + 1 log K / 1/3 log 2 we have.
This article was downloaded by:[Monash University] On: 12 April 2008 Access Details: [subscription number 778575838] Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Communications in Statistics Theory and Methods Publication details, including instructions for authors and subscription information: http://www.informaworld.com/smpp/title~content=t713597238

Statistical Evidence in Experiments and in Record Values

A. Habibi a; N. R. Arghami a; J. Ahmadi a a Department of Statistics, School of Mathematical Sciences, Ferdowsi University of Mashhad, Mashhad, Iran Online Publication Date: 01 November 2006 To cite this Article: Habibi, A., Arghami, N. R. and Ahmadi, J. (2006) 'Statistical Evidence in Experiments and in Record Values', Communications in Statistics Theory and Methods, 35:11, 1971 - 1983 To link to this article: DOI: 10.1080/03610920600762780 URL: http://dx.doi.org/10.1080/03610920600762780

PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf This article maybe used for research, teaching and private study purposes. Any substantial or systematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

Downloaded By: [Monash University] At: 06:09 12 April 2008

Communications in Statistics—Theory and Methods, 35: 1971–1983, 2006 Copyright © Taylor & Francis Group, LLC ISSN: 0361-0926 print/1532-415X online DOI: 10.1080/03610920600762780

Ordered Data Analysis

Statistical Evidence in Experiments and in Record Values A. HABIBI, N. R. ARGHAMI, AND J. AHMADI Department of Statistics, School of Mathematical Sciences, Ferdowsi University of Mashhad, Mashhad, Iran According to the law of likelihood, statistical evidence for one (simple) hypothesis against another is measured by their likelihood ratio. When the experimenter can choose between two or more experiments (of approximately the same cost) to obtain data, he would want to know which experiment provides (on average) stronger true evidence for one hypothesis against another. In this article, after defining a pre-experimental criterion for the potential strength of evidence provided by an experiment, based on entropy distance, we compare the potential statistical evidence in lower record values with that in the same number of iid observations from the same parent distribution. We also establish a relation between Fisher information and Kullback–Leibler distance. Keywords Exponentially twisted; Fisher information; Kullback–Leibler information; Law of likelihood; Record values; Statistical evidence. Mathematics Subject Classification Primary 62A10; Secondary 62B10.

1. Introduction and Preliminaries Let px  be the joint probability density function (pdf) of n iid observations from a distribution with pdf fx , then the likelihood ratio RE x =

px 1  px 0 

measures the strength of evidence favorable to the simple hypothesis H1   = 1 against the simple hypothesis H0   = 0 , (Royall, 1997, 2000). Received February 12, 2005; Accepted February 24, 2006 Address correspondence to N. R. Arghami, Department of Statistics, School of Mathematical Sciences, Ferdowsi University of Mashhad, P.O. Box 1159, Mashhad 91775, Iran; E-mail: [email protected]

1971

Downloaded By: [Monash University] At: 06:09 12 April 2008

1972

Habibi et al.

Suppose E1 and E2 are two experiments (or sampling schemes) with (approximately) the same cost, having outcomes x and y, which are the realizations of random vectors X and Y , with densities px  and qy , respectively, where  is an unknown parameter. When the objective of the study is to produce statistical evidence for one hypothesis against another (in the above sense), it is desirable to have a measure of performance of the experiments E1 and E2 . This can be defined as S E = E1 RE X + E0 1/RE X where · is a non decreasing function. If  t =

1 t ≥ K 0 t < K

S E is the sum of the probabilities of observing strong true evidence under H1 and H0 , where K is arbitrary and is usually between 8 and 32 (Royall, 1997). If t = t/1 + t, then S E = abc(E) = the area between the cumulative distribution function (cdf) curves (under H1 and H0  of = RE x/1 + RE x (Emadi and Arghami, 2003). If t = logt, then  S E = E1

   pX 1  pX 0  log + E0 log pX 0  pX 1 

= Dp1 p0  + Dp0 p1  = Jp1 p0  where Dp1 p0  and Jp1 p0  are, respectively, asymmetric and symmetric Kullback–Leibler (K–L) distance (information) of p1 and p0 . In this article, it is the last of the above three criteria that we shall use and we shall mean by S E. Example 1.1 (Bernoulli Trials). Let E1 and E2 be as follows. E1 : Take a random sample of size n from a Bernoulli distribution with parameter  (B). E2 : Continue sampling from a B distribution until p1 x/p0 x < 1/K or p1 x/p0 x > K. The question may be which one of E1 and E2 have more potential statistical evidence regarding the unknown parameter . We have 

   1X 1 − 1 n−X 0X 1 − 0 n−X S E1  = E1 log X + E0 log X 0 1 − 0 n−X 1 1 − 1 n−X    1 − 0  = n1 − 0  log 1 0 1 − 1 

Downloaded By: [Monash University] At: 06:09 12 April 2008

Statistical Evidence in Experiments

1973

Ignoring the “over shoot” we have       1 1 1 S E2   logKP1 R > K + log P R< + logKP0 R < K 1 K K   1 + log P R > K K 0     1 1  2 logK 1 − − 2 logK K+1 1+K   K−1 = 2 logK K+1 where the last approximate equality follows from Wald inequalities in the theory of SPRT (Rohatgi, 1976, pp. 616–617). Experiments E1 and E2 will have, on average, approximately the same cost if n = EN, where N is the final sample size of E2 . If 0 = 13 and 1 = 23 then Ei N =

1 − 2/K + 1 logK i = 0 1 1/3 log 2

So with n equal to the integer part 1 − 2/K + 1 logK/1/3 log 2 we have n  Ei N, i = 0 1 and S E1  

K − 1/K + 1 logK 2 × log 2 1/3 log 2 3

= 2 logK

K − 1 = S E2  K + 1

This is in contrast to sequential testing in which E2 has smaller error probabilities. E2 may be preferable to E1 , because E2 has no probability of weak evidence (that is Pi 1/K < R < K = 0 i = 0 1. On the other hand, E1 has the advantage of having a fixed sample size. Example 1.2 (Strength of Wooden Beams, Glick, 1978). Pressure is continuously increased until the beam breaks. The cost of the experiment is assumed to be equal to the number of broken beams. Let E1 and E2 be as follows. E1 : We measure the strength of each one of n wooden beams, so the number of broken beams is equal to n. E2 : We break only the beams that are weaker than all previous beams. We continue until n beams are broken, so the number of broken beams is equal to the number of lower record values. The question is: “Do the first n lower record values have more (less, equal) expected statistical true evidence than the same number, n, of iid observations from the same parent population?” Ahmadi and Arghami (2001, 2003) classified many classic families of distributions into three classes, RMI, RLI, and REI, according to whether record values contained More, Less, or Equal amount of Fisher information when

Downloaded By: [Monash University] At: 06:09 12 April 2008

1974

Habibi et al.

compared with the same number of iid observations. Similar classification has been done by Hofmann (2004). It is desirable to do similar classification on the basis of K–L information. The rest of the article is organized as follows: In Sec. 2, we prove that for two exponential families, their order in terms of K–L information is the same as their order in terms of Fisher information. In Sec. 3, we establish a general relation between K–L and Fisher information. In Sec. 4, we compare some specific families of distributions with families of distributions of their respective record values on the basis of K–L information.

2. Utilizing a Relation between K–L and Fisher Information Comparing two families of distributions with respect to their Fisher information is often easier than establishing their order in terms of K–L distance. Theorem 2.2 in this section uses an established link between K–L and Fisher information to prove a result for exponential families. Theorem 2.1 (Kullback, 1959, p. 27). i lim

→0

ii

1 1 Dp0 + p0  = IX 0  2 2

2 Dp p0 =0 = IX 0 

2

where IX  is Fisher information function of the family of distributions of X. Definition 2.1. Let p0 and p1 be two densities with common support . Then the family of densities defined by   1 p x t x ∈  t ∈ 0 1 p0 x 1 pt x = Nt p0 x is called the exponentially twisted family of densities of p0 and p1 and is denoted by ET(p0 p1 ). Here  Nt = p0 x1−t p1 xt dx <  Since we can write pt x =

  1 p x p0 x exp t log 1 Nt p0 x

the above family is an exponential family (for any pair of densities p0 and p1 ). The following relations were proved, under some mild conditions, by Dabak and Johnson (2002).  1 uI ∗ udu i Dp1 p0  = 0 (2.1)  1 ii Jp1 p0  = I ∗ udu 0

Statistical Evidence in Experiments

1975

Downloaded By: [Monash University] At: 06:09 12 April 2008

where I ∗ t =

 x

pt x

2 p xdx

t2 t

is the Fisher information of the family ETp0 p1 . Since ETp0 p1  is an exponential family, we have I ∗ t = −

N  tNt − N  t2 Nt2

Lemma 2.1. If p0 = px 0  and p1 = px 1  are two members of the exponential family P = px   ∈  then ETp0 p1  is a sub family of P. Proof. Suppose px  = chxeTx x ∈   ∈  then pt x ∝ e0 +1 −0 tTx hx which is a member of P with  = 0 + 1 − 0 t.



Theorem 2.2. If P = px  x ∈ X  ∈  and Q = qy  y ∈ Y  ∈  are exponential families, then the following three statements are equivalent: i IX  > < =IY  ∀ ∈  ii Dp1 p0  > < =Dq1 q0  ∀0 1 ∈  iii Jp1 p0  > < =Jq1 q0  ∀0 1 ∈  where p and q are members of P and Q, respectively. Proof. Assuming 0 < 1 by (2.1) we can write Dp1 p0  =

 0

1

uIX∗ udu

d

2 0 + 1 − 0 u (Lehmann, 1983, p. 118). Thus    1   −   1 0 2 I 1 − 0  d Dp1 p0  = 1 − 0 X 1 −  0 0  1 =  − 0 IX d

But IX∗ u = IX 

du

0

Thus the result follows.



3. A General Theorem Classifying a distribution family with respect to K–L distance is not usually (algebraically) easy, while it is relatively easier with respect to Fisher information.

Downloaded By: [Monash University] At: 06:09 12 April 2008

1976

Habibi et al.

In this section, we shall establish a general relation between K–L distance and Fisher information which enables us to classify families with respect to K–L distance, that is with respect to D distance (and thus with respect to J distance). Theorem 3.1. Let p  ∈  and q  ∈  be two families of distributions of X and Y , respectively. Assume that IX , DX p1 p0 , IY , and DY q1 q0  are Fisher information and K–L information of X and Y , respectively, and they are all continuous functions of their arguments. Assume that: (i) IX  − IY  ≥ d0 > 0 ∀ ∈ I = 0 1 ; (ii) The third derivatives of DX p+ p  and DY q+ q  with respect to (w.r.t) are bounded for every  ∈ I and every in the neighborhood I0 = 0 c of zero, then DX p1 p0  > DY q1 q0  To prove Theorem 3.1, we need the following lemmas. Lemma 3.1. Let  be fixed, and let h   = DX p+ p  =



 log X

 p+ x p+ xdx p x

Then (i) h 0 = 0 (ii) h 0 = 0 3 3 (iii) H   = 21 h 2  − 13 h 1  for some 0 < 1 2 < where H   = 2 1/ h   Proof. (i), (ii) are known properties of K–L distance. (iii) We have H   =

−2 1 h   + 2 h   3

By Taylor expansion we have H   =

=

  −2 2  3 3  h h h 0 + h 0 + 0 +     3 2  6  1   1 2 3 + 2 h 0 + h 0 + h 2  2 1 3 1 3 h   − h 1  2  2 3

Corollary 3.1. If the third derivative of DX p+ p  w.r.t is bounded in B = I0 × I, then H   is bounded in B.

Downloaded By: [Monash University] At: 06:09 12 April 2008

Statistical Evidence in Experiments

1977

Corollary 3.2. If the third derivatives of DX p+ p  and DY q+ q  w.r.t are bounded in B, then G   is bounded in B, where G   =

1 DX p+ p  − DY q+ q  2

Lemma 3.2. Under the assumptions of Theorem 3.1, there exist 0 > 0 such that for every  ∈ I and every ≤ 0 DX p+ p  − DY q+ q  >

d0 8

Proof. Let G   =

1 DX p+ p  − DY q+ q  2

By part (i) of Theorem 2.1, 1 I  − IY  2 X d = d ≥ 0 > 0 ∀ ∈ I 2

lim G   = →0

(3.1)

To prove the lemma by contradiction, suppose it does not hold. Then we have ∀ > 0 ∃ ∈ I s t G   ≤

d0 8

Thus, for some 1 > 0 in I0 , there exist 1 ∈ I such that G1  1  ≤ d0 /8 But by (3.1) ∃1 > 0 s t G1  >

d0 ∀0 <  ≤ 1 4

where (obviously) 1 < 1 . Now, let 2 = 21 1 , again ∃2 ∈ I s t G2  2  ≤

d0 8

and again by (3.1) ∃2 > 0 s t G2  >

d0 ∀0 <  ≤ 2 4

where 2 < 2 . If we continue in this manner, we shall have three sequences 1 2 1 2 and 1 2 such that for every n ≥ 1, we have n < n and Gn n  − Gn  n  >

d0 d d − 0 = 0 4 8 8

But by mean value theorem we have Gn n  − Gn  n  = n − n Gn n  >

d0 8

Downloaded By: [Monash University] At: 06:09 12 April 2008

1978

Habibi et al.

where n < n < n . We know that n − n  → 0, as n → , so Gn n  →  as n → . But this contradicts the assumption that Gn   is bounded. Thus the lemma is proved.  Lemma 3.3. Under the assumptions of Theorem 3.1, there exist  > 0 such that for every  ∈ I and every ≤     d0 p x d < DX p p0  − log p+  xdx < 0 − 32 p0 x 32    d q y d − 0 < DY q q0  − log  q  ydy < 0 32 q0 y + 32 Proof. By considering the fact that the expressions between inequality signs tend to zero when  −→ 0, the proof is similar to that of Lemma 3.2.   0 Proof of Theorem 3 1. Let ∗ = Min 0   and K =  1 − ∗ , where 0 and are as in Lemmas 3.2 and 3.3, respectively, and  means the smallest integer greater than or equal to . By Lemma 3.2 the inequality

DX p0 +k ∗ p0  − DY q0 +k ∗ q0  > 0

(3.2)

holds for k = 1. Aiming to give a proof by induction, we assume (3.2) holds for k ≤ K. Let Q = DX p0 +k+1 ∗ p0  − DY q0 +k+1 ∗ q0  It is easy to show that Q = Q + Q , where Q = DX p0 +k+1 ∗ p0 +k ∗  − DY q0 +k+1 ∗ q0 +k ∗  and 

Q =



 log

p0 +k ∗ x p0 x

By Lemma 3.2 Q >

d0 8

 p0 +k+1 ∗ xdx −



 log

q0 +k ∗ y q0 y

 q0 +k+1 ∗ ydy

and Lemma 3.3,

Q > DX p0 +k ∗ p0  − DY q0 +k ∗ q0  −

d0 16

So, Q > d0 /8 − d0 /16 > 0. Thus (3.2) holds for k + 1. The result follows by induction. 

4. Record Values and iid Observations Let Xi i ≥ 1, be a sequence of iid continuous random variables. An observation Xj will be called a lower record value if its value is smaller than that of all previous observations. Thus, Xj is a lower record value if Xj < Xi for all i < j. By convention, X1 is the first lower record value.

Downloaded By: [Monash University] At: 06:09 12 April 2008

Statistical Evidence in Experiments

1979

The times at which lower record values appear are given by the random variables Tj which are called record times and are defined by T1 = 1 with probability 1 and, for j ≥ 2, Tj = Mini  Xi < XTj−1  The waiting time between the ith lower record value and the i + 1th lower record value is called the inter-record time (IRT), and is denoted by i = Ti+1 − Ti i = 1 2 . Record times and inter-record times for upper record values are defined analogously. Let L1 L2 Ln be the first n lower record values from a distribution with the cdf Fx  and the pdf fx . Then the pdf of the joint distribution of the first n lower record values is given by ql  =

n−1

i=1

fli   fl   Fli   n

and the marginal density of Li (the ith lower record value, i ≥ 1) is given by qi li  =

− log Fli  i−1 fli   i − 1!

The joint distribution of lower record values and their IRT’s has density ql   =

n

fli  1 − Fli  i −1

i=1

and the joint density of Li and i is qi li i   =

− log Fli  i−1 fli  Fli  1 − Fli  i −1 i − 1!

See Arnold et al. (1998) for more details. In experiments such as in Example 1.2, where the experimenter has a choice between observing n iid random variables or n record values from the same distribution (almost at the same cost), it is desirable to know which experiment provides us (on average) with more statistical true evidence, that is which one of Jp1 p0  or Jq1 q0  is greater. We shall call the family of distribution {fx   ∈ } RMI, RLI, or REI if Jq1 q0  is More than, Less than, or Equal to Jp1 p0 , respectively, where q and p are the densities of the distribution of L and X, respectively. We should note that throughout this section we are considering lower record values without their IRT’s. Example 4.1 (Extreme Value Distribution, Scale Family). The family with cdf Fx  = exp−e−x   > 0 is called extreme value (scale) family. Its pdf is fx  = e−x exp−e−x  x ∈  > 0

Downloaded By: [Monash University] At: 06:09 12 April 2008

1980

Habibi et al.

This family can be shown to be RMI. We present the proof by induction. First let n = 2; we know that   fX 1  Dp1 p0  = 2E1 log fX 0    1 e−1 X exp−e−1 X  = 2E1 log 0 e−0 X exp−e−0 X          1 0  0 = 2 log − 1−  1 +  1 + −1 0 1 1 where · is the complete gamma function. Also,     qL 1  fL1  1 FL1  0 fL2  1  Dq1 q0  = E1 log = E1 log qL 0  fL1  0 FL1  1 fL2  0        fL1  1  FL1  0  fL2  1  = E1 log + E1 log + E1 log fL1  0  FL1  1  fL2  0  = A + B + C where 

 fL1  1  A = E1 log fL1  0           = log 1 − 1 − 0   1 +  1 + 0 − 1 0 1 1   FL1  0  B = E1 log = E1 e−1 X  − E1 e−0 X  FL1  1     = 1− 1+ 0 1 and       fL2  1    C = E1 log = log 1 +  2 + 1 − 2 − 1 − 0 E1 L2  fL2  0  0 0         1 0 0  0 = log + 2+ − 1−  1 + 1 − − 2 0 1 1 1 Thus we have          Dp1 p0  − Dq1 q0  = 2 1 + 0 −  2 + 0 − 1 − 0 1 1 1 = 1 − ccc − 1 where, c = 0 /1 < 1, so      1 1 1 Jp1 p0  − Jq1 q0  = 1 − ccc − 1 + 1 −  − 1 < 0 c c c

Downloaded By: [Monash University] At: 06:09 12 April 2008

Statistical Evidence in Experiments

1981

Now, we assume this family is RMI when n = m − 1. Then for n = m, we have   fX 1  Dm p1 p0  = Dm−1 p1 p0  + E1 log (4.1) fX 0  and  Dm q1 q0  = Dm−1 q1 q0  + E1

 fLm  1 FLm−1  0  log fLm  0 FLm−1  1 

(4.2)

(the indices of D in (4.1) and (4.2) represent the sample size). It is tedious but straightforward to show that the second term in (4.2) is greater than the second term in (4.1). Thus the result follows.  Example 4.2 (Extreme Value Distribution, Location Family). family with pdf

The parametric

fx  = e−x− exp − e−x− x ∈  ∈ where  is assumed to be known, is called extreme value (location) family. This family is REI, that is, lower record values contain equal amount of K–L information when compared with the same number of iid observations from the original distribution. This is implied by Theorem 4.1 below. We shall denote by C the class of all continuous distribution functions F such that Fx  = e−abx where a· and b· are real positive functions. This class includes, several important distributions such as: • Extreme value distribution (location family) with cdf Fx  = exp−e−x−  x ∈  ∈ • Power distribution with cdf Fx  = x = exp− logx 0 < x < 1  > 0 • Frechet distribution (scale family) with cdf, Fx  = exp−x−  x > 0  > 0 Theorem 4.1. All members of the class C are REI. Theorem 4.1 is proved easily by the aid of the following two lemmas. Lemma 4.1. For all members of the class C of families of distributions, Ln (the nth lower record value) is a sufficient statistic for the entire set of the first n lower record values (Ahmadi and Arghami, 2001).

Downloaded By: [Monash University] At: 06:09 12 April 2008

1982

Habibi et al.

Lemma 4.2. For all members of class C of families of distributions, bLn  is distributed  as TX = ni=1 bXi . 

Proof. The proof is easy and thus omitted. Proof of Theorem 4 1. We have 

fx  = −ab x exp−abx so 

qln   = −ab ln /n − 1!abln n−1 exp−abln  Thus  Rln  =

a1  a0 

n exp−a1  − a0 bln 

Hence, Dq1 q0  = E1 logRLn    a1  = n log − a1  − a0 E1 bLn  a0    n    a1  = n log bXi  − a1  − a0 E1 a0  i=1 = Dp1 p0 



Note: Theorem 4.1 can also be proved by using Lemmas 4.1 and 4.2, Theorem 2.2 above, and Theorem 3.1 of Ahmadi and Arghami (2001). By using Theorem 3.1, we derive Table 1 below from Table II of Ahmadi and Arghami (2003).

Table 1 Families of distributions classified on the basis of K–L information cdf N 2  N    0 <  < 1    = 1    > 1 x 0 < x < 1 0 <  < 1 exp− exp−x −  x ∈  ∈  > 0 exp− exp−x −  x ∈  ∈  > 0 L  L 

Without IRT

With IRT

RLI RMI RLI RLI RLI REI REI RMI RLI RLI

RMI RMI RLI REI RMI RMI RMI REI RMI RMI

Statistical Evidence in Experiments

1983

Downloaded By: [Monash University] At: 06:09 12 April 2008

References Ahmadi, J., Arghami, N. R. (2001). On the Fisher information in record values. Metrika 53:195–205. Ahmadi, J., Arghami, N. R. (2003). Comparing the Fisher information in record values and IID observations. Statistics 37:435–441. Arnold, B. C., Balakrishnan, N., and Nagaraja, H. N. (1998). Records. New York: John Wiley. Dabak, A. G., Johnson, D. H. (2002). Relation between Kullback–Leibler distance and Fisher information. http://cmc.rice.edu/docs/docs/Dab2002Sep1Relationsb.pdf. Emadi, M., Arghami, N. R. (2003). Some measures of support for statistical hypotheses. J. Statist. Theor. Appl. 2:165–176. Emadi, M., Ahmadi, J., Arghami, N. R. (2005). Comparison of record data and random observations based on statistical evidence. To appear in Statist. Pap. Glick, N. (1978). Breaking record and breaking boards. Amer. Math. Monthly 85:2–26. Hofmann, G. (2004). Comparing Fisher information in record data and random observations. Statist. Pap. 45:517–528. Kullback, S. (1959). Information Theory and Statistics. New York: John Wiley. Lehmann, E. L. (1983). Theory of Point Estimation. New York: John Wiley. Rohatgi, V. K. (1976). An Introduction to Probability Theory and Mathematical Statistics. New York: John Wiley. Royall, R. M. (1997). Statistical Evidence: A Likelihood Paradigm. London: Chapman and Hall. Royall, R. M. (2000). On the probability of observing misleading statistical evidence. J. Amer. Statist. Assoc. 95:760–780.