Probabilistic Knowledge and Probabilistic Common Knowledge 1 ...

3 downloads 20398 Views 179KB Size Report
3Department of Computer Science, College of Staten Island, CUNY and CUNY Graduate center. 1 ... It is clear, however, that Ann and Bob have nearly the same information and if the ...... search report, Brooklyn College, CUNY (1986).
Probabilistic Knowledge and Probabilistic Common Knowledge Paul Krasucki1 ,

Rohit Parikh2

and Gilbert Ndjatou3

Abstract: In this paper we develop a theory of probabilistic common knowledge and probabilistic knowledge in a group of individuals whose knowledge partitions are not wholly independent.

1

Introduction

Our purpose in this paper is to extend conventional information theory and to address the issue of measuring the amount of knowledge that n individuals have in common. Suppose, for example, that two individuals have partitions which correspond closely, then we would expect that they share a great deal. However, the conventional definition of mutual knowledge may give us the conclusion that there is no fact which is mutually known, or even known to one as being known to another. This is unfortunate because [CM] and [HM] both give us arguments that seem to show that common knowledge (mutual knowledge if two individuals are involved) is both difficult to attain and necessary for certain tasks. If however, we can show that probabilistic knowledge is both easier to attain and a suitable substitute in many situations, then we have made progress. See [Pa2] for a description of situations where partial knowledge is adequate for communication. To this end, we shall develop a theory of probabilistic common knowledge which turns out to have surprising and fruitful connections both with traditional information theory and with Markov chains. To be sure, these theories have their own areas of intended application. Nonetheless, it will turn out that our mathematical theory has many points in common with these two theories. The standard Logics of Knowledge tend to use Kripke models with S5 accessibility relations, one for each knower. One can easily study instead the partitions corresponding to these accessibility relations and we shall do this. We also assume that the space W of possible worlds has a probability measure µ given with it. 1

Department of Computer Science, Rutgers-Camden Department of Computer Science, CUNY Graduate center, 33 West 42nd Street, New York, NY 10036. email: [email protected]. 3 Department of Computer Science, College of Staten Island, CUNY and CUNY Graduate center 2

1

In Figure I below, Ann has partition A = {A1 , A2 } and Bob has partition B = {B1 , B2 } so that each of the sets Ai , Bj has probability .5 and the intersections Ai ∩Bj have probability .45 when i = j and .05 otherwise. The vertical line divides A1 from A2 The slanted line divides B1 from B2

.45

LL.05 L L L L L .05 L

.45

Figure I Since the meet of the partitions is trivial, there is no common knowledge in the usual sense of [Au], [HM]. In fact there is no nontrivial proposition p such that Ann knows that Bob knows p. It is clear, however, that Ann and Bob have nearly the same information and if the partitions are themselves common knowledge, then Ann and Bob will be able to guess, with high probability, what the other knows. We would like then to say that Ann and Bob have probabilistic common knowledge, but how much? One purpose of this paper is to answer this question and to prove properties of our definition that show why the answer is plausible. A closely related question is that of measuring indirect probabilistic knowledge. For example, we would expect that what Ann knows about Bob’s knowledge is less than or equal to what Bob himself knows, and what Ann knows of Bob’s knowledge of Carol is in turn less than or equal to the amount of knowledge that Bob has about Carol’s knowledge. We would expect in the limit that what Ann knows about what Bob knows about what Ann knows... about what Bob knows will approach whatever ordinary common knowledge they have. It turns out that to tackle these questions successfully, we need a third notion. This is the notion of the amount of information acquired when one’s probabilities change as a result of new information (which does not invalidate old information). Suppose for example that I am told that a certain fruit is a peach. I may then assign a probability of .45 to the proposition that it is sweet. If I learn then that it just came off a tree, then I will expect that it was probably picked for shipping and the probability may drop to .2, but if I learn again that it fell off the tree, then it will rise to .9. In each case I am getting information, consistent with previous information and causing me to revise my probabilities, but how 2

much information am I getting?

2

Preliminaries

We start by giving some definitions, some old, some apparently new. If a space has 2n points, all equally likely, then the amount of information gained by knowing the identity of a specific point x is n bits. If one only knows a set X in which x falls, then the information gained is less, in fact equal to I(X) = − log(µ(X)) where µ(X) is the probability4 of X. If P = P1 , ..., Pk is a partition of the whole space W , then the expected information when one discovers the identity of the Pi which contains x, is H(P) =

k X

µ(Pi )I(Pi ) =

i=1

k X

−µ(Pi ) log(µ(Pi ))

i=1

These definitions so far are standard in the literature [Sh], [Ab], [Dr]. We now introduce a notion which is apparently new. Suppose I have a partition P = {P1 , ..., Pk } whose a priori probabilities are y1 , ..., yk , but some information that I receive causes me to change them to u1 , ..., uk . How much information have I received? Definition 1 : IG(~u, ~y ) =

k X

ui (log ui − log yi ) =

k X i=1

i=1

ui log(

ui ) yi

Here IG stands for “information gain”. Clearly this definition needs some justification. We will first provide an intuitive explanation, and then prove some properties of this notion IG which will make it more plausible that it is the right one. (a) Suppose that the space had 2n points, and the distribution of probabilities that we had was the flat distribution. Then the set Pi has 2n · yi points 5 . After we receive our information, the points are no longer equally likely, and each point in Pi has probability ui |Pi |

=

ui yi 2 n .

Thus the expected information of the partition of the 2n singleton sets is −

k X i=1

(yi · 2n )

ui ui log( n ) n yi 2 yi 2

4 We will use the letter µ for both absolute and relative probabilities, to save the letter p for other uses. All logs will be to base 2 and since x log(x) → 0 as x → 0 we will take x log(x) to be 0 when x is 0. 5 There is a tacit assumption here that the yi are of the form k/2n . But note that numbers of this form are dense in the unit interval and if we assume that the function IG is continuous, then it is sufficient to consider numbers of this form.

3

which comes out to α=n−

k X

ui (log ui − log yi )

i=1

Since the flat distribution had expected information n, we have gained information equal to n − α = n − (n −

k X

ui (log ui − log yi )) =

i=1

k X

ui (log ui − log yi ) =

i=1

k X

ui log(

i=1

ui ) yi

(b) In information theory, we have a notion of the information that two partitions P and Q share, also called their mutual information, and usually denoted by I(P; Q). I(P; Q) =

X

µ(Pi ∩ Qj ) log(

i,j

µ(Pi ∩ Qj ) ) µ(Pi ) · µ(Qj )

We will recalculate this quantity using the function IG. If Ann has partition P, then with probability µ(Pi ) she knows that Pi is true. In that case, she will revise her probabilities ~ to µ(Q|P ~ i ) and and in that case her information gain about of Bob’s partition from µ(Q) ~ i ), µ(Q)). ~ Bob’s partition is IG(µ(Q|P Summing over all the Pi we get X

~ i ), µ(Q)) ~ = µ(Pi ) · IG(µ(Q|P

X

X

µ(Pi )(

j

i

i

(µ(Qj |Pi ) log(

µ(Qj |Pi ) )) µ(Qj )

and an easy calculation shows that this is the same as I(P; Q) =

X

µ(Pi ∩ Qj ) log(

i,j

µ(Pi ∩ Qj ) ) µ(Pi ) · µ(Qj )

Since the calculation through IG gives the same result as the usual formula, this gives additional support to the claim that our formula for the information gain is the right one.

3

Properties of information gain

Theorem 1 : (a) IG(~u, ~v ) ≥ 0 and IG(~u, ~v ) = 0 iff ~u = ~v . ~ ) and if there is set X such that ui = µ(Pi |X) for all i, then (b1) If p~ = µ(P IG(~u, p~) ≤ −log(µ(X)) Thus the information received, by way of a change of probabilities, is less than or equal to the information I(X) contained in X. (b2) Equality obtains in (b1) above iff for all i, either µ(Pi |X) = µ(Pi ), or else µ(Pi ∩X) = 0. 4

Thus if all nonempty sets involved have non-zero measure, every Pi is either a subset of X or disjoint from it. Proof : (a) It is straightforward to show using elementary calculus that log x < (x − 1) log e except when x = 1 when the two are equal.6 Replacing x by 1/x we get log x > (1 − 1/x) log e except again at x = 1. This yields IG(~u, ~v ) = (

X

ui (log

i

X X X vi ui )) ≥ ( ui (1 − )) log e = (( ui ) − ( vi )) log e = 0 vi ui i i i

with equality holding iff, for all i, either arise, since we know that

P

i ui

=

P

i vi

ui vi

= 1, or ui = 0. However, the case ui = 0 cannot

= 1 and ui ≤ vi for all i.

(b1) Let ui = µ(Pi |X), p~ = (µ(P1 ), ..., µ(Pk )).

IG(~u, p~) =

k X

µ(Pi |X) log

i=1 k X

µ(Pi |X) log

i=1

where α =

Pk

i=1 µ(Pi |X) log

k µ(Pi ∩ X) µ(Pi |X) X = µ(Pi |X) log = µ(Pi ) µ(Pi )µ(X) i=1

k µ(Pi ∩ X) X µ(Pi |X) log µ(X)) = α + I(X) − µ(Pi ) i=1

µ(Pi ∩X) µ(Pi )

≤ 0, since

µ(Pi ∩X) µ(Pi )

≤ 1 for all i and

Pk

i=1 µ(Pi |X)

= 1.

(b2) α = 0 only if, for all i, µ(Pi |X) = 0 or µ(Pi ∩ X) = µ(Pi ), i.e. either Pi ∩ X = ∅ or 2

Pi ⊆ X (X is a union of the Pi ’s).

If we learn that one of the sets can be excluded, that we had initially considered possible (its probability was greater then zero), then our information gain is the least if the probability of the excluded piece is distributed over all the other elements of the partition, proportionately to their initial probabilities. the gain is greatest when the probability of the excluded piece is shifted to a single element of the partition, and this element was initially one of the least likely elements. Theorem 2 : Let ~v = (v1 , ..., vk−1 , vk ), ~u = (u1 , ..., uk−1 , uk ), where uk = 0, ui = vi + ai vk for i = 1, ..., k − 1, ai ≥ 0, (a) IG(~u, ~v ) is minimum when

ai vi

Pk−1 i=1

ai = 1, and vk > 0. Then:

= c is the same for all i = 1, ..., k − 1, and c =

1 1−vk .

moreover, this minimum value is just − log(1 − vk ). (b) IG(~u, ~v ) is maximum when ai = 1 for some i such that vi = minj=1...k−1 (vj ) and the other aj are 0. Proof : (a) Let a = (a1 , ..., ak−2 , ak−1 ). Since 6

Pk−1 i=1

ai = 1 we have ak−1 = 1−

e is of course the number whose natural log is 1. Note that log e = log2 e = is tangent to the curve y = log x at (1,0), and lies above it.

5

1 . ln 2

Pk−2 i=1

ai .

The line y = (x−1) log e

So we need only look at f : [0, 1]k−2 → R, defined by: k−2 X (vk−1 + vk (1 − vi + ai vk +(vk−1 +vk (1− aj )) log f (~a) = IG(~u, ~v ) = (vi +ai vk ) log vi vk−1 j=1 i=1 k−2 X

Pk−2 j=1

To find the extrema of f in [0, 1]k−2 , consider the partial derivatives vk−1 + vk (1 − ∂f vi + ai vk = vk (log − log ∂ai vi vk−1 ∂f ∂ai

=

for all

i vk 0 iff vi +a vi ak−1 i, avii = vk−1

Pk−2

=

(vk−1 +vk (1− vk−1

j=1

get c = iff ai
0 iff

Thus f has a minimum when ai =

vi 1−vk

Pk−1

i=1 vi vi ai > 1−v k

= 1 − vk we and it is < 0

for all i. The fact that this minimum

value is − log(1 − vk ) is easily calculated by substitution. Note that this quantity is exactly equal to I(X) where X is the complement of the set Pk whose probability was vk . Thus we have an exact correspondence with parts (b1) and (b2) of the previous theorem. (b) To get the maximum, note that since the first derivatives

∂f ∂ai

are always increasing,

and the second derivatives are all positive, the maxima can only occur at the vertices of [0, 1]k−1 . (If they occurred elsewhere, we could increase the value by moving in some direction). Now the values of f at the points pj = (0, ...0, 1, 0, ...0) (ai = δ(i, j)), are x+vk k IG(~u, ~v ) = g(vj ) where g(x) = (x+vk ) log x+v x . But g(x) = (x+vk ) log x is a decreasing

function of x. so IG(u, v) is maximum when aj = 1 for some j such that vj is minimal. 2 Example 1 : Suppose for example that a partition {P1 , P2 , P3 , P4 } is such that all the Pi have probabilities equal to .25. If we now receive the information that P4 is impossible, then we will have gained information approximately equal to IG(.33, .33, .33, 0, .25, .25, .25, .25) ≈ .33 3·(.33) log .25 ≈ log 43 ≈ .42. Similarly, if we discover instead that it is P3 which is impossible.

If, however, we only discover that the total probability of the set P3 ∪ P4 has decreased to .33, then our information gain is only IG(.33, .33, .17, .17, .25, .25, .25, .25) ≈ .08, which is much less. And this makes sense, since knowing that the set P3 ∪ P4 has gone down in weight tells us less than knowing that half of it is no longer to be considered, and moreover which half. If we discover that P4 is impossible and all the cases that we had thought to be in P4 are in fact in P1 , then the information gain is IG(.50, .25, .25, 0, .25, .25, .25, .25) = which is .5 and more than our information gain in the two previous cases. 6

1 2

log 2

aj )

Example 2 : As the following example shows, IG doesn’t satisfy the triangle inequality. I.e. if we revise our probabilities form ~y to ~u and then again to ~v , our total gain can be less than revising it straight from ~y to ~v . This may perhaps explain why we do not notice gradual changes, but are struck by the cumulative effect of all of them. Take ~v = (0.1, 0.9), ~u = (0.25, 0.75), ~y = (0.5, 0.5). IG(~v , ~u) + IG(~u, ~y ) = 0.13 + 0.21 = 0.34, while IG(~v , ~y ) = 0.53 (approximately). Also IG(~y , ~v ) = 0.74 so that IG is not symmetric. Another way to see that this failure of the triangle equality is reasonable is to notice that we could have gained information by first relativising to a set X, and then to another set Y , gaining information ≤ − log(µ(X)) and − log(µ(Y )) respectively. However, to get the cumulative information gain, we might need to relativise to X ∩ Y whose probability might be much less than µ(X)µ(Y ). We have defined the mutual knowledge I(P; Q) of two partitions P, Q. If we denote their join as P +Q then the quantity usually denoted in the literature as H(P, Q)) is merely H(P + Q). The connection between mutual information and entropy is well known [Ab]: H(P + Q) = H(P) + H(Q) − I(P; Q) Moreover, the equivocation H(P|Q) of P with respect to Q is defined as H(P|Q) = H(P) − I(P; Q). If i and j are agents with respective partitions Pi and Pj respectively, then inf(ij) will be just I(Pi ; Pj ). The equivocations are non-negative, and I is symmetric, and so we have: I(P; Q) ≤ min(H(P), H(Q)) Thus what Ann knows about Bob’s knowledge is always less than what Bob knows and what Ann herself knows. We want now to generalise these notions to more than two people, for which we will need a notion from the theory of Markov chains, namely stochastic matrices. We start by making a connection between boolean matrices and the usual notion of knowledge.

4

Common knowledge and Boolean matrices

We start by reviewing some notions from ordinary knowledge theory, [Au], [HM], [PK]. Definition 2 : Suppose that {1,...,k} are individuals and i has knowledge partition Pi . If w ∈ W then i knows E at w iff Pi (w) ⊆ E, where Pi (w) is the element of the partition 7

Pi containing w. Ki (E) = {w|i knows E at w}. Note that Ki (E) is always a subset of E. Write w ≈i w0 if w and w0 are in the same element of the partition Pi (iff Pi (w) = Pi (w0 )). Then i knows E at w iff for all w0 , w ≈i w0 → w0 E. Also, it follows that i knows that j knows E at w iff wKi (Kj (E)) iff

S

l l l≤n {Pj |Pj



Pi (w) 6= ∅} ⊆ E i.e. {w0 |∃v such that w ≈i v ≈j w0 } ⊆ E. Definition 3 : An event E is common knowledge between a group of individuals i1 , ..., im at w iff (∀j1 , ..., jk ∈ {i1 , ..., im })(w ≈j1 w1 , ..., wk−1 ≈jk w0 ) → (w0 ∈ E) iff for all X{K1 , ..., Kn }∗ wX(E). We now analyse knowledge and common knowledge using boolean transition matrices7 : Definition 4 : The boolean transition matrix Bij of ij is defined by letting Bij (k, l) = 1 if Pik ∩ Pjl 6= ∅, and 0 otherwise. We can extend this definition to a string of individuals x = i1 ...ik : Definition 5 : The boolean transition matrix Bx for a string x = i1 ...ik is Bx = Bi1 i2 ⊗ Bi2 i3 ⊗ ... ⊗ Bik−1 ik where ⊗ is defined as normalised matrix multiplication: if (B × B 0 )(k, l) > 0 then (B ⊗ B 0 )(k, l) is set to 1, otherwise it is 0. We can also define ⊗ as: (B ⊗ B 0 )(k, l) =

Wn

m=1 (B(k, m)

∧ B 0 (m, l))

We say that there is no non-trivial common knowledge iff the only event that is common knowledge at any w is the whole space W . Fact 1 : There is no non-trivial common knowledge iff for every string x including all individuals, limn→∞ Bxn = 1 where 1 is the matrix filled with 1’s only. We now consider the case of stochastic matrices.

5

Information via a string of agents

When we consider boolean transition matrices, we may lose some information. If we know the probabilities of all the elements of the σ-field generated by the join of the partitions Pi , the boolean transition matrix Bij is created by putting a 1 in position (k, l) iff µ(Pjl |Pik ) > 0, and 0 otherwise. We keep more of the information by having µ(Pjl |Pik ) in position (k, l). We denote this matrix by Mij and we call it the transition matrix from i to j. 7

The subscripts to the matrices will denote the knowers, and the row and column will be presented explicitly as arguments. Thus Bij (k, l) is the entry in the kth row and jth column of the matrix Bij .

8

Definition 6 : For every i, j, the ij-transition matrix Mij is defined by: Mij (a, b) = µ(Pjb |Pia ). For all i, Mii is the unit matrix of dimension equal to the size of partition Pi . Definition 7 : If x is a string of elements of {1, ..., k} (x ∈ {1, ..., k}∗ , x = x1 ...xn ), then Mx = Mx1 x2 × ... × Mxn−1 xn is the transition matrix for x. We now define inf(ixj), where x is a sequence of agents. inf(ixj) will be the information that i has about j via x. If e.g. i = 3, x = 1, j = 2, we should interpret inf(ixj) as the amount of information 3 has about 1’s knowledge of 2. Example 3 : In our example in the introduction, If i were Ann and j were Bob, then we would get Mij

.9 .1 = .1 .9



The matrix Mji equals the matrix Mij and the matrix Miji is Miji

.82 .18 = .18 .82



Thus it turns out that each of Ann and Bob has .53 bits of knowledge about the other and Ann has .32 bits of knowledge about Bob’s knowledge of her. Definition 8 : Let m ~ l = (ml1 , ..., mlk ) be the lth row vector of the transition matrix Mixj (mlt = µ(Pjt |x Pil ), where µ(Pjt |x Pil ) is the probability that a point in Pil will end up in Pjt after a random move within Pil followed by a sequence of random moves respectively within the elements of those Pxr which form x). Then: inf(ixj) =

k X

~ j )) µ(Pil )IG(m ~ l , µ(P

l=1

~ j )) is the information gain of the distribution m where IG(m ~ l , µ(P ~ l over the distribution ~ j ). µ(P ~ j ). However, The intuitive idea is that the a priori probabilites of j’s partition are µ(P if w is in Pil , the l’th set in i’s partition, then these probabilities will be revised according ~ j )). The to the l’th row of the matrix Mixj and the information gain will be IG(m ~ l , µ(P expected information gain for i about j via x is then obtained by multiplying by the µ(Pil )’s and summing over all l. Example 4 : Consider Miji . For convenience we’ll denote elements Pim by Am and elements Pjm by Bm (so that the A’s are elements of i’s partition, and the B’s are elements

9

of j’s partition). Therefore Miji = Mij × Mji where:

Mij

µ(B |A ) · · · µ(B |A ) 1 1 k 1 µ(B1 |A2 ) · · · µ(Bk |A2 ) = .. .. .. . . . µ(B1 |Ak ) · · · µ(Bk |Ak )



Mji

µ(A |B ) · · · µ(A |B ) 1 1 1 k µ(A1 |B2 ) · · · µ(Ak |B2 ) = .. .. .. . . . µ(A1 |Bk ) · · · µ(Ak |Bk )



Miji is the matrix of probabilities µ(Al |j Am ) for l, m = 1, ..., k, where µ(Al |j Am ) is the probability that a point in Am , will end up in Al after a random move within Am followed by a random move within some Bs .

Miji

µ(A | A ) µ(A | A ) · · · µ(A | A ) 1 j 1 1 j 2 1 j k µ(A2 |j A1 ) µ(A2 |j A2 ) · · · µ(A2 |j Ak ) = .. .. .. .. . . . . µ(Ak |j A1 ) µ(Ak |j A2 ) · · · µ(Ak |j Ak )



Note that for x = λ, where λ is the empty string, inf(ij) = I(Pi ; Pj ), as in the standard t l P ~ j )) = Pk µ(P l ) Pk µ(Pj |P l ) log µ(Pj |Pt i ) definition: inf(ij) = k µ(P l )IG(µ(P~j |P l ), µ(P l=1

=

6

Pk

l,t=1 µ(Pj

∩ Pil ) log

i

l=1

i

i

t=1

i

µ(Pj )

µ(Pjt ∩Pil ) µ(Pjt )µ(Pil )

Properties of transition matrices

The results in this section are either from the theory of Markov chains, or easily derived from these. Definition 9 : A matrix M is stochastic if all elements of M are reals in [0,1] and the sum of every row is 1. Fact 2 : For every x, the matrix Mx is stochastic. Definition 10 : A matrix M is regular if there is m such that ∀(k, l)M m (k, l) > 0. The following fact establishes a connection between regular stochastic matrices and common knowledge: Fact 3 : Matrix Mixi is regular iff there is no common knowledge between i and individuals from x. Fact 4 : For every regular stochastic matrix M , there is a matrix M 0 such that lim M n = M 0

n→∞

M 0 is stochastic, and all the rows in M 0 are the same. Moreover the rate of convergence is exponential: for a given column r, let dn (r) be the difference between the maximum and 10

the minimum in M n , in that column. Then there is  < 1 such that for all columns r and all sufficiently large n, dn (r) ≤ n . By combining the last two facts we get the following corollary: Fact 5 : If there is no common knowledge between i and the individuals in x, then lim (Mixi )n = M

n→∞

where M is stochastic, and all rows in M are equal to the vector ~ui of probabilities of the sets in the partition Pi . A matrix with all rows equal represents the situation that all information is lost and all that is known is the a priori probabilities. Fact 6 : If L, S are stochastic matrices and all the rows of L are equal, then S × L = L, and L × S = L0 , where all rows in L0 are equal (though they may be different from those of L). Fact 7 : For any stochastic matrix S and regular matrix Mixi : S × lim (Mixi )n = M 0 n→∞

where M 0 = lim (Mixi )n n→∞

Definition 11 : For a given partition Pi and string x = x1 x2 ...xk we can define a relation ≈x between the partitions Pi and Pj . Pim ≈x Pjn iff for w ∈ Pim and w0 ∈ Pjn , there are v1 , ..., vk−1 such that v1 ∈ Pim , vk ∈ Pjn and w ≈x1 v1 ≈x2 ...vk−1 ≈xk w0 . Definition 12 : ≈∗x is the transitive closure of ≈x . It is an equivalence relation. Fact 8 : Assume that x contains all j. Then the relation ≈∗x does not depend on the particular x and we may drop the x. Pim ≈∗ Pjn iff Pim and Pjn are subsets of the same element of P − where P − is the meet of the partitions of all the individuals. Observation: We can permute the elements of the partition Pi so that the elements of the same equivalence class of ≈∗ have consecutive numbers and then Mixi looks as follows:

Mixi

M1 = ... 0

··· 0 .. .. . . · · · Mr



where Ml for l ≤ r is the matrix corresponding to the transitions within one equivalence class of ≈∗ . All submatrices Ml are square and regular. 11

Note that if there is no common knowledge then ≈∗ has a single equivalence class. Since we can always renumber the elements of the partitions so that the transition matrix is in the form described above, we will assume from now on that the transition matrix is always given in such a form. Fact 9 : If x contains all j then lim (Mixi )n = M

n→∞

where M is stochastic, submatrices Ml of M are regular (in fact positive) and all the rows within every submatrix Ml are the same.

7

Properties of inf(ixj)

Theorem 3 : If there is no common knowledge and x includes all the individuals, then lim inf(i(jxj)n ) = 0

n→∞

Proof : Matrix M = limn→∞ (Mjxj )n has all rows positive and equal. Let m ~ be a ~ j )). Since the limiting vector m row vector of M . Then limn→∞ inf(i(jxj)n ) = IG(m, ~ µ(P ~ ~ j ), we get: limn→∞ inf(i(jxj)n ) = IG(µ(P ~ j ), µ(P ~ j )) = 0. 2 is equal to the distribution µ(P The last theorem can be easily generalised to the following: Fact 10 : If there is no common knowledge among the individuals in x, and i, j occur in x, then as n → ∞, inf(ixn j) goes to zero.

8

Probabilistic common knowledge

Common knowledge is very rare. But, even if there is no common knowledge in the system, we often have probabilistic common knowledge. Definition 13 : Individuals {1, ..., n} have probabilistic common knowledge if ∀x ∈ {1, ..., n}∗ inf(x) > 0 We note that there is no probabilistic common knowledge in the system iff there is some string x such that for some i, Mxi is a matrix with all rows equal and Mxi (·, t) = µ(Pit ) for all t.

12

Theorem 4 : If there is common knowledge in the system then there is probabilistic common knowledge, and ∀x ∈ {1, ..., n}∗ inf(x) ≥ H(P − ) Proof : We know from Fact 9 that

Mixi

M1 = ... 0

··· 0 .. .. . . · · · Mr



where Ml for l ≤ r is the matrix corresponding to the transitions within one equivalence class of ≈∗x , and all submatrices Ml are square and regular. Here r is the number of partitions in P − . Suppose that the probabilities of the sets in the partition Pi are u1 , ..., uk and that the probabilities of the partition P − are w1 , ..., wr . Each wj is going to be the sum of those ul where the lth set in the partition Pi is a subset of the jth set in the partition P − . Let m ~l be the lth row of the matrix Mixi . Then inf(ixi) is

Pk

~ l , ~u). l=1 ul IG(m

The row m ~ l consists

of zeroes, except in places corresponding to subsets of the apppropriate element Pj− of P − . 1 Then, by theorem 2, part (a): IG(m ~ i , ~u) ≥ log( 1−(1−w ) = − log wj . This quantity j)

may repeat, since several elements of Pi may be contained in Pj− . When we add up all the multipliers ui that occur with log wj , these multipliers also add up to wj . Thus we get inf(ixi) ≥

r X

−wj log(wj ) = H(P − )

j=1

2

. We can also show:

Theorem 5 : If x contains i, j and there is common knowledge between i, j and all the components of x, then the limiting information always exists and limn→∞ inf(i(jxj)n ) = H(P − ) We postpone the proof to the full paper. References [Ab] Abramson, N., Information Theory and Coding, McGraw-Hill, 1963 [AH] M. Abadi and J. Halpern, Decidability and expressiveness for first-order logics of probability, Proc. of the 30th Annual Conference on Foundations of Computer Science, 1989, pp. 148–153. [Au] Aumann, R., “Agreeing to Disagree”, Annals of Statistics, 1976, 4, pp. 1236-1239 13

[Ba] F. Bacchus, On Probability Distributions over Possible Worlds, Proceedings of the 4th Workshop on Uncertainty in AI, 1988, pp. 15-21 [CM] H. H. Clark and C. R. Marshall, “Definite Reference and Mutual Knowledge”, in Elements of Discourse Understanding, Ed. Joshi, Webber and Sag, Cambridge U. Press, 1981. [Dr] F. Dretske. Knowledge and the Flow of Information, MIT Press, 1981. [Ha] J. Halpern, An analysis of first-order logics of probability, Proc. of the 11th International Joint Conference on Artificial Intelligence (IJCAI 89), 1989, pp. 1375–1381. [HM] Halpern, J. and Moses, Y., “Knowledge and Common Knowledge in a Distributed Environment”, Proc. 3rd ACM Conf. on Principles of Distributed Computing, 1984, pp. 50-61 [KS] Kemeny, J. and Snell, L., Finite Markov Chains, Van Nostrand, 1960 [Pa] Parikh, R., “Levels of Knowledge in Distributed Computing”, Proc. IEEE Symposium on Logic in Computer Science, 1986, pp. 322-331 [Pa2] Parikh, R., “A Utility Based Approach to Vague Predicates” To appear. [PK] Parikh, R. and Krasucki, P. “Levels of Knowledge in Distributed Computing”, Research report, Brooklyn College, CUNY (1986). Revised version of [Pa] above. [Sh] Shannon, C., “Mathematical Theory of Communication” Bell System Technical Journal, 28, 1948. (Reprinted in: Shannon and Weaver, A Mathematical Theory of Communication University of Illinois Press, 1964.)

14