Outlink Estimation for Pagerank Computation under ... - CiteSeerX

1 downloads 0 Views 210KB Size Report
ABSTRACT. The enormity and rapid growth of the web-graph forces quantities such as its pagerank to be computed under sig- nificant amount of missing ...
Outlink Estimation for Pagerank Computation under Missing Data Sreangsu Acharyya

Joydeep Ghosh

Department of E.C.E University of Texas, Austin.

Department of E.C.E University of Texas, Austin.

[email protected]

[email protected]

ABSTRACT The enormity and rapid growth of the web-graph forces quantities such as its pagerank to be computed under significant amount of missing information consisting mainly of outlinks of pages that have not yet been crawled. This paper examines the role played by the size and distribution of this missing data in determining the accuracy of the computed pagerank, focusing on questions such as (i) the accuracy of pageranks under missing information, (ii) the size at which a crawl process may be aborted while still ensuring reasonable accuracy of pageranks, and (iii) algorithms to estimate pageranks under such missing information. The first couple of questions are addressed on the basis of certain simple bounds relating the expected distance between the true and computed pageranks and the size of the missing data. The third question is explored by devising algorithms to predict the pageranks when full information is not available. A key feature of the “dangling link estimation” and “clustered link estimation” algorithms proposed is that, they do not need to run the pagerank iteration afresh once the outlinks have been estimated.

1. INTRODUCTION Selecting and ordering query results, from over 3 billion hyperlinked pages that now constitutes the web graph G(V, E) is a difficult web mining problem of extreme importance and one in which link analysis plays a key role. Link analysis leverages the fact that the web can be looked upon as a noisy, distributed recommendation system where each page recommends other pages through its outlinks. The Pagerank [16] is perhaps the most well known among such algorithms and is used by the popular search engine Google to rank the results. The algorithm requires complete knowledge of the graph for which the pageranks are to be computed. This knowledge is acquired by crawling. On the web however, it is rarely practical to wait long enough for the crawl to have visited all pages that are reachable. Hence forward we call such pages that have not been reached the “uncrawled pages”. It is the effect of these pages that is investigated in this paper. The enormity and dynamic nature of the web-graph, and especially its rapid rate of growth, forces link analysis based ranking schemes like Pagerank to operate under a significant amount of outdated and missing data, present in the form of unknown outlinks from uncrawled pages. The computaCopyright is held by the author/owner(s). WWW2004, May 17–22, 2004, New York, NY USA. ACM xxx.xxx.

tion intensiveness of the task of updating the information of the web-graph forces Google to operate under latencies of months between re-crawling and re-ranking. This time lag is sufficiently large that deferring the presentation of search results till all pages have been freshly crawled is impractical, even when done offline. The Stanford web base project [9] quotes that for a 290 Million vertex web subgraph, they base their pagerank computation on some 30 Million vertices as the rest of the vertices either could not be crawled due to time constraints or were crawled but had no outlinks. This naturally raises questions on the accuracy of the pageranks under such severe conditions, for example, how much of the web graph need to be traversed so that enough faith can be ascribed to the computed pagerank values. In this paper we examine the role played by the size of the missing data and its distribution in determining how accurate and stable the pageranks would be when they have been computed on partial crawls, hoping to answer the following questions: (i) Given an incomplete crawl, how accurate are the pageranks, (ii) When is it safe to stop the crawling process and yet guarantee reasonable accuracy and (iii) How to get better estimate of pageranks under such missing information and what are the assumptions involved. An apparently ad hoc but efficient method was suggested in the original Pagerank paper to deal with the problem of dangling links, posed by vertices having no known outlinks [16]. We show that their proposed scheme, with some minor corrections, can be derived from certain assumptions about the distribution of the unknown outlinks. In this paper, schemes for estimating the unknown outlinks and incorporating them in the pagerank calculation are suggested and compared experimentally.

C’

111111111 000000000 000000000 111111111 000000000 111111111 000000000 111111111 C 000000000 111111111 000000000 111111111 000000000 111111111 F 000000000 111111111 000000000 111111111

Figure 1: Crawled (C), Uncrawled(C’) and the Forward (F) sets of Pages

The set of pages Vk known to the pageranking engine con-

sists of pages that have either been crawled or are pointed to by the crawled pages. We would like to know what should be the relationship between the size of Vk and the crawled vertex set Vc ⊂ Vk , such that the calculated pageranks are close enough to the true pageranks. Here the term “true pageranks” is used in a well-defined sense explained later. As a measure of “close enough” we propose that the distance between the pageranks is of constant order, i.e. given several instances of Vc and Vk of growing sizes, the distance between true and estimated pageranks should not grow similarly. This is explored through very simple probabilistic bounds relating the expected distance between the true pageranks and those calculated under missing information. The focus here is not in getting the tightest bounds, but those which are simple and easily computable. The outlinks of a page can be represented as |V | dimensional binary vector, where the ith component indicates the presence or absence of a link to page i. We show, that under the assumption that the outlink vector is chosen uniformly at random, this distance is of constant order if the size√ of the known but uncrawled set of pages i.e. |Vk | − |Vc | is O( Nk ), where Nk is the size of the known set. This tells us when is it safe to rely on the pagerank values calculated. Studies on the accuracy of such rankings under uncertainty should ideally be based on the distribution from which the local distributions are drawn, but as a preliminary study we base it on the simplifying assumption that the individual preferences are sampled from an uniform distribution, though it is known that they are biased towards sparse distributions, in which case the distance between the correct and the estimated pageranks is even higher. In the following section we formalize what is meant by missing information under which the Pagerank algorithm operates. In section 3 the effects of the missing information on pagerank calculations are quantified, illustrating that they can be quite detrimental unless its size is small compared to the size of the known vertex set. Having established that missing information may pose accuracy issues, we examine schemes aimed at mitigating the effect in sections 4, 5 and 6 following which some experimental results are presented in section 7.

2. PAGERANK ITERATION AND INCOMPLETE DATA The Pagerank algorithm ranks web documents according to the stationary distribution of a random surfer traversing the directed web graph G(V, E). The surfer is assumed to be Markovian and traverses the web graph mostly by following outlinks, with occasional jumps to a random page on the web. The surfer either picks one outlink from the current page uniformly at random or resets to a random page on the web with probabilities 1 − β and β respectively. The escape probability of β was introduced as a mechanism to ensure non-zero pageranks and irreducibility of the resulting Markov chain [16] which is required for the Markov chain to have an unique stationary distribution. If A is the adjacency matrix of the graph induced by the known vertices of the web graph and Dout the diagonal matrix formed by the out-degrees, the transition matrix T of the surfer can be expressed in terms of the out-degree nor−1 A, and the random malized adjacency matrix n A = Dout jump probability β as

T=

·

n A(1

− β) 1/N · · · 1/N

e′ 0

¸

(1)

where the components of the vector e′ are given by e′i =

½

β 1

if rowi ofA 6= 0 . otherwise

(2)

The Pagerank iteration converges to the primary eigenvector π of T T or, equivalently, the steady-state probabilities of the Markov chain defined by it. The convergence rate is proportional to λ1 /λ2 , the ratio of the first and second eigenvalues of T [16]. In order to compute the pageranks, it becomes necessary for the pageranking engine to crawl or traverse the web graph to reconstruct it, a process which is both incomplete and error prone because of the size and dynamic nature of the web. Thus pertinent questions to ask are, would the computed ranks be the same as those calculated had the entire graph G(V, E) been available, or if we are somewhat less demanding, would the rankings be the same if we assume that the vertices of the web-graph consists only of Vk , the pages that are currently known. We first need to define what we mean by true pagerank of a subgraph. Correctness of the pagerank computed on the reconstructed subgraph is defined as Definition 1. Given a subset Vk of the vertices of the web-graph G(V, E), the true pageranks of Vk are defined to be those that are calculated on the subgraph G′ (Vk , Ek ) induced by the vertices Vk , i.e. G′ contains all and only those edges xy ∈ E, s.t. x, y ∈ Vk . At any stage of operation of a pageranking engine, the entire set of web pages V can be divided into the crawled set C and its complement, the uncrawled set C ′ . We define forward set of C as F = {p : ∃(q ∈ C)|(q, p) ∈ E}. Henceforth the shorthand q → p is used to denote (q, p) ∈ E. For pages in the set of known but uncrawled pages FC ′ = {F ∩ C ′ } the pageranking engine will have no knowledge of their outward links. This constitutes missing information under which the ranking scheme has to operate. We call the set {C ∪ FC ′ } the known set Vk and represent its cardinality by Nk = |Vk |. The set C is used to build up the incomplete adjacency matrix Aincomplete . Without any loss in generality one can consider the last m rows of the matrix Aincomplete to belong to the known uncrawled pages FC ′ , where m = |FC ′ |. The matrix can now be represented as · ¸ a v Aincomplete = , 0 0 where the fully specified adjacency sub-matrix a is square of size (Nk − m) and v is the matrix of inlink vector of size (Nk − m) × m. In absence of knowledge about the outlinks of the m uncrawled pages, the matrix Aincomplete is used to calculate the approximate pagerank r as the left eigenvector of the transition matrix · ¸ t w Tincomplete = ∗ ∗ −1 ≃ (1 − β)Def f ective Aincomplete + β(

1 )(1)(1T ) N

induced by appropriate normalization of Aincomplete . There are various ways of imputing the missing values denoted by asterisk above. The original Pagerank paper [16] refers to the problem of having an adjacency matrix with null rows as the problem of dangling links. There they take an apparently ad hoc approach of ignoring the last m rows and working with a only. Once the stationary distribution of the induced transition matrix t has been computed, the final ranks are computed by running multiple, but fixed number of iterations of the Pagerank algorithm applied to vertices in FC ′ to estimate their pageranks. We show that a single iteration suffices under certain assumptions. Other possibilities that have been suggested include replacing the the last m rows by uniform distribution {1/Nk , 1/Nk · · · 1/Nk }, a baseline used for experimental comparison. Whatever the method, it requires a principled justification. Specifically, one needs to make a careful note of all the implicit or explicit assumptions made about the missing data, especially because the actual outlinks contained in these known uncrawled pages may have significant impact on the value of the pagerank of other pages, especially if they have high pagerank and point to other pages [15]. Only if the pages do not have any outlink will the estimated pagerank would be the same as the correct pagerank. This paper examines schemes with which one may apportion a pagerank value to the pages in FC ′ that incorporates the unknown outlinks. The first scheme shows how and under what assumptions can the method of handling dangling links as used in [16] be derived. The second, estimates the conditional probability of outlinks given an unit neighborhood of the page, under naive assumptions of independence. A third scheme estimates a latent variable, such that it renders the outlinks from a page statistically independent. Any scheme used for predicting the missing outlinks requires some extra computation. In the next section, we show whether this extra computation is justified or necessary. For that we use results to quantify the effect of perturbations on the pagerank vector. It is tempting to believe that for large graphs like that of the web, missing or noisy data would not distort the pagerank significantly. But this depends a lot on the size of the set that has been perturbed i.e. FC ′ , as we shall show in the next section.

3. PERTURBATION BOUNDS Sensitivity of pagerank to missing data is an important and valid concern because of the significant amount of missing data under which it operates. In order to have faith in the pagerank values produced, one should be in a position to quantify or bound the deviation of the computed pageranks from the actual. In this section the sensitivity of the pageranks to perturbations of the rows of the induced transition matrix is examined. Definition 2. Robustness: Given an incomplete transition matrix of size N , and a distribution p(·) from which the unspecified rows (normalized outlink vectors) have been drawn, the resulting pageranks are called robust if the expected distance between the calculated and true pageranks is of order O(1). Later, we go on to prove the following proposition, under the condition that p(.) is uniform.

Proposition 1. For outlinks drawn from a uniform distribution, the pagerank computation √ is robust if the size of the perturbed set is of the order O( Nk ). This is of some concern, because during the initial phase of crawling the size of the unknown set grows linearly in Nk . The rate decays only after a majority of the vertices of the web graph have been crawled. In order to explain our bounds, we will need to consider the fundamental matrix of a Markov chain [10] and its properties. Let T1 be the irreducible transition matrix induced by the adjacency matrix A. Suppose that the original matrix T1 is perturbed to T2 resulting in a shift of the stationary probability from π1 to π2 . In order to quantify the effect of ∞ this perturbation let us examine the M = [I − T1T + T1T ]1 called the fundamental matrix, having the property M (π2 − π1 ) = (T2T − T1T )π2 .

(3)

Lemma 1. The L2 matrix norm of M −1 can be expressed in terms of the second eigenvalue λ2 of the transition matrix 1 T as follows: ||M −1 ||2 = ( 1−λ ) 2 From equation (3) we get ∆π = M −1 ∆T T π2 , therefore E[||∆π||2 ] ≤ ||M −1 ||2 E[||∆T T ||2 ]||π2 ||2 ≤ ( ≤

1 E[||∆T T ||1 ] (1 − λ2 )



1 )E[||∆T T ||2 ].1 1 − λ2

m E[D]. (1 − λ2 )

(4)

where E[D] is the expected L1 distance between two columns of T and m = |FC ′ |. Notice that each column of T T is an Nk dimensional probability distribution. Assuming that the simplex is sampled uniformly and that Nk is high enough for approximation by continuous distributions, one can evaluate the expected distance E[D] between two Nk dimensional discrete probability vectors sampled uniformly at random over the unit Nk regular simplex. This is non-trivial, as the components of the vector cannot be assumed to be independent. Even R forR small dimensions the multiple integral D = E[D] = ··· P p =1 ||x − y||1 dp becomes difficult to i compute directly . To avoid this a computational trick introduced by Crofton in the field of geometrical probability in 1877 [14] is used, or alternatively one could use Dirichlet integrals. Proposition 2. The average distance E[D] between two uniformly sampled discrete probability distributions of di−1)3/2 mension N is E[D] = 2(N N (2N −1) The following simple Lemma is needed for our proof. Lemma 2. For two disjoint sets A and B of finite dimensional and bounded vectors, the expected L1 distance E{x∈A,y∈B} [d(x, y)] = d(¯ x, y¯) We prove proposition 2 as follows: Proof. Let the  0 ···   {l .. .. .. . {. .   {0 · · · · · ·

vertices  of the N regular simplex be 0}   .. .}   l}

1 ∞ T = Ltn→∞ T n , T T denotes the transpose of the matrix T

The volume V (N, l) of this simplex S0 is ´1/2 √ N −1 ³ N V (N, l) = ( (N2l)−1)! 2(N −1) Enlarge the simplex S0 by making an infinitesimal change ∆l in the lengths along the coordinate axes. The infinitesimal volume added can be divided into N strips on each of the N faces S1 · · · SN of the simplex . We evaluate the average distance as follows. Let the two vectors be distributed uniformly over the enlarged simplex. We consider the case where • Both vectors are in S0

whenever vector X and vector X′ differ only in the kth coordinate. Let µ be the expected value of the random variable f (X). Then, for any t ≥ 0, 2

− P2t 2 c

P r{|f (X) − µ| ≥ t} ≤ e

The L1 norm of a matrix is given by the the maximum absolute column sum of the matrix. Considering the perturbations in each column of the matrix ∆T to be independent, and observing that the change in the L1 norm is bounded by 2 for any change in any one of the columns, we obtain

• One vector is in S0 and the other is in Si , i 6= 0. • Both vectors are in Si , i 6= 0. P E[D] = i E[D|Ai ]P (Ai ) Therefore, using the shorthand D for E[D] ¶2 V (N, l) D(L + ∆l) = D(l). V (N, l + ∆l) µ ¶ V (N − 1, l) V (N, l) ∆l + N. .√ . V (N, l + ∆l) V (N, l + ∆l) N µ

.E{x∈S0 ,y∈S1 } [d(x, y)]

Using Lemma 2 Ex∈S0 ,y∈S1 [d(x, y)] = 2l/N. Substituting the values we obtain ¶ µ 2(N − 1) ∆l + √ D(L + ∆l) =D(l). 1 − 2(N − 1) l + ∆l N µ ¶1/2 µ ¶ N −1 ∆l .∆l . 1 − 2(N − 1) N l + ∆l Re-arranging and taking lim∆l→0 , it simplifies to (N − 1)D N −1 ∂D =2 (N − 1)1/2 − 2 . ∂l N l Solving the differential equation, we obtain

E[D] = which is upper bounded by thus proving proposition 1.

2(N − 1)3/2 N (2N − 1) √1 N

for all positive values of N ,

So far we have established the expected behavior of the distance between the true and the computed pageranks. But it might be the case that the expected behavior is not representative. Such doubts can be mitigated only by concentration bounds, which are present bellow. Using the equation (4) and the following well-known lemma which we state without proof, we show the concentration of the distribution about its mean. Lemma 3. [13] Let X = (X1 , X2 , · · · Xn ) be a set of independent random variables with Xi taking values in R+ . n Suppose that a real valued function f : R+ → R satisfies ||f (X) − f (X′ )|| ≤ ck

k

P r{||∆T ||1 ≥ t} ≤ e−

(t−E||∆T ||1 )2 2m

Using equation (4) this can be expressed as √ (t−E||∆T ||1 )2 2 2m P r{||∆π||1 ≥ ( )t} ≤ e− (1 − λ2 ) ≤ e−

(t−mE[D])2 2m

which shows that the bounds are exponentially tight around the mean as long as m is of order O(N ). From practical standpoint, however, the assumption of uniform distribution of outlinks is debatable. Web statistics suggest that outlinks are a lot sparser than those generated from uniform sampling over the entire regular unit N simplex. A closer approximation would be that they are sampled from lower dimensional simplices that constitute the boundary of our original N -dimensional simplex. This would increase the expected distance, for a distribution uniform over the lower dimensional simplex n.

4.

DANGLING LINK ESTIMATION

Having established that significant error might be introduced into the pagerank values because of perturbations entailed by the missing entries, we take a look at prediction schemes aimed at reducing this effect. Unlike the method where the unknown rows of the transition matrix are filled by a non-committal uniform distribution, one may use the expected distribution i.e. the distribution of the state transition events averaged over an infinite time, which under mild conditions converge to the stationary distribution or the pagerank values. Replacing unknown values by their expectation has been a standard method of imputation [12], however in this case we can argue for it even more strongly by drawing upon studies conducted on the power law nature of web graphs [11] [1]. It has been shown that preferential attachment of links is crucial for explaining such power laws. A model where web vertices link to other vertices proportional to the pagerank values generate power laws which are similar to those observed in practice [17]. One is obviously tempted to run this scheme iteratively where the next pagerank estimates are computed by replacing the unknown rows by the current pagerank values till convergence. Thus we are looking for a vector (equivalently, a probability distribution) r such that if we substitute it in the place of the unknown rows we get back r as our pagerank, i.e. r is a fixed point. The fixed point can however be computed analytically without the need for such iterative updates.

Proposition 3. Calculating the pagerank of the fully specified transition sub-matrix t followed by a single pagerank iteration on the incomplete matrix provides a valid pagerank under the assumption that the unknown rows have the same distribution as the converged pagerank vector. Proof. Partitioning the vector r conformally with Tincomplete ˆT we obtain: [T incomplete ], [r] = [r] ⇒ ·

tT wT

r1 r2

¸·

r1 r2

¸

=

·

r1 r2

¸

For simplicity we have taken w to be a vector, as a result of which r2 is a scalar. This does not however change the analysis or the final answer. The above leads to two equations T

t r1 T

w r1 +

r22

= (1 − r2 )r1

(5)

= r2

(6)

In order to compute ˆ r we need to estimate Ep(z) [z] from the available data, i.e., a. Here we assume each row is independent of each other and that each component of z are independent, which simplifies our task to estimating Ep(zij |a) [zij |a]. Estimation of p(zij |a) a multivariate distribution in |a| parameters, is a formidable task because of the size of a and the associated curse of dimensionality. This can be simplified by the Markovian assumption whereby the probability is conditionally independent of the rest of the graph given a small neighborhood of the page. Fixing the size of the neighborhood to one, we need to consider only the in and outlinks of the page to determine the probability distribution over the incomplete data. Given that a page receives inlinks from {vi } p(zij |a) = ∀{κ6=j} p(zij |{vi }, ziκ ) = p(zij |{vi }). The parameters of this model can then be estimated through the maximum likelihood principle which reduces in this case to keeping track of the frequency counts.

Let rincomplete be the L1 normalized pagerank of the submatrix tT . From equation (5) we can see that r1 is nothing but the eigenvector of tT and is proportional to the L1 normalized pagerank rincomplete calculated from the incomplete transition matrix. Let r1 = λrincomplete . This allows us to rewrite equation (6) as λwT .rincomplete + r22

=

r2

(7)

λ.rincomplete T .1 + r2 = λ + r2

=

1

(8)

R A P

Equation (8) follows from the condition of L1 normalization. Substituting the value of λ in equation (7), we obtain the quadratic equation

B Q

(r2 − wT rincomplete ).(r2 − 1) = 0 From which we obtain the solution r2

=

wT rincomplete

r1

=

(1 − wT rincomplete )rincomplete

This is exactly equivalent to calculating the pagerank by ignoring the missing rows and running a single round of pagerank iteration with the missing rows after convergence.

5. NAIVE LINK ESTIMATION Considering the the last m rows as, m random L1 normalized vectors z ∼ p(z), we represent the incomplete adjacency matrix as · ¸ t w Tz = z 0

Consider an incomplete breadth first crawl. The uncrawled page P is linked by 2 (crawled) pages A and B (shown in grey). The common outlink neighbor of A and B is Q. The predicted outlink of page P is a linear combination of the outlinks of the common page(s) Qi . If there are no common pages the predicted outlink would be a null vector.

Figure 2: Predicting outlinks for uncrawled page P, which has multiple inlinks

5.1

Sparsity and Smoothing

The small size of the the sample a compared to the size of the complete web-graph and sparsity of z makes maximum likelihood a poor choice as an estimator for the problem described above. Smoothing is a technique applied for compensating its undesirable effects by apportioning a specified and the pageranks r(T(z)) as a function of these random amount of probability mass to the un-observed events and variable, where r is a vector valued matrix function, r : distributing it among them and has been motivated from RN ×N → RN , defined as the principal eigenvectors of the random transition matrix T (z). In the absence of further indifferent models, for example regularization and their equivformation one should return the expected pageranks Ep(z) [[r(T(z))]] alent Bayesian models or mixture models [18]. in order to minimize the expected error (measured in the L2 Noting that the state transition probabilities are multinosense). However this is difficult to calculate or bound. We mial, one may assume Dirichlet priors with hyper-parameters use ˆ r = r(Ep(z) [[Tz ]]) as a first order Taylor approximation. αi which forms a conjugate prior to the multinomial distriIn the Naive Link estimation method we temporarily relax bution, leading to a Bayesian mean aposteriori estimate of i P the componentwise dependence condition to arrive at an es, where ni is the frequency of occurance of zi as P nnii+α + αi i timation scheme, subsequently however, an algorithm will zi . be proposed which does not require such stringent condiA popular smoothing technique based on mixture models tions of independence. is Jelinek-Mercer smoothing [8]. Probability of page i link-

ing page j conditioned on the set of vertices ({vi } linking page i, according to Jelinek-Mercer would be the mixture distribution 2 PJM (i → j|{vi } → i)

= ν.PM L (i → j|{vi } → i) + (1 − ν)PM L (i → j) = ν.PM L (i → j|{vi } → i) + (1 − ν)r(j).

In the context of web graphs, Jelinek-Mercer smoothing has a direct interpretation as the mixture of preferential and non-preferential attachment of links [17] that has been used to explain and match the power laws observed on the web. However, with appropriate choice of parametersP these two models are equivalent as seen by taking ν = P ni +nPi αi and αi ∝ ri .

urls that would ever be present on the web, making it impossible to define the multinomial distribution over it. So if we are to apportion any probabilities to and from these new links one has to perform the entire estimation process every time the set X changes. It should also be noted that the set of words is not static either, but its rate of change is slow enough to be considered so. We would like to make use of the helpful concept of conditional independence given a latent variable but design our model in such a way that allows the set X to grow or change. Note that, the scope of this model is not limited to outlink prediction alone. For example it would be very useful in solving the cold start problem in recommendation and collaborative filtering applications. The model is introduced in the next subsection.

6. CLUSTERED LINK ESTIMATION

6.1

In the previous section we examined a naive scheme of predicting the outlinks by assuming independence among them. Such assumptions have been effective for modeling the Bayes decision boundary in classification tasks, but are poor estimators of the actual probability values [4]. In this section we would like to incorporate some level of dependence into the model and yet not sacrifice computational efficiency. In this new setup we assume more realistically, that the outlinks are no longer independent of each other, but there exists an underlying latent variable Z the knowledge of which would render the information from other links redundant. In other words we assume that the links are conditionally independent given the latent variables Z which takes values in a much smaller set than that of the outlinks. Such latent variable based modeling is not new [5], and has been popularized in the machine learning community by Hoffman through his models of dyadic and co-occurance data [7] and application of those ideas to text mining using PLSI [6]. We explain this model briefly and comment on its limitation when applied to a dynamic scenario such as ours, thereby motivating a different but related model which does not suffer from this problem. Hoffman’s model is easy to understand in the document/word setup, though a word/word co-occurance model could easily be considered. Consider a set of all words X = {xi } in a dictionary, members of which occur in each document y present in the document set Y = {yj }. The requirement now, is to predict what members of X would a member of Y contain. This is specified by two distributions involving a latent variable Z, (i) the distribution over the latent variable Z, specified apriori as well as aposteriori and (ii) a (multinomial) distribution over X given an instance of the latent variable Z. As long as the members of the set X is known and fixed, this technique is very effective as shown by its success in the text domain. The EM framework can be used to estimate the set of distributions, which constitutes the parameters of the model. An application of the same framework to model the outlinks seems straightforward and has been tried [2], where the set X now represents the set of urls or as in [3] the product space of words and urls suitably decomposed assuming independence. Such a model however is of limited use in the setting where the set of urls is continuously growing. Unlike words in a dictionary it is not possible to capture the set of all possible

Consider the set of current urls X, members of which may be contained in a web document y drawn from a set Y . Note for this particular application the sets X and Y are identical, but we continue to use both X and Y in order to distinguish when we are talking of y as a container of outlinks or when we are talking of links themselves. The task at hand is to predict whether web-document y links to another the web document by including a link x. This is modeled as follows. Each web document yi has the same apriori distribution over the space of paired latent variables (ZO, ZI) of fixed cardinality K, (K = |ZI|.|ZO| ≪ |Y |), which is used to draw a sample zO(yi ), zI(yi ) independent of other pages. The probability whether yi links to yj is given in terms of the joint distribution p(zO, zI). The link probabilities can be calculated using independence and the law of total probability as

2 Subscript M L corresponds to the Maximum likelihood estimate

Proposition 4. If the joint distribution P (y1 → y2 ) can be factored as in (9) with identically distributed marginals for

Formulation and Algorithm

P (y1 → y2 ) X = P (y1 → y2 |zO(y1 ), zI(y2 )).P (zO(y1 ), zI(y2 )) zO,zI

=

X

zO,zI

P (zO(y1 ), zI(y2 ))P (y1 |zO(y1 )).P (y2 |zI(y2 )).

(9)

Note that we need two latent variables ZO and ZI to have a well defined joint distribution P (ZO, ZI). However for the model to be useful for predicting the outlinks of a page with known inlinks, we pose the model in terms of a single random variable Z by introducing constraints that the marginals of Z0 and ZI are identically distributed. This constraint also has another significance that unlike a general joint distribution over discrete random variables this can mapped directly into a Markov chain on K states,thereby making it possible to compute the pagerank in the coarser representation and estimating the final pageranks from it. Examining our comment above in some more detail we observe that the pageranks are determined by the conditional probability of an uniform random walk on the webgraph, which in turn has a well defined joint distribution P (yi → yj ) of which the pageranks are the marginals. Because of this restriction, the row and column marginal of P (yi → yj ) must be identically distributed which can be showed to be ensured by having a similar constraint on identically distributed marginals of the joint distribution P (Z, Z) of the latent variable Z.

ZO and ZI, the joint distribution P (y1 → y2 ) has identically distributed row and column marginals and corresponds to a Markov chain. In absence of multiple samples of transitions from the joint distribution P (yi → yj ) the web-graph is used as a sample such that each directed edge denotes a transition from the source vertex to the destination and is used to estimate the the joint distribution subject to the conditional independence entailed by the latent variable Z as described above. More important is the fact that estimating P (Z, Z) implicitly precomputes the pagerank of future pages that may be added to the web-graph, as will be shown shortly. Note that the decomposition of the joint distribution P (y1 → y2 ) as in equation (9) requires a distribution over the set Y , therefore it does not still address the problem of computing pageranks and outlink distributions for pages added dynamically to the set Y , which was our main argument against using the models proposed in [7]. At this point we would like to remind the reader that our objective here is to estimate the unspecified rows of the pagerank matrix T , i.e. the conditional distribution table P (y2 |y1 ) and its corresponding stationary distribution r as the new pagerank vector. Let us examine the quantity P (y2 |y1 ), in the light of the latent variable decomposition above. For notational brevity we use zi to denote z(yi ). P (y1 → y2 ) P (y1 ) XX P (y1 |z1 )P (z1 ).P (z2 |z1 )P (y2 |z2 )

P (y2 |y1 ) =

1 P (y1 ) z z 2 1 XX P (z1 |y1 )P (z2 |z1 ).P (y2 |z2 ) = =

z1

=

P (y2 ).

z1

z2

P (z2 , z1 ) P (z1 |y1 ) .P (z2 |y2 ) P (z1 ).P (z2 )

z(yi ),z(yj )

j

log(P (z(yi ) → z(yj )|Θ2 )

(13)

s.t. the marginals of P (Z, Z) are identically distributed. (10)

Now except for P (y2 ), which is essentially the pagerank of y2 , no other term assigns probabilities over the set Y and therefore can be modeled by a fixed set of parameters, irrespective of whether the set Y changes or not, this property allows us to use the model even in a dynamic scenario. However we do seem to have a chicken and egg problem because to estimate the pagerank P (y2 ) we need the transition probabilities P (y2 |y1 ) which in turn requires P (y2 ). Before resolving the issue, let us express the equation above in a more P (z2 ,z1 ) , U [i, j] = compact matrix notation, let Λ[1, 2] = P (z 1 ),P (z2 ) P (Z(yj ) = i|yj ), a diagonal matrix R[i, i] = P (yi ) = ri and r[i] = P (yi ). Using the equation (10) above and the stationarity property of the pageranks one has T T r = R[U T ΛT U ].r = r or, [U T ΛT U ]r = R−1 r = 1

Thus it suffices to maximize the following expression respect to Θ2 X µ ri ¶ X P (z(yi )|yi , Θ1 )P (z(yj )|yj , Θ1 ) di y ,y ∈E i

z2

XX

the joint distribution P (Z, Z) by maximizing the expected log likelihood, to reach a local maxima of the likelihood function. The E step consists of using the Bayes law to estimate the posteriors, the M step consists of a constrained maximization step which reduces to the solution of a set of linear equations as explained in the algorithm below. For each yi , yj in the edge set E of the graph G(V, E) with the pagerank ri of the vertex yi and degree di , the difference in log likelihood function at a parameter value Θ2 and Θ1 is given as: ¶ µ X µ ri ¶ P (yi → yj |Θ2 ) L(Θ2 ) − L(Θ1 ) = log di P (yi → yj |Θ1 ) yi ,yj ∈E ! Ã ¶ µ X P (yi → yj , z|Θ2 ).P (z|Θ1 ) X ri log = di P (yi → yj , z|Θ1 ) z yi ,yj ∈E µ µ ¶ ¶ X yi → yj , z|Θ2 ri X ≥ P (z|Θ1 ) log di yi → yj , z|Θ1 z yi ,yj ∈E ¶ µ X X ri P (z(yi )|yi , Θ1 )P (z(yj )|yj , Θ1 ). = d i yi ,yj ∈E z(yi ),z(yj ) µ ¶ P (z(yi ) → z(yj )|Θ2 ) log (12) P (z(yi ) → z(yj )|Θ1 )

• E Step:

For each page yi let {oe1 · · · } be its out links and {ie1 · · · } its inlinks then P (Z(yi ) = z|oe1 oe2 · · · , ie1 , ie2 · · · ) =

P (oe1 |z(yi ))P (oe2 |z(yi )) · · · P (ie1 |z(yi )) · · · P (oe1 , oe2 · · · ie1 , ie2 , · · · ) X P (oe1 |Z = z) = P (oe1 |Z = z, z).P (z) Z

=

P (Z = z, z).P (z).

(14)

Z

• M Step:

ArgMax P (Z→Z) y i

(11)

Given the matrices Λ and U the linear equation above in |Y | = Nk unknowns can be solved using an iterative method for example Gauss-Seidel, in which case the computational complexity of each iteration will match that of running pagerank iterations on Nk pages. In the subsequent paragraphs we examine how one can obtain the matrices Λ and U . We use the maximum likelihood principle to estimate the parameters of the model. However the hidden variables Z makes it difficult to maximize the likelihood directly prompting the use of the EM method, which alternates between (i) estimating the posterior probabilities of Z and (ii) finding

X

X µ ri ¶ di ,y ∈E j

X

P (z(yi )|yi )P (z(yj )|yj )

z(yi ),z(yj )

log(P (z(yi ) → z(yj )|Θ2 )

(15)

s.t. the marginals of P (Z, Z) are identically distributed. Using Lagrange multipliers { λ1o , · · · λ1K } the Lagrangian for the single edge y1 , y2 can be written as ri X P (zi |y1 )P (zj |y2 ) log(P (zi → zj |Θ2 )+ L=( ) di z ,z i

j

K X 1 X 1 X P (zi , zj ) + (P (zk , zl ) − P (zl , zk )). λo z z λk i j

k=1

l

(16)

∂L Setting ∂P (z = 0 we obtain, i ,zj ) P (zi , zi ) = λo P (zi |yi )P (zi |y2 )

1.6

(18)

Setting ∂λ∂ k = 0 leads to a system of linear equations in K +1 unknown Lagrange multipliers, hence solvable as long as the coefficient matrix is non-singular. Note however we need not solve this linear equation in every step, it suffices to perform a single Gauss-Seidel iteration, thus constituting an GEM algorithm. Interestingly it can be shown that for a symmetric graph, the marginal equality constraints are satisfied automatically at the maxima and one need not use the expensive mechanism of Lagrange multipliers. Proposition 5. Given a symmetric graph the solution of the unconstrained M Step satisfies the condition that the row marginals and the column marginals are identical.

1.55

L1 Distance, D

P (zi , zj ) = (λo + λi − λj )P (zi |y1 )P (zj |y2 )

0

2000

4000

6000

8000

’crawlrate’

16000

10000

12000

No. of Pages with distance < D

Figure 4: Comparison of outlink Predictions by Dangling Link (DL) Estimation at various sizes of the crawl set. The quality of prediction increases with the size of the crawled set. 2

18000

Size of the Known set N_k

1.45

1.35

EXPERIMENTAL RESULTS

Data was collected by crawling pages belonging to the domain utexas.edu with depth restricted to 20 links. A total of 113826 outlinks were encountered pointing to 30850 unique web pages were crawled of which 7459 pages had outlinks. 20000

1.5

1.4

Uniform 1000 2000 3000 4000 5000 7000 Dangling: 1000 2000 3000 4000 5000 7000

1.8

L1 Distance, D

7.

1000 2000 3000 4000 5000 7000

(17)

1.6

1.4

14000

1.2 12000 10000

1

8000

0

2000

4000

6000

8000

10000

12000

No. of Pages with distance < D

6000 4000 2000 0

1000

2000

3000

4000

5000

6000

7000

Number of Pages Crawled N_C

Figure 3: Proportion of crawled pages during the linear growth of the crawling process We present snapshots at various points of crawling process to show that at initial stages of the breadth-first crawl, a substantial proportion of the pages constituted of uncrawled pages, as can be seen in the fig: 3 the relationship is roughly linear, which is of some significance when we consider the √ Nk bound. The schemes explained in the paper were pitted against each other and against two base case predictors. Base 1 always substituted a uniform distribution for the unknown outlinks and Base 2 always decided that the page in question had no outlinks. Since the matrix generated by Base 2 would not be a valid transition matrix, it was implemented by using an extra rank distribution vertex, such that all pages with no outlink pointed to this rank distribution node with probability 1. The outlinks of the rank distribution node were taken to be uniform. The distance measure used is that of the L1 distance between the predicted and the actual outlink distribution which were computed from the available ground truth. We report the results only for the test pages, i.e. the set of pages in the set FC ′ whose outlinks were not used for estimation.

Figure 5: Comparison of outlink Prediction by DL Estimation and Uniform distribution at various sizes of the crawl set. The plots with high L1 Distance correspond to those predicted by uniform distribution. The vertical axis denotes the L1 distance and the horizontal axis, the number of pages for which the error was less than the value indicated in the ordinate. Each curve denotes the results for different sizes of the crawled set FC . Clearly Base 1 which predicted an uniform distribution, faired worst. Base 2 did better since a significant fraction of the pages did not have outlinks to any pages. However, on those pages which had outlinks Base 2 suffered the maximum penalty. We see that both Scheme 1 and Scheme 2 result in predictions that are significantly better than the base cases, with scheme 2 being able to utilize the fact that many pages do not have any outlinks.

8.

CONCLUSION

In this paper we explored the effects of missing outlink information on the accuracy of computed pagerank values, under the assumption that outlink distributions are chosen uniformly at random, this also indicated where it might be possible to stop the time consuming crawling process and yet keep pagerank errors under control. Two algorithms

2

1.8

L1 Distance, D

10.

Dangling 1000 2000 3000 4000 5000 7000 Null:1000 2000 3000 4000 5000 7000

1.6

1.4

1.2

1 0

2000

4000

6000

8000

10000

12000

No of Pages with distance < D

Figure 6: Comparison of outlink prediction by DL Estimation and Null distribution at various sizes of the crawled set. The average L1 distance for prediction with Null distribution is significantly higher than those with DL Estimation. Naive 500 Null 500 Uniform 500 Dangling 500

2

L1 Distance, D

1.5

1

0.5

0 0

500

1000

1500

2000

2500

3000

No of Pages with distance less than D

Figure 7: Comparison of outlink Prediction by DL, Naive Link, Null and Uniform distribution for a crawl set of 500 pages.The NL estimates is seen to be better than DL on pages with no outlinks and similar on others. for estimating the outlinks of un-crawled pages were introduced, one based on approximating the outlink distribution by pagerank itself and another based on a latent variable characterization. In both the cases the changed pagerank can be computed without re-starting the Pagerank iteration and they performs better estimation of the outlinks than the common practice of substituting an uniform distribution or a null distribution. Lack of time prevented reporting extensive results on the performance of the second algorithm, which we would like to incorporate later. The latent variable characterization though computationally much more intensive has the advantage that it can assign different outlink distributions taking into consideration the inlinks of a page, whereas the resulting clustering of the web-graph can be used for other purposes.

9. ACKNOWLEDGEMENT We would like to thank the author of the Larbin crawler [email protected] for making the code available under GPL.

REFERENCES

[1] R. A. Barabasi and H. Jeong. Mean-field theory for scale-free random graphs. Physica, 272(A):173–187, 1999. [2] D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. 17th International Conf. on Machine Learning, pages 167–174. Morgan Kaufmann, San Francisco, CA, 2000. [3] D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In Neural Information Processing Systems 13, 2001. [4] P. Domingos and M. J. Pazzani. Beyond independence: Conditions for the optimality of the simple bayesian classifier. In International Conference on Machine Learning, pages 105–112, 1996. [5] B. S. Everitt. An Introduction to Latent Variable Models. Chapman and Hall, London, 1984. [6] T. Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncertainty in Artificial Intelligence, UAI’99, Stockholm, 1999. [7] T. Hofmann and J. Puzicha. Unsupervised learning from dyadic data. Technical Report TR-98-042, University of California, Berkeley, Berkeley, CA, 1998. [8] F. Jelinek, R. Mercer, L. Bahl, and J. Baker. Interpolated estimation of markov source parameters from sparse data. Pattern Recognition in Practice, pages 381–397, 1980. [9] S. Kamvar, T. Haveliwala, C. Manning, and G. Golub. Exploiting the block structure of the web for computing pagerank. Technical report, Stanford University, March, 2003. [10] J. G. Kemeny and J. L. Snell. Finite Markov Chains. Van Nostrand Reinhold, New York, 1960. [11] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and Upfal. Stochastic models for the web graph. In FOCS: IEEE Symposium on Foundations of Computer Science (FOCS), 2000. [12] R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. John-Wiley, New York, 1987. [13] C. McDiarmid. On the method of bounded differences. In In Survey in Combinatorics, pages 148–188. Cambridge University Press, 1989. [14] M.W.Crofton. Geometrical theorem’s relating to mean values. Proceedings of the London Mathematical Society, 8:304–309, 1877. [15] A. Y. Ng, A. X. Zheng, and M. I. Jordan. Link analysis, eigenvectors and stability. In IJCAI-01, pages 903–910, 2001. [16] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web, 1998. [17] G. Pandurangan, P. Raghavan, and E. Upfal. Using PageRank to Characterize Web Structure. In 8th Annual International Computing and Combinatorics Conference (COCOON), 2002. [18] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Research and Development in Information Retrieval, pages 334–342, 2001.