PAGERANK REVISITED 1. Introduction The World Wide ... - CiteSeerX

12 downloads 12201 Views 250KB Size Report
PageRank, one part of the search engine Google, is one of the most prominent link-based ... Regarding the problem of personalization of the ranking we observe (as Jeh and Widom already ..... It remains to check the effect of the start vector r0.
Technical University of Ilmenau Institute for Theoretical and Technical Computer Science Automata and Formal Languages

PAGERANK REVISITED MICHAEL BRINKMEIER

Abstract. PageRank, one part of the search engine Google, is one of the most prominent link-based rankings of documents in the World Wide Web. Usually it is described as a Markov Chain modelling a specific random surfer. In this paper an alternative representation as a power series is given. Nonetheless it is possible to interpret the values as probabilities in a random surfer setting, differing from the usual one. Using the new description we restate and extend some results concerning the convergence of the standard iteration used for PageRank. Furthermore we take a closer look at sinks and sources, leading to some suggestions for faster implementations.

1. Introduction The World Wide Web is one of the most rapidly developing and perhaps the largest source of information. Due to its de-central nature the access to the relevant and wanted information becomes increasingly difficult. Due to the pure amount of content more and more search engines, indexes and archives try to harvest the information implicit in the link structure to improve speed and quality of search. One of the tools in this field is the ranking of web pages according to their relevance. PageRank, one of the most prominent, was presented by Page, Brin et. al. in [BP98, BPMW99]. It is an essential part of Google’s ranking scheme. Together with content based measures this purely link based value is the basis of the order of search results produced by this widely used and accepted search engine. Since the ranking seems to be quite succesfull (if considering the number of users and their confidence in the results), the theoretical properties of PageRank raise some interesting questions. Usually PageRank is viewed as a Markov chain (e.g. [BPMW99, KHMG03a, HK03, Hav02, BGS03]), even though its original definition does not constitute one, as pointed out by several authors. But if sinks, i.e. nodes without outgoing edges, are removed or connected to all other nodes, one obtains a Markov Chain producing the same ranking as the original definition [BGS03]. More general PageRank is usually calculated by iteratively multiplying a (ranking) vector to a form of normalized adjacency matrix of the graph. Standard results of linear algebra and numerical mathematics show that this iteration converges to the principal eigenvector of the normalized adjacency matrix. In this paper we do not directly use Markov chains. Instead we will describe PageRank as a power series. This new perspective allows us to reproof and extend many known results and provides new insight into this ranking. First of all our description allows an exact interpretation of PageRank as the probability, that a Key words and phrases. pagerank, link-analysis, world wide web, random surfer, ranking algorithm, web graph, web search, personalisation, dynamical update, Markov chain, web page scoring. 1

2

MICHAEL BRINKMEIER

specific type of random surfer ends his walk at the page. This interpretation differs from the one usually suggested in the literature, but has some advantages. It inherently treats the sinks in a special way but also works if there are no sinks. In fact it provides a detailed explanation why sinks cause a global loss of ranking. Furthermore we extend the known results of [HK03] and [BGS03] about the convergence properties of the standard iteration and prove that the rate of convergence is exactly the damping factor d and only slightly influenced by the topology of the graph. Regarding the problem of personalization of the ranking we observe (as Jeh and Widom already implicitly did in [JW02] and [JW03]) that PageRank is a linear map from the personalization vector to a ranking. In addition to these restatements of known results (but with a different flavor) we will take a closer look at sources and sinks, i.e. nodes with no incoming or outgoing links. As a consequence of our observations we suggest a way to speed up the calculation of PageRank by exclusion of sources, sinks and related nodes from the iteration. Finally we shortly discuss the usage of arbitrary probabilities for the choices of the random surfer, instead of the uniform distribution originally suggested by Page and Brin. We observe that all of our results also hold in this more general setting, leading to a whole range of rankings. We have to thank an anonymous referee for some very good and helpful remarks and corrections, which forced the author to gain an even deeper insight into the nature of PageRank, as was initially attemptted (espc. regarding the difference between the linear system and the Markov model). 1.1. Preliminaries. Throughout this paper G = (V, E) is an arbitrary directed graph with node set V and edge set E ⊂ V 2 , without multiple edges. An edge (v, u) ∈ E is written as v → u. P Summations of x(v) over nodes v satisfying a condition cond(v) are written as v|cond(v) x(v). If there is no restriction we P simply write v x(v). Similar notations are used for edges. l

A path π : u → v from u to v of length l(π) := l is a sequence of l edges u = u0 → u1 → u2 → · · · → ul−1 → ul = v. If we do not restrict the length of a ∗ path we write π : u → v. 1.2. PageRank. In [BP98] and [BPMW99] Page, Brin et. al. described an approach for the estimation of the importance of a web page based purely on the link structure of the world wide web. Their proposed score PageRank was based on the assumption, that a document spreads his relevance equally to all documents to which it links. To generate rank a fixed value e(u), the personalization value, is given for each node u, and (1 − d)e(u) is added to the rank, resulting in: X PageRank(v) + (1 − d)e(u).1 (1) PageRank(u) = d out(v) v|v→u

Using the normalized link matrix of the web, i.e. the matrix M = (muv ) with 1 muv = out(u) if there exists a link from u to v and 0 otherwise, this equation may 1In fact in the original paper [BP98] the factor with which the personalization is multiplied, was given as d. But in later publications it was replaced by (1 − d). As we will see this has influence of the absolute values, but not on the ranking.

PAGERANK REVISITED

3

be reformulated as the following linear system: (I − dM T )PageRank = (1 − d)e,

(2)

where I is the unit matrix. Under certain circumstances2 the equation may be solved using the iteration ri+1 = dM T ri + (1 − d)e,

(3)

which corresponds to the iterative algorithm suggested by Page and Brin3. Translated to the underlying graph the iteration leads to the following equation and algorithm (1). (4)

ri+1 (u) = d

X v|v→u

ri (v) + (1 − d)e(u). out(v)

The starting vector r0 can be chosen arbitrarily. In algorithm (1) it is called r. Page and Brin suggested an interpretation of the iteration as the behavior of a random surfer, who occasionally jumps or teleports to another page instead of following a link, leading to a Markov model. But unfortunately the normalized adjacency matrix M is not stochastic, since sinks, ie. nodes without outgoing edges, have columns summing to 0. Usually this problem is solved by adding virtual edges from each sink to all other nodes (including the sink) weighted by the personalisation vector, leading to a proper Markov model (see eg. [KHMG03a, HK03, Hav02, BGS03] ). This results in a fix point condition of the following form: (5) X PageRank(v) X e(u) PageRank(u) = d +d PageRank(v) + (1 − d)e(u). out(v) kek1 v|v→u

v|out(v)=0

The convergence of this iteration was proved in several papers (eg. [BGS03]). As we will see both equations are closely related since (thm. 2.9) PageRank(v) =

kek1 PageRank(v). kPageRankk

We are going another way. Instead of using Markov models, we look directly at the linear system 2 and provide an interpretation of the solution PageRank(u) using a random surfer similar to the one described by Page and Brin in [BP98] and [BPMW99], but without the quite artificial teleportation. In addition our description provides detailed insights into the influence of the parameters for the ranking formula. 2. PageRank as a Power-Series In this section we are going to prove that PageRank is a power-series over the damping factor d. Furthermore we will see that its coefficients reflect a kind of weighted reachabilities of the respective node, which may be computed from the graph structure and the personalization vector e(u). 2The spectral radius of the matrix I − dM T has to be les than 1. 3Since they normalized the ranking vector, they could describe PageRank as an eigenvector of

a specific matrix.

4

MICHAEL BRINKMEIER

for v ∈ V do PageRank(0) [v] = r[v]; od while not converged do for v ∈ V do PageRank(i+1) [v] = (1 − d)e(v) for u with u → v do (i)

[u] ; PageRank(i+1) [v] = PageRank(i+1) [v] + d PageRank out(u)

od od od

Algorithm 1: The basic PageRank algorithm Theorem 2.1. For each node u ∈ V we have (6)

PageRank(u) =

∞ X

dl (1 − d)

X

al (v, u)e(v)

v

l=0

with (7)

( 1 a0 (v, u) = 0

if v = u otherwise

and

al+1 (v, u) =

X al (v, w) out(w)

w|w→u

Proof: Inserting the power series into the right side of the fix point condition (1) leads to ! ∞ X X X 1 d dl (1 − d) al (v, w)e(v) + (1 − d)e(u) out(w) v l=0

w|w→u

=

= =

∞ X

dl+1 (1 − d)

X

l=0

v

∞ X

X

l=0 ∞ X l=0

dl+1 (1 − d)

e(v)

X al (v, w) + (1 − d)e(u) out(w)

w|w→u

e(v)al+1 (v, u) + (1 − d)e(u)

v

dl (1 − d)

X

al (v, u)e(v)

v

Hence the power series is at least formally a solution of 1. It remains to prove their convergence, which is done in section 2.1.  2.1. The al and the Random-Walk-Interpretation. In this section we will take a closer look at the al (v, u). This will lead to a nice interpretation of these values and the PageRank as a probability closely connected to specific random walks. Furthermore we will prove that the series converges. Lemma 2.2. For two arbitrary nodes v, u ∈ V and l = k + h with l ≥ 0 and k, h ≥ 0 we have: X (8) al (v, u) = ak (v, w)ah (w, u) w∈V

PAGERANK REVISITED

5

Proof: Equation (8) is proved by induction over l. For l = 0 it is obviously true, sind k = 0 or l = 0. Now assume that it holds for l ≥ 0 and that l + 1 = k + h. This implies h > 0 or k > 0. If h > 0 we have X al (v, w) X X 1 al+1 (v, u) = ak (v, w0 )ah−1 (w0 , w) = out(w) out(w) 0 w|w→u

=

X w0

w

w|w→u

X ah−1 (w0 , w) X ak (v, w0 ) = ak (v, w0 )ah (w0 , u). out(w) 0 w

w|w→u

If h = 0 then k = l + 1 and the equation (8) becomes trivial. Corollary 2.3. For v ∈ V and l ≥ 0 we have ( P 1 al (w, u) al+1 (v, u) = out(v) w|v→w 0 Proof: If out(v) 6= 0 then X al+1 (v, u) = a1 (v, w)al (w, u)

=

w

X

if out(v) 6= 0 if out(v) = 0

X

w w0 |w0 →w

X a0 (v, w0 ) al (w, u) = out(w0 ) 0

=

w →w



a0 (v, w0 ) al (w, u) out(w0 )

X al (w, u) , out(v)

w|v→w

where we used that a0 (v, w0 ) 6= 0 if and only if v = w0 . If out(v) = 0 then the statement is proved by induction over l ≥ 1. For l = 1 we have X a0 (v, w) a1 (v, u) = . out(w) w|w→u

Since a0 (v, w) = 0 for w 6= v and since out(v) = 0 this implies a1 (v, u) = 0. Now assume the statement holds for l ≥ 1. Then X al (v, w) al+1 (v, u) = = 0. out(w) w|w→u

 For the proof of the convergence of the power series (6) we require one more result. Corollary 2.4. Let G = (V, E) be a directed graph and l ≥ 0 For two arbitrary nodes u, v ∈ V we have al (v, u) ≤ 1

(9)

Proof: For l = 0 the correctness follows directly from the definition. Now assume that it holds for l ≥ 0. If out(v) = 0 the inequality (9) follows from the preceeding corollary. In the case out(v) 6= 0 we have X X 1 1 1 al+1 (v, u) = al (w, u) ≤ 1= out(v) = 1 out(v) out(v) out(v) w|v→w

w|v→w

 Now the convergence is clear.

6

MICHAEL BRINKMEIER

Theorem 2.5. The power series PageRank(u) =

∞ X

X

dl (1 − d)

al (v, u)e(v)

v∈V

l=0

converges for all u ∈ V and 0 < d < 1. Proof: We have PageRank(u) =

∞ X

l

d (1 − d)

l=0

X

al (v, u)e(v) ≤

v∈V

∞ X l=0

dl (1 − d)

X

e(v) =

v∈V

X

e(v)

v∈V

 P In the following we assume that e(v) ≥ 0 for each node v and v e(v) = 1. Otherwise the interpretation of PageRank(u) as a probability is impossible. Nonetheless the results are still true if these conditions do not hold. The power series may be interpreted as an ’unforgiving’ random surfer. Before the surfer starts surfing she chooses a starting node. The probability that a specific node v is chosen is e(v). As she reaches a node, including the starting point, she decides wether to continue the walk. With a probability of (1 − d) she chooses to end her travel. Otherwise the next node is chosen with equal probability among the neighbours of the current position, if they exist. If no neighbours exist, but the surfer decides to continue the walk, she becomes annoyed and simply stops doing anything. PageRank(u) is the probability that the surfer voluntarily ends her walk at node u4. Assumed that she decided to start at node v, she stops in the node u with P∞ l l probability l=0 P (v → u)(1 − d). Here P (v → u) is the probability that she is in u after l steps. These may be calculated using the following iteration. ( 1 if u = v 0 P (v → u) = = d0 a0 (v, u) 0 if u 6= v ( d if v → u 1 = da1 (v, u). P (v → u) = out(v) 0 otherwise By induction we obtain l+1

P (v → u) =

X w|w→u

d l P (v → w) = dl+1 al+1 (v, u). out(w)

Therefore the probability that the surfer stops in u is given as ∞ ∞ X X X X l P (u) = e(v) (1 − d)P (v → u) = e(v) (1 − d)dl al (v, u) = PageRank(u). v

l=0

v

l=0

Using paths, the interpretation of the al as probabilities leads to a formula already used in [JW02, JW03], namely the expected meeting distance. Proposition 2.6. For all nodes u, v we have X al (v, u) = P (π) l

π : v →u 4Observe, that the rankings are not the whole truth. They are only a part of a probability distribution. There exists an additional hidden state, describing an annoyed surfer. This state causes a loss of ranking, as shown in theorem 2.7.

PAGERANK REVISITED

7

where P (π) is defined inductively over the length l of the paths:   0 P (π) = 1   P (π0 )

if l = 0 and u 6= v if l = 0 and u = v π0

if l > 0 and π : u → w → v,

out(w)

where π is the path π 0 from u to w followed by the edge (x, v). This leads to (10)

PageRank(u) =

X

X

e(v)

v

P (π)(1 − d)dl(π) .



π : v →u

The interpretation of PageRank as probabilities of a random walk is destroyed P if e(v) < 0 for some nodes or e(v) 6= 1. P If all e(v) are non-negative PageRank(u) is the probability multiplied by kek1 = v∈V e(v). But if some of the e(v) are negative the random walk interpretation becomes very questionable. Nonetheless the ranking is well-defined and exists. To get a complete picture of the random surfer we have to calculate the probability that she becomes annoyed, i.e. that she runs into a sink and decides to continue. The probability that she becomes annoyed after the l-th step and starting in v is X

dl+1 al (v, w).

w|out(w)=0

Hence the probability of the surfer to become annoyed starting in v is ∞ X

X

dl+1 al (v, w)

i=0 w|out(w)=0

and hence in total X

e(v)

v

∞ X i=0

dl+1

X

al (v, w) =

w|out(w)=0

d 1−d

X

PageRank(w).

w|out(w)=0

This is confirmed by the following result. Theorem 2.7. X

PageRank(u) = (1 − d)

u

X

e(v) + d

v

=

X v

e(v) −

X

PageRank(v)

v|out(v)6=0

d 1−d

X v|out(v)=0

PageRank(v)

8

MICHAEL BRINKMEIER

Proof: We have ∞ X XX X PageRank(u) = e(v) dl (1 − d)al (v, u) u

u

v

l=0

= (1 − d)

X

= (1 − d)

X

e(v) +

∞ XX X

v

v l=1 u ∞ XX X

e(v) + d

v

= (1 − d)

X

e(v)dl (1 − d)al (v, u)

l=0

e(v) + d

u

e(v)dl (1 − d)al+1 (v, u)

v

∞ X X XX

v

u w|w→u v

e(v)dl (1 − d)

l=0

al (v, w) out(w)

1 = (1 − d) e(v) + d PageRank(w) out(w) v w→u X X X 1 = (1 − d) e(v) + d PageRank(w) out(w) v w|out(w)6=0 u|w→u X X = (1 − d) e(v) + d PageRank(w) X

X

v

w|out(w)6=0

This proves the first equality. The second is a consequence of X X X PageRank(u) = (1 − d) e(v) + d PageRank(w) u

v

w|out(w)6=0

 = (1 − d)

X v

e(v) + d 

 X

PageRank(w) −

w

X

PageRank(w)

w|out(w)=0

and therefore (1 − d)

X u

PageRank(u) = (1 − d)

X v

e(v) − d

X

PageRank(w).

w|out(w)=0

 2.2. Uniqueness. So far we have proved that the power series is one solution of the fix point condition (1). But is it the only one and, if not, how is it related to the other solutions? To answer these questions, we describe equation (1) in terms of linear algebra. If M is the normalized adjacency matrix, i.e. M = (muv ) with 1 muv = out(u) if there exists a link from u to v and 0 otherwise, then it may be written as (11)

PageRank = dM T PageRank + (1 − d)e

where PageRank is the column vector containing the ranks and e is the column vector of the e(v). Now assume that there exists another ranking r(v) that satisfies (11). Then we have PageRank − r = dM T PageRank + (1 − d)e − dM T r − (1 − d)e = dM T (PageRank − r). This implies that the difference of two solutions of (11) is an eigenvector of M T to the eigenvalue d1 . Since 0 < d < 1 this requires that M T has an eigenvalue d1 > 1.

PAGERANK REVISITED

9

But following standard results of linear algebra, the spectral radius ρ(M ) is less or equal to the norm kM T xk1 . kxk1 kxk1 6=0

kM T k1 = sup

For an arbitrary vector x with kxk1 6= 0 one obtains kM T xk1 ≤

X (u,v)∈E

| xu | = out(u)

X

| xu | ≤

X

| xu | = kxk1

u

out(u)6=0

and hence kM T k1 ≤ 1. This implies that M T has no eigenvalue larger than 1 and therefore the solution of (1) is uniquely determined. Theorem 2.8. For each directed graph G = (V, E) the values

PageRank(u) =

∞ X

dl (1 − d)

X

al (v, u)e(v)

v

l=0

form a unique solution of the fix point condition (1). 2.3. The Teleportation. Now that we know the unique solution of the system 11, we will relate it to the adaption of PageRank often found in the literature, obtained by adding teleportations from each sink to each vertex u with probability e(u)/kek1 . This approach is described by the fixpoint condition (5). In [BGS03, Prop. 2.2] the following proposition was already proved for e(v) = 1, but here the state the result for arbitrary personalization vectors with e(v) ≥ 0 for each vertex. Proposition 2.9. Let e be an rabitrary personalization vector with non-negative entries, ie. e(v) ≥ 0. Then the solution PageRank of equation (5) is given by

PageRank =

kek1 PageRank. kPageRankk1

Proof: First observe that thm. 2.7 implies

kek1 = kPageRankk1 +

d 1−d

X

PageRank(v).

v|out(v)=0

With PageRank(v) :=

kek1 PageRank(v) kPageRankk1

10

MICHAEL BRINKMEIER

this leads to

kek1 PageRank(u) kPageRankk1 =

X PageRank(v) kek1 d (1 − d)kek1 + e(u) kPageRankk1 out(v) kPageRankk1 v|v→u

=d

X PageRank(v) out(v)

v|v→u



+

=d

(1 − d)e(u)  d kPageRankk1 + kPageRankk1 1−d

 X

PageRank(v)

v|out(v)=0

X PageRank(v) + (1 − d)e(u) out(v)

v|v→u

+d

X v|out(v)=0

=d

e(u) PageRank(v) kPageRankk1

X PageRank(v) + (1 − d)e(u) + d out(v)

v|v→u

X v|out(v)=0

e(u) PageRank(v). kek1

Therefore PageRank satisfies eq. (5).



3. Convergence of PageRank In this section we are going to describe an algorithm for the calculation of PageRank(u) based on our representation. As we will see it exactly corresponds to the power iteration suggested by Page and Brin in [BP98] and [BPMW99]. But our representation of the unique solution provides a better insight into the convergence properties of the iteration. Let PageRank(i) (u) be the partial sum

(12)

PageRank(i) (u) :=

i X l=0

(1 − d)dl

X v

al (v, u)e(v)

PAGERANK REVISITED

11

This leads to PageRank(i+1) (u) =

i+1 X

(1 − d)dl

X

=

i+1 X

(1 − d)dl

X

i+1 X

(1 − d)dl−1

X w|w→u

X

=d

w|w→u

X

e(v)

v

l=1

=d

al (v, u)e(v) + (1 − d)e(u)

v

l=1

=d

al (v, u)e(v)

v

l=0

1 out(w)

i X

X al−1 (v, w) + (1 − d)e(u) out(w)

w|w→u

(1 − d)dl

X

e(v)al (v, w) + (1 − d)e(u)

v

l=0

1 PageRank(i) (w) + (1 − d)e(u) out(w)

Comparing this formula with the iteration (4) reveals, that the standard iteration with start values r(0) (u) = (1 − d)e(u) computes the partial sums. This in turn allows us to estimate the error of the iteration. We have kPageRank(u) − r(i) (u)k1 =

∞ X

dl (1 − d)

l=i+1



X

e(v)al (v, u)

v ∞ X l=i+1

dl (1 − d)

X

e(v) = di+1 kek1 .

v

More precisely we obtain Theorem 3.1. The iteration (4) with start vector (1 − d)e converges linear with rate d, i.e. kPageRank − PageRank(i+1) k1 ≤ dkPageRank − PageRank(i) k1 . Proof: Since the partial sums PageRank(i) can be calculated by the iteration 4 and since PageRank is the unique solution of 1, we have kPageRank − PageRank(i+1) k1 = dkM T (PageRank − PageRank(i) )k1 . Since the (natural) matrix norm kM k1 is compatible with the vector norm k · k1 , this leads to kPageRank − PageRank(i+1) k1 ≤ dkM T k1 kPageRank − PageRank(i) k1 ≤ dkPageRank − PageRank(i) k1 , since kM k1 as observed in section 2.2.  This result is an extension of the convergence result in [HK03] to arbitrary graphs. Haveliwala and Kamvar required the graph to contain at least two disjoint closed subsets, i.e. two sets S1 and S2 such that no edge leads out of these5. It remains to check the effect of the start vector r0 . 5The irreducibility required in [HK03] is equivalent to the requirement of the two sets being disjoint.

12

MICHAEL BRINKMEIER

for v1 . . . vn do PageRank[vi ] = (1 − d)e(vi ); od while not converged do for v1 . . . vn do PageRank[vi ] = (1 − d)e(vi ) for u with u → vi do ; PageRank[vi ] = PageRank[vi ] + d PageRank[u] out(u) od od od

Algorithm 2: The adapted PageRank algorithm Lemma 3.2. Let r0 be an arbitrary vector and i ≥ 1. Then the iteration (4) yields X (13) ri (u) = di ai (v, u)r0 (v) + PageRank(i−1) (u). v

Proof: For i = 1 we have X r0 (v) X r1 (u) = d + (1 − d)e(u) = d a1 (v, u)r0 (v) + PageRank(0) (u). out(v) v v|v→u

Now assume (13) holds for i ≥ 1. X ri (v) ri+1 (u) = d + (1 − d)e(u) out(v) v|v→u

=d

X

di

X ai (w, v)r0 (w) + PageRank(i−1) (v) out(v)

w

v|v→u

= di+1

X

= di+1

X

w

r0 (w)

+ (1 − d)e(u)

X ai (w, v) + PageRank(i) (u) out(v)

v|v→u

r0 (w)ai+1 (w, u) + PageRank(i) (u)

w

 Corollary 3.3. For every vector r0 the iteration (4) converges to PageRank. 4. The Gauß-Seidel-Iteration Examining the iteration (4) and the basic algorithm (1), one observes that two rankings have to be stored for each node, the old and the new one. In this section we prove that this is not necessary. We are going to use only one value per node, and at the same time we get a faster convergence for free, as already confirmed by experiments of Arasu et al. [ANTT01]. In terms of iterative solutions of linear systems, algorithm 1 corresponds to a Jacobi-iteration, while algorithm 2 is a GaußSeidel-iteration. The new algorithm is obtained from algorithm (1) by simply ignorig the ‘generation’ of the ranking, and iterating through the nodes in a predefined but arbitrary order v1 , . . . , vn . For technical reasons we furthermore require the starting vector to be (1 − d)e. This results in algorithm (2).

PAGERANK REVISITED

13

Since the new algorithm calculates the rank of vi using the new rankings for nodes vk with k < i and the old ones for k > i, it corresponds to the following iteration equation:  (14)

si+1 (vj ) = d 

X

kj|vk →vj

 si (vk )  + (1 − d)e(vi ). out(vk )

and starting vector s0 = r0 = (1 − d)e. As we will see in the following, this iteration converges to PageRank and we have

kPageRank − si k1 ≤ kPageRank − ri k1 ,

ie. the absolute error is at most as large as the one of the usual iteration. First observe that with r0 = (1 − d)e the ri is the partial sum PageRank(i) , and hence a monotone increasing sequence. Lemma 4.1. For r0 = s0 = (1 − d)e, 1 ≤ j ≤ n and i ≥ 0 we have ri (vj ) ≤ si (vj ) ≤ rj(n+1)i +1 (vj ). Proof: We prove the allegation by induction over the ‘generation’ i. For i = 0 we obviously have (1 − d)e(vj ) = r0 (vj ) = s0 (vj ) ≤ rj+1 (vj ). For i > 0 we prove the inequality by induction over the number j of the nodes. For j = 1 we have

ri (v1 ) = (1 − d)e(v1 ) + d

X vk |vk →v1

≤ (1 − d)e(v1 ) + d

ri−1 (vk ) out(vk )

X k>1|vk →v1

si−1 (vk ) out(vk )

= si (v1 ) ≤ (1 − d)e(v1 ) + d

X k>1|vk →v1

≤ (1 − d)e(v1 ) + d

X k>1|vk →v1

= r(n+1)i +1 (v1 ).

rk(n+1)i−1 +1 (vk ) out(vk ) r(n+1)i (vk ) out(vk )

14

MICHAEL BRINKMEIER

For j > 1 we obtain X

ri (vj ) = (1 − d)e(vj ) + d

vk |vk →vj

ri−1 (vk ) out(vk )

 = (1 − d)e(vj ) + d 

X

kj|vk →vj

= si (vj )  ≤ (1 − d)e(vj ) + d 

X

kj|vk →vj

= r(n+1)i +1 (vj ), where we used k(n + 1)i−1 + 1 ≤ n(n + 1)i−1 ≤ (n + 1)i ≤ j(n + 1)i if 1 ≤ j < k and k(n + 1)i + 1 < j(n + 1)i + 1 if 1 ≤ k < j.



Corollary 4.2. For i → ∞ and 1 ≤ j ≤ n the sequence si (vj ) converges to PageRank(vj ) and kPageRank − si k1 ≤ kPageRank − ri k1 , Proof: Both, ri (vj ) and ti (vj ) := rj(n+1)i +1 (vj ) converge to PageRank(vj ), and hence the sequence si (vj )does. Since ri (vj ) ≤ si (vj ) ≤ PageRank(vj ), the inequality holds.  5. Personalization – PageRank as a linear map To examine the effect of the personalization vector e, we rewrite the power series slightly and use e as a parameter. PageRanke (u) :=

X

e(v)

∞ X

v∈V

with a(v, u) :=

(1 − d)dl al (v, u) =

v∈V

l=0 ∞ X l=0

X

(1 − d)dl al (v, u).

e(v)a(v, u)

PAGERANK REVISITED

15

This formulation makes clear, that PageRank(u) is linear in the personalization vector e, i.e. for two arbitrary vectors e and e0 and a real number c we have (15) (16)

PageRankce (u) = cPageRanke (u) PageRanke+e0 (u) = PageRanke (u) + PageRanke0 (u).

(cmp. [JW03, Hav03]). Proposition 5.1. PageRank is a linear map of the personalization vector e to a ranking vector PageRanke . Its (V ×V )-matrix (i.e. its columns and rows are indexed by nodes) has the entry a(v, u) in row u and column v. This interpretation allows a fast but restricted personalization. Instead of calculating the rank vectors PageRanke for arbitrary e, one restricts the choice of personalization vectors to a certain k-linear subspace of RV , the personalization space P, described by a basis {e1 , . . . , ek }. In this case the ranking induced by a Pk personalization vector e = i=1 ci ei ∈ P is given by PageRanke (u) =

k X

ci PageRankei (u).

i=1

Therefore one has to know k partial scores for each node and the coefficients ci for the personalization vector. Suggestions in these directions can be found in [Hav03, JW03]. 6. Sinks and Sources 6.1. Sinks. The sinks, i.e. pages without outgoing links, play a special role. Theorem 2.7 proves that the sum of all ranks is the 1-norm of the personalization vector d minus 1−d -times the sum of ranks of sinks. Hence, as already noted in [BGS03], sinks cause a loss of rank. In the random surfer model this corresponds to the probability that the surfer is forced to end his tour in a sink. In addition theorem 2.3 has an interesting and intuitively very obvious interpretation. For a sink ω all values al (ω, u) for l ≥ 1 are equal to zero. Hence only a0 (ω, ω) is not zero and the sink has no influence on the PageRank of all other nodes. Therefore the sinks can be omitted from the iteration. Iterating the argument one can exclude even more nodes from the iteration. Consider the following iteratively defined sets of nodes. Ω0 = {ω | out(ω) = 0} Ωi+1 = Ωi ∪ {ω | ω → u implies u ∈ Ωi } Ω0 is the set of sinks. Ωi+1 is constructed from Ωi by adding all nodes which only link to nodes in Ωi . We choose Ω := Ωi such that i is the minimal index with Ωi = Ωi+1 . The definition implies an iterative algorithm, which in step i + 1 runs through the nodes not in Ωi and adds them to Ωi+1 if appropriate. This leads to an order ω1 , . . . , ωk on the elements of Ω. Lemma 6.1. al (ω, v) = 0 for ω ∈ Ω, v 6∈ Ω and l ≥ 0. Proof: Obviously al (ω, v) for each ω ∈ Ω0 and v 6= ω. Now assume that al (ω, v) = 0 for ω ∈ Ωi , v 6∈ Ω and l ≥ 0. Then for ω ∈ Ωi+1 and v 6∈ Ω we have a0 (ω, v) = 0

16

MICHAEL BRINKMEIER

and al+1 (ω, v) =

X al (u, v) = 0, out(u)

u|ω→u

because each u with ω → u is in Ωi and hence al (u, v) = 0.  As a consequence the nodes in Ω have no effect on the other nodes. Hence they may be omitted from the iteration. Their PageRank can be calculated from the known rankings via X PageRank(u) + (1 − d)e(ω). PageRank(ω) = d out(u) u|u→ω

To be sure that the ranking of each predecessor u is known at the time of calculation, one simply has to calculate the ranks on the reversed sequence ωk , . . . , ω1 obtained during the construction of Ω. This guarantees that a predecessor either was not a member of Ω or was in a higher level of Ω and hence its value is already known. 6.2. Sources. Regarding sources, i.e. nodes α without incoming edges, the situation is the other way round. Obviously we have PageRank(α) = (1 − d)e(α). By iterative construction of the sets A0 = {α | in(α) = 0} Ai+1 = Ai ∪ {α | w → α implies w ∈ Ai } one obtains a sequence α1 , . . . , αh of nodes which can be omitted from the iteration. For α ∈ Ai its PageRank either is known in advance (i = 0), or it can be calculated directly from the ranks of nodes in Ai−1 . Let A = Ai be given such that i is the minimal index with Ai = Ai+1 . Since the ranks of nodes in A may be used during the iteration this has to be done before the iteration starts. 6.3. A faster Algorithm? The observations about sinks and sources lead to an adapted algorithm. (1) Iteratively construct A = {α1 , . . . , αh } and Ω = {ω1 , . . . , ωk }. (2) Ordered by i, set X PageRank(w) PageRank(αi ) = d + (1 − d)e(αi ). out(w) w|w→αi

(3) Calculate PageRank iteratively on all nodes v 6∈ A ∪ Ω. (4) In reverse order, set X PageRank(w) PageRank(ωi ) = + (1 − d)e(ωi ). out(w) w|w→ωi

Depending on the graph, the pre- and post-processing may significantly reduce the size of the graph on which the iteration has to be made. Experiments based on samples of a crawl of the uk domain6 [web] show, that the percentage of time saved due to the extraction of iterated sinks and sources, is about 2.6% (see tab. 1), while the number of edges used in the iterations was reduced by 6.1%. These values are, up to moderate variations, independent of the number of vertices and edges. 6The crawls were obtained by a depth-first searches starting from 10 random vertices.

PAGERANK REVISITED

17

Comparison of Runtimes of the Standard and the Improved Algorithm Vertices 50000 50000 50000 50000 50000 100000 100000 100000 100000 100000 200000 200000 200000 200000 200000 300000 300000 300000 300000 300000 500000 500000 500000 500000 500000 750000 750000 750000 750000 750000 1000000 1000000 1000000 1000000 1000000

Edges 917233 942304 943389 949646 988359 1720248 1753619 1829260 2008708 2594222 3608664 3789710 4021397 4031564 4090879 5323560 5667435 5737053 5758545 6300042 8891571 9116317 9399429 9591599 10619428 12996280 13285463 13489963 13752241 15323023 17474000 18496113 19249868 19758854 20074026

Sources 7 9 8 6 5 7 7 8 5 9 3 5 7 5 8 4 5 8 7 7 6 7 6 4 5 8 7 8 5 6 160 3 6 4 8

Sinks 6251 8042 8302 8035 7063 17927 15445 15352 15373 17379 31018 32075 29780 30536 32491 48710 46873 52724 45159 48764 88348 85174 83240 82251 75457 126849 127735 123166 111797 119842 167953 167951 160471 157466 157268

Remaining Edges 869141 888356 888861 896677 937987 1613699 1630987 1714334 1915968 2416041 3404691 3591242 3793958 3785828 3759175 4988322 5364362 5335334 5439614 5871255 8267953 8502149 8776220 8910032 10063884 12177586 12426479 12606496 12923384 14390760 16375208 17431133 18100707 18603814 19011465

Standard Algorithm 68584 69460 69294 69830 72508 128680 130200 136302 147028 185246 266768 279836 293638 294944 299758 394950 417494 421866 427448 452642 619466 670740 698504 707290 761588 930228 984928 997201 1020713 1110343 1308330 1362314 1421159 1453838 1481932

Improved Algorithm 67492 68156 67786 68402 71398 124658 125690 132324 145704 180488 261040 274578 287910 287596 286882 383466 409432 407034 418134 440310 597132 648200 675930 681502 744906 904326 953990 966366 994887 1083363 1265920 1326006 1385099 1420842 1460930

Edge Ratio 0,948 0,943 0,942 0,944 0,949 0,938 0,930 0,937 0,954 0,931 0,943 0,948 0,943 0,939 0,919 0,937 0,947 0,930 0,945 0,932 0,930 0,933 0,934 0,929 0,948 0,937 0,935 0,935 0,940 0,939 0,937 0,942 0,940 0,942 0,947

Time Ratio 0,984 0,981 0,978 0,980 0,985 0,969 0,965 0,971 0,991 0,974 0,979 0,981 0,980 0,975 0,957 0,971 0,981 0,965 0,978 0,973 0,964 0,966 0,968 0,964 0,978 0,972 0,969 0,969 0,975 0,976 0,968 0,973 0,975 0,977 0,986

Average Standard Deviation

0,939 0,007

0,974 0,007

Remaining Edges: Number of edges leading to non-sink and non-source vertices Standard/Improved Algorithm: Avg. runtime of the algorithm (100 iterations, 5 runs) in ms Edge Ratio: Number of remaining edges divided by total number of edges Edge Ratio: Improved time divided by standard time

Table 1. Runtimes of the standard and the improved algorithm

7. Arbitrary Probabilities During the whole paper we assumed that the random surfer chooses his next destination equally among the neighbors of his current position v, leading to a 1 probability of out(v) for each outgoing link. But all results also hold in a more general setting.

18

MICHAEL BRINKMEIER

Assume that the probability of going from v to u is given by 0 ≤ pv,u ≤ 1 if the link from v to u exists and 0 otherwise. These values form a V × V -matrix P P = (pv,u )v,u∈V , such that the sums u pv,u are either 1 if v has outgoing links, or 0 if not. Therefore P is not quite stochastic. Nonetheless one can calculate the PageRank for this setting, using the fix point condition X PageRank(u) = d pv,u PageRank(v) + (1 − d)e(v). v|v→u

As before this leads to PageRank(u) =

∞ X

(1 − d)dl

X

al (u, v)e(v)

v

l=0

with ( 1 a0 (v, u) = 0

if v = u otherwise

and

al+1 (v, u) =

X

al (v, w)pw,u .

w|w→u

Obviously, all results presented in this paper can be directly transferred to this situation. Hence PageRank is only a quite canonical member of a whole class of rankings. 8. Conclusion As we have seen the representation of PageRank as power series provides deeper insight into the nature and properties of this ranking. The effects of the parameters, i.e. the graph, the personalization vector and the damping factor, are clearly separated from each other, and their influence on the resulting scores becomes clear. Due to this separation we were able to give an exact interpretation in terms of a random surfer, interpreted the coefficients al as probabilities of paths and gained insight into the convergence properties of the standard iteration (4) without using the theory of Markov-chains. Furthermore we were able to establish a special, supposedly more efficient treatment of sources and sinks and closely related nodes. Additionally the new formulation of the solution provides new tools for more detailed analysis of adaptions and suggestions found in the literature. References [ANTT01]

[BGS03] [BP98] [BPMW99]

[Hav02] [Hav03]

A. Arasu, J. Novak, A. Tomkins, and J. Tomlin. Pagerank computation and the structure of the web: Experiments and algorithms, 2001. citeseer.ist.psu.edu/arasu02pagerank.html. M. Bianchini, M. Gori, and F. Scarselli. Inside PageRank. ACM Trans. Internet Tech., 2003. Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual websearch engine. In Proc. of the 7th World Wide Web Conference (WWW7), 1998. Sergey Brin, Lawrence Page, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford Digital Library Technologies Project, 1999. http://dbpubs.stanford.edu:8090/pub/199966. Taher H. Haveliwala. Topic-sensitive pagerank. In Proc. of the 11th WWW Conference (WWW11), pages 517–526, 2002. Taher H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. IEEE Trans. Knowl. Data Eng., 15(4):784–796, 2003.

PAGERANK REVISITED

[HK03]

19

T. Haveliwala and S. Kamvar. The second eigenvalue of the google matrix. Technical Report 2003-20, Stanford University, 2003. http://dbpubs.stanford.edu/pub/200320. [HKJ03] T. Haveliwala, S. Kamvar, and G. Jeh. An analytical comparison of approaches to personalizing pagerank. Technical report, Stanford University, 2003. [JW02] Glen Jeh and Jennifer Widom. SimRank: A measure of structural-context similarity. In Proc. 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, July 2002. [JW03] G. Jeh and J. Widom. Scaling personalized web search. In Proc. 12th Intl. WWW Conf., May 2003. [KHG03] S. Kamvar, T. Haveliwala, and G. Golub. Adaptive methods for the computation of pagerank. Technical report, Stanford University, 2003. [KHMG03a] Sepandar D. Kamvar, Taher H. Haveliwala, Christopher D. Manning, and Gene H. Golub. Exploiting the block structure of the web for computing pagerank. Technical Report 2003-17, Stanford University, 2003. http://dbpubs.stanford.edu:8090/pub/2003-17. [KHMG03b] Sepandar D. Kamvar, Taher H. Haveliwala, Christopher D. Manning, and Gene H. Golub. Extrapolation methods for accelerating pagerank computations. In WWW, pages 261–270, 2003. [web] Webgraph. University of Milano, http://webgraph.dsi.unimi.it/.

Michael Brinkmeier Technical University of Ilmenau Faculty of Computer Science and Automation Institute for Theoretical and Technical Computer Science E-mail address: [email protected]