Communication-optimal Parallel Minimum Spanning Tree ... - CiteSeerX

2 downloads 12877 Views 395KB Size Report
1Department of Computer Science, University of Toronto, Canada. ... of a BSP algorithm is simply the sum of the cost of its constituent supersteps. ..... tree with a fairly large degree without increasing the communication time asymptotically.
Communication-optimal Parallel Minimum Spanning Tree Algorithms Micah Adler1

Wolfgang Dittrich2 Ben Juurlink3 Ingo Rieping3

Miroslaw Kutylowski4

Abstract

Lower and upper bounds for nding a minimum spanning tree (MST) in a weighted undirected graph on the BSP model are presented. We provide the rst non-trivial lower bounds on the communication volume required to solve the MST problem. Let p denote the number of processors, n the number of nodes of the input graph, and m the number of edges of the input graph. We show that in the worst case, a total of ( min(m; pn)) bits need to be communicated in order to solve the MST problem, where  is the number of bits required to represent a single edge weight. This implies that if each message communicates at most  bits, any BSP algorithm for nding an MST requires communication time

(g min(m=p; n)), where g is the gap parameter of the BSP model. In addition, we present two algorithms with communication requirements that match our lower bound in di erent situations. Both algorithms perform linear work for appropriate values of n, m and p, and use a number of supersteps that is bounded for arbitrarily large input sizes. The rst algorithm is simple but can employ at most m=n processors eciently. Hence, it should be applied in situations where the input graph is relatively dense. The second algorithm is a randomized algorithm that performs linear work with high probability, provided that m n log p. 







Department of Computer Science, University of Toronto, Canada. Email: [email protected]. Supported by an operating grant from the Natural Sciences and Engineering Research Council of Canada, and by ITRC, an Ontario Centre of Excellence. This research was conducted in part while the author was at the Heinz Nixdorf Institute Graduate College, Paderborn, Germany. 2 Bosch Telecom GmbH, Dept. UC-ON/ERS, Backnang, Germany. Email: Wolfgang.Dittrich2 @pcm.bosch.de. This research was done while the author was working at the University of Paderborn, Germany. 3 Heinz Nixdorf Institute, Dept. of Computer Science, University of Paderborn, Germany. Email: fbenj,[email protected]. This research was partially supported by DFG-SFB 376 \Massive Parallelitat" and EU ESPRIT Long Term Research Project 20244 (ALCOM-IT). 4 Computer Science Institute, University of Wroclaw, Poland. Email: [email protected]. 1

1

1 Introduction Parallel computation models like the BSP [Val90], the BSP* [BDM95], the QSM [GMR97], the CGM [DFRC93] and the LogP [CKP+ 93] are used more and more often for designing and analyzing parallel algorithms. In contrast to PRAM models, these models account for communication cost by introducing parameters that correspond to considerations such as communication bandwidth, latency or the startup cost of a message transmission. For this reason, algorithms derived using these models often have more practical relevance than PRAM algorithms. All models mentioned above provide the incentive to design parallel algorithms that use less communication time than computation time. Such algorithms are called communicationecient. For \regular" problems such as matrix multiplication, FFT and LU decomposition, this is relatively easy to achieve. Communication-ecient algorithms have also been obtained for a number of \irregular" problems such as sorting [GV94, ABK95, Goo96, GS96], multisearch [BDM95, BD96, Goo97] and parallel priority queues [BDMR96a]. However, for many graph problems such as list ranking, connected components and minimum spanning tree, it seems dicult to nd communication-ecient algorithms. For many of these problems, very fast (linear time) sequential algorithms exist, but ecient parallel algorithms are unknown. In this paper, we consider one of these problems, that of nding a minimum spanning tree (MST), and use the BSP model [Val90] as our cost model. Brie y, a BSP computer consists of a set of processor/memory modules, a routing network, and a mechanism for synchronizing the processors in a barrier style. The performance of a BSP computer depends on three parameters that specify the number of processors p, the computation-to-communication throughput ratio g , and the latency/synchronization cost L. A BSP computation consists of a sequence of supersteps, separated by barrier synchronizations. In a superstep, a processor can perform local operations, send messages, and receive messages. Messages received during one superstep cannot be used until the next superstep. The cost of a superstep is given by w + g  h + L, where w is the maximum number of local operations performed by any processor, and h is the maximum number of messages sent or received by any processor, also called the (per-processor) BSP communication volume of the superstep. (Note that the last notion is somewhat vague, since it is usually not speci ed how large a single message in the BSP model might be. However, h is widely used as a measure of communication eciency.) The total cost of a BSP algorithm is simply the sum of the cost of its constituent supersteps. In this paper we often analyze the three cost components separately, and thus, we let W denote the total (per-processor) local work performed by the algorithm, H denote the total (per-processor) BSP communication volume, and S denote the number of supersteps. Communication time of a BSP algorithm is then computed as g  H + L  S . For more details and motivation for the BSP model, the reader is referred to [Val90].

1.1 De nition of the MST Problem

Given a connected undirected graph G = (V; E ), each of whose edges has a weight w(e), we let n = jV j denote the number of nodes and m = jE j denote the number of edges of the input graph. A tree T = (V; F ) where F  E and jF j = jV j ? 1 is called a spanning tree of G if it connects all the vertices. The weight w(T ) of the spanning tree is X w(T ) = w(e): e2F

2

T is a minimum spanning tree (MST) if w(T )  w(T 0) for all other spanning trees T 0. In the

parallel version of this problem, we assume that each edge is stored exactly once, and that the edges can be distributed over the processors in any manner where each processor has the same number of edges.

1.2 Overview of Results

A natural rst step for any parallel MST algorithm is for each processor i to compute a minimum spanning forest on the edges local to i. Once this has been accomplished, the number of edges remaining in the graph is at most min(m; np). In this paper, we demonstrate that the quantity min(m; np) de nes the communication requirements of the MST problem. We provide the rst non-trivial lower bounds on the communication volume required to solve the MST problem. In particular, we show that for any values of p, m, and n, the total number of bits that need to be communicated, in the worst case, is (  min(m; pn)), where  is the number of bits required to represent an edge weight. The lower bound is information theoretic, and holds for Las Vegas randomized as well as deterministic algorithms. The worst case distribution of edges to processors required by this lower bound is natural: we assume only that at the start of the algorithm, the edges are sorted by source node for an arbitrary orientation of the edges. A lower bound on the communication volume required to solve a problem implies a lower bound on the running time of any BSP algorithm solving that problem. However, this lower bound depends on b, the number of message bits contained in each BSP message packet. When b = , L  g , and b = (log p), then the lower bound implies that the communication time required by any BSP algorithm is (g  min(m=p; n)). Note that the requirement b = (log p) is equivalent to saying that the number of message bits in each packet is not dominated by the number of bits needed to describe the packet destination. Furthermore, the techniques used to prove this bound do not assume the BSP model, and thus can be used to provide lower bounds for other models that limit interprocessor communication, including the LogP [CKP+ 93] and the QSM [GMR97]. The lower bound techniques we develop are also quite general. For example, they can be used to provide similar lower bounds on the communication volume required for the problem of nding a minimum matching. We also demonstrate upper bounds that asymptotically match the lower bound on communication volume provided that m  n  log p, i.e., on all but very sparse graphs. To do this, we present two di erent algorithms: one for the case m  np, which we call mst-dense, and one for the case m  np, which we call mst-sparse. Algorithm mst-dense requires W = O(n  log log p + Tseq(n; m=p)) local computation time, H = O(n) BSP communication volume and S = O(log p) supersteps. Here, Tseq(n; m=p) denotes the running time of a sequential MST algorithm on an input graph with n nodes and m=p edges. Notice that the BSP communication volume H is independent of m (provided that m  np) and that the number of supersteps is independent of the problem size. Furthermore, when m  np, the communication time matches the lower bound. However, the main drawback of the algorithm is that it can employ at most m=n processors eciently. mst-sparse is a Las Vegas randomized algorithm based on the sequential linear-time algorithm presented in [KKT95]. This algorithm requires a stronger assumption on the initial distribution of edges to processors. Speci cally, we assume that every edge is stored twice (once in each direction), and that the edges are sorted by source node. However, when m=p  p , we can use integer sorting to convert an arbitrary distribution of edges to processors 3

into the distribution we use, without signi cantly e ecting any of the performance measures of our algorithm. mst-sparse uses several known ideas, but combines them together in a novel way. For example, it uses a BSP implementation of the algorithm due to Awerbuch and Shiloach [AS87]. This is a CRCW PRAM algorithm, and thus the best known workpreserving emulation on the BSP model needs (n + m)=p  p slackness for some constant  > 0. We show that the amount of slackness required as well as the number of supersteps can be reduced by keeping the graph in a representation in which all edges incident to a node are stored consecutively. Furthermore, mst-sparse uses an observation due to Johnson and Metaxas [JM95], which states that in order to reduce the number of nodes by a factor of k, it is sucient to consider only the k cheapest edges incident to each node. In our algorithm, however, edges are discarded much more aggressively in order to keep the communication time low. Overall, mst-sparse requires W = O(Tseq(n=p; m=p) + (n  log p)=p) local work, H = (12 + )  (m=p) + o(m=p) + O((n  log p)=p) BSP communication volume, and S = O(log p  dlog p= log((m + n)=p)e) supersteps. Thus, it performs linear work provided that m  n  log p, and when np  m  n  log p, the communication time matches the lower bound. The total BSP costs of algorithms mst-dense and mst-sparse are summarized in the following table. Algorithm

Randomized W H S mst-dense No Tseq (n; mp ) + n log logp n logp p n m n n m mst-sparse Yes Tseq ( p ; p ) + p log p p + p logp log p log((log m+n)=p) 





d

e

We compare our algorithms with two sequential MST algorithms: (1) mst-rand, the randomized linear-time algorithm due to Karger, Klein and Tarjan [KKT95], and (2) mstdijkstra, the algorithm commonly attributed to Dijkstra implemented using Fibonacci heaps [CLR90], which requires O(n  log n + m) time. Given a sequential algorithm A, we say that a parallel algorithm for the MST problem is work-optimal with respect to A, if it performs the same amount of local work as A. It is said to be communication-optimal if the per-processor communication volume H is O(min(n; m=p)). The following table shows the conditions under which the presented algorithms are work-optimal and communication-optimal. Algorithm

Compared with

Work-optimal Com.-optimal mst-dense mst-dijkstra m  n  log n  p m  np mst-dense mst-rand m  n  p  log log p m  np mst-sparse mst-dijkstra m  n  log p _ n  p np  m  n  log p mst-sparse mst-rand m  n  log p np  m  n  log p

1.3 Previous Work

Many PRAM algorithms for computing the MST exist. The algorithm due to Awerbuch and Shiloach [AS87] runs in O(log n) time on an (n + m)-processor PRIORITY CRCW PRAM, where the priority of a processor is determined by the weight of the edge assigned to it. This is a very powerful contention resolution rule. Johnson and Metaxas [JM95] gave an algorithm 4

running in O(log3=2 n) time on an (n + m)-processor EREW PRAM. Using the same number of processors, Chong [Cho97] presented an algorithm running in O(log n  log log n) time. Karger [Kar93] was the rst to use randomization in parallel MST algorithms. He gave an EREW PRAM algorithm that requires O(log n) time and uses m= log n + n1+ processors. A randomized sequential linear-time algorithm is given in [KKT95]. A parallel version of this algorithm, running in O(log n) time and performing linear work on a CRCW PRAM, is described in [CKT96]. Concurrent to our work, several groups have investigated the MST problem on various parallel computation models. Dehne and Gotz [DG97] presented MST algorithms for the CGM model. They described a randomized algorithm that can be implemented on the BSP in time O( n+pm log p+g  n+pm log p+Llog p). If bit operations on the edge weights are possible, the algorithm can be made deterministic. They also gave an algorithm for dense graphs. If m = (np), this algorithm takes O( n+pm + g  n+pm + L  log p) time on the BSP model. In comparison, algorithm mst-dense has an extra term O(n  log log p) = O((m=p)  log log p) in its running time. On the other hand, the communication time of mst-dense is independent of m, and we also show that a slight modi cation of mst-dense needs only a constant number of supersteps. Furthermore, Dehne and Gotz always assume that sucient slackness is available ( n+pm  p), whereas we assume arbitrary slackness (n + m  p). Moreover, their algorithms are not work-ecient for sparse input graphs. Poon and Ramachandran [PR97] considered the MST problem on the EREW PRAM, the QSM and the BSP. They designed a randomized algorithm for the EREW PRAM that log f  log n  log log n), where performs linear expected work and runs in time T ( n; m ) = O (2 p f = 1 + log n  log log n  n=m. For all m and n, this provides a linear work algorithm with running time O(log n  log log n  2log n ), and when m = (n  (log n  log log n)2), this is improved to time O(log n  log log n). They also show that this implies a linear work algorithm for the BSP model with running time O(g  T (n; m)  ((L=g ) + g  log p)). When m = O(n  p), the resulting BSP algorithm is also communication-optimal. In contrast to the Poon-Ramachandran algorithm, we only provide work-optimal algorithms for the case that m  n  log p, but our algorithms also have several advantages. First, our algorithms require at most O(log2 p) supersteps, and in many cases only O(log p) supersteps, and thus the number of supersteps is bounded for arbitrarily large input sizes. The Poon-Ramachandran BSP algorithm needs T (n; m) supersteps. Second, our algorithms are designed directly for the BSP model, which obviates the need to hash the memory address space. Third, we introduce some novel techniques in order to reduce the constant in the term O(g  m=p) of the communication time required by algorithm mst-sparse. Finally, our algorithms are communication-optimal algorithms as long as m = (n  log p). The Poon-Ramachandran BSP algorithm, on the other hand, is no longer communication-optimal when m = ! (np).

1.4 Organization of the Paper

In Section 2 the lower bound technique is described and applied to the problem of nding the MST on the BSP model. In Section 3, BSP algorithms are presented for some basic operations that are used frequently in our algorithms. Section 4 describes the main ideas and techniques employed in the MST algorithms. Algorithm mst-dense is described in Section 5. A top-level description of algorithm mst-sparse is given in Section 6. It employs three subalgorithms which are described in Sections 7 through Section 9. 5

2 Communication Lower Bounds for the MST Problem

2.1 Communication Volume

We here present a general, model independent, framework for proving lower bounds on the amount of communication required by an algorithm. We then demonstrate the power of this framework by describing how to use it to obtain lower bounds on the amount of time required for communication in a BSP algorithm. We here introduce the concept of communication volume. This takes into account the amount of information processors obtain both directly, i.e., from bits that are sent and received, as well as indirectly, i.e. from information such as \no message was received at time t", or \the third message received came from processor i". It is known that such indirect methods of communication can be used eciently. An example of this is the algorithm of [CDR86] that computes the OR of n bits on an EREW PRAM in approximately 0:72 log(n) steps. Any EREW or CREW PRAM algorithm computing this function which uses only direct communication requires at least log n steps. Our only assumption about the model of computation is that each processor communicates with the other processors through some communication facility, which we call here a router. We assume that when an algorithm terminates, at least one processor produces each bit of the output correctly, and no bit of the output is produced incorrectly. For any model M , algorithm A, processor P and problem input I , we de ne a transcript T (M; A; I; P ), a string of bits which encodes all the information discernible to P and to any other processor regarding the ow of information between P and the router. Any encoding to binary strings is allowed, provided that the following condition is adhered to: for any algorithm A and two inputs I1 and I2 such that on algorithm A any processor is able to di erentiate between I1 and I2 based solely on the ow of information between P and the router, T (M; A; I1; P ) 6= T (M; A; I2; P ).

De nition 1 Let jT (M; A; I; P )j be the number of bits in the transcript T (M; A; I; P ). When the input I is chosen randomly from a distribution of inputs I , then jT (M; A; I ; P )j is a random variable, and has an expectation E [jT (M; A; I ; P )j] that depends on the method of

encoding the transcript. We say that the per processor communication volume of a processor P in model M running algorithm A on distribution I is the minimum, over all encodings of the transcript, of E [jT (M; A; I ; P )j]. The total communication volume is the sum of communication volumes of all processors.

Thus, for example, when I consists of z equally likely inputs, all of which must have a different transcript for processor P , then the per processor communication volume of processor P must be at least log z. Note that the per processor communication volume is essentially equivalent to the entropy of the communication between P and the remainder of the processors, on an input chosen from I . The crucial feature of the above de nition is that if we nd a lower bound v on communication volume at processor P (by choosing an appropriate probability distribution of the inputs), then for each method of encoding the transcripts there is an input with the transcript at P of length at least v . (Note that which input has this property may depend on the method of encoding the transcript | one can adjust a communication protocol to favor a given input). Therefore, if we have a lower bound of v on communication volume at processor P , then we may less formally say that \there is an input, where communication volume at P is at least v ". 6

This information-theoretic approach for the de nition of the communication volume has two advantages over an analysis performed for a speci c model. First, proving a lower bound for a speci c model is no more general than the model used. In particular, for the lower bound to apply to other models, it is necessary to demonstrate that the model where the lower bound applies is at least as powerful as the other models. Second, even when signi cant details of the communication mechanism for a model are left unspeci ed (for example, in the BSP model, the maximum size of a message as well as the method of specifying the destination address of a message) lower bounds on communication volume apply regardless of the actual mechanism of these details. Below, we show how a general information theoretic bound may be applied to a speci c model: the BSP model.

Lemma 1 A lower bound of V on the per processor communication volume, for any processor,

implies a lower bound of (gV=b) on the worst case time required for communication in the BSP model, provided that L  g and that the BSP model computes with messages consisting of b bits, where b = (log p).

Proof: We show that the communication at a processor P during any execution of a BSP

algorithm that completes in time gV=b can be encoded into a transcript of length O(V ), from which the Lemma follows directly. Each superstep is represented by a sequence of messages sent or received by P . Each message is encoded by a triple: source address, destination address and message bits. The sequences describing supersteps are separated by end-ofsuperstep marks. This allows, for example, processors to identify supersteps where they do not receive messages. By the assumption L  g , the computation considered may consist of at most V=b supersteps. Hence, separating supersteps requires O(V=b) bits. The total number of messages sent in time gV=b is at most V=b, and thus the total number of message bits is at most V . Also, by the assumption that each message has length (log p), the total number of bits required for source and destination addresses is at most O(V ). 2 Note that in most implementations, the destination address is included in a packet that is sent to the router. When this is the case, the assumption that b = (log p) holds whenever the number of message bits is not asymptotically smaller than the number of bits used to describe a packet destination.

2.2 Lower Bounds for the Vector Minimum Problem

To show a lower bound for the MST problem we rst show a lower bound on a related problem, and then reduce this problem to the MST problem.

De nition 2 Given n sets X1; : : : ; Xn, where each set Xi consists of p keys, xi;0; : : :xi;p?1.

The vector minimum of Xi 's is the vector (min(X1); min(X2), : : : , min(Xn )). We assume that there are p processors P0 ; : : : ; Pp?1, and, for every j < p, processor Pj holds the numbers x1j ; : : : ; xnj (one number from each Xi ). The vector minimum (VM) problem is to compute the vector minimum of X1 ; : : : ; Xn .

Consider the VM problem where each element of the sets Xi is a -bit number. For the majority of inputs, the total communication volume of this problem is far below (np). However, it turns out that the worst case complexity is quite large: 7

Theorem 1 Consider the VM problem where each of n sets Xi consists of -bit numbers and the number of processors is p. For any deterministic algorithm solving this problem, there is an input for which the total communication volume is (np). Furthermore, for any randomized algorithm that always answers correctly, there is an input where the expected total communication volume is (np). Proof: We consider the behavior of deterministic algorithms running on an input drawn from a distribution which we de ne below. We show that using the distribution , the communication volume is large. Thus, for any deterministic algorithm and method of encoding the transcript, there must exist some input which also requires a large communication volume. Furthermore, from Yao's Lemma [Yao83], this also implies the stated result about randomized algorithms. An input from the distribution is chosen as follows.  We choose n seeds s1; : : : ; sn independently and uniformly at random from the set of all  ? 1 bit binary strings. For 1  i  n and 0  j < p, let xij = si .  With probability 12 , we choose one processor Pu uniformly at random, and replace each number xiu held by Pu with a completely random  ? 1 bit string. This is referred to as scrambling processor Pu . With probability 21 , we do not scramble any processor.  For each j , 0  j < p, we append a 1 to each string xij , 1  i  p, except for strings xhj , such that h = j mod p, to which we append a 0. (The bit appended is the least signi cant bit.) Lemma 2 For any processor Pj , and for the distribution on inputs to the VM problem, the per processor communication volume at processor Pj is (n).

From the linearity of expectation, Lemma 2 gives us that the total communication volume is (pn). Thus, Theorem 1 follows from Lemma 2. Proof of Lemma 2: We use a variant of a technique developed by Yao for the study of communication complexity [Yao79]. We consider the actions of processor Pj under the assumption that no other processor has been scrambled. Since this occurs with probability at least 21 , this can increase the expected communication volume at Pj by at most a factor of 2. We allow the processors di erent from Pj to determine all the seed values s1 ; : : : ; sn with no charge (this can only decrease the complexity of the problem.) We call the string that processor Pj would hold if it has not been scrambled its standard string, and denote this by S . We denote by (S; u) the input where the standard string is S and processor Pj has not been scrambled, and by (S; S) the input where the standard string is S and Pj has been scrambled so that it holds S. We consider pairs of inputs, called distinguished pairs, of the form (S; u) and (S 0; u), where S and S 0 di er on some bit in seed si , where i 6= j mod p. Claim The communication transcript between processor Pj and the rest of the processors must di er for inputs (S; u) and (S 0; u), if (S; u) and (S 0; u) are a distinguished pair. Proof of Claim: We assume that (S; u) and (S 0; u) are a distinguished pair such that the communication between Pj and the rest of the processors is the same for inputs (S; u) and (S 0; u). We show that there is an input where some processor produces a false bit of the 8

output. Let s01 ; : : : ; s0n be the seeds corresponding to S 0. Let si be a seed such that si 6= s0i and i 6= j mod p, and assume without loss of generality that s0i < si . Let ai = min Xi . Now, consider the input (S; S 0). If processor Pj outputs the least signi cant bit of ai on the input (S; S 0), then it must output the same bit on the input (S 0; u), since computations on (S; S 0) and (S 0; u) are indistinguishable for Pj . However, on input (S; S 0), ai = s0i 1 and on input (S 0; u), ai = s0i 0. Therefore, some other processor must output the least signi cant bit of ai on input (S; S 0). But, this processor must output the same value on input (S; u), since Pj behaves in exactly the same way while communicating with the remaining processors for inputs (S; S 0) and (S; u). However, on input (S; S 0), ai = s0i 1 and on input (S; u), ai = si 0. 2 To complete the proof of Lemma 2, we condition upon the event that the seeds sh , for h = j mod p, are set. The number of completions of these seeds into string S is at least z = 2(?1)bn=pc(p?1). Each of such inputs (S; u) is equally probable and they together have probability 21 . So, the per processor communication volume is (log z ) = (n).

2.3 Reducing the VM Problem to the MST Problem

We show that the vector minimum problem can be reduced to the minimum spanning tree problem. Therefore, a lower bound on the communication volume for the VM problem implies a lower bound on the communication volume for the MST problem:

Corollary 1 Let m and n be arbitrary positive integers such that m  2n. Consider the MST problem for graphs with n vertices and m edges with weights encoded by -bit numbers on a p processor system. Then there is an input that requires total communication volume

(  min(m; np)). Applying Lemma 1 we get immediately:

Corollary 2 Solving the MST problem on the BSP model requires communication time

(g  (=b)  min(m=p; n)) provided that L  g and that the BSP model computes with messages consisting of b bits, where b = (log p).

Proof of Corollary 1: We construct a graph G with n nodes and m edges for which solving

the MST problem is dicult. The construction is slightly di erent in two cases. Case 1: p  m=n. The nodes of G are partitioned into sets A and B , each of size n=2. The nodes of A are partitioned into p disjoint groups, called A1 ; : : : ; Ap, each of cardinality n=(2p). The edges of G fall into the following categories (see Figure 1):  There are n=2 ? 1 edges of weight 0 connecting the nodes of A.  There are edges connecting the nodes of A with the nodes of B. This is arranged so that each node of A is connected with p nodes of B , and the nodes of each Ai , 1  i  p, are connected with all nodes of B . This is possible, since p  jAi j = p  n=(2p) = n=2. The weights of these edges are arbitrary odd numbers in the interval [1; 2?1 ? 1]. 9

G

Ap

A1 Set A:

1

0

2

0 0 ... n/(2p)

...

...

0 n/2

[arbitrary] bipartite Set B: n/2+1

...

n

Figure 1: Graph G constructed in the proof of Corollary 1, Case 1.

 In order to make the total number of edges equal to m, we insert m ? (n=2 ? 1) ? pn=2 additional edges connecting arbitrary nodes, each of weight 2 ? 1. Note that the number of these additional edges is nonnegative. Indeed, since p  m=n and n  m, m ? (n=2 ? 1) ? pn=2  m ? (m=2 ? 1) ? m=2 = 1 It is easy to see that the minimal spanning tree of G must contain all edges of the rst kind (if not, then we add them and for each added edge remove some other edge from the resulting cycle; this operation decreases the total weight). Similarly, the minimum spanning tree does not contain any edge of the third kind. Indeed, such an edge cannot connect the edges of A, since the nodes of weight 0 span A. If such an edge connects a node in A with a node b 2 B , then we may remove this edge and replace it by an arbitrary edge pointing to b taken from the second set of edges. The graph obtained spans G and has a smaller weight. Similarly, we show that an edge connecting two nodes in B can be removed and replaced by an edge from the second set of edges (of a smaller weight). We see that the minimum spanning tree contains all edges of the rst kind and, for each b 2 B, the edge of minimal weight from A to b. The distribution of the edges to processors is such that the ith processor gets the adjacency list of the edges of the second kind for the nodes of Ai . Thus, nding a minimum spanning tree for G is at least as dicult as computing the minimum over the weights of edges pointing to node b over all nodes b 2 B . This is a VM problem. Case 2: p  m=n, m  2n. There are 3 groups of nodes in G: group A consisting of n=2 nodes with p subgroups A1; : : : ; Ap of size n=(2p) each, group B consisting of m=(2p) nodes, and group C consisting of the remaining n ? n=2 ? m=(2p) nodes. The edges of G are de ned as follows (see Figure 2):  The nodes of A are connected by n=2 ? 1 edges of weight 0.  Each node of A is connected with m=n nodes of B, so that the nodes of a single group Ai are connected with all m=(2p) = m=n  n=(2p) nodes of B. The weights of the edges mentioned are arbitrary odd numbers from the interval [1; 2k?1 ? 1]. 10

G

Set C:

Ap

A1 Set A:

1

0

2

... n/(2p) 0

... 0

...

0 n/2

bipartite Set B:

...

n/2+1

n/2+m/(2p)

Figure 2: Graph G constructed in the proof of Corollary 1, Case 2.

 The nodes of C are connected to the rst node of A by jC j edges of weight 0. Since the number of remaining edges equals m ? n=2 + 1 ? n=2  m=n  n=2 + 1 > jC j, there are

enough edges to connect all elements of C .  We place arbitrary edges of weight 2k ? 1 to make the total number of edges equal to m. It is easy to see that the minimum spanning tree for G contains, for each b 2 B , the minimum weight edge pointing from A to b. The input allocation is such that the ith processor gets the edges pointing from Ai to B . So, we have to solve a vector minimum problem with m=p sets of  ? 2 bit numbers. In each case considered, we have to solve a VM problem for min(m=p; n) sets. Therefore, by Theorem 1, there is an input that requires communication volume (min(m=p; n)  p  ( ? 2)) =

(  min(m; np)). 2 The proof of Corollary 1 is presented for the case where each processor stores exactly m=p edges. Small changes suce to cover the case where m is not divisible by p.

3 Basic Algorithms Our algorithms frequently employ some basic operations like broadcast and parallel pre x. In this section, BSP algorithms for these operations are presented. Let k  1 be an integer. In the k-item broadcast operation, k data items need to be sent from one processor to all other processors. In the k-item parallel pre x operation, k independent parallel pre x operations need to be performed. We present two algorithms for both these problems. Lemma 3 If k  p for some constant  > 0, then a k-item broadcast and a k-item parallel pre x operation can be performed in time O(k + g  k + L). 11

Proof: The two-phase broadcast/pre x algorithm described in [JW96] takes O(k + g  k + L) time, provided that k  p. If k < p, we rst broadcast the items to a group of p processors. Since k  p , this takes O(k + g  k + L) time. After that, every processor containing the broadcast items sends them to another group of p processors, and so on. The total BSP complexity can be seen to be O(?1  (k + g  k + L)) = O(k + g  k + L), since  > 0 is a constant. Using this technique, a k-item parallel pre x operation is possible within the same time. 2 A di erent broadcast/pre x algorithm is obtained by sending the packets along a d-ary tree and by pipelining the packets.

Lemma 4 ([Val90]) For any d  2, a k-item broadcast or parallel pre x operation can be performed in time O(d  k + g  d  k + L  dlogd pe). Thus, by increasing the degree of the tree, one increases the communication volume but reduces the number of supersteps. This fact is employed as follows. Since our MST algorithms for sparse graphs require (g  (n + m)=p) communication time anyhow, we can use a broadcast tree with a fairly large degree without increasing the communication time asymptotically.

Corollary 3 A single-item broadcast or parallel pre x operation can be done in time r

Tpre x = O(g  n +p m + L  d log((mlog+pn)=p) e) = o(g  n +p m ) + O(L  d log((mlog+pn)=p) e): p Proof: Set d = (n + m)=p in Lemma 4. 2 We also need an algorithm for selection.

Lemma 5 Let s  p and let s keys be distributed evenly among the processors. Selecting the key with rank k takes time

r ps s s +Ld log p n + m g  p + O(g   ) + O ( p p p log((m + n)=p) e): with probability at least 1 ? n?c for an arbitrary constant c.

Proof: this algorithm, p We adapt the algorithm described in [BDMR96a, BDMR96b]. In p a d s=pe-item broadcast operation needs to be performed. By setting d = (n + m)=p in Lemma 4, we obtain the stated time bound. 2 Finally, we need algorithms for list ranking and Euler tour construction. Given a linked list of n nodes, the list ranking problem is to determine for each node the distance to the head of the list. Given a tree T with n nodes, the Euler tour is an array of size 2n ? 1 that corresponds to a depth- rst traversal of the tree. Caceres et al.[CDF+ 97] solve both problems deterministically using O( np  log p) computation time and O(log p) communication rounds on the CGM model. This immediately translates to a BSP algorithm with a running time of O(( np + g  np + L)  log p). However, these algorithms assume polynomial slackness (n  p1+ for some constant  > 0). If less slackness is available (p  n < p1+ ), we simply apply Wyllie's pointer jumping algorithm [Wyl79], which can be implemented on BSP in time O(( np + g  np + L)  log n) = O(( np + g  np + L)  log p) when p  n < p1+ . 12

Boruvka step (reduces nodes)

Merge step

Random Sampling

(reduces edges)

(reduces edges)

BSP-MERGE

BSP-VERIFY

BSP-AS

BSP-DENSE BSP-AS-SELECT

BSP-SPARSE

Figure 3: Algorithms map.

Lemma 6 Ranking a list with n nodes and computing the Euler tour of a tree with n nodes have BSP complexity

O(( np + g  np + L)  log p):

4 Outline of the Algorithms In this section we present the main ideas and techniques employed in the MST algorithms. All algorithms are based on three main techniques: Bor_uvka-steps, merge steps and random sampling. These techniques and how they are used in the algorithms are shown schematically in Figure 3. Our rst algorithm for dense graphs, called bsp-merge, is extremely simple. It is based on the idea of merging MSFs. Let Gi = (V; Ei), 1  i  r, be r subgraphs of G = (V; E ) where [ri=1 Ei = E , and let Fi be the MSF of Gi , for 1  i  r. Then, the MST of [ri=1 Fi is the MST of G, and we will call merging two or more MSFs a merge step. Note that this technique reduces the number of edges (if jEi  jV j) that can possibly belong to the MST. The communication cost incurred in algorithm bsp-merge does not match the lower bound. In order to achieve this, we present a di erent algorithm called bsp-dense. bsp-dense is similar to bsp-merge but in addition executes a limited number of so-called Bor_uvka-steps [Rei93]. Such a step consists of two substeps. First, the cheapest edge incident to each node is selected. It is well-known that the selected edges form a pseudo-forest with a cycle of length two in every pseudo-tree (a pseudo-tree is a tree plus an edge that creates a cycle). In the second substep, every pseudo-tree is contracted into a single \supervertex". After eliminating any self-loops, one obtains a multigraph G0 , and one can easily show that the selected edges plus the MST of G0 form the MST of the input graph G. A Bor_uvka-step reduces the number 13

of nodes, because the number of supervertices is decreased by a factor of at least two in every Bor_uvka-step. One disadvantage of the merge technique is that it needs runtime (n). Thus, it can only be applied if the number of edges is large (m  np). For the other case we present algorithm bsp-sparse. One module of this algorithm is bsp-as. It is a BSP implementation of the algorithm due to Awerbuch and Shiloach [AS87], which runs in O(log n) time on an (n + m)-processor PRAM. Unfortunately, because the model used in the Awerbuch-Shiloach algorithm is the CRCW PRAM, a work-preserving emulation on the BSP model requires (n + m)=p  p slackness (i.e., the ratio of the problem size n + m to the number of processors p) for some constant  > 0. We show in Section 7 that the amount of slackness required can be reduced by keeping the graph in a representation in which all edges incident to a node are stored consecutively, so that concurrent reads and writes can be implemented using segmented parallel pre x operations. Furthermore, a direct implementation of the PRAM algorithm on the BSP model would result in (log n) supersteps. We show that O(log p  d log((mlog+pn)=p) e) supersteps are sucient. Thus, if the slackness is polynomial, this results in O(log p) supersteps. The amount of local work performed by algorithm bsp-as as well as its communication volume is not linear in the problem size m + n, since the number of edges does not necessarily decrease by more than an additive term. To achieve this, we use an observation due to Johnson and Metaxas [JM95], which states that in order to increase the size of every supervertex by a factor of k, it is sucient to consider only the k cheapest edges incident to each node. This observation is used in algorithm bsp-as-select (Section 8). Brie y, algorithm bspas-select works as algorithm bsp-as except that in round i only the k=2i cheapest edges incident to each node are considered. This algorithm is used to decrease the number of nodes in the rst step of algorithm bsp-sparse in linear time. The third technique used in our algorithms is the sampling technique due to Karger, Klein and Tarjan [KKT95]. This technique also reduces the number of edges and can brie y be described as follows. Given a graph G with n nodes, let H be a subgraph of G obtained by including each edge with probability P , and let F be the MSF of H . The sampling lemma of [KKT95] shows that the number of F -light edges of G (edges which cannot be discarded because they can belong to the MST of G) is at most n=P with high probability. For applying this sampling lemma we need a veri cation algorithm which determines the F -light edges. The veri cation algorithm presented in Section 9 is derived from the algorithm given in [KPRS97]. This is an EREW PRAM algorithm that needs linear work and O(log n) time. Our algorithm requires only O(log p) supersteps and needs less slackness than a PRAM simulation on the BSP model. Furthermore, it employs a simple load balancing technique in order to reduce the number of times the edges have to be communicated. The sampling technique will be used in algorithm mst-sparse.

5 Simple Algorithms for Dense Graphs In this section we present two simple algorithms for computing the minimum spanning tree of relatively dense graphs. With \relatively dense" we mean graphs for which m  p  n, but not necessarily m = (n2 ). The rst algorithm, called mst-merge, is extremely simple. It uses a parameter d which denotes the degree of the merge tree. The right choice for d will be discussed momentarily. 14

The algorithm does not require any special distribution of edges to processors. It is only assumed that the edges are distributed evenly among the processors.

Algorithm 1

mst-merge(d)

(1) Every processor locally computes the MSF of the graph induced by the edges stored in its local memory. Edges not in any MSF cannot belong to the MST and can therefore be discarded. Thus, after this step, every processor contains a subgraph of the original graph consisting of n nodes and at most n ? 1 edges. (2) The p MSFs, one on every processor, are merged to form the MST of G. This is accomplished as follows. W.l.o.g., assume that p = dk for some integer k. First, each processor i sends its MSF to processor bi=dc. Thereupon, every processor i, 0  i < p=d, merges the MSFs it received using a sequential MST algorithm. This process is repeated recursively by the processors 0; 1; : : :; p=d ? 1. Then, after O(logd p) supersteps, processor 0 computes the MST of the graph G. - End of Algorithm -

Theorem 2 The BSP cost of algorithm mst-merge is given by W + g  H + L  S , where W = Tseq(n; mp ) + O (logd (p)  Tseq(n; d  n)) ; H = O (logd (p)  d  n) ; and S = O(logd p):

When the density m=n of the input graph is slightly larger than p, the parameter d can be chosen in such a way that the algorithm needs a constant number of supersteps and is communication-ecient at the same time.

Theorem 3 If m=n  p1+ for some constant  > 0, then algorithm mst-merge computes the MST in

W = O(Tseq(n; mp ));   m H = O p1+=2 ; and S = O(1)

time on the BSP model.

Proof: Step (1) obviously takes Tseq(n; m=p) time. In Step (2) we employ a d-ary tree, where

d = p=2 . Thus, Step (2) consists of a constant number of rounds and the computation time needed in each round is given by Tseq(n; p=2  n) = O(Tseq(n; m=p1+=2)) = O(Tseq(n; m=p)), where we assumed that Tseq is a non-decreasing function in both its parameters, i.e., if n1  n2 and m1  m2 , then Tseq(n2 ; m2)  Tseq(n2 ; m2). Furthermore, the per-processor communication volume is given by H = O(p=2  n) = O(m=p1+=2). 2 15

Because d  2, the BSP communication volume incurred in algorithm mst-merge is at least (n  log p). This is larger than the lower bound of (n) proved in Section 2. In algorithm mst-dense, described below, the communication volume is reduced by performing a limited number of Bor_uvka-steps between Step (1) and (2) of algorithm mst-merge in order to reduce the number of nodes participating in the merge phase.

Algorithm 2

mst-dense

(1) Identical to Step (1) of algorithm mst-merge. (2) Perform 2  log log p Bor_uvka-steps, as follows. Let nk , 1  k  2  log log p, denote the number of distinct supervertices at the start of iteration k. Initially, every vertex is a supervertex by itself, so that n1 = n. We maintain the following invariant: prior to the start of iteration k, the supervertices are labeled from 0 to nk ? 1. Each edge is stored as a record containing the names of its endpoints and the labels of the supervertices to which these endpoints belong. The algorithm proceeds as follows. For k 1 to 2  log log p perform Steps (2.1) through (2.5). (2.1) Every processor locally computes its cheapest edge incident to each supervertex i, 0  i < nk . If a processor does not contain any edge with endpoint i, it creates a dummy edge with weight 1. (2.2) The processors collectively determine the cheapest edge incident to each supervertex by performing a vector minimum. The selected edges are marked as MST edges. (2.3) The edges selected during Step (2.2) are sent to processor 0. This processor sequentially computes the connected components of the graph de ned by the selected edges and labels them from 0 through nk+1?1. Simultaneously, a vector V of length nk is built, where entry i, 0  i < nk , contains the new label for supervertex i. (2.4) Processor 0 broadcasts V to all other processors. (2.5) Every processor relabels the edges it contains according to the new labels of supervertices given by V . (3) Let SV denote the set of supervertices remaining after Step 2. Note that the edges of the original graph induce a multigraph on SV . This step proceeds as follows. First, each processor locally removes its multiple edges between elements of SV , leaving always the edge with the smallest weight. After that, every processor locally computes the MSF of the resulting graph G0 on SV . (4) Identical to Step (2) of algorithm mst-merge. The MST of the original graph consists of the edges identi ed in Step (2) and the MST of G0. - End of Algorithm -

16

Theorem 4 If m  p  n, then the BSP cost of algorithm mst-dense is given by W + g  H +

L  S , where

W = Tseq(n; mp ) + Tseq( n2 ; n) + log p  Tseq( n2 ; n2 ) + O(n  log log p); log p log p log p H = O(n); and S = O(log p):

Proof:

First we determine the local computation time W . Step (1) obviously takes Tseq(n; m=p) time. In every iteration of Step (2), the number of supervertices is decreased

by a factor of at least two. The number of edges in every processor, however, does not decrease by more than an additive term. Therefore, every iteration of Step (2) takes O(n) time, for a total of O(n  log log p). Because the number of supervertices remaining after Step (2) is at most n= log2 p, Step (3) takes Tseq(n= log2 p; n) + O(n) time, and Step (4) requires log p  Tseq(n= log2 p; n= log2 p) local work. The total computation time is obtained by simply adding the time needed for each step. The communication cost is determined as follows. In Step (1), (2.1), (2.5) and (3), no communication is performed. For Step (2.2), (2.3) and (2.4), we observe that the vector size n=2i during iteration i is always greater than n= log2 p  p= log2 p  p for some constant  > 0. Thus we may apply Lemma 3 and nd that the communication cost of these steps during iteration i is at most O(g  n=2i + L). The total communication cost incurred in Step (2) is therefore at most O(g  n + L  log log p). Finally, the communication cost of Step (4) is given by O(g  n= log p + L  log p). 2 It is perhaps not immediately clear what the result of Theorem 4 means. In particular, it is unclear whether the algorithm attains a p-fold speedup. But the following corollary shows that this is exactly the case. We use Valiant's notion of c-optimality to judge the eciency of our algorithms. Let Tp = W + g  H + L  S be the runtime of a parallel algorithm A and let T1 be the runtime of the corresponding sequential algorithm. A is said to be c-optimal if (1) W = (c + o(1))  T1 =p, and (2) the ratio of the communication time taken by A to T1 =p is in o(1).

Corollary 4 If p  log log p = o(m=n), then algorithm

mst-dense is 1-optimal compared to (1) the randomized algorithm due to Karger et al.[KKT95], and (2) Dijkstra's algorithm implemented using Fibonacci heaps [CLR90].

Proof: The randomized algorithm takes Tseq(n; m) = O(n + m) time. If this algorithm is

applied in algorithm mst-dense, its local time consumption simpli es to W = Tseq(n; m=p)+ O(n  log log p), which is Tseq(n; m=p) + o(Tseq(n; m=p)) when p  log log p = o(m=n). Furthermore, it can be seen that the communication volume H = O(n) grows at a rate that is asymptotically strictly smaller than the local computation time. If Dijkstra's algorithm is applied, then Tseq(n; m) = O(n  log n + m), and the analysis proceeds analogously. 2 Algorithms mst-merge and mst-dense cover the dense case, i.e., graphs for which m  n  p. Furthermore, recent experiments conducted by Gotz [Got98] show that they perform very well in practice.

17

G n

(1)

G’ n/k

(2)

G’s

(3)

(4)

n/k

G’’ n/k

MST(G’s) m

m

m/k

(5)

MST(G’’) n

Figure 4: Overview of algorithm mst-sparse.

6 Algorithm for All But Very Sparse Graphs The main drawback of algorithm mst-dense is that it can employ at most m=n processors eciently. In this section, we give a top-level description of an algorithm for the case that more than m=n processors are available. It uses three subroutines: bsp-as, bsp-as-select, and bsp-verify. They are described in Sections 7, 8 and 9, respectively. The algorithm consists of ve steps, which are illustrated in Figure 4. Let k = m=n be the density of the input graph. Algorithm 3 mst-sparse(G; k) (1) Reduce the number of nodes of G by a factor of k by performing log k rounds of algorithm bsp-as-select. This algorithm maintains a forest to represent the connected components which are created in the course of the algorithm, but in contrast to algorithm mst-merge, does not contract them in each round. Form the graph G0 = (V 0; E 0) by contracting every tree into a single supervertex. G0 consists of at most n0  n=k nodes and m0  m edges. (2) Construct a sample graph G0s of G0 by including every edge with probability k?1 . G0s consists of at most O(m0=k) = O(n) edges, with high probability. (3) Find the minimum spanning forest Fs0 of the graph G0s using algorithm bsp-as. (4) Using algorithm bsp-verify, determine and delete all edges of G0 which cannot belong to the MST. By the sampling lemma of Karger, Klein and Tarjan [KKT95], the remaining graph G00 has at most O(k  n0 ) = O(n) edges, with high probability. (5) Compute the MST T 00 of the graph G00 , again by using algorithm bsp-as. The MST of the original graph consists of the edges of T 00 plus the edges found in Step (1). - End of Algorithm The total BSP complexity of algorithm mst-sparse is given in the following theorem. Theorem 5 Let n  p  log p, m  n  log p and  > 0 constant. There exists a BSP algorithm for computing the MST with time complexity W + g  H + L  S , where

W = O(Tseq( np ; mp ) + np  log p); H = (12 + )  mp + o( mp ) + O( np  log p); and S = O(log p  d log((mlog+pn)=p) e) 18

with probability at least 1 ? n?c for any constant c > 0.

Proof: A step-by-step analysis of algorithm mst-sparse is given below. (1) By Lemma 12, algorithm bsp-as-select requires W + g  H + L  S time, where W = O(m=p + log p  (n=p)), H = (6 + )  (m=p) + o(m=p) + O(log p  (n=p)), and S = O(log p  d log((mlog+pn)=p) e). (2) Choosing the sample can be done locally using work linear in the number of edges. (3) The graph G0s consists of at most O(n=k) nodes and O(m=k) edges. By Theorem 6, algorithm bsp-as takes time W = O(Tseq(n=(k  p); m=(k  p))), H = O(log p  (n=p)), and O(log p  d log((mlog+pn)=p) e). (4) The input for the veri cation algorithm consists of a spanning tree of size n=k and m edges to be veri ed. Its BSP complexity is given by (cf. Theorem 7): W = O(m=p + log p  (n=p)), H = 6  (m=p) + o(m=p) + O(log p  (n=p)), and S = O(log p). (5) After the veri cation algorithm, at most k  n0 = n edges remain with high probability. Furthermore, G00 consists of at most n=k nodes. This step therefore takes time W = O(Tseq(n=(k  p); n=p)), H = O(log p np ), and S = O(log p d log((mlog+pn)=p) e) under the BSP cost model (cf. Theorem 6). The total BSP cost of algorithm mst-sparse is obtained by simply adding the cost of every step. The probability that mst-sparse requires more than the claimed running time is determined as follows. In Step (1), the runtime bound of algorithm bsp-as-select fails with probability at most k  n?c , for any constant c > 0. Since k  p  n, this probability is small enough. In Step (2), a sample graph is constructed by including every edge independently with probability k?1 . By applying Cherno bounds, the sample graph consists of (m=k) edges with probability at least 1 ? exp(?cn). Given a graph G with n0 nodes, let H be a subgraph of G obtained by including each edge with probability P , and let F be the minimum spanning forest of H . The sampling lemma of Karger, Klein and Tarjan [KKT95] shows that the number of F -light edges of G is at most n0 =P with probability at least 1 ? exp(?cn0 ). In algorithm mst-sparse we have n0 = n=k and P = k?1 where k = m=n, which implies that at most n edges remain with probability  1 ? exp(?c  n=k). This probability is high enough when n=k  log n. Since k = m=n, this is the case if m  n2 = log n. The remaining case (m > n2 = log n) can be dealt with by choosing k = m=(n  log n). It is easy to verify that this choice a ects the running time of the algorithm by at most a constant factor. 2

7 BSP Implementation of the Awerbuch-Shiloach Algorithm In this section, a BSP implementation of the CRCW PRAM algorithm due to Awerbuch and Shiloach [AS87] is described. This algorithm will be denoted by bsp-as. Brie y, the Awerbuch-Shiloach algorithm works as follows. It employs n + m processors, and each processor is assigned either to a vertex (\vertex-processor") or to an edge (\edgeprocessor"). Concurrent write con icts are resolved as follows. If several edge-processors write 19

to the same shared memory location simultaneously, the processor whose edge has the smallest weight succeeds. A rooted forest is maintained, which represents connected components or \supervertices" of the original graph. Thereupon, the algorithm iteratively executes three steps. In the rst step, all edge-processors assigned to edges emanating from a star (tree of depth 1) try to hook the star onto another tree. Since the processor representing the cheapest edge succeeds, this step can be viewed as a Bor_uvka-step. As a result of this step, cycles of length 2 may be formed. In Step (2), these cycles are eliminated by deleting the edge pointing from the smaller-numbered vertex to the larger-numbered vertex. Finally, in Step (3), the height of each tree is reduced by an operation called shortcutting in which every vertex adopts his grandparent as his new parent. By an induction proof, it can be shown that the total height of all trees reduces by a factor of 3=2 in every iteration, and therefore the algorithm terminates after O(log n) iterations. Since every iteration takes constant time, the total time consumption is also O(log n). The crux in our algorithm is to show how concurrent reads and writes can be implemented eciently on the BSP model. This is done by using a data structure in which all edges incident to a node are stored consecutively. We show that this technique, which is similar to a technique developed by Blelloch [Ble89] for the SCAN model of parallel computation, can also be applied for the BSP model. One novelty of our technique is that two edge arrays are needed; one to represent the graph G and the other to represent the forest which is created in the course of the algorithm. Furthermore, to maintain each data structure the other one is needed. This technique is applicable to other graph problems as well, for example, to the connected components problem. This section is organized as follows. First, the data structure is described. After that, it is shown how this data structure can be used to resolve concurrent reads and writes, and how the data structure is maintained if a hooking or a shortcutting step is applied. Finally, we describe algorithm bsp-as and analyze its running time.

7.1 Data Structure

We use the following data structures in order to represent the graph G. Each undirected edge fv; wg is represented by two directed edges (v; w) and (w; v), which are stored in an array E[1::2m]. Every entry e = E[i] is a record consisting of the following elds: e.Src is the source node of the edge, e.Dest is its destination node, e.Weight is its weight, and e.Forest is a boolean value that indicates if the edge belongs to the minimum spanning forest. In order to simulate concurrent accesses eciently, we also maintain a eld e.Opp for every edge (v; w), which is a pointer (array index) to the opposite edge (w; v ). Initially, all edges incident to a node are stored consecutively in the array (although not necessarily sorted). Hence, the array E can be viewed as an adjacency list representation of the graph G. Furthermore, the array is distributed evenly among the p processors so that every processor holds a contiguous block of m=p array elements. This distribution is dynamic, meaning that a processor can store di erent edges at di erent times. An example is given in Figure 5. In the course of algorithm bsp-as, connected components are being formed which are represented by a collection of rooted trees. The edges of these trees are stored in a forest edge array F[1::2n]. Every entry f = F[i] also consists of the elds f.Src, f.Dest and f.Opp. In addition, a boolean eld f.Parent is used to indicate whether the edge points from a child to its parent or vice versa. The distribution of these edges is also dynamic and ordered by f.Src. 20

1

1 3

3

4

8

2

9

5

2 6

7 4

Index Src Dest Weight Opp

1 1 2 3 3

2 1 3 1 7

3 2 1 3 1

4 2 3 4 8

5 2 6 2 17

6 2 4 6 13

7 3 1 1 2

8 3 2 4 4

6

5

9 3 4 9 12

10 3 5 8 14

11 4 6 5 18

12 4 3 9 9

13 4 2 6 6

14 5 3 8 10

15 5 6 7 16

16 6 5 7 15

17 6 2 2 5

18 6 4 5 11

Figure 5: A graph and its edge representation. Index Src Dest Parent Opp

1 1 1 x 2

2 3 4 5 1 2 2 3 1 2 2 3 x x 1 4 3 6

6 7 8 9 3 4 4 5 3 4 4 5 x x 5 8 7 10

10 11 12 5 6 6 5 6 6 x 9 12 11

Figure 6: Example of the forest array F. Initially, for each node v , there exist two edges (v; v ) in F. Figure 6 shows how F is initialized. A third array V[1::n] is used to store node information. Every entry v = V[i] consists of the elds v:P and v:Star. The parent of node u in the collection of rooted trees is given by V[u]:P , and V[u]:Star is an extra eld which is used to indicate if a node belongs to a star or not. The array V[1::n] is distributed evenly among the processors, such that each processor holds a continous block of n=p array elements. This distribution is static throughout the algorithm, so that the index of the processor storing the information about node u is always given by bu=(n=p)c.

7.2 Simulating Concurrent Accesses and Maintaining the Data Structures In this section we show how the concurrent accesses occuring in the Awerbuch-Shiloach algorithm can be simulated eciently on the BSP model using the data structures described in the previous section. Simulating concurrent accesses is usually done by sorting (see, e.g., [Val90]), but sorting requires polynomial slackness for a work-optimal simulation. Furthermore, the presented simulation scheme is deterministic, since there is no need to hash the memory address space. In the Awerbuch-Shiloach algorithm, all concurrent accesses are one of two forms: (a) for each edge (u; v ), read of write information about node u, or (b) for each edge (u; v ), read of write information about node v . 21

One typical example is when every edge (u; v ) wants to know the parent of its source node u in the forest, as needed in the shortcutting step.

Lemma 7 Every concurrent access of type (a) or type (b) can be simulated in O(g  (m +

n)=p + L + Tpre x ) time on the BSP model. Proof: Divide the edge array into n segments consisting of edges with the same source

node. The rst edge in each segment is called the segment leader. Then a concurrent read of type (a) can be realized by the following two steps: (1) each segment leader fetches the requested information from its source node, (2) using a segmented broadcast, this information is distributed across the segment. A concurrent write can be implemented by performing a segmented parallel pre x before Step (1) in order to combine the information to write. In this case, Step (2) can be omitted. Furthermore, access type (b) can be realized by sending the request from edge e to its opposite edge, performing access type (a), and by sending the answer back to e afterwards. The time taken by this procedure can be easily seen to be O(g  (m + n)=p + L + Tpre x). 2 In order to be able to apply Lemma 7, it needs to be shown how the arrays F and E are restored in every round of the Awerbuch-Shiloach algorithm, so that prior to each round all edges incident to the same node are stored consecutively. The data structure is modi ed in the hooking step, in which the minimum edge emanating from a star is used to connect it to another supervertex, and in the shortcutting step, in which each node adopts its grandparent as its new parent. Both operations modify F as well as E. We rst show how the ordering of the edges in E is maintained when every edge (u; v ) is replaced by the edge (u:P; v:P ). Below, an edge f in the forest array F is called a child edge if it points from a node to its child, and a parent edge otherwise.

Algorithm 4 Maintaining the edge array E. (1) In array E, count the number degree(v ) of edges incident to each node v . After that, the segment leader associated with node v writes degree(v ) into position v of array V. (2) Every parent edge (v; w) in the forest edge array F reads degree(v ) from position v in V, and sends this number to the position of its twin pointer. Now every child edge (v; w) is labeled with degree(w). The label of each parent edge is set to 0. (3) Perform a pre x sum operation on the values degree(v ) in F. (4) By reversing the actions of Step (2) and (1), every segment leader in E is correctly labeled with its new position. The new indices of the remaining edges can be computed by a segmented parallel pre x with the increment operator (+1). (5) All edges in E are permuted to their new positions. - End of Algorithm An example is shown in Figure 7 and 8. Updating the F array after a shortcutting step can be done almost identically, but in this case the E array is not needed. Finally, it needs to be shown how the F array is maintained after a hooking step. This will be done using the following algorithm. 22

0

1

3

0

2

2

1

4

3

4

Figure 7: Example graph: Dotted lines are nontree edges and straight lines are tree edges. Idx Src Dest Opp

0 3 4 5

Degree 2

New idx Pref New-Src New-Dst

Array E 1 2 3 4 3 0 0 2 0 3 2 0 2 1 4 3 2

0 4 0 1 4 5 1 1 1 0 1 0 1 0

5 4 3 0

1 1

3 3 0 1

2 2 0 0

Array V Node 0 1 2 3 4 Parent 0 0 0 1 1

Degree 2 0 1 2 1

New idx 4 3 3 0 2

Idx Src Dest Opp Parent edge

0 4 1 3 x

1 3 1 2 x

2 1 3 1

Array F 3 4 5 6 1 1 0 0 4 0 0 1 0 6 8 4 x x

7 0 2 9

8 0 0 5

9 2 0 7 x

Degree 1 2 0 2 1 to Opp 2 1 0 1 2 0 2 3 3 4 Scan to Opp 2 0 3 4 3

Figure 8: Example of maintaining the edge array E for the graph depicted in Figure 1.

Algorithm 5 Maintaining the forest edge array F after a hooking step. (1) In array F, count the number degree(v ) of edges incident to each node v . After that, the segment leader associated with node v writes degree(v ) into position v of array V. (2) Every minimum edge v:Min in E sends a message to its twin pointer and labels the twin edge with 1. Perform a segmented pre x sum on E with the labeled edges. Increase degree(v ) (in V ) by this sum. Decrease degree(v ) by 1 if v has a minimum edge. (Note that not every node has a minimum edge.) (3) Perform a pre x sum operation on the values degree(v ) in V. These are the new index positions of the segments in F. (4) In E every segment leader gets the new index position and the old degree(v )of Step (1). These two values are broadcast among all labeled edges in the same segment. Together with the segmented pre x value of Step (2) these three integers are summed up and sent via twin pointer to array V. 23

(5) Every twin pointer of a parent pointer in F where source and destination are the same gets the values stored in Step (4) in V which give the new array positions. The new postitions of the other edges are determined as follows: Every segment leader gets the new index position of the segment and the position of every edge is computed by a segmented parallel pre x with the increment operator (+1). (6) All edges in F are permuted to their new positions. - End of Algorithm The runtime of Algorithm 4 and Algorithm 5 can be seen to be O(g  (m + n)=p + L + Tpre x). This shows:

Lemma 8 Maintaining the edge arrays E and F takes O(g  (m + n)=p + L + Tpre x ) time under the BSP cost model.

7.3 Algorithm and Analysis

Now we are ready to describe algorithm bsp-as.

Algorithm 6

bsp-as

(1) Simulate the Awerbuch-Shiloach algorithm for 2  (log p + log log p) rounds, as follows. (1.1) Star recognition. For each node v , determine if it belongs to a star. This is done in three substeps: (1) for each node v , set V[v ]:Star := true, (2) if V[v ]:P 6= V[V:P ]:P , then set V[v ]:Star := false and V[V[V:P ]:P ]:Star := false, (3) if V[V:P ]:Star = false, set V[v ]:Star := false. Concurrent reads and writes in this step are avoided by using the F array. (1.2) Hooking and Tie breaking. For each active node u which is a star, determine its minimum edge E[iu ] and set E[iu ]:Forest := true. Tie breaking can be done by sending a message from each minimum edge to its twin pointer. Afterwards, each node u sets V[V[E[iu]:Src]:P ]:P := V[E[iu]:Dest]:P . Concurrent accesses in this step are avoided by the technique described in Algorithm 5. (1.3) Shortcutting. For each node u which does not belong to a star, set V[u]:P := V[V[u]:P ]:P . This can be done by rst applying shortcutting in the array F, after which every parent edge (v; w) in F updates the parent eld of V[v ]. (1.4) Maintain the edge array E. For each edge E[i], get the new supervertex names of its endpoints by setting E[i]:Src := V[E[i]:Src]:P and E[i]:Dest := V[E[i]:Dest]:P . This is done using the technique described in Lemma 7. Erase internal edges (edges whose endpoints are the same) and restore the edge ordering by E[i]:Src, as explained in Algorithm 4. (2) Contract every connected component as de ned by the forest edges into a single vertex by rst applying the Euler tour algorithm of [CDF+97], after which every segment leader fetches the new node number. 24

(3) Every processor locally computes the MSF of the graph de ned by the remaining supervertices and the edges stored in its local memory. (4) Merge the MSFs as in Step (4) of algorithm mst-dense. - End of Algorithm The time complexity of the Awerbuch-Shiloach algorithm is calculated by proving that after s rounds the total height of all trees is less than or equal to (2=3)s  n, but we need that the size of each supervertex increases by a constant factor > 1 in every round.

Lemma 9 After s rounds of bsp-as, the size of each supervertex remaining is at least 2s=2. Proof: By induction on s. The base case s = 0 trivially holds. Now suppose it holds for s ? 1 and consider a supervertex C during iteration s. If C is a star, then it gets hooked and the resulting supervertex has size at least 2  2(s?1)=2 > 2s=2 . Otherwise, let s0 be the previous

iteration in which C took part in a hooking step. Note that s0 is well-de ned because every supervertex gets hooked in the rst iteration. Because C did not become a star in s ? s0 0 0 s ? s s ? s iterations, it must have contained a chain of length at least 2 (actually, 2 + 1 since it takes dlog(h ? 1)e shortcutting steps to reduce a tree0 of h levels into a star). By induction, every supervertex on that chain had size at least 2(s ?1)=2. It follows that the size of C is at 0 0 ?1)=2 0 +1)=2 s ? s ( s s ? ( s least 2  2 =2  2s=2 since s  s0 + 1. 2

Theorem 6 Let p  n + m. Algorithm bsp-as solves the minimum spanning tree problem in time W + g  H + L  S , where n  log p); W = O(Tseq( np ; mp ) + m + p m + n H = O( p  log p); and S = O(log p  d log((mlog+pn)=p) e):

under the BSP cost model.

Proof: Each round of Step (1) takes O(g  (m + n)=p + L + Tpre x) by Lemma 7 and Lemma 8. The total BSP cost of Step (1) is therefore given by O(g  ((m + n)=p + L + Tpre x)  log p). Step (2) takes O(g  (m=p + (n=p)  log p) + L  log p) by Lemma 6 and Lemma 7. Lemma 9 shows that after 2  (log p + log log p) rounds, each supervertex has size at least p  log p, which implies that at most n=(p  log p) supervertices remain. Step (3) therefore requires Tseq(n=(p  log p); m=p) local computation time. It also implies that the communication cost incurred in Step (4) is at most O(g  n=p + L  log p). Furthermore, the total local computation time Tseq(n=(p  log p); m=p) + log p  Tseq(n=(p  log p); n=(p  log p)) performed in Step (4) does not exceed O(Tseq(n=p; m=p)), because the minimum spanning tree problem needs at least linear time. In case n  p log p, only (3=2)  log n = O(log p) rounds of Step (1) are executed and Steps (2), (3) and (4) are omitted. After these rounds only one supervertex remains, which implies that the entire MST has been found. 2 25

8 Towards a Work Optimal Algorithm In the previous section an ecient BSP implementation of the Awerbuch-Shiloach algorithm was described. The purpose of this section is to modify algorithm bsp-as in order to reduce the communication and computation requirements of the rst phase of algorithm mst-sparse. We achieve this goal be employing the following observation. Since the purpose of the rst phase is to reduce the number of nodes by a factor of k, it is sucient to consider only the k cheapest edges incident to each node in the rst round of algorithm bsp-as. Similarly, in the second round, each supervertex (of size at least 2) needs at most k=2 other supervertices (of size at least 2) to hook with in order to attain size k. Thus, in general, it is sucient to consider only the k=2i cheapest edges incident to each supervertex in round i of algorithm bsp-as, so that after log k rounds every supervertex has size at least k. Since the number of edges decreases geometrically, the per-processor communication volume of the sketched algorithm is given by O(g  (m=p + k  n=p + log k  n=p)). By choosing k = m=n, runtime O(g  m=p) is achieved. The idea of growing supervertices only by a bounded factor and therefore not having to consider all edges was originally used by Johnson and Metaxas [JM95] in their EREW PRAM algorithm. In our algorithm, however, edges are deleted much more aggressively in order to keep the communication requirements low. Because of this, it cannot be assured that after log k rounds the size of every supervertex is at least k, but only k1=3. This does not cause any problems, since the algorithm can be applied three times to achieve a supervertex size of at least k. There are two other problems that need to be resolved. First, the edges that remain after each round should be non-internal (i.e., their endpoints should not belong to the same supervertex). However, nodes belonging to a tree are unable to identify which of their incident edges are internal, because they do not know a unique supervertex number (this number is not available until the supervertex becomes a star). Therefore, after deleting all but the best k=2i edges, it could happen that a node has only internal incident edges left, while the cheapest edge incident to the entire supervertex has been deleted. This supervertex therefore has to stop hooking (become an inactive supervertex ) or otherwise an edge may be misidenti ed as belonging to the minimum spanning tree. It may seem that a supervertex that becomes inactive soon will not have the required size, but Lemma 11 will show that this is not the case. Inactive supervertices will no longer take part in hooking steps, but they still do shortcutting until they become stars. A second problem is caused by multiple edges pointing from a star to the same other supervertex. In algorithm bsp-as it is impossible to identify these edges but this causes no problems because the cheapest edge is always selected and edges are never erased. However, in the modi ed algorithm problems will occur if all but the k=2i cheapest edges are erased and all remaining edges happen to point to the same supervertex. To circumvent this problem we do not erase edges incident to supervertices but edges incident to original nodes. This means that in round i, the best k=2i edges incident to each original node remain. The number of edges still decreases geometrically and we are sure that we did not prevent growing a supervertex due to erasing edges. If a node erased edges and after round i all edges point to the same supervertex, we know that this supervertex consists of at least k=2i nodes and Lemma 11 shows that this is sucient.

26

Algorithm 7

bsp-as-select(k)

(1) For each node v , select its k-cheapest edge ev . All edges incident to node v with weight less than or equal to ev (as well as their opposite edges) stay \alive". All other edges are (temporarily) deleted. The remaining edges are distributed evenly among the processors. (2) Repeat 3 times: (2.1) For i = 1; : : :; log k do (2.1.1) For every original node v , select the k=2i-cheapest edge ev . (2.1.2) All edges incident to node v with weight less than or equal to ev (as well as their opposite edges) stay alive. All other edges are (temporarily) deleted. (2.1.3) Determine which stars become inactive. If a node v erased one of its edges in Step (2) of this or a previous round, and v has no edges left anymore, then set V[v:P ]:Active := false. Afterwards, each star node v sets V[v ]:Active := V[v:P ]:Active. (2.1.4) Finish this round by executing one round of Step (1) of algorithm bsp-as. Only active stars take part in hooking steps. (2.2) Using list ranking and the Euler tour technique, each connected component is contracted into a single node.

- End of Algorithm We now determine the runtime of algorithm bsp-as-select. Lemma 10 Round i of loop (2.1) of algorithm bsp-as-select takes O((ki  n + n)=p + g  (ki  n + n)=p + L  dlog p= log((m + n)=p)e) time under the BSP model, where ki = k=2i denotes the number of live edges per node. Proof: Using the techniques described in Section 7.2 (concurrent accesses), Steps (2.1.2) and (2.1.3) can be implemented in time O(g  (ki  n + n)=p + L + Tpre x). As shown in the previous section, Step (2.1.4) has the same complexity. Furthermore, since in Step (2.1.1) each node has degree at most k  m=p, this step can be implemented using any sequential linear time selection algorithm. 2

Lemma 11 After log k rounds of loop (2.1), each supervertex consists of at least k1=3 nodes. Proof: Consider an arbitrary supervertex C and assume that it is active for j  log k

rounds. In the rst j rounds the behaviour of C is the same as in algorithm bsp-as and, thus Lemma 9 shows that the size of the supervertex is at least 2j=2. Furthermore, because C becomes inactive in round j , it contains at least one node which has deleted edges by selection (and therefore had a large number of incident edges) and has no incident edges anymore after the internal edges are deleted in Step (4) of algorithm bsp-as. Let v be this node. Since v had k=2j incident edges left prior to round j , the supervertex that contains v consists of at least k=2j nodes, because these edges originally pointed to di erent nodes. Thus, each supervertex consists of at least min1ilog k (maxf2i=2; k=2ig) nodes. The minimum of this function is assumed for j = 32 log k, which proves the Lemma. 2 27

Since the number of edges decreases geometrically in each round, the following result is obtained:

Lemma 12 Let k  p  n and k =  m=n  log n for some constant > 0. For any

constant  > 0, algorithm bsp-as-select determines minimum spanning tree supervertices of size at least k in time W + g  H + L  S , where

W = O( mp + np  log p); H = (6 + )  mp + o( mp ) + O( np  log p); and S = O(log p  d log((mlog+pn)=p) e): This runtime bound holds with probability at least 1 ? n?c for any constant c > 0

Proof: By Lemma 5, selection takes g  2  m=p + O(m=p + Tpre x) time, However, in this case

multiple independent selections have to be performed on groups of consecutively numbered processors. Since k  log n is assumed, the runtime bound still holds with high probability. It also takes g  2  m=p + O(m=p + Tpre x) time to notify every twin edge of the fact whether it has to stay alive, as well as to redistribute the surviving edges evenly among the processors. The total BSP cost of Step (1) is therefore given by g  6  m=p + O(m=p + Tpre x). In order to reduce the number of times every edge has to be communicated, the following technique is applied. Let c be a constant such that the per-processor communication volume in Step (2) of algorithm bsp-as-select is at most cm=p+O((n=p)log p). We set k0 = m=(cn) and call bsp-as-select(k0 ). Then, the per-processor communication volume incurred in Step (2) is given by   m=p + O((n=p)  log p). 2

9 Parallel Veri cation We are given a forest F with n0  n=k  n= log p nodes and a set of m0  2  m non-forest edges. For every non-forest edge e = (v; w), it needs to be veri ed whether w(e) is greater than the heaviest edge on the path between v and w in F (1 if v and w are not connected). Such edges are called F -heavy and cannot belong to the MST. For simplicity, we assume that we are given a tree T instead of a forest F . T can be easily obtained from F by creating a new root node, connecting the roots of the forest to this new node, and by giving the new edges weight 1. The algorithm described in this section, called bsp-verify, is derived from the EREW PRAM algorithm presented in [KPRS97], which in turn is based on the CREW PRAM algorithm given in [DT94]. Brie y, the EREW PRAM algorithm works as follows. First, it converts the tree T to a so-called Bor_uvka-tree B by repeatedly applying parallel shunt operations (tree contraction [JaJ92]). This Bor_uvka-tree has three important properties: (1) the weight of the heaviest edge between any pair of nodes v and w in T is equal to the weight of the heaviest edge on the path between v and w in B , (2) all non-tree edges are incident only on leaves of B , and (3) B has a constant degree and logarithmic depth. After that, the algorithm nds the least common ancestors (LCAs) of the endpoints of every non-tree edge. Finally, it veri es that the non-tree edges do not violate the minimality condition. 28

The main di erences between bsp-verify and the PRAM algorithm of [KPRS97] can be summarized as follows. First, bsp-verify requires only O(log p) supersteps and there is no need to hash the memory address space. Second, the PRAM algorithm converts the tree B of size (n) into a tree B 0 of size (m) in order to avoid concurrent reads while determining the LCAs. Since we are using a non-optimal Euler tour construction algorithm (cf. Lemma 6), applying this technique here would lead to an (g  (m=p)  log p) time algorithm, which is too large. Finally, algorithm bsp-verify employs a simple load-balancing technique in order to minimize the number of times the edges have to be communicated.

9.1 Formation of the Bor_uvka-Tree

Since tree contraction can only be applied to a binary tree, T is rst converted to a binary tree T 0. On the BSP model, this can be done as follows. First, the number of children of every internal node is computed. This takes O(n0=p + Tpre x) time, since T is represented by adjacency lists. After that, a dummy child node is added to all nodes with only one child, and a binary tree of depth d ? 1 is created for every node with d  3 children. The parent node f will be the root of this tree, its left child will be its rst child in the tree T (as encountered in the adjacency list for f ), and a new node u becomes the right child of f . Similarly, the left child of u is the second child of f in T and its right child will be a new node u + 1, and so on. The numbers of the new nodes can be determined by another parallel pre x operation. New edges are given a cost of ?1, so that they are not heavier than any non-tree edges. The total BSP cost of the binarization process, including distributing T 0 evenly among the processors, can be seen to be O(g  n0 =p + Tpre x) = O(g  n0 =p + L  log p). After the tree has been binarized, the Bor_uvka-tree is formed. Let nl = O(n0) denote the number of leaves of T 0. First, the leaves are numbered from 0 through nl ? 1 in a left-to-right order. This can be done by constructing an Euler tour and by performing list ranking, both of which can be performed in time O((n0=p)  log p + g  (n0=p)  log p + L  log p) by Lemma 6. Now, we apply tree contraction as in [KPRS97]. Tree contraction shrinks a binary tree by repeatedly applying the shunt operation. The shunt operation, applied to a leaf vertex v , removes v and its parent from the tree, and connects the sibling v 0 of v to the grandparent of v . In order to prevent the tree from becoming disconnected, the shunt operation is rst applied in parallel to all odd numbered leaves that are left children and then in parallel to all other odd numbered leaves. We only apply log p parallel shunt operations (or log n0 if n0 < p). Thus, when n  p, since every complete iteration gives rise to at most two levels in B [KPRS97], only the lower 2  log p levels of B are actually constructed. The tree that remains after log p shunt operations has size O(n0=p) and is broadcast to all processors in time O(n0 =p + g  n0 =p + L  log p). Lemma 13 Forming the lower 2  log p levels of the Bor_uvka-tree B and broadcasting the remaining tree to all processors can be performed in time 0 0 O( np  log p + g  np  log p + L  log p) = O( np + g  np + L  log p)

in the BSP model. Proof: Every parallel shunt operation takes O(n0=p + g  n0=p + L) time. Together with the other steps, the total time consumption can be estimated on O((n0=p)  log p + g  (n0 =p)  log p + L  log p). This can be simpli ed to O(n=p + g  n=p + L  log p), because when n0 < p, the term L  log p dominates, since L  g  1. 2

29

9.2 Lowest Common Ancestors

We are now given a tree which is divided into two parts. The lower 2  log p levels form a forest F 0 which is distributed evenly among the processors. The trees of F 0 are designated as microtrees. The remaining upper levels (the macrotree or the head H ) consist of O(n0 =p) nodes of which every processor contains a local copy. In this section it is shown how the lowest common ancestor (LCA) of the endpoints of each non-tree edge are computed. We use a variant of the non-optimal algorithm described in [JaJ92]. In [JaJ92] it is shown that the LCA problem can be reduced to the range-minima problem, which is de ned as follows. Given an array A of size n = 2l, and a set of range-minimum queries RMQ(i; j ). For every query RMQ(i; j ), nd the minimum of fA[i]; A[i + 1]; : : :; A[j ]g. The algorithm constructs a complete binary tree whose leaves corresponds to the elements of A, and associates two arrays with every internal node v that correspond to the pre x and sux minima of the subarray de ned by the leaves of the subtree rooted at v . After this preprocessing step, every range-minimum query can be determined in constant time, provided read con icts are allowed (i.e., on the CREW PRAM). In order to reduce the communication time required for determining the LCAs, the following load balancing technique is employed. The edge array E is divided into two edge arrays Elow and Ehigh. Elow stores the edges starting at nodes with degree  log2 p (low degree nodes ), whereas Ehigh contains the edges starting at nodes with degree > log2 p (high degree nodes ). Both arrays can be distributed evenly among the processors in time at most 2  g  m=p + O(L  log p). Note that the size of Elow is at most O(n0  log2 p) = O(n  log p), and that every processor contains the segment leader of at most 2  m= log2 p high degree nodes. First, we give a top-level description of the LCA algorithm, which will be called bsp-lca. After that, we describe each step in more detail and determine its time complexity.

Algorithm 8

bsp-lca

(1) Divide the edge array E into two edge arrays Elow and Ehigh, as described above. Distribute both arrays evenly among the processors. (2) Construct the Euler tour array A corresponding to a depth rst search of the forest F 0 , and use it to compute the level of each element of A. (3) For each node v , compute its leftmost appearance l(v ) in A and its rightmost appearance r(v) in A. After that, replace every query LCA(u; v) by the range-minimum query RMQ(r(u); l(v )) if r(u)  l(v ) and by RMQ(r(v ); l(u)) otherwise. (4) For each node v in the forest F 0 , compute the pre x and sux minima array as described in [JaJ92]. Then O(log p) pre x and sux minima will be computed for every leaf u. Store these values in an array associated with u. (5) The range-minimum queries are answered as follows. (a) Every non-tree edge incident to a low degree node v sends a pre x or sux minimum request to the processor responsible for v . (b) If v is a high degree node, then the array associated with node v is broadcast to all processors storing non-tree edges incident to v . 30

(6) If the LCA of two nodes is not within F 0 , then the LCA can be determined by looking at the head H without any communication. - End of Algorithm We now describe each step in more detail and determine its time complexity. As described above, Step (1) can be implemented in 2  g  m=p + O(L  log p) time. By Lemma 6, Step (2) takes O(g  (n0=p)  log p + L  log p) = O(g  n=p + L  log p) time. The time needed for determining l(v ) and r(v ) is given by O(g  n0=p + L), because v = A[i] is the leftmost (rightmost) appearance i level(A[i ? 1]) = level(A[i]) ? 1 (level(A[i + 1]) = level(A[i]) ? 1). To replace every query LCA(u; v ) by the corresponding range-minimum query, the concurrent read technique of Section 7.2 is applied. First, every segment leader corresponding to a node v fetches l(v ) (which is equal to r(v ) because v is a leaf). Since every processor is responsible for at most O((n=p)log p) low degree nodes and at most 2m=(plog2 p) high degree nodes, the cost of this substep is given by O(g  ((n=p)  log p + m=(p  log2 p))+ L) = o(g m=p)+O(g (n=p)log p+L). After that, a segmented parallel pre x operation is performed to broadcast l(v ) to all edges with starting node v , which takes O(m=p + L  log p) time. Finally, every edge (v; w) sends l(v ) to its opposite edge (w; v ) in time 2  g  m=p + L. Using the algorithm described in [JaJ92], Step (4) takes O(g  (n0=p)  log p + L  log p) = O(g  n=p + L  log p) time. To determine the time consumption of Step (5), we observe that low degree nodes cause at most O(g  (n=p)  log p + L  log p) communication cost. As noted before, every processor is responsible for at most 2  m=(p  log2 p) high degree nodes. For each of these nodes, it receives an array of size O(log p). The total time consumption of Step (5) is therefore given by o(g  m=p) + O(g  (n=p)  log p + L  log p). Finally, Step (6) takes O(n0=p + m=p) local computation time. The total time needed by algorithm bsp-lca is given in the following lemma.

Lemma 14 Let n  p and m  n  log p. The BSP cost of algorithm bsp-lca is given by W = O( mp + np  log p); H = 4  mp + o( mp ) + O( np  log p); and S = O(log p):

9.3 Veri cation

It remains to verify that the weight of every non-tree edge (u; v ) is at least as large as the weight of any edge on the path between u and v . This will be done as follows. First, every non-tree edge (u; v ) is replaced by the query edge (u; lca(u; v )). Since the opposite edge (v; u) also appears in the array of non-tree edges, the query edge (v; lca(u; v )) is also present. From here on, it will be assumed that every non-tree edge (u; x) is between a leaf u and one of its ancestors x. From determining the LCAs, we know for each edge (u; x) if x lies within the same microtree as u or in the head H . If x lies in H , then the edge (u; x) is split into two edges (u; r) and (r; x) with the same weight as (u; x), where r is the root of the microtree that contains u. The edges (r; x) can be veri ed locally in O((n0 + m)=p) time using the sequential linear time algorithm given in [Kin95], since every processor has a local copy of H . 31

The edges within the microtrees are veri ed as follows. First, for every leaf v , we compute the heaviest edge on all paths leading from v to the root r of the microtree that contains v. This is done identical to the macrotree veri cation algorithm given in [KPRS97]. The algorithm proceeds in 2  log p iterations. In the rst iteration, each node v sends the weight of the edge (v; f (v )) to all of its children, where f (v ) is the parent of v . In the remaining iterations, every node forwards the weight of the edge it received in the previous iteration to all children. An array Path To Anc[v ] of size O(log p) is associated with every leaf v , and all leaves save the ith value received in position i of this array. Every iteration takes O(g  n0=p + L) time, since every node has degree at most 5 [KPRS97]. The total time complexity of this step is therefore given by O(g  (n0=p)  log p + L  log p) = O(g  n=p + L  log p). Finally, a pre x maximum operation is performed on every Path To Anc[v ] array, which takes O(n=p) local computation time. As in bsp-lca, the requests are answered by either broadcasting the Path To Anc array to the processors that need it or by letting the edges send requests. Using the same argument as before, the communication time given by this step is given by o(g  m=p)+ O(g  (n=p)  log p + L). In order to combine the results, every edge simply sends the outcome to its opposite edge. This takes 2  g  m=p + L time.

Lemma 15 Let n  p and m  n  log p. Comparing the weights of all non-tree edges (u; v) with the weights of the path from u to v in the minimum spanning tree can be done in W = O( n + m );

p m H = 2  p + o( mp ) + O( np  log p); and S = O(log p):

time under the BSP model.

By combining Lemma 13, Lemma 14, and Lemma 15 we obtain:

Theorem 7 Let n  p and m  n  log p. Given a tree T of size  n= log p distributed evenly among the processors, and an array E of non-tree edges of size 2  m, also distributed evenly

among the processors. For each edge e = (v; w), it can be veri ed that the weight of e is at least as large as the weight of the heaviest edge on the path between v and w in T , in time W + g  H + L  S , where

W = O( mp + np  log p); H = 6  mp + o( mp ) + O( np  log p); and S = O(log p)

in the BSP model.

10 Conclusions In this paper we presented lower and upper bounds for nding the minimum spanning tree on the BSP model. In particular, it was shown that when p  m=n, any BSP algorithm 32

for computing the minimum spanning tree requires communication time (g  n), and an algorithm matching this lower bound was also presented. On the other hand, when p > m=n, it takes at least (g  m=p) communication time to nd the minimum spanning tree on the BSP model. For this case we also presented an algorithm matching the lower bound, but it requires that m=n  log p. The graph representation that we have used seems in some way natural, because it allows many simple graph problems to be solved in linear time. Examples of such problems include computing the degree of each node and, as was shown in Section 7, the shortcut operation. When using this graph representation, a (segmented) pre x computation becomes an important primitive operation. It is not unreasonable to assume that pre x computations can be performed in time O(L) for the following reasons. First, several parallel architectures have dedicated hardware to perform parallel pre x operations. Second, many BSP libraries execute an all-to-all routing when routing an h-relation, which takes at least the time needed for a parallel pre x operation. Under this assumption, the presented algorithms require only O(log p) supersteps. One reason for developing BSP algorithms is that BSP algorithms appear to be more practical than PRAM algorithms. Gotz implemented some MST algorithms for dense graphs. Among other things these showed that algorithm bsp-dense has good speed-ups in practice. A practical comparison for sparse graphs between these algorithms and PRAM algorithms is still lacking. There are several possible directions for future research. First, we remark that a deterministic sequential linear-time algorithm for nding a minimum spanning tree is still an open problem. Second, a logarithmic time, linear-work EREW PRAM algorithm for the minimum spanning tree problem also remains to be discovered.

References [ABK95]

Micah Adler, John W. Byers, and Richard M. Karp. Parallel sorting with limited bandwidth. In Annual ACM Symposium on Parallel Algorithms and Architecture, pages 129{136, 1995. [AS87] B. Awerbuch and Y. Shiloach. New connectivity and MSF algorithms for Shue-Exchange network and PRAM. IEEE Transactions on Computers, C36(10):1258{1263, 1987. [BD96] A. Baumker and W. Dittrich. Fully Dynamic Search Trees for an Extension of the BSP Model. In Annual ACM Symposium on Parallel Algorithms and Architecture, 1996. [BDM95] Armin Baumker, Wolfgang Dittrich, and Friedhelm Meyer auf der Heide. Truly ecient parallel algorithms: c-optimal multisearch for an extension of the BSP model. In Proc. of the Annual European Symposium on Algorithms, 1995. [BDMR96a] Armin Baumker, Wolfgang Dittrich, Friedhelm Meyer auf der Heide, and Ingo Rieping. Realistic parallel algorithms: Priority queue operations and selection for the BSP* model. In Proc. of the Annual International EURO-PAR Conference, LNCS, 1996. 33

[BDMR96b] Armin Baumker, Wolfgang Dittrich, Friedhelm Meyer auf der Heide, and Ingo Rieping. Realistic parallel algorithms: Priority queue operations and selection for the BSP* model. Technical Report TR-RSFB-96-027, University of Paderborn, Computer Science Dept., 1996. [Ble89] Guy E. Blelloch. Scans as primitive parallel operations. IEEE Transactions on Computers, pages 1526{1538, 1989. [CDF+ 97] E. Caceres, F. Dehne, A. Ferreira, P. Flocchini, I. Rieping, A. Roncato, N. Santoro, and S.W. Song. Ecient parallel graph algorithms for coarse grained multicomputers and BSP. In Annual International Colloquium on Automata, Languages and Programming (ICALP), 1997. [CDR86] S. Cook, C. Dwork, and R. Reischuk. Upper and lower time bounds for parallel random access machines without simultaneous writes. SIAM Journal on Computing, 15(1):87{97, 1986. [Cho97] K.W. Chong. Finding Minimum Spanning Trees on the EREW PRAM. Manuscript, 1997. [CKP+ 93] D. Culler, R. Karp, D. Patterson, A. Sahay, K. Schauser an d E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a realistic model of parallel computation. In Proc. 4th Symp. on Principles and Practice of Parallel Programming, pages 1{12. ACM SIGPLAN, May 1993. [CKT96] Richard Cole, Philip N. Klein, and Robert E. Tarjan. Finding minimum spanning forests in logarithmic time and linear work using random sampling. In Annual ACM Symposium on Parallel Algorithms and Architecture, pages 243{250. ACM, 1996. [CLR90] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. The MIT Press, Cambridge, Massachusetts, McGraw-Hill, 1990. [DFRC93] Frank Dehne, Andreas Fabri, and Andrew Rau-Chaplin. Scalable parallel computational geometry for coarse grained multicomputers. In Annual ACM Symposium on Computational Geometry, 1993. [DG97] F. Dehne and S. Gotz. Ecient Parallel Minimum Spanning Tree Algorithms for Coarse Grained Multicomputers and BSP. Manuscript, 1997. [DT94] B. Dixon and R.E. Tarjan. Optimal parallel veri cation of minimum spanning trees in logarithmic time. Parallel and Distributed Computing, Theory and Practice, LNCS 805, pages 13{22, 1994. [GMR97] Philip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. Can a sharedmemory model serve as a bridging model for parallel computation? In Annual ACM Symposium on Parallel Algorithms and Architecture, 1997. [Goo96] Michael T. Goodrich. Communication-ecient parallel sorting. In Proc. of the ACM Symposium on Theory of Computing, 1996. 34

[Goo97] [Got98] [GS96] [GV94] [JaJ92] [JM95] [JW96] [Kar93] [Kin95] [KKT95] [KPRS97] [PR97] [Rei93] [Val90] [Wyl79] [Yao79]

Michael T. Goodrich. Randomized fully-scalable BSP techniques for multisearching and convex hull construction. In ACM-SIAM Symposium on Discrete Algorithms, 1997. S. Gotz. Communication-Ecient Parallel Algorithms for Minimum Spanning Tree Computation. Master's thesis, University of Paderborn, May 1998. Alexandros V. Gerbessiotis and Constantinos J. Siniolakis. Deterministic sorting and randomized median nding on the BSP model. In Annual ACM Symposium on Parallel Algorithms and Architecture, 1996. Alexandros V. Gerbessiotis and Leslie Valiant. Direct bulk-synchronous parallel algorithms. Journal of Parallel and Distributed Computing, 22:251{267, 1994. Joseph JaJa. An Introduction to Parallel Algorithms. Addison-Wesley Publishing Company, 1992. Donald B. Johnson and Panagiotis Metaxas. A parallel algorithm for computing minimum spanning trees. Journal of Algorithms, 19:383{410, 1995. Ben H. H. Juurlink and Harry A. G. Wijsho . Communication Primitives for BSP computers. Information Processing Letters, 58(6):303{310, 1996. David R. Karger. Random sampling for minimum spanning trees and other optimization problems. In Annual ACM Symposium on Foundations of Computer Science, 1993. Valerie King. A simpler minimum spanning tree veri cation algorithm. In Workshop on Algorithms and Data Structures, pages 440{448. Springer-Verlag, 1995. LNCS 955. D.R. Karger, P.N. Klein, and R.E. Tarjan. A Randomized Linear-Time Algorithm to nd Minimum Spanning Trees. Journal of the ACM, 42:321{328, 1995. V. King, C.K. Poon, V. Ramachandran, and S. Sinha. An optimal EREW PRAM algorithm for minimum spanning tree veri cation. Information Processing Letters, 62(3):153{159, 1997. C.K. Poon and V. Ramachandran. A randomized linear work EREW PRAM algorithm to nd a minimum spanning forest. In International Symposium on Algorithms and Computation (ISAAC), LNCS 1350, pages 212{222, 1997. J.H. Reif, editor. Synthesis of parallel algorithms. Morgan Kaufmann Publishers, 1993. Leslie Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8), August 1990. J.C. Wyllie. The Complexity of Parallel Computation. Technical Report Technical Report 79/387, Cornell University, Ithaca, NY, 1979. A. C. Yao. Some complexity questions related to distributive computing. In Proc. of the ACM Symposium on Theory of Computing, pages 209{213, 1979. 35

[Yao83]

A.C. Yao. Lower bounds by probabilistic arguments. In 24th IEEE Symposium on Foundations of Computer Science, pages 420{428, 1983.

36