Efficient Algorithms for the All Nearest Neighbor and Closest Pair ...

1 downloads 0 Views 1MB Size Report
closest pair (2D_CP) problem, the 3D all nearest neighbor (3D_ANN) problem and the 3D closest pair (3D_CP) problem of n points on the linear array with a ...
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

VOL. 16,

NO. 3,

MARCH 2005

193

Efficient Algorithms for the All Nearest Neighbor and Closest Pair Problems on the Linear Array with a Reconfigurable Pipelined Bus System Yuh-Rau Wang, Shi-Jinn Horng, and Chin-Hsiung Wu Abstract—In this paper, we present two Oð1Þ time algorithms for solving the 2D all nearest neighbor (2D_ANN) problem, the 2D closest pair (2D_CP) problem, the 3D all nearest neighbor (3D_ANN) problem and the 3D closest pair (3D_CP) problem of n points on the linear array with a reconfigurable pipelined bus system (LARPBS) from the computational geometry perspective. The first Oð1Þ time algorithm, which invokes the ANN properties (introduced in this paper) only once, can solve the 2D_ANN and 2D_CP problems of 5 7 n points on an LARPBS of size 12 n3þ , and the 3D_ANN and 3D_CP problems of n points on an LARPBS of size 12 n4þ , where 1 0 <  ¼ 2cþ1 1  1, c is a constant and positive integer. The second Oð1Þ time algorithm, which recursively invokes the ANN properties 3 k times, can solve the kD ANN, and kD CP problems of n points on an LARPBS of size 12 n2þ , where k ¼ 2 or 3, 0 <  ¼ 2cþ111  1, and c is a constant and positive integer. To the best of our knowledge, all results derived above are the best Oð1Þ time ANN algorithms known. Index Terms—Parallel algorithm, all nearest neighbors, closest pair, LARPBS, reconfigurable bus model.

æ 1

INTRODUCTION

T

HE ANN problem can be solved from either the computational geometry or image processing perspective. From a computational geometry perspective, the ANN problem is defined as follows: Given a set of points, to find a nearest neighbor from the set for each point of the set is defined as the all nearest neighbor (ANN) problem. Given a set of points, to find the closest pair from the given set is defined as the closest pair (CP) problem. If the given set V of n points is in the two-dimensional (2D) Euclidean space (E 2 ), then we define the ANN problem as the 2D all nearest neighbor (2D_ANN) problem and the closest pair (CP) problem as the 2D closest pair (2D_CP) problem. The 3D_ANN and 3D_CP problems can be defined similarly if the given set V of n points is in the three-dimensional (3D) Euclidean space (E 3 ). After we solve the ANN problem, the CP problem can be easily solved by simply finding the minimum from the ANN distances for all points in the given set. The ANN problem has been applied in pattern recognition, geography, mathematical ecology, and solidstate physics [26]. It has been introduced by several

. Y.-R. Wang is with the Department of Computer Science and Information Engineering, St. John’s & St. Mary’s Institute of Technology, 499, Section 4, Tam King Road Tamsui, Taipei, Taiwan. E-mail: [email protected]. . S.-J. Horng is with the Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, 43, Section 4, Kee-Lung Road, Taipei, Taiwan. E-mail: [email protected]. . C.-H. Wu is with the Department of Information Technology & Communication, Shih Chien University Kaohsiung Campus, 200 University Rd., Neimen Shiang, Kaohsiung, Taiwan. E-mail: [email protected]. Manuscript received 8 Jan. 2004; revised 14 June 2004; accepted 6 Aug. 2004; published online 20 Jan. 2005. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPDS-0008-0104. 1045-9219/05/$20.00 ß 2005 IEEE

researchers under different parallel computation models as follows. For the 2D_ANN problem of n points, Miller and Stout [17] proposed an OðnÞ time algorithm on a 2D meshconnected computer of n  n processors. Prasanna and 1 Raghavendra [25] proposed an Oðn3 Þ time algorithm on a 2D n  n mesh with multiple broadcasting (MMB). Miller et al. [16] proposed an Oðlog nÞ time algorithm on a 2D n  n MMB. Eshaghian [2] proposed two algorithms for the optical mesh with reconfigurable free-space optical beams (OMC): One runs in Oðlog nÞ time on a 2D logn n  logn n array and the other runs in Oð1Þ time on a 2D n2  n2 array. Tsai et al. [34] proposed two algorithms on hyper-bus broadcast network (HBBN): One runs in Oðlogloglogn nÞ time on a 2D HBBN using n  n processors, and the other runs in Oð1Þ time on a 1 2D HBBN using n  n processors with nc -bit bus width, 1 where nc > log n, and c is a constant and positive integer. 1 Miller and Stout [18] proposed an Oðn2 Þ time algorithm on a 1 1 regular (nonconfigurable) mesh computer of n2  n2 processors. Jang et al. [5] proposed an Oð1Þ time algorithm on an n  n reconfigurable mesh. Up to now, very few parallel algorithms are introduced for solving higher-dimensional ANN problem. For n points in a fixed k-dimensional Euclidean space (E k ), where k  2, Lai and Sheng [8] proposed an Oð1Þ time algorithm for ANN problem on an n  n reconfigurable mesh. Some researchers solved the 2D_ANN problem from an image processing perspective if the domain is a 2D image. Then, the 2D_ANN problem is redefined as follows: Given a 2D N  N image, to find a nearest neighbor (i.e., black pixel) from the image for each black pixel of the image is defined as the 2D_ANN problem. Assume that there are n black pixels in a 2D N  N image. Clearly, 0  n  N 2 . Note that the 2D_ANN problem “looks like” closely related with the 2D Euclidean distance transform (2D_EDT) problem; however, they are very different (e.g., their Published by the IEEE Computer Society

194

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

properties are very different). The 2D_EDT is defined as an operation that converts an image (of size N  N) consisting of black and white pixels to an image where each pixel has a value that represents the Euclidean distance to its nearest black pixel. If a pixel is a black pixel, then its nearest black pixel is the pixel itself. See [10], [36], [37] for reference. Some papers have been proposed to solve the 2D_ANN problem from the image processing perspective as follows. Kao and Horng [6] proposed an Oð1Þ time algorithm using an N  N 2 reconfigurable array of processors (RAP) with N c -bit bus 2 width, where N c > 2 log N, and c is a constant and positive integer. Also, Pan et al. [22] proposed three algorithms on the linear array with a reconfigurable pipelined bus system (LARPBS): One runs in Oðlog log NÞ time on N 2 processors, another runs in Oð1Þ time on N 2þ processors for any fixed constant  > 0, and the other runs in Oð1Þ time on N 2 processors with high probability. Basically, the ANN problem is considered to be a different problem if it is approached from a different perspective. Usually, the application itself decides which approach should be chosen. Hence, it is very difficult for us to judge which one is better. Assume that both approaches are good for solving an ANN application, then the image processing might be a good choice if n ¼ OðN 2 Þ; however, if n ¼ OðNÞ, n ¼ Oðlog NÞ, or even n ¼ Oð1Þ, then the image processing approach might be very inefficient. In this paper, we consider the ANN problem based on the computational geometry perspective. Recently, two related models based on reconfigurable optical buses have been proposed. They are the array with reconfigurable optical buses (AROB) proposed by Pavel and Akl [23], and the LARPBS proposed by Pan and Hamdi [20], [21]. The AROB model can be distinguished into the basic AROB and the extended AROB. The counting and online switching are allowed in the extended AROB model during a bus cycle, but are not permitted in the basic AROB and the LARPBS. The AROB can extend to higher dimensions, while the LARPBS is only a 1D array. In this paper, we will base on the LARPBS to devise our algorithms. Since the LARPBS can simulate the basic AROB in Oð1Þ steps, so all the algorithms derived (on the LARPBS) in this paper can also be implemented on a basic AROB in the same time complexity and with the same number of processors. In this paper, by devising some innovative and efficient minimum finding and sorting algorithms, proving some new geometric properties of 3D_ANN, and fully taking the advantages of the LARPBS, we present two parallel algorithms for solving the 2D_ANN, 2D_CP, 3D_ANN, and 3D_CP problems of n points in Oð1Þ time on the LARPBS model from the computational geometry perspective: The first algorithm, which invokes the geometric properties as described in Section 4 only once, can solve the 2D_ANN and 5 2D_CP problems using 12 n3þ processors and the 3D_ANN 7 and 3D_CP problems using 12 n4þ processors; the second algorithm, which recursively invokes the geometric properties k times, can solve the kD ANN and kD CP problems, 3 using 12 n2þ processors, where k ¼ 3 (or 2), 0 <  ¼ 2cþ111  1, and c is a constant and positive integer. Comparing with the best previous Oð1Þ time ANN algorithms proposed in [5] and [8], which are also implemented by recursively invoking the geometric properties, the result of our second algorithm is far

VOL. 16,

NO. 3,

MARCH 2005

Fig. 1. (a) An LARPBS of size N. (b) An LARPBS is split into two independent subarrays.

better than those derived in [5] and [8]. Even our first algorithm still uses fewer processors and executes faster. To the best of our knowledge, all results derived in this paper are the best Oð1Þ time algorithms known, compared with all the previously published results. The remainder of this paper is organized as follows: In Section 2, we introduce the computation model used in this paper. In Section 3, we introduce some basic algorithms which will be used as the building blocks of this paper. In Section 4, we introduce some basic geometric properties of ANN. In Sections 5 and 6, two parallel algorithms for solving the 2D_ANN, 2D_CP, 3D_ANN, and 3D_CP problems are presented. Finally, some concluding remarks are included in the last section.

2

THE LARPBS MODEL

A linear array with a reconfigurable pipelined bus system (LARPBS) [20], [21], which actually is the same as a basic 1D AROB, is an array of N processors P0 , P1 ,    , PN1 connected by optical buses as shown in Fig. 1. Each processor with a local memory is identified by a unique index denoted as Pi , 0  i  N  1. The optical bus is constructed from three identical waveguides. They are message, select, and reference waveguides. The message waveguide is used for sending data, and the select and reference waveguides are used for sending address information. Each waveguide is conceptually divided into two segments, the transmitting segment and the receiving segment. Each processor Pi is connected to the optical bus with two couplers. One coupler which is a 1  2 optical switch (denoted as RST(i)) is used to write data on the upper (transmitting) segment of the bus, and the other which is a 2  1 optical switch (denoted as RSR(i)) is used to read the data from the lower (receiving) segment of the bus. The optical signals propagate unidirectionally from left to right on the transmitting segment and from right to left on the receiving segment. This bus system is referred to as the folded-bus connection in [3], [15], [29], [30]. Since each processor Pi has to connect to three waveguides and each waveguide needs a pair of RST(i) and RSR(i), thus, each processor Pi needs six local optical switches, three for connecting to its three transmitting segments and three for

WANG ET AL.: EFFICIENT ALGORITHMS FOR THE ALL NEAREST NEIGHBOR AND CLOSEST PAIR PROBLEMS ON THE LINEAR ARRAY WITH...

Fig. 2. (a) An LARPBS of size 5. (b) The switch-controlled conditional delays implemented using 2  2 optical switches.

connecting to its three receiving segments. These optical switches can be set to either cross or straight by the local processor Pi for reconfiguration. In the following discussion, these switches will be called reconfigurable switches due to their function. Due to reconfigurability, an LARPBS can be partitioned into  (where   2) independent subarrays. All these subarrays can operate independently to solve different subproblems without interference [19]. For example, if all reconfigurable switches are set to straight, the bus system operates as a regular pipelined bus system as shown in Fig. 1a. If all six reconfigurable switches of processor Pi are set to cross, then the LARPBS will be split into two separate systems, one consisting of processors P0 , P1 ,    , Pi , whose leader processor is P0 , and the other consisting of processors Piþ1 ,    , PN1 , whose leader processor is Piþ1 as shown in Fig. 1b. Each processor uses a set of control registers to store information needed to control the transmission and reception of messages by that processor. The LARPBS model assumes that the optical distances between any two consecutive processors on the bus are the same (i.e., d0 as shown in Fig. 1a) and, hence, results in the same propagation delays between any two consecutive processors. In order to ensure that all processors can write their messages on the bus simultaneously without collision, the collision-free condition, d0 > bwcg , must be satisfied, where b is the maximum number of binary bits in each message, w is the width of an optical pulse used to represent one bit in each message, and cg is the velocity of light in the waveguide. Besides, Guo et al. [3] also showed that when d0 approaches bwcg , the pipelined bus can accommodate N messages in the same bus cycle due to the property of pipelining. Fig. 2a shows an LARPBS of size 5. Only the select and reference waveguides are shown. The message waveguide is similar to the reference waveguide. A unit delay is defined to be the spatial length of a single optical pulse, shown as a loop on the receiving segments of the reference and message waveguides. Further, a switch-controlled conditional delay is added between every pair of consecutive processors on the transmitting segment of the select waveguide. If the switch is set to cross state, one unit of delay is introduced in the select waveguide and if the switch is set to straight state, no delay is introduced. See Fig. 2a for

195

illustration. Note that, in Fig. 1, all the unit delays and conditional switches are omitted to avoid confusion. As in many other synchronous parallel computing systems, an LARPBS computation is a sequence of alternate global communication and local computation steps. The time complexity of an algorithm is measured in terms of the number of bus cycles and the number of arithmetic operations. In an LARPBS model, a petit cycle () is defined as the time needed for a pulse to traverse the optical distance between two consecutive processors on the bus. That is,  ¼ d0 =cg . A bus cycle is defined as the end-to-end propagation delay time  (where  ¼ 2N) of the messages on the optical bus, together with the message processing time for the message generation, buffering, encoding, decoding, etc., in the source and destination processors. That is, the bus cycle can be considered to be OðÞ. Although the message communication time is proportional to the system size N, in parallel computing, the time for a message to propagate a global broadcasting bus is assumed to be constant [1]. In [15], Melhem et al. assumed that a bus cycle in a pipelined optical bus is compatible with the computation time of any arithmetic or logic operation in the processors. Since the ratio r between the message processing time and a petit cycle  (i.e., the time needed for communication over a distance d0 ) is in the order of 103 [3], [12], [21], [23], it implies that a bus cycle can be regarded as a constant time for systems of size OðrÞ (e.g., r ¼ 103 ). In order for the message processing time to be of the same order of magnitude as the end-to-end communication time , the number of processors should satisfy N  r. Obviously, the bus cycle OðÞ, where  ¼ 2N, is proportional to N. However, for a bus cycle to be considered as one step and each step takes Oð1Þ time, N should be restricted to a reasonable range. An example is given as follows: Assume that each message is 10 bits and its transmission speed is 20 Gb/s in an LARPBS. Then, bw ¼ 10 bits109 ns=ð20  109 bitsÞ ¼ 0:5 ns. Since d0 > bwcg and  ¼ d0 =cg , so  > bw ¼ 0:5 ns. Therefore, when N ¼ 103 and let  ¼ 0:5 ns, we get  ¼ 2N ¼ 2  103  0:5 ns ¼ 1 s. That is, if N is in the order of 103 , then a bus cycle time 1 s is compatible to the time of a CPU operation [15], [28]. This shows that a bus cycle can be regarded as a constant time for an LARPBS system of size OðrÞ (e.g., r ¼ 103 ) and the size of a k-D reconfigurable optical bus system (e.g., k-D AROB) can be quite large (e.g., 1000k ). In summary, since the message communication time on the whole optical pipelined bus is compatible with the computation time of any arithmetic or logic operation, we can assume that a bus cycle as well as a processor cycle as one step and each step takes Oð1Þ time. This assumption for time complexity measure has been shown and adopted widely in the literature [3], [4], [12], [15], [23], [24], [32], [33], [37], [38], [39]. In this paper, we adopt the same time complexity measure. There are several ways such as time waiting function [3], time-division multiplexity scheme [30], and the coincident pulse technique [11], [30] for the AROB computation model to route messages from one processor to another. However, only the coincident pulse technique is used on the LARPBS computation model for addressing [13].

196

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

2.1 Some Primitive Operations on the LARPBS In parallel processing systems, the communication efficiency often determines the effectiveness of processor utilization, which, in turn, determines performance. Limited bandwidth, capacitive loading, and cross-talk caused by mutual inductance are the fundamental constraints in electronic systems. From a technological point of view, optical buses offer numerous advantages including high bandwidth and low interference. However, the two most relevant properties of optical buses are their unidirectional propagation and predictable propagation delay per unit length. These two properties enable synchronized concurrent access to an optical bus in a pipelined fashion [3], [15], [29], [30]. This makes the architecture with bus structure of optical interconnections suitable for a wide range of applications, especially for communication intensive applications. In the following, we will introduce some primitive operations which are communication intensive. (Multiple) one-to-one communication: Assume that a subset of processors Ps0 ; Ps1 ; . . . ; Psm1 are senders and another subset of processors Pd0 ; Pd1 ; . . . ; Pdm1 are receivers. For a fixed i, processor Psi sending a message to processor Pdi is defined as one-to-one communication. For some i, 0  i  m  1, processor Psi sending a message to processor Pdi simultaneously, is defined as multiple one-to-one communication. 2. Broadcasting: For a fixed integer s, 0  s  N  1, a source processor Ps sending a message to all the N processors P0 ; P1 ; . . . ; PN1 is defined as broadcasting. 3. Multicasting: For a fixed integer s, 0  s  N  1, a source processor Ps sending a message to a subset of N processors P0 ; P1 ; . . . ; PN1 is defined as multicasting. 4. Multiple multicasting: Assume that there are i sending processors Ps0 ; Ps1 ; . . . ; Psi1 and there are i disjoint groups of receiving processors, GP d0 ; GP d1 ; . . . ; GP di1 . For all j, 0  j  i  1, processors Psj broadcasting a message to all the processors in group GP dj is defined as multiple multicasting. 5. Binary prefix sum operation: Assume that each processor Pi , 0  i  N  1, holds a binary value vi . The computation of all the prefix sums psi ¼ v0 þ v1 þ . . . þ vi is defined as the binary prefix sum operation. 6. Binary summation operation: Assume that each processor Pi , 0  i  N  1, holds a binary value vi of one or zero. The task of binary summation operation is to compute the sum of these binary values and move the sum to the first processor. 7. Compression operation: Each processor is marked either as active or as inactive by setting a bit in one of its registers. Also, each processor Pi , 0  i  N  1, has a data di . The compression operation is to bring all the data in the active processors in contiguous processors at one end of the array in their original order. Note that the operation is named as compaction on the AROB model. The following lemma is designed for the primitive operations stated above. See [12], [21] for the detailed 1.

VOL. 16,

NO. 3,

MARCH 2005

implementation. These powerful primitive operations, plus the reconfigurability of the LARPBS model, make the LARPBS very attractive in solving problems that are both computation and communication intensive, such as ANN and CP problems. Lemma 1 [12], [21]. (Multiple) one-to-one communication, broadcasting, multicasting, multiple multicasting, binary prefix sum, binary summation, and compression can be done in Oð1Þ bus cycles on an LARPBS.

3

SOME FUNDAMENTAL ALGORITHMS

3.1 Oð1Þ Time Sorting Algorithms on the LARPBS Lemma 2 [22]. N items can be sorted in Oð1Þ time using an LARPBS of N 2 processors. Lemma 2 is denoted as the basic sorting algorithm. It can be implemented by reconfiguration, comparing, and several primitive operations such as multicasting, binary summation, and (multiple) one-to-one communication. See [22] for details. Based on Lemma 2 and the column sort algorithm presented by Leighton [9], in the following, we will derive two Oð1Þ time sorting algorithms on the LARPBS. A brief description of column sort is first stated as follows: Column sort is a generalization of Batcher’s odd-even merge. It may be used to sort N input data items and produces a sorted array in a column major order if these items can be arranged as a matrix M with r rows by s columns, where r  s2 , r  s ¼ N, and r mod s ¼ 0. See [9], [14], [32] for 2 1 reference. Here, we choose r ¼ N 3 and s ¼ N 3 . Column sort can be implemented in eight steps. In Steps 1, 3, 5, and 7, each column of matrix M is sorted in ascending order. Step 2 performs a “transpose” operation, i.e., we pick up all the items of M in column major order and put them back in row major order. Step 4 performs an “untranspose” operation which is the reverse of Step 2. That is, we pick up all the items of M in row major order and put them back in column major order. Step 6 performs a “shift” operation. We shift all the items of M in column major order by b2rc. Clearly, this step will increase the number of columns by one, i.e., M turns out to be a r  ðs þ 1Þ matrix. Step 8 performs a “unshift” operation which is the reverse of Step 6. In this step, the number of columns will decrease by one, i.e., M becomes a r  s matrix again. Steps 2, 4, 6, and 8 can be performed in Oð1Þ time on an LARPBS of N processors by performing multiple one-to-one communication. Since in each of Steps 1, 3, 5 and 7, each column of size 2 2 N 3 can be sorted in Oð1Þ time using an LARPBS of N 3  2 4 1 N 3 ¼ N 3 processors, and s columns (where s ¼ N 3 ) perform column sort in parallel, so, N numbers can be sorted in Oð1Þ 4 1 5 time using an LARPBS of N 3  N 3 ¼ N 3 processors. This concludes Lemma 3. Lemma 3. The task of sorting an N-item sequence can be 5 performed in Oð1Þ time on an LARPBS of N 3 processors. Lemma 3 will be invoked by the first Oð1Þ ANN algorithm as described in Section 5. In the following, we will derive another sorting algorithm for the second Oð1Þ ANN algo2 1 2  2  rithm as described in Section 6. Since N ð3Þ ¼ N ð3Þ  N ð3Þ =2 ,

WANG ET AL.: EFFICIENT ALGORITHMS FOR THE ALL NEAREST NEIGHBOR AND CLOSEST PAIR PROBLEMS ON THE LINEAR ARRAY WITH...

197

2 1

so N ð3Þ numbers can be sorted in Oð1Þ time using an 2  2  2  2  2 1 LARPBS of N ð3Þ  N ð3Þ  N ð3Þ =2 ¼ N ð3Þ  N ð3Þ processors based on Lemma 2 and column sort. Lemma 3 is the case of 2 4 2 2  ¼ 1. If in the case of  ¼ 2, then N 3 ¼ N 9  N 9 and, hence, N 3 4 numbers can be sorted in Oð1Þ time using an LARPBS of N 9  4 2 10 N 9  N 9 ¼ N 9 processors. So, if we solve each column of size 2 10 N 3 in Oð1Þ time using an LARPBS of N 9 processors in each of 1 Steps 1, 3, 5, and 7, and s columns (where s ¼ N 3 ) perform column sort in parallel, then N numbers can be sorted in Oð1Þ 10 1 13 time using an LARPBS of N 9  N 3 ¼ N 9 processors. This concludes Lemma 4. Of course, we can recursively solve each 2  N ð3Þ , where   3, and conclude that the task of sorting an N-item sequence can be performed in Oð1Þ time on an 2  LARPBS of N 1þð3Þ processors, where  stands for the times the column sort algorithm to be invoked. This can be proven easily by induction and we skip the proof here for simplicity. However, the lemma that sorts an N-item sequence in Oð1Þ 2  time on an LARPBS of N 1þð3Þ processors is not recommended because it will lead to a larger Oð1Þ time factor and the sorting is not the bottleneck of this paper. Lemma 4. The task of sorting an N-item sequence can be 13 performed in Oð1Þ time on an LARPBS of N 9 processors.

3.2

A Scalable Oð1Þ Time Minimum Finding Algorithm Lemma 5 [21]. The minimum value of N data items can be computed in Oð1Þ time on an LARPBS of size 12 N 2 . Lemma 5, which introduces the basic minimum finding algorithm, can be implemented by performing the multicasting, broadcasting, and binary summation operations as described in Lemma 2. Here, we skip the details. Based on Lemma 5 and [35], we derive a scalable Oð1Þ time minimum finding algorithm, named Algorithm MFA as follows, for finding the minimum value of N numbers in constant time on an LARPBS of size 12 N 1þ , where 0 <  ¼ 2cþ111  1, c is a constant and positive integer. Algorithm MFA Input: N data items. Suppose that each data item with index i, 0  i  N  1, is initially allocated in processor with index 12 iN  (i.e., P12iN  ) of an LARPBS of size 12 N 1þ . Output: The minimum value of these input N data items. Step 1: Assume  ¼ 2cþ111 , c is a constant. As described above, we assume that the N data items with index i, where 0  i  N  1, are initially given in processor with index 2cþ1 1  1 1þ 1 (i.e., 12 N 2cþ1 1 Þ. 2 iN (i.e., P2iN  ) of an LARPBS of size 2 N First, we partition the processors of this LARPBS incþ1the 2 21 ascending order of processor index into N 1 (i.e., N 2cþ1 1 ) subsets, each of size 12 N 2 . The corresponding N data items with index i (which have been given in processor P12iN  , 0  i  N  1) are automatically partitioned into N 1 subsets, each of size N  . Refer to Fig. 3a for illustration. Each subset of the contiguous 12 N 2 processors is responsible for finding the minimum from the subset of N  data

Fig. 3. An illustration of Lemma 6.

items in Oð1Þ time by invoking Lemma 5 (which is implemented by performing the multicasting, broadcasting, and binary summation operations). In each subset, only one data item (i.e., the minimum) will survive. Then, we allocate each of the survived data items in processors with indices 12 iN 2 (i.e., P12iN 2 ), where 0  i  N 1  1, (by performing the one-to-one communication) in Oð1Þ time on the LARPBS of size 12 N 1þ . So, at the end of this step, we are left with N 1 survived data items, with each of them stored in processor P12iN 2 . Step 2: In this step, we partition the processors in the ascending order of processor index into N 13 (i.e., 2cþ1 22 N 2cþ1 1 ) subsets, each of size 12 N 4 . The survived N 1 data items (which have been given in processor P12iN 2 , where 0  i  N 1  1) are also automatically partitioned into N 13 subsets, each of size N 2 . Refer to Fig. 3b for illustration. Each subset of 12 N 4 processors is responsible for finding the minimum from the subset of the survived N 2 data items in Oð1Þ time. Similarly, after this step, only the minimum in each subset survives. We then allocate this survived data item in processor with index 12 iN 4 (i.e., P12iN 4 ). So, at the end of this step, we are left with N 13 survived data items, with each of them stored in processor P12iN 4 , where 0  i  N 13  1. Step c: The steps described above can be performed iteratively. In Step c, where c is a constant and positive integer, the 12 N 1þ processors are partitioned into c c N 1ð2 1Þ subsets, each of size 12 N 2  . The survived c1 N 1ð2 1Þ data items are also automatically partitioned c c1 into N 1ð2 1Þ subsets, each of size N 2  . Each subset of c 1 2 processors is responsible for finding the minimum 2N c1 from each subset of N 2  data items in Oð1Þ time. Only the minimum in each subset survives. We then allocate c this survived data item in processor with index 12 iN 2  (i.e., P12iN 2c  ). So, at the end of this step, we are left with c N 1ð2 1Þ survived data items, with each of them stored c in processor P12iN 2c  , where 0  i  N 1ð2 1Þ  1. Step c þ 1: Finally, in Step c þ 1, the 12 N 1þ processors are cþ1 partitioned into N 1ð2 1Þ ¼ N 0 ¼ 1 subset of size

198

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

VOL. 16,

NO. 3,

MARCH 2005

nearest neighbor in the interval ½x1 ; x2  [ ½y1 ; y2 ). Then, the following statements are true. 1.

Fig. 4. The geometric properties of the 2D_ANN and 2D_CP problems.

2. 1 2cþ1  2N

1 1þ 2N

1

cþ1

c

c

¼ (since  ¼ 2cþ1 1 , so 2  ¼ 2  þ 2  ¼ c 1 þ , and 1  ð2cþ1  1Þ ¼ 0). The survived N 1ð2 1Þ data elements are also automatically partitioned into an ðcþ1Þ c N 1ð2 1Þ ¼ N 0 ¼ 1 subset of size N 2  . Refer to Fig. 3c cþ1 for illustration. All the 12 N 2  processors are responsible c for finding the minimum from the subset of N 2  survived data items in Oð1Þ time. So, the minimum of the N data items is computed at the end of Step c þ 1. We allocate it in processor P0 . Since each step can be computed in Oð1Þ time using 12 N 1þ and the minimum can be computed in c þ 1 steps, where c is a constant, so, this general minimum finding algorithm can be found in ðc þ 1ÞOð1Þ ¼ Oð1Þ time on an LARPBS of size 12 N 1þ , where 0 <  ¼ 2cþ111  1. This concludes the following lemma. Lemma 6. The task of finding the minimum of an N-item sequence can be performed in Oð1Þ time on an LARPBS of size 1 1þ , where 0 <  ¼ 2cþ111  1, c is a constant and positive 2N integer. Clearly, Lemma 5 is a special case of Lemma 6 (when c ¼ 0). Note that in each step of this algorithm, the processors are always fully utilized and  converges toward zero very fast. It implies that the processors needed will reduce dramatically even if c increases very slowly. For example, let N ¼ 1015 , then if we invoke Lemma 5, it will take 5  1029 processors for us to find the minimum in Oð1Þ time; however, if we invoke Lemma 6 and choose c ¼ 3, then it only takes 5  1015 processors for us to find the minimum in Oð1Þ time. Obviously, a huge amount of processors saving (1014 times) are rewarded by this algorithm when we increase c from 0 to 3. Although c can be any nonnegative integer, to fully take the advantage of this algorithm, we suggest that c  3 or 5. This lemma can be used as a subalgorithm to solve lots of other related parallel computation problems on the LARPBS more efficiently.

4

SOME BASIC GEOMETRIC PROPERTIES

OF

ANN

First, we introduce the 2D geometric properties upon which our 2D_ANN and 2D_CP algorithms are based. See Fig. 4 for reference. Lemma 7 [18]. Given an arbitrary set S of points in a 2D plane, and arbitrary numbers x1 < x2 and y1 < y2 , let R be a rectangle defined by R ¼ fðx; yÞjx1  x  x2 and y1  y  y2 g, let p be any point of R \ S, let dðp; qÞ be the Euclidean distance between points p and q, let DðpÞ ¼ minfdðp; qÞjq 6¼ p; and 8 q 2 Sg (i.e., the distance from point p to its global nearest neighbor), and let D0 ðpÞ ¼ minfdðp; qÞjq 6¼ p; x1  Xq  x2 or y1  Yq  y2 , q ¼ ðXq ; Yq Þ 2 Sg (i.e., the distance from point p to its local

3.

If for all four corners ci ; 0  i  3, of the rectangle R, D0 ðpÞ < dðp; ci Þ hold, then DðpÞ ¼ D0 ðpÞ. That is, the distance from point p to its local nearest neighbor (i.e., D0 ðpÞ) turns out to be the distance from point p to its global nearest neighbor (i.e., DðpÞ). If p is any point of R \ S such that DðpÞ < D0 ðpÞ (i.e., if the global nearest neighbor of a point p is neither in the interval ½x1 ; x2  nor in the interval ½y1 ; y2 ), then there at least exists a corner ci of the rectangle R, such that dðp; ci Þ < D0 ðpÞ. There are at most two points p 2 R \ S such that dðp; ci Þ < D0 ðpÞ for each corner ci , 0  i  3, of the rectangle R; that is, for all the four corners of the rectangle R, in total there are at most eight points p 2 R \ S such that dðp; ci Þ < D0 ðpÞ for all the four corners of the rectangle R.

We then extend the geometric properties from the 2D plane to the 3D space as follows. Lemma 8. Given an arbitrary set S of points in a 3D space, and arbitrary numbers x1 < x2 , y1 < y2 , and z1 < z2 , let R be a rectangular parallelepiped (or box) defined by R ¼ fðx; y; zÞ jx1  x  x2 and y1  y  y2 and z1  z  z2 g, let p be any point of R \ S, let dðp; qÞ be the Euclidean distance between points p and q, let DðpÞ ¼ minfdðp; qÞjq 6¼ p; and q 2 Sg (i.e., the distance from point p to its global nearest neighbor), and let D0 ðpÞ ¼ minfdðp; qÞjq 6¼ p; x1  Xq  x2 or y1  Yq  y2 or z1  Zq  z2 , q 2 Sg (i.e., the distance from point p to its local nearest neighbor in the interval ½x1 ; x2  [ ½y1 ; y2  [ ½z1 ; z2 ). Then, the following statements are true. 1.

2.

3.

If p is any point of R \ S and for all the eight corners ci ; 0  i  7, of the rectangular box R, D0 ðpÞ < dðp; ci Þ holds, then D0 ðpÞ ¼ DðpÞ. That is, D0 ðpÞ is the distance from point p to its global nearest neighbor. If p is any point of R \ S such that DðpÞ < D0 ðpÞ, then there exists a corner ci of the rectangular box R, such that dðp; ci Þ < D0 ðpÞ. There are at most three points p 2 R \ S such that dðp; ci Þ < D0 ðpÞ for each corner ci , 0  i  7, of the rectangular box R; that is, for all the eight corners of the rectangular box R, in total there are at most 24 points p 2 R \ S such that dðp; ci Þ < D0 ðpÞ for all the eight corners of the rectangular box R.

Proof. 1.

2.

If, for all the eight corners ci ; 0  i  7 of the rectangular box R, D0 ðpÞ < dðp; ci Þ holds, it means that the distance D0 ðpÞ is smaller than the distance from point p to any point q which is not in the interval ½x1 ; x2  [ ½y1 ; y2  [ ½z1 ; z2 . This implies that D0 ðpÞ is the distance from point p to its global nearest neighbor, i.e., D0 ðpÞ ¼ DðpÞ. If p 2 R \ S such that DðpÞ < D0 ðpÞ (i.e., it implies that the global nearest neighbor of a point p is not in the interval ½x1 ; x2  [ ½y1 ; y2  [ ½z1 ; z2 ), then there is a point q ¼ ðx; y; zÞ 2 S such that DðpÞ ¼ dðp; qÞ, where x is not in the interval ½x1 ; x2 , y is not in the

WANG ET AL.: EFFICIENT ALGORITHMS FOR THE ALL NEAREST NEIGHBOR AND CLOSEST PAIR PROBLEMS ON THE LINEAR ARRAY WITH...

q0 ¼

199

 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi   pffiffiffi  1 3 ;0 ; Xq0 ; 1  ðXq0 Þ2 ; 0 ¼ 2 2

and 0

r ¼

3.

dðw; pÞ ¼ dðw; qÞ ¼ dðw; rÞ    2  2 !1=2 1 2 1 1 ¼ 0:92 < 1: ¼ 1  pffiffiffi þ pffiffiffi þ pffiffiffi 3 3 3 A contradiction. In case two, assume these four points are organized as a ring. In order for point w to have enough space, we move point q ¼ ð0; 1; 0Þ along the arc qq0 p to q0 and move point r ¼ ð0; 0; 1Þ along the arc rr0 p to r0 to shrink the area covered by 4pq0 r0 to be as small as possible. However, dðp; q0 Þ  1, dðp; r0 Þ  1 and dðq0 ; r0 Þ  1 must hold. Now, let dðp; q0 Þ ¼ 1 and dðp; r0 Þ ¼ 1, we get

pffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  3 1 2 Xr0 ; 0; 1  ðXr0 Þ ¼ : ; 0; 2 2

To fully utilize the rest area for putting the point w into the surface of the unit ball at corner ci , point w should be on the arc rwq as shown in Fig. 5c. Let dðr0 ; wÞ ¼ 1, i.e., rffiffiffi rffiffiffi!  qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 1 2 ; w ¼ 0; Yw ; 1  ðYw Þ ¼ 0; ; 3 3

Fig. 5. The geometric properties of the 3D_ANN and 3D_CP problems.

interval ½y1 ; y2 , and z is not in the interval ½z1 ; z2 . Assume x  x2  x1 , y  y2  y1 , and z  z2  z1 , then there exists a corner ci ¼ ðx2 ; y2 ; z2 Þ such that dðp; ci Þ < dðp; qÞ ¼ DðpÞ < D0 ðpÞ. All other seven combinations of x; y; z values can be proved similarly. First, for clarity and without loss of generality, let ci ¼ ð0; 0; 0Þ be any of the eight corners of the rectangular box R, and suppose p; q; r 2 R \ S are such that dðp; ci Þ < D0 ðpÞ, dðq; ci Þ < D0 ðqÞ, and dðr; ci Þ < D0 ðrÞ. See Fig. 5a for illustration. Then, it must be that ffpci q  3 , ffpci r  3 , and ffqci r  3 . Hence, there are at least three points (i.e., p; q; r) of R \ S which are closer to ci than to any other points of R \ S. Then, we will show that there are at most three points of R \ S which are closer to each corner ci than to any other points of R \ S. It can be proven by contradiction. Assume that there exists another point w which is closer to ci than to any other points of R \ S. There are two cases to be analyzed as follows: For clarity and without loss of generality, we assume that all these four points are on the surface of a unit ball as shown in Fig. 5. In case one, assume these four points are organized as a star. In order for point w to have enough space, we move points p; q; r to coordinates (1,0,0), (0,1,0), and (0,0,1), respectively, to expand the area covered by 4pqr to be as large as possible. See Fig. 5b for illustration. In this case, the best coordinates for point w to be allocated are the coordinates which have equal distances to all points p; q, and r. That is, the coordinates of w are ðp1ffiffi3 ; p1ffiffi3 ; p1ffiffi3Þ. Then,



then dðw; q0 Þ ¼ 0:765 < 1. A contradiction.

u t

In [8], Lai and Sheng proved that there are at most k  2kþ1 points p 2 R \ S such that dðp; ci Þ < D0 ðpÞ for each corner ci , 0  i  2k , of the supercube R in the k-dimensional space (i.e., at most 16 points for each 2D corner and at most 48 points for each 3D corner). More precisely, in this paper, we show that there exist at most two points for each 2D corner and further prove that at most three points for each 3D corner such that dðp; ci Þ < D0 ðpÞ. See Lemmas 7 and 8 for details.

5

THE FIRST Oð1Þ TIME ALGORITHM FOR SOLVING THE 2D_ANN, 2D_CP, 3D_ANN, AND 3D_CP PROBLEMS

In this section, we present an algorithm for computing the kD ANN and kD CP problems of n points in Oð1Þ time on an 2kþ1 LARPBS of size 12 n kþ1 þ , where k ¼ 3 (or 2), 0 <  ¼ 2cþ111  1, and c is a constant and positive integer. That is, the 2D_ANN and 2D_CP problems of n points can be solved in Oð1Þ time on 5 an LARPBS of size 12 n3þ , and the 3D_ANN and 3D_CP problems of n points can be solved in Oð1Þ time on an LARPBS 7 of size 12 n4þ , where 0 <  ¼ 2cþ111  1, c is a constant and positive integer.

5.1 Description of Algorithm Solving ANN problem is deeply dependent on the sorting algorithm to presort the input points and the minimum finding algorithm to find the nearest neighbors of all the input points. In this algorithm, we use the innovative scalable Oð1Þ time minimum finding algorithm (i.e., Lemma 6) and the efficient Oð1Þ time sorting algorithm (i.e., Lemma 3, which is derived by invoking column sort only once) as building blocks and invoke the geometric properties as described in Lemma 8 (or 7) only once also to devise the Oð1Þ time kD ANN algorithm, where k ¼ 3 (or 2). It is well known that giving up recursively invoking Lemma 8 (or 7) will result in using more processors to devise an Oð1Þ time algorithm. Fortunately, comparing with the Oð1Þ time algorithms introduced in [5] and [8], which are the best previous algorithms known, this Oð1Þ time algorithm not only uses fewer processors but also executes faster and, hence, it is a very efficient algorithm. In the following, we solve the kD ANN problems of n points in four steps. In Step 1, we presort the input points of size n by ascending hth-coordinate, where 1  h  k, k ¼ 3 (or 2). In 1 Step 2, we partition the sorted input set into nkþ1 slabs, each k of nkþ1 points. In Step 3, we solve the kD ANN problem of

200

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

VOL. 16,

NO. 3,

MARCH 2005

Fig. 6. An illustration of the index mapping.

each slab locally (i.e., within its own slab). In each 1 coordinate, there are nkþ1 slabs to be processed in parallel. Finally, based on the geometric properties as stated in Lemma 8 (or 7), all the n input points find their global kD ANNs within the input set V of n points. After we solve the kD ANN problem, the kD CP problem can be easily solved by simply taking the minimum over all the kD ANN distances for all points in the given set. We state this algorithm in detail as follows: Note that in the following of this paper, we will base on the 3D case to introduce the algorithms and some illustrations are based on the 2D case for simplicity. In the 2D case, just omit all the notations and computations related to the Z-coordinate. Algorithm 3D_ANN_1 (or 2D_ANN_1) Input: A set V ¼ fv0 ; v1 ; . . . ; vi ; . . . ; vn1 g of n points in the 3D space (or 2D plane). Output: The nearest point for each point and the closest pair from the given set.

1

1

loss of generality, in this paper, bnkþ1 c is represented as nkþ1 1 k and dn=bnkþ1 ce is represented as nkþ1 . In the remainder of this paper, all steps for partition are represented in a k similar way. We denote each slab as S1 ¼ fs1i jfnkþ1  i  k 1 ð þ 1Þ nkþ1  1g, where 0    nkþ1  1. The slabs S1 and 1 are separated by a plane (for the 3D case) or a line (for Sþ1 1 , the 2D case) which passes the first point of each slab Sþ1 1 where 0    nkþ1  2, and is perpendicular to the X-axis. The Y and Z-coordinates are processed similarly. That is, 1 k S 2 is divided into nkþ1 slabs, each slab S2 ¼ fs2i j nkþ1  k 1 3 i  ð þ 1Þnkþ1  1g, and S is also divided into nkþ1 slabs, k k each slab S 3 ¼ fs3i j nkþ1  i  ð þ 1Þnkþ1  1g, where 0  1 ;  nkþ1  1. Then, in the 3D case, let S1 \ S2 \ S 3 be defined as rectangular box Rð; ; Þ. The 3D space will be 1 3 divided into ðnkþ1 Þ3 ¼ n4 rectangular boxes. Similarly, in 1 2 the 2D case, let S \ S be defined as rectangle Rð; Þ. The 1 2 2D plane will be divided into ðnkþ1 Þ2 ¼ n3 rectangles. See Fig. 7 for illustration.

Step 2: // Partitioning each S h into nkþ1 slabs, each slab k containing nkþ1 points. // For h ¼ 1 (i.e., the X-coordinate), 1 we divide the sorted sequence S 1 into bnkþ1 c slabs, each 1 containing dn=bnkþ1 ce points. For readability and without

Step 3: // For the 3D case, each point vi 2 S1 \ S2 \ S 3 \ V finds its local 3D_ANN within the bounded 3D space S1 [ S2 [ S 3 . For the 2D case, each point vi 2 S1 \ S2 \ V finds its local 2D_ANN within the bounded 2D plane S1 [ S2 . // We use both the notations Pl and P ði; jÞ to 2kþ1 represent an LARPBS of 12 n kþ1 þ processors, where l ¼ i  12 k k þ nkþ1 þ j, 0  i  n  1, 0  j  12 nkþ1þ  1. Fig. 8 shows the processor partition for the 2D case. We assign an k LARPBS of 12 nkþ1ð1þÞ processors for each point in slab S1 to find its nearest neighbor within its own slab (i.e., S1 ) in k Oð1Þ time. Since there are nkþ1 points in slab S1 , so we assign k k 2k k an LARPBS of nkþ1  12 nkþ1ð1þÞ ¼ 12 nkþ1þkþ1 processors for all k these nkþ1 points in slab S1 to be processed in parallel. See the dashed block of Fig. 8 for the illustration of the 2D case. Since the sorted sequence S 1 of n points are divided into 1 k nkþ1 S1 slabs, each containing nkþ1 points, so, it takes 2kþ1 k 1 kþ1 þkþ1 processors for all the n points to find their 2n respective local kD ANNs, denoted as N  ðs1i Þ, and associated distances, denoted as min ðs1i Þ, within their respective slabs in parallel in Oð1Þ time. The N  ðs2i Þ and

TABLE 1 Index Mapping between s1i and v1 ðiÞ

TABLE 2 Index mapping between s2i and v2 ðiÞ

Step 1: // Presorting points in V by ascending hth-coordinate. // Sort points in V by ascending hth-coordinate, where 1  h  3. Based on Lemma 3, it takes Oð1Þ time on an 5 LARPBS of n3 processors to sort the n points in each coordinate. Let the sorted sequence be sh0 ; sh1 ; . . . ; shi ; . . . ; shn1 and denoted as S h . Assume that h ðiÞ is the index such that vh ðiÞ ¼ shi , where 0  i; h ðiÞ  n  1. T h e n , S h ¼ fshi j 0  i  n  1g ¼ fvh ðiÞ j 0  i  n  1g. We store the point vi ; shi and index h ðiÞ in P ði; 0Þ. Let’s give an example of a set V ¼ fv0 ; v1 ; v2 ; v3 ; v4 ; v5 ; v6 ; v7 g of eight points in the 2D plane as shown in Fig. 6a. Then, Fig. 6b shows the sorted sequence s10 ; s11 ; s12 ; s13 ; s14 ; s15 ; s16 ; s17 in ascending X-coordinate (i.e., the 1st-coordinate) and Fig. 6c shows the sorted sequence s20 ; s21 ; s22 ; s23 ; s24 ; s25 ; s26 ; s27 in ascending Y -coordinate (i.e., the 2nd-coordinate). Table 1 shows the index mapping between s1i and v1 ðiÞ . Table 2 shows the index mapping between s2i and v2 ðiÞ . 1

WANG ET AL.: EFFICIENT ALGORITHMS FOR THE ALL NEAREST NEIGHBOR AND CLOSEST PAIR PROBLEMS ON THE LINEAR ARRAY WITH...

201

min3 ðvi Þg. Therefore, the point N ;; ðvi Þ which associates with min;; ðvi Þ is found also. 2kþ1

Fig. 7. An illustration of Algorithm 2D_ANN_1.

min ðs2i Þ in ascending Y -coordinate, and the N ðs3i Þ and min ðs3i Þ in ascending Z-coordinate are computed similarly and sequentially. Then, for each i, all the computed min ðs1i Þ, min ðs2i Þ, and min ðs3i Þ are routed from P ði; 0Þ to P ð1 ðiÞ; 0Þ, P ð2 ðiÞ; 0Þ, and P ð3 ðiÞ; 0Þ, where 0  1 ðiÞ; 2 ðiÞ; 3 ðiÞ  n  1, respectively. This results in that each processor P ði; 0Þ, 0  i  n  1, holds all the k distances denoted as min1 ðvi Þ, min2 ðvi Þ, and min3 ðvi Þ which are the distances from point vi to its nearest neighbors in slabs S1 , S2 , and S 3 , respectively. Each processor P ði; 0Þ then finds the minimum, denoted as min;; ðvi Þ, from these three distances. That is, min;; ðvi Þ ¼ minfmin1 ðvi Þ; min2 ðvi Þ;

5

Step 4: // Finally, we use 12 n kþ1 þ processors to find the global kD ANN and kD CP, where k is 3 (or 2), of all the n points in Oð1Þ time. // For each point vi 2 Rð; ; Þ \ V (or vi 2 Rð; Þ \ V), we compute the distances from each point vi to each of the 2k corners of rectangular box Rð; ; Þ. We check if there is any of the k  2k computed distances smaller than min;; ðvi Þ, which is computed in Step 3. If the answer is affirmative, we mark this point (i.e., vi ). Based on Lemma 8 (or 7), for each unmarked point vi , the min;; ðvi Þ turns out to be the distance from point vi to its global nearest neighbor and we denote it as minðvi Þ. We then collect the marked points by invoking the compression primitive operation introduced in Lemma 2. Based on Lemma 8 (or 7), for k any of the nkþ1 rectangular boxes, each of its 2k corners has at most k marked points. So, there are at most k  k 2k  nkþ1 marked points in the k-D space. The marked points can be partitioned into at most k  2k groups, each k of nkþ1 points. We assign 12 n1þ processors for each of the marked points to find its kD ANN and associated distance minðvi Þ within the set V of n points. Hence, it 2kþ1 k takes at most nkþ1  12 n1þ ¼ 12 n kþ1 þ processors for all the marked points vi in each group to find their kD ANNs and associated distances minðvi Þ in parallel. It repeats at

Fig. 8. An LARPBS of size 12 N 3þ for Algorithm 2D_ANN_1, where 0 <  ¼ 2cþ111  1, and c is a constant and positive integer. A processor with index l 2 2 (i.e., Pl ) is also denoted as P ði; jÞ, where l ¼ i  12n3þ þ j, 0  i  n  1, 0  j  12 n3þ  1.

202

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

most k  2k times for these groups to be processed sequentially. Finally, we use 12 n1þ processors to find kD CP by computing minfminðvi Þj 0  i  n  1g. This concludes the following theorem. Theorem 1. The 2D_ANN and 2D_CP problems of n points can be 5 solved in Oð1Þ time on an LARPBS of size 12 n3þ , and the 3D_ANN and 3D_CP problems of n points can be solved in Oð1Þ 7 time on an LARPBS of size 12 n4þ , where 0 <  ¼ 2cþ111  1, and c is a constant and positive integer.

6

3.

Similarly, it takes ðn 2k Þ 9 ¼ n18þ18k processors to kþ2 presort n 2k points in ascending X-coordinate by invoking Lemma 4. The sorted sequence of kþ2 1 size n 2k can be divided into m ¼ n2k slabs, each kþ1 of n 2k points. Based on the result of 1), the kþ1 ANN problem of n 2k points can be solved in 2kþ1 kþ1 Oð1Þ time on an LARPBS of 12 n 2k þ 2k  proces2kþ1 kþ1 1 sors. So, in total it takes n2k  12 n 2k þ 2k  ¼ kþ1 1 1 2kþ2 þ  2k 2k processors for all of the n2k slabs, 2n kþ1 each of n 2k points, to be processed in parallel. The Y and Z-coordinates are processed similarly and sequentially. kþ2 b. Since the sorted sequence of n 2k points can be 1 divided into n2k slabs in each coordinate, so the kþ2 slab of n 2k points in the k-D space will be 1 1 divided into ðn2k Þk ¼ n2 rectangular boxes (or rectangles). Similarly, based on Lemma 8 (or 7), and following Steps 3 and 4 of the first algorithm as described in Section 5, there are at 1 most k  2k  n2 marked points in the k-D space. The marked points can be partitioned 1 into at most k  2k groups, each of n2 points. kþ2 Then, we assign 12 n 2k ð1þÞ processors for each of the marked points to find its kD ANN within kþ2 1 these n 2k points. In total, it takes at most n2  kþ2 2kþ2 kþ2 1 2k ð1þÞ ¼ 12 n 2k þ 2k  processors for all the 2n marked points in a group to find their kD ANNs in Oð1Þ time. Each group is processed sequentially. From a and b, the number of processors needed is 2kþ2 kþ1 2kþ2 kþ2 2kþ2 kþ2 13 26 maxðn18þ 18k ; 12 n 2k þ 2k  ; 12 n 2k þ 2k  Þ ¼ 12 n 2k þ 2k  . So, it is true when i ¼ 2 also. Note that for the 2D case, the maximum value of i is 2 and, hence, the 3 number of processors needed is 12 n2þ . For i ¼ 3 (only for k ¼ 3): a.

kþi

Lemma 9. The ANN problem of n 2k points in the k-D space can 2kþi kþi be performed in Oð1Þ time on an LARPBS of 12 n 2k þ 2k  processors, where k ¼ 3 (or 2), 1  i  k;  ¼ 2cþ111 , and c is a constant and positive integer. Proof. This lemma can be proven by the following steps: For i ¼ 1:

b.

kþ1 13

13

13

It takes ðn 2k Þ 9 ¼ n18þ18k processors to presort kþ1 n 2k points in ascending X-coordinate by kþ1 1 k invoking Lemma 4. Since n 2k ¼ n2k  n2k , so kþ1 the sorted sequence of n 2k points can be 1 k 1 divided into m ¼ n2k slabs, each of n2k ¼ n2 points. Then, based on Algorithm MFA, the 1 ANN problem of n2 points can be solved in 1 1 Oð1Þ time on an LARPBS of n2  12 n2ð1þÞ ¼ 1 1 1 1þ2 processors. So, in total it takes n2k  2n 2kþ1 1 1 1 1 1þ2 ¼ 12 n 2k þ2 processors for all of the n2k 2n 1 slabs each of n2 points to be processed in parallel. The Y and Z-coordinates are processed similarly and sequentially. kþ1 Since the sorted sequence of n 2k points can be 1 divided into n2k slabs in each coordinate, so kþ1 the slab of n 2k points in the k-D space will be 1 1 divided into ðn2k Þk ¼ n2 rectangular boxes (or rectangles). Based on Lemma 8 (or 7), and follow the Steps 3 and 4 of the first algorithm as described in Section 5, there are at most 1 k  2k  n2 marked points in the k-D space. kþ1 Then, we assign 12 n 2k ð1þÞ processors for each of the marked points to find its kþ1 kD ANN within these n 2k points. The marked points can be partitioned into at most k  2k 1 groups, each of n2 points. In total, it takes at kþ1 2kþ1 kþ1 1 most n2  12 n 2k ð1þÞ ¼ 12 n 2k þ 2k  processors for all the marked points vi to find their

MARCH 2005

2.

In this section, we present the second algorithm, which recursively invokes Lemma 8 (or 7) k times, for computing the kD ANN and kD CP problems of n points in Oð1Þ time on the 3 LARPBS of size 12 n2þ , where k ¼ 3 (or 2), 0 <  ¼ 2cþ111  1, and c is a constant and positive integer. The result is far better than those derived in [5] and [8]. First, we introduce Lemma 9 upon which the second algorithm is based.

a.

NO. 3,

kD ANNs and associated distances minðvi Þ kþ1 within these n 2k points in Oð1Þ time. Each group is processed sequentially. From a and b, the number of processors needed is 2kþ1 1 2kþ1 kþ1 2kþ1 kþ1 13 13 maxðn18þ18k ; 12 n 2k þ2 ; 12 n 2k þ 2k  Þ ¼ 12 n 2k þ 2k  . So, it is true when i ¼ 1. For i ¼ 2:

THE SECOND Oð1Þ TIME ALGORITHM FOR SOLVING THE 2D_ANN, 2D_CP, 3D_ANN, AND 3D_CP PROBLEMS

1.

VOL. 16,

a.

kþ2 13

kþ3 13

13

13

26

39

Finally, it takes ðn 2k Þ 9 ¼ n18þ18k processors to kþ3 presort n 2k ¼ n points in ascending Xcoordinate by invoking Lemma 4. The sorted sequence of n points can be divided into m ¼ kþ2 1 1 5 n2k ¼ n6 slabs, each of n 2k ¼ n6 points. Based kþ2 on the result of 2), the ANN problem of n 2k points can be solved in Oð1Þ time on an 2kþ2 kþ2 LARPBS of 12 n 2k þ 2k  processors. So, in total 2kþ2 kþ2 2kþ3 kþ2 1 it takes n2k  12 n 2k þ 2k  ¼ 12 n 2k þ 2k  proceskþ2 1 5 sors for all of the n2k slabs, each of n 2k ¼ n6 points, to be processed in parallel. The Y and Z-coordinates are processed similarly and sequentially.

WANG ET AL.: EFFICIENT ALGORITHMS FOR THE ALL NEAREST NEIGHBOR AND CLOSEST PAIR PROBLEMS ON THE LINEAR ARRAY WITH... kþ3

Since the sorted sequence of n 2k points can be 1 divided into n2k slabs in each coordinate, so the kþ3 slab of n 2k points in the k-D space will be 1 1 divided into ðn2k Þk ¼ n2 rectangular boxes. Based on Lemma 8, and following Steps 3 and 4 of the first algorithm as described in 1 Section 5, there are at most k  2k  n2 marked points in the 3D space. The marked points can be partitioned into at most k  2k ¼ 24 groups, kþ3 1 each of n2 points. Then, we assign 12 n 2k ð1þÞ processors for each of the marked points to kþ3 find its 3D_ANN within these n 2k ¼ n points. kþ3 1 In total, it takes at most n2  12 n 2k ð1þÞ ¼ kþ3 1 2kþ3 þ  2k 2k processors for all the marked points 2n in a group to find their kD ANNs in Oð1Þ time. Each group is processed sequentially. From a and b, the number of processors needed is   2kþ3 kþ2 13 39 1 þ18k þ 2k  1 2kþ3 þkþ3  2k 2k 2k 18 max n ; n ; n 2 2 1 2kþ3þkþ3 1 3þ ¼ n 2k 2k ¼ n2 : 2 2 b.

So, it is true when i ¼ 3 also. Note that for the 3D case, the maximum value of i is 3. u t

6.1 Description of Algorithm In this algorithm, we employ the divide-and-conquer strategy to solve the kD ANN and kD CP problems of n points by recursively invoking Lemma 8 (or 7) k times, where k ¼ 3 (or 2). The high level description of this algorithm is as follows: First, we perform the “divide” phase. We presort these n points into ascending 1 hth-coordinate and then divide them into n2k slabs each of 2k1 n 2k points. Now, in each coordinate, the ANN problem of 1 2k1 n points is divided into n2k ANN problems each of n 2k 1 points. In the k-D space, in total there are k  n2k ANN 2k1 problems each of n 2k points. Similarly, the ANN problem 2k1 2k1 of n 2k points will be solved by dividing the n 2k points into 1 2k2 n2k ANN problems each of n 2k points. The above steps will be recursively divided k times until each slab contains 2kk 1 n 2k ¼ n2 points. Each coordinate is processed similarly and sequentially. Then, we recursively invoke Lemma 8 (or 7) to kþi solve an ANN problem of n 2k points by merging the solved kþi1 1 n2k ANN problems each of n 2k points sequentially from i ¼ 1 to 3 until the ANN problem of n points is solved. We state this algorithm in detail as follows. Algorithm 3D_ANN_2 (or 2D_ANN_2) Input: A set V ¼ fv0 ; v1 ; . . . ; vi ; . . . ; vn1 g of n points in the 3D space (or 2D plane). Output: The nearest point for each point and the closest pair from the given set. Step 1: // Presorting points in V by ascending hth-coordinate. // Sort points in V by ascending hth-coordinate, where 1  h  3. Based on Lemma 4, it takes Oð1Þ time on an 13 LARPBS of n 9 processors to sort the n points in each coordinate. The sorted sequence is denoted as S h .

203

Step 2: // Recursively partitioning S h until each slab 1 containing n2 points. // For h ¼ 1 (i.e., the X-coordinate), 1 we divide the sorted sequence S 1 of n points into n2k slabs, 2k1 each containing n 2k points. We then presort and divide 2k1 2k2 each of the new slabs of n 2k points into n 2k slabs in all the k coordinates. The above steps are processed recursively 2kðk1Þ until each of the new slabs of n 2k points is divided into 1 k 1 n2k slabs in all the k coordinates, each containing n2k ¼ n2 points. Note that in each recursion, each slab must be divided into subslabs in all the k coordinates. Otherwise, if we divide it only in one coordinate (e.g., only in the X-coordinate), then it will lead to an incorrect result. Each S h is recursively divided similarly and sequentially. 1

Step 3: // Solving the ANN problem of each slab of n2 points. // Then, based on Algorithm MFA, the ANN problem of 1 n2 points can be solved in Oð1Þ time on an LARPBS of 1 1 1 n2  12 n2ð1þÞ ¼ 12 n1þ2 processors and, hence, it takes Oð1Þ 1 3 1 time on an LARPBS of n  12 n2ð1þÞ ¼ 12 n2þ2 processors for all the n points in S 1 to find their respective (local) ANNs 1 within their own slabs, each of n2 points, in parallel. Each S h is processed similarly and sequentially. kþi

Step 4: // Solving the ANN problem of n 2k points sequentially from i ¼ 1 to k. // Based on Lemma 9 and kþ1 for i ¼ 1, the ANN problem of n 2k points in the k-D space 2kþ1 kþ1 can be solved in Oð1Þ time on an LARPBS of 12 n 2k þ 2k  processors. Since the sorted sequence S h of n points can be kþ1 k1 divided into n 2k slabs, each containing n 2k points, so it 2kþ1 kþ1 k1 3 kþ1 k1 takes n 2k  12 n 2k þ 2k  ¼ 12 n2þ 2k  processors for all the n 2k slabs to be processed in parallel. In general, since the sorted ki input sequence S h of n points can be divided into n 2k slabs, kþi each containing n 2k points, and based on Lemma 9, the kþi ANN problem of n 2k points in the k-D space can be 2kþi kþi performed in Oð1Þ time on an LARPBS of 12 n 2k þ 2k  ki processors, so, from i ¼ 1 to k, it will always take n 2k  2kþi kþi 3 kþi ki 1 2k þ 2k  ¼ 12 n2þ 2k  processors for all the n 2k slabs to be 2n processed in parallel. For the k-D case, if i ¼ k, then kþi 2k ¼ 1. This concludes the following theorem. Theorem 2. The 2D_ANN, 2D_CP, 3D_ANN, and 3D_CP problems of n points can be solved in Oð1Þ time on an 3 LARPBS of size 12 n2þ , where 0 <  ¼ 2cþ111  1, c is a constant and positive integer.

7

CONCLUDING REMARKS

The LARPBS computation model used in this paper includes electronic digital processors, fiber interconnections and opto-electronic couplers. Currently, the fiber interconnection techniques have been applied to establish the reconfigurability and simultaneous switching for massively parallel processing system [31]. All the opto-electronic devices assumed by the LARPBS model exist [24]. Present commercial devices limit the operation speed to a few hundreds of MHz. For example, the Jitney Optical Bus with 20 channels (500 Mb/s/ch) has been designed for high speed parallel computing and successfully demonstrated in IBM AS/400 and RS6000 power parallel systems testbeds [7]. Demonstration of optical transmission at 250 Gb/s has

204

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

VOL. 16,

NO. 3,

MARCH 2005

TABLE 3 Summary of Comparison Results for Parallel 2D_ANN Algorithms

TABLE 4 Summary of Comparison Results for Parallel 3D_ANN Algorithms

already been reported [27]. In fact, many commercial massively parallel computers, such as Cray T90 supercomputer system, use optical technologies in their communication subsystems. As described above, unlike many theoretical models, such as PRAM, the models with reconfigurable optical buses such as LARPBS are more realistic and are likely to become feasible architectures in the near future. To demonstrate the computation power of the LARPBS, we develop two parallel algorithms for solving the 2D_ANN, 2D_CP, 3D_ANN, and 3D_CP problems from the computational geometry perspective. The first algorithm can solve the 2D_ANN and 2D_CP problems of n points in Oð1Þ time on an 5 LARPBS of size 12 n3þ , and the 3D_ANN and 3D_CP problems 7 of n points in Oð1Þ time on an LARPBS of size 12 n4þ ; the second algorithm can solve the 2D_ANN, 2D_CP, 3D_ANN, and 3D_CP problems of n points in Oð1Þ time on the LARPBS of 3 size 12 n2þ , where 0 <  ¼ 2cþ111  1, c is a constant and positive integer. Clearly, the second algorithm improves the processors complexity quite a lot, although its constant factor of the time complexity is a little larger than that of the first algorithm due to recursive overhead. Assume n ¼ 1015 and

5

3

c ¼ 10. Then, n  n ¼ 1030 , 12 n3þ 5  1024 , and 12 n2þ 5  1021:5 . Clearly, comparing with [5] and [8], huge amount of processors saving (about 2  105 times) is rewarded for our first algorithm and much more amount of processors saving (about 2  107:5 times) is rewarded for our second algorithm. Clearly, it is a great improvement. To the best of our knowledge, all results derived in this paper are the best Oð1Þ time algorithms known, compared with all the previously published results. The comparison results are shown in Tables 3 and 4.

ACKNOWLEDGMENTS This work was partly supported by the National Science Council under contract numbers NSC-93-2213-E-129-011, NSC-93-2213-E-011-066, and NSC-92-2213-E-012-001.

REFERENCES [1] [2]

S.G. Akl, Parallel Computation: Models and Methods. Upper Saddle River, N.J.: Prentice-Hall, 1997. M.M. Eshaghian, “Parallel Algorithms for Image Processing on OMC,” IEEE Trans. Computers, vol. 40, pp. 827-833, 1991.

WANG ET AL.: EFFICIENT ALGORITHMS FOR THE ALL NEAREST NEIGHBOR AND CLOSEST PAIR PROBLEMS ON THE LINEAR ARRAY WITH...

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

Z. Guo, R.G. Melhem, R.W. Hall, D.M. Chiarulli, and S.P. Levitan, “Pipelined Communications in Optically Interconnected Arrays,” J. Parallel and Distributed Computing, vol. 12, no. 3, pp. 269-282, 1991. Z. Guo, “Optically Interconnected Processor Arrays with Switching Capability,” J. Parallel and Distributed Computing, vol. 23, pp. 314-329, 1994. J. Jang, M. Nigam, V.K. Prasanna Kumar, and S. Sahni, “Constant Time Algorithms for Computational Geometry on the Reconfigurable Mesh,” IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 1, pp. 1-12, Jan. 1997. T.W. Kao and S.J. Horng, “Efficient Algorithms for Computing Two Nearest Neighbor Problems on a RAP,” Pattern Recognition, vol. 27, no. 12, pp. 1707-1716, Dec. 1994. D.M. Kuchta, J. Crow, P. Pepeljugoski, K. Stawiasz, J. Trewhella, D. Booth, W. Nation, C. DeCusatis, and A. Muszynski, “Low Cost 10 Gigabit/s Optical Interconnects for Parallel Processing,” Proc. Fifth Int’l Conf. Massively Parallel Processing, pp. 210-215, 1998. T.H. Lai and M.J. Sheng, “Constructing Euclidean Minimum Spanning Trees and All Nearest Neighbors on Reconfigurable Meshes,” IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 8, pp. 806-817, Aug. 1996. T. Leighton, “Tight Bounds on the Complexity of Parallel Sorting,” IEEE Trans. Computers, vol. 34, no. 4, pp. 344-354, Apr. 1985. Y.H. Lee, S.J. Horng, and J. Seitzer, “Parallel Computation of the Euclidean Distance Transform on a Three Dimensional Image Array,” IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 3, pp. 203-212, Mar. 2003. S.P. Levitan, D.M. Chiarulli, and R.G. Melhem, “Coincident Pulse Technique for Multiprocessor Interconnection Structures,” Applied Optics, vol. 29, pp. 2024-2033, 1990. K. Li, Y. Pan, and S.-Q. Zheng, “Fast and Processor Efficient Parallel Matrix Multiplication Algorithms on a Linear Array with a Reconfigurable Pipelined Bus System,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 8, pp. 705-720, Aug. 1998. K. Li, Y. Pan, and S.Q. Zheng, “Efficient Deterministic and Probabilistic Simulations of PRAMs on Linear Arrays with Reconfigurable Pipelined Bus Systems,” J. Supercomputing, vol. 15, pp. 163-181, 2000. R. Lin and S. Olariu, “Efficient VLSI Architectures for Columnsort,” IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 7, no. 1, pp. 135-139, 1999. R.G. Melhem, D.M. Chiarulli, and S.P. Levitan, “Space Multiplexing of Waveguides in Optically Interconnected Multiprocessor Systems,” Computer J., vol. 32 no. 4, pp. 362-369, 1989. R. Miller, V. Prasanna Kumar, D. Reisis, and Q.F. Stout, “Image Computations on Reconfigurable VLSI Arrays,” Proc. IEEE Conf. Computer Vision Pattern Recognition, pp. 925-930, 1988. R. Miller and Q.F. Stout, “Geometric Algorithms for Digitized Pictures on a Mesh-Connected Computer,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 7, no. 2, pp. 216-228, Feb. 1985. R. Miller and Q.F. Stout, “Mesh Computer Algorithms for Computational Geometry,” IEEE Trans. Computers, vol. 38, pp. 321-340, 1989. Y. Pan, “Basic Data Movement Operations on the LARPBS Model,” Parallel Computing Using Optical Interconnections, K. Li, Y. Pan, and S. Q Zheng, eds. Boston: Norwell, Kluwer Academic Publishers, 1998. Y. Pan and M. Hamdi, “Quicksort on a Linear Array with a Reconfigurable Pipelined Bus,” Proc. Int’l Symp. Parallel Architectures, Algorithms, and Networks, pp. 313-319, 1996. Y. Pan and K. Li, “Linear Array with a Reconfigurable Pipelined Bus System—Concepts and Applications,” J. Information Sciences, pp. 237-258, 1998. Y. Pan, K. Li, and S.-Q. Zheng, “Fast Nearest Neighbor Algorithms on a Linear Array with a Reconfigurable Pipelined Bus System,” Parallel Algorithms and Applications, vol. 13, pp. 1-25, 1998. S. Pavel and S.G. Akl, “On the Power of Arrays with Reconfigurable Optical Bus,” Proc. Int’l Conf. Parallel and Distributed Processing Techniques and Applications, pp. 1443-1454, 1996. S. Pavel and S.G. Akl, “Integer Sorting and Routing in Arrays with Reconfigurable Optical Buses,” Int’l J. Foundations of Computer Science, vol. 9, no. 1, pp. 99-120, 1998.

205

[25] V. Prasanna Kumar and C.S. Raghavendra, “Array Processor with Multiple Broadcasting,” J. Parallel Distributed Computing, vol. 4, no. 2, pp. 173-190, 1987. [26] F.P. Preparata and M.I. Shamos, Computational Geometry: An Introduction. Third corrected printing, Berlin: Springer-Verlag, 1990. [27] P.R. Prucnal, I. Glesk, and J.P. Sokoloff, “Demonstration of AllOptical Self-Clocked Demultiplexing of TDM Data at 250 Gb/s,” Proc. First Int’l Workshop Massively Parallel Processing Using Optical Interconnections, pp. 106-117, 1994. [28] C. Qiao, “A Unique Design of Fiber-Optic Interconnection Networks and Algorithms,” Parallel Computing Using Optical Interconnections, K. Li, Y. Pan and S.Q. Zheng, eds. Boston: Kluwar Academic Publishers, 1998. [29] C. Qiao, R.G. Melhem, D. Chiarulli, and S. Levitan, “Optical Multicasting in Linear Arrays,” Int’l J. Optical Computing, vol. 2, no. 1, pp. 31-48, 1991. [30] C. Qiao and R.G. Melhem, “Time-Division Communications in Multiprocessor Arrays,” IEEE Trans. Computers, vol. 42, pp. 577590, 1993. [31] C. Qiao, R.G. Melhem, D. Chiarulli, and S. Levitan, “Dynamic Reconfiguration of Optically Interconnected Networks with Time Division Multiplexing,” J. Parallel Distributed Computing, vol. 22, no. 8, pp. 268-278, 1994. [32] S. Rajasekaran and S. Sahni, “Sorting, Selection and Routing on the Arrays with Reconfigurable Optical Buses,” IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 11, pp. 1123-1132, Nov. 1997. [33] J.L. Trahan, A.G. Bourgeois, Y. Pan, and R. Vaidyanathan, “An Optimal and Scalable Algorithm for Permutation Routing on Reconfigurable Linear Arrays with Optically Pipelined Buses,” J. Parallel and Distributed Computing, vol. 60, no. 9, pp. 1125-1136, 2000. [34] H.R. Tsai, S.J. Horng, T.W. Kao, S.S. Lee, and S.S. Tsai, “Fundamental Data Movement Operations and Its Applications on a Hyper-Bus Broadcast Network,” Parallel Computing, vol. 25, no. 2, pp. 137-157, Feb. 1999. [35] Y.R. Wang and S.J. Horng, “An O(1) Time Parallel Algorithm for the 3D Euclidean Distance Transform on the CRCW PRAM Model,” Proc. IASTED Int’l Conf. Networks, Parallel and Distributed Processing and Applications (NPDPA ’02), pp. 419-424, Oct. 2002. [36] Y.R. Wang and S.J. Horng, “An O(1) Time Parallel Algorithm for the 3D Euclidean Distance Transform on the CRCW PRAM Model,” IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 10, pp. 973-982, Oct. 2003. [37] Y.R. Wang and S.J. Horng, “Parallel Algorithms for Arbitrary Dimensional Euclidean Distance Transforms with Applications on Arrays with Reconfigurable Optical Buses,” IEEE Trans. Systems, Man, and Cybernetics—Part B: Cybernetics, vol. 34, no. 1, pp. 517532, 2004. [38] C.H. Wu and S.J. Horng, “L2 Vector Median Filters on Arrays with Reconfigurable Optical Buses,” IEEE Trans. Parallel and Distributed Systems, vol. 12, no. 12, pp. 1281-1292, Dec. 2001. [39] C.H. Wu and S.J. Horng, “Fast and Scalable Selection Algorithms with Applications to Median Filtering,” IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 9, pp. 983-992, Sept. 2003. Yuh-Rau Wang received the BS degree in electrical engineering from National Cheng Kung University, Taiwan, in 1976, the MS degree in computer science from Polytechnic University, Brooklyn Campus, New York, in 1987, and the PhD degree in electrical engineering from National Taiwan University of Science and Technology, Taiwan, in 2003. From 1987 to 1992, he was the director of the Image Products Research and Development Department, Peripheral Products Business Unit of Acer Group, Taipei, Taiwan. From 1993 to 2003, he was a faculty member of Lan Yang Institute of Technology, I-Lan, Taiwan. From 1993 to 1996, he was the director of the Computer Center and the chairman of the Department of Information Management at Lan Yang Institute of Technology. Currently, he is an associate professor in the Department of Computer Science and Information Engineering at St. John’s & St. Mary’s Institute of Technology, Taipei, Taiwan. His research interests include image processing, computer vision, and parallel algorithms.

206

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

Shi-Jinn Horng received the BS degree in electronic engineering from National Taiwan Institute of Technology, the MS degree in information engineering from National Central University, and the PhD degree in computer science from National Tsing Hua University in 1980, 1984, and 1989, respectively. He has published more than 100 research papers and received many awards. Currently, he is a full professor in the Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology. His research interests include VLSI design, multiprocessing systems, and parallel algorithms.

VOL. 16,

NO. 3,

MARCH 2005

Chin-Hsiung Wu received the BS degree from Chinese Naval Academy, the MS degree in information management from National Defence Management College, and the PhD degree in electrical engineering from National Taiwan University of Science and Technology, Taiwan, in 1986, 1991, and 2000, respectively. Currently, he is a full professor in the Department of Information Technology & Communication at Shih Chien University Kaohsiung Campus. His research interests include image processing, computer vision, and parallel and distributed computing.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.