Efficient BSP/CGM Algorithms for the Maximum ... - Springer Link

3 downloads 25 Views 918KB Size Report
We also tested the maximum relative density sum algorithm with a image of the cancer imaging archive. Keywords: Parallel algorithms · BSP/CGM · Computer ...
Efficient BSP/CGM Algorithms for the Maximum Subarray Sum and Related Problems Anderson C. Lima(B) , Rodrigo G. Branco, and Edson N. C´ aceres Faculdade de Computa¸ca ˜o, Universidade Federal de Mato Grosso do Sul, Campo Grande, Brazil {anderson.correa.lima,rodrigo.g.branco}@gmail.com, [email protected] http://www.facom.ufms.br

Abstract. Given an n × n array A of integers, with at least one positive value, the maximum subarray sum problem consists in finding the maximum sum among the sums of all rectangular subarrays of A. The maximum subarray problem appears in several scientific applications, particularly in Computer Vision. The algorithms that solve this problem have been used to help the identification of the brightest regions of the images used in astronomy and medical diagnosis.   The best known sequential algorithm that solves this problem has O n3 time complexity. In this work we revisit the BSP/CGM parallel algorithm that solves this problem and we present BSP/CGM algorithms for the following related problems: the maximum largest subarray sum, the maximum smallest subarray sum, the number of subarrays of maximum sum, the selection of the subarray with k- maximum sum and the location of the subarray with the maximum relative density sum. To the best of our knowledge there are no parallel BSP/CGM algorithms for these  related problems. Our algorithms use p processors and require O n3 /p parallel time with a constant number of communication rounds. In order to show the applicability of our algorithms, we have implemented them on a cluster of computers using MPI and on a machine with GPGPU using CUDA and OpenMP. We have obtained good speedup results in both environments. We also tested the maximum relative density sum algorithm with a image of the cancer imaging archive. Keywords: Parallel algorithms · BSP/CGM GPGPU · Maximum subarray sum

1

·

Computer vision

·

Introduction

The maximum subarray sum problem consists in finding the maximum sum among the sums of all rectangular subarrays of an n × n array A of integers, with at least one positive number [3]. This problem arises in many areas of Science. One such area is Computer Vision, where many applications require the solution of the maximum subarray sum problem. Among these, the identification c Springer International Publishing Switzerland 2015  O. Gervasi et al. (Eds.): ICCSA 2015, Part I, LNCS 9155, pp. 392–407, 2015. DOI: 10.1007/978-3-319-21404-7 29

Efficient BSP/CGM Algorithms for the Maximum Subarray

393

of the brightest region of an image is an important application. This task can aid in medical diagnosis based on images. The best  known sequential algorithm for the maximum subarray sum problem has O n3 time complexity [3]. Previous works have reported good parallel solutions for the maximum subarray sum problem. Qiu and Akl presented an algorithm that works on interconnection networks (hypercube and star) of length p, using O log n parallel time with  3  O n / log n processors [10]. Zhaofang Wen  a PRAM parallel algo   presented rithm using O log n parallel time with O n3 / log n processors [13]. Perumalla and Deo also presented a PRAM algorithm with the same time complexity and number of processors [9]. A BSP/CGM algorithm for the problem was presented  by Alves et al. using O n3 /p parallel time with p processors and a constant number of communication rounds [1]. Bae [2] presented parallel solutions for the maximum subarray problem on a mesh. Most of these algorithms do not explore the characteristics of their solutions and the output returns only the value of the maximum subarray sum. In this work, we revisited the maximum subarray sum problem and we propose solutions to five related problems: the maximum largest subarray sum, the maximum smallest subarray sum, the number of subarrays of maximum sum, the selection of the subarray with k-maximum sum and the location of the subarray with the maximum relative density sum. The latter is specially useful when the concentration of elements is more important than the global maximum sum. To the best of our knowledge there are no parallel BSP/CGM algorithms for these   problems. Our algorithms use p processors with O n3 /p parallel time and a constant number of communication rounds. In order to show the efficiency not only in theory but also in practice, we implemented the algorithms using distributed and shared memory environments. In the distributed memory environment we have implemented the algorithms on a cluster of computers using MPI and in the shared memory we have implemented the algorithms on GPGPU using CUDA and OpenMP. We have got good speedup results in both environments. This remaining of this paper is organized as follows: Section 2 presents the background and the related works to this study. Section 3 presents our proposed BSP/CGM algorithms. The implementations and results are presented in Section 4. Finally, Section 5 presents the conclusions and future work.

2

Notation and Computational Model

Let A be a n × n array of integers, with at least one positive value and (g, h) a pair of integers, where 1 ≤ g, h ≤ n. We denote Rg,h as the set that represents all subarrays A[i1 . . . i2 , g . . . h], where 1 ≤ i1 ≤ i2 ≤ n. Similarly, we also denote the sequence Cjgh of size n as the column resulting from the sum of all elements of each row of A[1 . . . n, g . . . h] that are between g e h columns, including g and h h, i.e: Cjgh = k=g aik . We define the maximum subarray sum problem as the task of obtaining the subarray with the maximum sum among all the subarrays R1,n of A [3]. In

394

A.C. Lima et al.

the array represented by Figure 1 there are three subarrays of maximum sum. All have sum equal to 16.

A11 A 12

A13

A14 A 15 A16

A17

A 18

-20

4

0

-15

2

- 22

2

A21

A22

A23

A24 A25 A 26

A27

A28

4

-2

-1

3

2

1

-18

10

A31 A 32

A33

A34 A35 A 36

A37

A38

-15

6

3

-5

-1

3

-20

4

A41 A42

A 43

A 44 A45 A 46

A47

A48

-1

-1

3

-1

4

1

-17 -20

A 51 A 52

A 53 A 54 A 55 A 56 A 57

A 58

3

-3

2

0

3

-3

-2

-20

A61 A 62

A 63

A 64 A65 A 66

A67

A68

-2

1

-2

1

-1

3

-1

1

A71 A72

A 73

A 74 A75 A 76

A77

A78

2

-14

0

1

0

-3

A81 A 82

A 83

A 84 A85 A 86

A87

A88

-11

1

-2

-1

1

-17

A8 x 8

-2

-20

-18

14

2

Maximum subarray sum = 16

Fig. 1. Array A8×8 with A[2, 6, 2, 6], A[1, 3, 8, 8] and, A[8, 8, 7, 8], three subarrays of maximum sum

2.1

Related Problems

Given a n × n array A of integers, as illustrated in Figure 1, there might be more than one maximum subarray sum. In this array there are three subarrays of maximum sum. All have sum equal to 16, but with different sizes (number of elements): 3, 25 and 2, respectively. In this context, at least three new problems arise: the maximum largest subarray sum, the maximum smallest subarray sum and the number of subsequences of maximum sum. In the first two problems, the size is related to the number of elements that make up the subarrays. Besides these three, we expanded our interest including two other problems related to the general problem: the selection of the subarray with k-maximum sum and the location of the subarray with the maximum relative density sum. 2.2

Computational Model

In this paper we use the BSP/CGM parallel computation model  [5].  It consists of a set of p processors, each having a local memory of size O n/p , where n is the size of the problem. An algorithm in this model executes supersteps, that are a series of rounds, alternating well-defined local computation and global communication phases, separated by a barrier synchronization. The cost of communication considers the number of rounds required. The implementation of a BSP/CGM algorithm generally presents good results with similar performance to those predicted in their theoretical analysis [4]. Since the MPI library is designed for distributed memory environments, a BSP/CGM algorithm can be mapped into a MPI implementation using the message’s resources of this library. On the other hand, the implementation of BSP/CGM algorithms on a shared memory environment, like GPGPUs (General Purpose Graphics Processing Units) using CUDA, some abstractions are necessary. In this context, the supersteps of the BSP/CGM model are represented by

Efficient BSP/CGM Algorithms for the Maximum Subarray Host

Device

P1

P2

P3

P4

sm1 sm 2

sm n

Kernel 2 call

. . . sm1 sm 2

time

Pn

. . .

Kernel 1 call

. . .

395

. . .

sm n

time

. . .

Local computation. Barriers of synchronization. Global communication.

Communication between the CPU and the GPU (Global memory). Comunnication among treads via shared memory. Threads of different blocks can communicate through shared memory.

Fig. 2. Abstraction between the BSP/CGM model [7] and GPGPU

sequential invocations of each CUDA kernel. Furthermore, we related the set of processors of the BSP/CGM model with the set of blocks of CUDA. Figure 2 illustrates our suggestion for this process, where the BSP/CGM model [7] is represented on the right.

3

Maximum Subarray Sum

The BSP/CGM algorithm proposed by Alves et al. [1] is well suited for computing the maximum subarray sum in distributed memory environments, since it compress the subarrays in order to decrease the size of the messages between the processors. It is not clear if this compression gives any advantage when using it in a shared memory environment, like a GPGPU. Besides that, the compression makes difficult the computation of the size and localization of the subarrays with maximum sum. On the other hand, the PRAM algorithm for this problem, proposed by Perumalla and Deo [9], does not compress the subarrays. Since we want to explore the informations of the size and localization of the subarrays that have maximum sum, based on the ideas of the later algorithm, we devised a new BSP/CGM parallel algorithm for the maximum subarray sum. This new algorithm can be used in shared or distributed memory environments. In addition, we expanded the algorithm to compute also the five related problems described in Section 2 and implemented the solutions using MPI, OpenMP and GPGPU. 3.1

The BSP/CGM Algorithms for the General and Related Problems

Initially we designed a new BSP/CGM algorithm that solves the general problem of maximum subarray sum (MSS). Since we want to compute the size and localization of the maximum subarray sum and test the algorithm in distributed and

396

A.C. Lima et al.

shared memory environments, we based the new BSP/CGM algorithm on the results proposed by Perumalla and Deo [9]. The MSS algorithm transforms the computation of the maximum subarray sum in the computation of the maximum subsequence sum (MSqS) in an array of sequence of numbers.

Algorithm 1. Maximum Subarray Sum (M SS) Input: (1) A set of P processors; (2) The identification of each processor pi , where 1 ≤ i ≤ P ; (3) Array A[1 . . . n][1 . . . n] of integers. Output: (1) Array M of maximum values of all subarrays from A. (2) max the maximum value in M. 1. Each processor pi in parallel do P S[1 . . . n][1 . . . n] ← the prefix sums of the rows from array A;. (g,h) 2. Each processor pi in parallel do T empj [1 . . . n] ← C (P S); 3. Each processor pi in parallel do max localj ← M SqS(T empj [1 . . . n]) 4. Each processor pi = 1 in parallel do send(max localj , pi = 1); 5. if pi = 1 then 6. for k = 2 to P do 7. receive(max localj , pk ) 8. end for 9. for j = 1 to n(n + 1)/2 do 10. M [j] ← max localj 11. end for 12. end if 13. Each processor pi in parallel do max ← M [1, . . . , n(n + 1)/2]. 14. return M , max

The input of the our main algorithm (Algorithm 1) is a n × n array A of integers, such as the example shown in Figure 1. The first step of the algorithm 1 is the computation of the array P S. Each row of the array P S is  given by the j prefix sum of the respective row of the array A, where P S[i, j] = k=1 A[i, k]. Figure 3 illustrates the computation of the array P S.

A11

A 12

A13

A14

A 15 A16

A 17

A 18

PS

A21

A22

A23

A24

A25 A 26

A27

A28

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

A31

A 32

A33

A34

A35 A 36

A37

A38

PS

PS

PS

PS

PS

PS

PS

PS

A41

A 42

A 43

A 44

A45 A 46

A47

A 51

A 52

A 53

A 54 A 55 A 56

A 57

A48

PS

PS

PS

PS

PS

PS

PS

PS

A 58

PS

PS

PS

PS

PS

PS

PS

PS

A61

A 62

A 63

A 64

A65 A 66

A67

A68

PS

PS

PS

PS

PS

PS

PS

PS

A71

A 72

A 73

A 74

A81

A 82

A 83

A 84

A75 A 76

A77

A78

PS

PS

PS

PS

PS

PS

PS

PS

A85 A 86

A87

A88

PS

11 21

31 41 51 61

71 81

12 22

32 42 52

13 23

33 43 53

62

63

72

PS

14 24

34 44 54 64

73

PS

82

A8 x 8

25

35 45 55 65

74

PS

83

PS Prefix Sum in the rows of A

15

75

PS

84

16 26

36 46 56 66

76

PS

85

86

17

27

37 47 57 67

77

PS 87

18 28

38 48 58

68

78

PS 88

8x8

Fig. 3. Computation of the array P S

After that we apply the Algorithm 2 on array P S to compute a set of uni dimensional subarrays C (gh) s , where 1 ≤ g ≤ h ≤ n [9]. Since each line of P S  is the prefix sum of the respective line in A, the subarrays C (gh) s , can be easily gh computed in constant time with Cj [i] = P S[i][h]−A[i][g −1]. The computation

Efficient BSP/CGM Algorithms for the Maximum Subarray

397

Algorithm 2. C (g,h) Input: (1) A set of P processors; (2) The identification of each processor pi , where 1 ≤ i ≤ P ; (3) Array P S[1 . . . n][1 . . . n] of integers; Output: Set of one-dimensional arrays T empj [1 . . . n]. 1. if pi = 1 then 2. T empj [1 . . . n] ← P S[i][h]. 3. end if 4. Each processor pi = 1 in parallel do: T empj [1 . . . n] ← P S[i][h] − P S[i][g − 1]. 5. return T empj [1 . . . n]

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

PS

11

C

1,1

C 3,6

PS 8x8

PS 8x8 C C

1,2 2,2

C

1,3

C C

2,3

3,3

1,4

C C C C

2,4

3,4 4,4

C C C

1,5 2,5

C

1,6 2,6

C

C

1,7

C

3,5 C 3,6

C

4,5 C 4,6

C

C C

5,5

1,8

C

4,7

C

5,6

C C

5,7

2,8

C

3,7

6,6

C

2,7

C

21

31

41

3,8

C

4,8

51 61

71

5,8

81

12

13

22

32

23

33

42

43

52

53

62

63

72

73

82

83

14 24

34

44 54 64

74 84

15

25

35

45 55

65

75 85

16

26

36

46

56

66

76

86

17

27

37

47

18

28

38

48

57 67

77 87

58 68

78 88

11

21

31

41 51 61

71 81

12 22

32

42 52 62

72 82

(1)

C

6,7

C

C

23

33

43 53 63

73 83

14 24

34

44 54 64

74 84

15

16

25

26

35

36

45

46

55

56

65

66

75

76

85

86

17

27

37

47 57 67

77 87

18

28

38

48 58 68

78 88

(2)

6,8

7,7 C 7,8 C

13

8,8

3,6

3,6

maximum sum of C

array C

Apply maximum subsequence sum algorithm (sequential).



(a) Number of C gh s = n(n + 1)/2

(b) C g,h

Fig. 4. The steps for computing the subarrays C g,h 

and the number of the subarrays C (gh) s derived from the array A can be seen in Figure 4. In the next step, for each subarray Cjgh (compressed in a sequence) the algorithm finds the maximum subsequence sum and the respective maximum sum. In the last step, the algorithm finds the subarray Cjgh with the maximum sum, using the best sequential algorithm for maximum subsequence sum, whose time   complexity is O n (Algorithm 3). These tasks are illustrate in Figure 4 (b).

Algorithm 3. Maximum Subsequence Sum (M SqS) Input: Integer sequence S. Output: The value of the maximum subsequence sum of S. 1. MaxSoFar ← 0. 2. MaxEndingHere ← 0. 3. for i=1 to n do 4. MaxEndingHere ← max(MaxEndingHere+S[i], 0). 5. MaxSoFar ← max(MaxSoFar, MaxEndingHere). 6. end for

We adapt the M SS algorithm (Algorithm 1) in order to have an output with more information than the previous BSP/CGM algorithms. With this we

398

A.C. Lima et al.

can work on our second goal and we propose solutions for five related problems. They are: 1) the maximum largest subarray sum; 2) the maximum smallest subarray sum; 3) the number of subarrays of maximum sum; 4) the selection of the subarray with k-maximum sum and 5) the location of the subarray with the maximum relative density. The strategies to solve these five problems are described by the Algorithms 4 and 5.

Algorithm 4. Arrays of Maximum Subarray Sum Input: (1) A set of P processors; (2) The identification of each processor pi , where 1 ≤ i ≤ P ; (3) Array A[1 . . . n][1 . . . n] of integers. Output: (1) Array M of maximum values of all subarrays from A. (2) Array E of number of elements of all maximum subarrays from A. 1. Each processor pi in parallel do P S[1 . . . n][1 . . . n] ← the prefix sums of the rows from array A[1 . . . n][1 . . . n]; 2. Cgh quantity ← n(n + 1)/2; 3. k ← Cgh quantity/P ; 4. for j = 1 to k in parallel do 5. T empj [1 . . . n] ← C (g,h) (P S); 6. LocalM [1 . . . j], E[1 . . . j] ← M SqS(T empj [1 . . . n]); 7. end for 8. Each processor pi in parallel do send(LocalM [1 . . . k],pi = 1); 9. Each processor pi in parallel do send(LocalE[1 . . . k],pi = 1); 10. j ← 1; 11. if pi = 1 then 12. for i = 2 to P do 13. receive(LocalM i[1 . . . k], pi ); 14. receive(LocalEi[1 . . . k], pi ); 15. end for 16. Computes the arrays M [1 . . . n(n + 1)/2] and E[1 . . . n(n + 1)/2], where: M [1 . . . n(n + 1)/2]] = [LocalMp1 , . . . , LocalMp1 , . . . , LocalMpp , . . . , LocalMpp ] . 1 k 1 k E[1 . . . n(n + 1)/2]] = [LocalEp1 , . . . , LocalEp1 , . . . , LocalEpp1 , . . . , LocalEpp ] . k 1 k 17. end if

Algorithm 5. Maximum Subarray Sum and Related Problems Input: (1) A set of P processors; (2) The identification of each processor pi , where 1 ≤ i ≤ P ; (3) Arrays M[1 . . . n(n + 1)/2] and E[1 . . . n(n + 1)/2] from Algorithm 4. Output: (1) Solutions for the maximum subarray sum (general problem) and five related problems. 1. 2. 3. 4. 5. 6. 7. 8.

(M[1 . . . n(n + 1)/2],E[1 . . . n(n + 1)/2]) ← Parallel Sort by Key-value(M,E); maximum subarray sum ← M[n(n + 1)/2]; {general problem} (NewM[1 . . . (n(n + 1)/2) − i], NewE[1 . . . (n(n + 1)/2) − i]) ← Discard all i elements such as M[i] ¡ Maximum subarray sum ; maximum smallest subarray sum ← NewE[0]; {related problem 1} maximum largest subarray sum ← NewE[length(newE)]; {related problem 2} k-maximum subarray sum ← NewE[k]; {related problem 3} number of subarrays of maximum sum ← size(NewE); {related problem 4} maximum relative density ← Parallel Reduce to Max(M,E), using di = M[i]/E[i]; {related problem 5}

The Algorithms 1 and 4 have similar structures. However, in the Algorithm 4 a new array was created. This array, called E[1 . . . n(n + 1)/2]], stores the number of elements of each position of the array M [1 . . . n(n + 1)/2]]. The two arrays

Efficient BSP/CGM Algorithms for the Maximum Subarray

399

generated as output from the Algorithm 4 are used by the Algorithm 5 to solve the general problem of maximum subarray sum and the related problems. For the solutions, the Algorithm 5 uses a parallel sorting algorithm (by key-value) on the arrays M [1 . . . n(n + 1)/2]] and E[1 . . . n(n + 1)/2]]. In the last step, to solve the related problem of the location of the subarray with the maximum relative density, we established divisions between valid values of M [1 . . . n(n + 1)/2]] and E[1 . . . n(n + 1)/2]]. The Figure 5 illustrates the solution process for all these problems. Particularly in the GPGPU environment, after the sorting, all the steps (except 8) that solve the related problems can run in constant parallel time.

M

7

0

0

5

0

16

0

0

16

0

1

0

5

0

16

E

3

0

0

2

0

8

0

0

4

0

1

0

2

0

2

Sorting Algorithm by Key-Value M

0

0

0

0

0

0

0

0

1

5

E

0

0

0

0

0

0

0

0

1

2

5

2

7

16 16 16

3

2

4

8

Unnecessary Elements Valid Elements 16 16 2 4

16 8

16 16 16

Selection of the subarray with k-maximum sum. Solution of the maximum subarray sum problem. Solution of the largest subarray sum problem.

2 8

4

4

8

2

Subarray with the maximum relative density sum.

Counting Algorithm

Solution of the smallest subarray sum problem.

3

Solution of the number of subarrays of maximum sum.

Fig. 5. From the array of valid elements, in constant time, we located the solutions for five problems: The maximum subarray sum has sum equal to 16. The maximum largest subarray sum has sum equal to 8. The maximum smallest subarray sum has sum equal to 2. The number of subarrays of maximum sum is 3 (3). The subarray with the maximum relative density has value equal to 8.

3.2

Complexity

The complexity of our algorithms is obtained through of an analysis of all steps. Initially, in the Algorithm 1, the prefix sum can be computed by a BSP/CGM parallel prefix algorithm using p processors in O n/p time and a constant number of communication  rounds. Since there are n columns, the time complexity of  arrays is this step is O n2 /p . Then, in the step 2, a set of n2 one-dimensional   obtained by the processors. Subsequently, the Algorithm 3 (O n ) runs in each array, thus the time complexity of each of these steps is O n3 /p . The step 4 has

400

A.C. Lima et al.

only communication. The final steps can be computed by a BSP/CGM   parallel algorithm of reduction to maximum, using p processors with O n/p time and a constant number of communication rounds. Therefore, we conclude that the  final time complexity using p processors is O n3 /p with a constant number of communication rounds. Particularly, the Algorithm 4 has the same final complexity, since the final step of parallel sorting does not interfere in the general complexity. Particularly, the algorithms 4 and 5 have the same final complexity, since the final step of parallel sorting and reduce does not interfere in the general complexity.

4

Implementations and Results

In this section, we discuss the main strategies to development of our algorithms. Regarding the maximum subarray sum (Algorithm 1) two implementations were done. The first one using MPI and the second one using OpenMP. After, we made implementations for the related problems (Algorithm 4) using only CUDA. It is important to state that the Algorithm 4 also solves the problem of maximum subarray sum. In the MPI solution, we have explored the distributed memory environment. With the OpenMP and CUDA we explore the shared memory environment. In all these solutions, we search to optimize the most of local processing, running independent operations and thus avoiding unnecessary communications. The process of mapping and subsequent data division is illustrated in Figure 6. 4.1

Computational Resources

We have run the Algorithms on two different computing systems (distributed and shared memory). The Algorithms in CUDA and OpenMP were executed on a machine with 8 GB of RAM, Operating System Linux (Ubuntu 14:04), (R) Intel Core (TM) processor i5-2430M @ 2.40GHz CPU (two cores with 4 threads in each) and an NVIDIA GeForce GTX 680 with 1536 processing cores. The MPI Algorithms were executed on a cluster with 64 nodes, each node consisting of 4 processors Dual-Core AMD Opteron 2.2 GHz with 1024 KB of cache and 8 GB of memory. All nodes are interconnected via a Foundry SuperX switch using Gigabit ethernet. This cluster belongs to High Performance Computing Virtual Laboratory (HPCVL) of Carleton University. In this environment, we performed tests with 16, 32 and 64 processors. In order to compute the speedup (distributed and shared memory environments), the sequential algorithm was run on the cluster environment and on the machine with CPU and GPU. Experiments with different sizes of sequences were performed. For each input sequence, the algorithms were run 10 times and the arithmetic mean of the running times was computed.

Efficient BSP/CGM Algorithms for the Maximum Subarray

401

Triangular Two-Dimensional Array C

1,1

C C

1,2

C

2,2

1,3

C C

2,3

3,3

1,4

C C C C

2,4

3,4 4,4

C C C

1,5

C

1,6

2,5

C

3,5

C

C C

2,6 3,6

4,5 C 4,6 5,5

1,7

C C

2,7

3,7 4,7

C

5,6

C

6,6

C

C C

C

5,7

6,7

C

7,7

C

1,8 2,8

C C

3,8

C C C C C

4,8 5,8 6,8 7,8 8,8

Global One-Dimensional Array 1,1 1,2

C

C

1

2

1,3

C

3

1,4

1,5 1,6

C

C

4

5

C

6

1,7

C 7

1,8

C 8

2,2 2,3

C 9

C

2,4

C

10 11

2,5

C

2,6

C

2,7

2,8

C

C

3,3 3,4

C

C

3,5

C

3,6

C

3,7 3,8

C

C

4,4 4,5

C

C

4,6

C

4,7

C

4,8

C

5,5 5,6

C

C

5,7

C

5,8

C

6,6

C

6,7

C

6,8

C

7,7 7,8 8,8

C

C

C

12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

Load Balancing (Mapping Functions) Local Array

...

...

...

...

Unit 1

Unit 2

Unit 3

Unit 4

... Unit 5

Unit n

Processing Units

Fig. 6. The process of mapping: each processing unit receives the same number of  subarrays C (g,h) s

4.2

Results

Below we present the results achieved by the implemented BSP/CGM algorithms. Three cases of comparison between the algorithms are presented. In these cases we used two-dimensional arrays of integers generated randomly. In all cases we used milliseconds as the time unit. In the speedup calculations, we used the parallel version with the best running time. For equivalent comparisons the pure sequential algorithms were run on different environments, by this they have different times for the same inputs, as presented in the tables. The First Comparison Case: The first case illustrates a comparison of running times of algorithms for the general problem of maximum subarray sum. We conducted tests with three algorithms: two MPI and one OpenMP. The first MPI algoritm was described in a previous work of Alves et al. [1]. This algorithm involves data compression and a considerable cooperation between the processors. The Figure 7 and the Table 1 illustrate the results of the comparison between this algorithm and the sequential solution. The results prove the efficiency of the solution with increasing speedup values, however an initial overhead seems to elevate the time with 64 processors.

402

A.C. Lima et al. ( Sequential ) ( MPI - Alves et al. (16 p)) ( MPI - Alves et al. (32 p)) ( MPI - Alves et al. (64 p))

107

M illiseconds

106 105 104 103 102 101 0

1,000

2,000 n

3,000

4,000

Fig. 7. Sequential × MPI versions (Alves et. al)

Table 1. Running Times: Sequential × MPI versions (16, 32 and 64 p.) n×n

Sequential Time

MPI Time (16 p.) MPI Time (32 p)

64×64 128×128 256×256 512×512 1024×1024 2048×2048 4096×4096

29.2992 229.9130 1849.6825 14879.4916 125239.8713 1038533.5501 8460234.4575

6.8273 18.6285 74.1179 397.6235 4403.7971 20241.3991 158989.7957

14.0064 27.3484 72.2008 302.5005 5783.0959 17722.9556 81369.5981

MPI Time (64 p.) Speedup(Seq./MPI32p.) 2.0918 8.4068 25.6186 49.1883 21.6562 58.5982 103.9729

5422.5232 6877.8257 87.3313 316.9507 1279.9464 29944.1617 58257.9493

The second test represents our MPI solution for the maximum subarray sum problem (Algorithm 1). Differently of the Algorithm of Alves et al. [1], in our  solution, each processor is responsible for calculating a set of subarrays C (g,h) s . Furthermore, we do not apply data compression. This is an important factor to extend, more easily, the general solution to the solutions of related problems. The Figure 8 and the Table 2 illustrate the results of the comparison between the Algorithm 1 and the sequential algorithm [3]. In this test we also obtained good results of speeedup, however the values decrease for large input. ( Sequential ) ( MPI - Algorithm 1. (16 p)) ( MPI - Algorithm 1 (32 p)) ( MPI - Algorithm 1 (64 p))

M illiseconds

106

104

102

100 0

1,000

2,000 n

3,000

4,000

Fig. 8. Sequential × MPI versions (Algorithm 1)

The third test represents our OpenMP implementation for the Algorithm 1. In this test, we worked with 8 threads, because the available machine had a processor with two cores, each with 4 threads. The Figure 9 and the Table 3 illustrate the results of the comparison with the sequential algorithm [3].

Efficient BSP/CGM Algorithms for the Maximum Subarray

n×n

403

Table 2. Running Times (Milliseconds): Seq. × MPI (16, 32 and 64 p.)

64×64 128×128 256×256 512×512 1024×1024 2048×2048 4096×4096

Sequential

MPI Time (16 p.) MPI Time (32 p)

29.2992 229.9130 1849.6825 14879.4916 125239.8713 1038533.5501 8460234.4575

0.7725 2.0684 15.1679 280.3312 3748.8917 31783.7091 261292.5225

0.9899 1.6854 8.5535 145.2910 1896.4386 15835.1481 130540.967

MPI Time (64 p.) Speedup(Seq./MPI32p.) 29.5981 136.4145 216.2486 102.4117 66.0395 65.5841 64.8090

1.1785 1.6009 5.3154 77.4639 966.3708 8046.7096 65800.121

( Sequential ) OpenMP - Algorithm 4

106 5

M illiseconds

10

104 103 102 101 100 0

1,000

2,000 n

3,000

Table 3. Running Times: Seq. × OpenMP

4,000

n×n 64×64 128×128 256×256 512×512 1024×1024 2048×2048 4096×4096

Fig. 9. Sequential × OpenMP (Algorithm 1)

Sequential 360.9985 360.8055 360.3314 1171.1028 10099.6082 96861.6133 835974.2750

OpenMP 2.1495 3.4071 19.8622 126.3485 2221.9307 25647.4277 242314.7492

Speedup 167.9453 105.8981 18.1416 9.2688 4.5454 3.7767 3.4500

The Second Comparison Case: The second comparison case illustrates the test based on the CUDA implementation of the Algorithm 4. In this case two versions were implemented, with and without the triangular load balancing described in Section 4. The Figure 10 and the Table 4 illustrate the results of the comparison with the sequential solution. It is possible to note that the version with triangular balancing is a little more efficient. In this implementation, we obtained speedup values that increasing continually. ( Sequential ) ( CUDA - Version Algorithm 4 ) ( CUDA Triangular - Version Algorithm 4 )

M illiseconds

106

105

104

103

0

1,000

2,000 n

3,000

Table 4. Running Times: Seq. × CUDA

4,000

Fig. 10. Sequential × CUDA versions

n

Sequential

CUDA

CUDA Triangular 64×64 360.9985 362.1512 271.7961 128×128 360.8055 362.0914 272.1506 256×256 360.3314 362.9239 271.9573 512×512 1171.1028 365.0831 301.2226 1024×1024 10099.6082 741.5493 826.9745 2048×2048 96861.6133 879.3046 697.1044 4096×4096 835974.2750 2699.4783 2525.3061

Speedup (Seq./CUDA) 1.3282 1.3258 1.3250 3.8878 12.2127 138.9485 331.0388

404

A.C. Lima et al.

The Third Comparison Case: In the third and final case, we compared the results of speedup of our GPU implementation, with the results of a recent work developed by Cleber et al. [6]. In both cases, GPU algorithms were developed for the problem of the maximum subarray sum. Importantly, the computing resources involved in the two cases were different. The comparative results are illus- Table 5. The best results of speedups trated in Table 5. We observed that our and average of speedups values of speedup are smaller, but the n Speedup (Cleber Speedup (Algodifference tends to decrease. In the last et al.) rithm 4) 512 × 512 80.2260 3.8878 analyzed input (n = 4096), our value is 1024 × 1024 121.9900 12.2127 2048 × 2048 215.2870 138.9485 better. We believe that this difference 4096 × 2048 235.9650 331.0388 occurred because our algorithm has a higher workload, since it presents solutions for the maximum subarray sum and related problems, concomitantly. 4.3

Image Application

In medical diagnosis based on X-ray images, the areas of interest are usually the brightest or most luminous region (with more white pixels). In this context, a real application for the maximum subarray sum was presented by Raouf et al. to locate the brightest regions in mammography images, in order to detect macrocalcifications [11]. Particularly, the bitmap images are a good example of input for the maximum subarray sum problem. In this case the RGB pixels are converted to a twodimensional array of integers or real numbers. This conversion process uses the luminance value. The luminance expresses the color intensity and for the human vision system corresponds the sense of brightness [12]. The luminance of each pixel can be measured by the equation: Luminance = 0.30R + 0.59G + 0.11B. The pixels with higher luminance are converted into higher numerical values and lower values (sometimes negatives) are assigned to the pixels with lower value of luminance [2]. The condition of mapping is as follows: If the RGB values of the pixel are within the interval [0, 1], then the luminance value of the pixel will also be within the interval [0, 1]. However, in this work, we applied a correction factor equal to -0.5, in order to change the interval of [0, 1] for [−0.5, +0.5] and consequently obtain an array of positive and negative values. In this work, we also applied our algorithm on real images. The Figure 12 illustrates the result obtained by our algorithm on a real image of X-ray of the human colon. The image belongs to a database of images of various types of cancer [8]. The brightest region is indicated by the dotted rectangle (Figure 12(d)). The Figure 11 illustrates our process of image analysis. Initially, each pixel from image is converted to an integer using a mapping algorithm. This algorithm uses the formula of luminance and the correction factor proposed. Then, the pixels are placed in a two-dimensional array of same size of the original image. The two-dimensional array is the input to the next stage, which is the location of the area of interest (in this case, the maximum subarray sum). At the end of this step, the output consists of four points, which define

Efficient BSP/CGM Algorithms for the Maximum Subarray

405

Bitmap Image Bitmap Image

Mapping Algorithm

Bidimensional Array

Maximum Subarray Algorithm

Area of Interest Located

Rendering Algorithm

Brightest Region Located

Fig. 11. The steps for finding the brightest region in a X-ray image

the area of interest (rectangle). The points are used as input for the our rendering algorithm, which finally shows the original image, highlighting the region of interest. The four points (x, y) that define the rectangular region are: (319, 245), (319, 285), (392, 245) and (319, 285), as observed in Figure 12 (c). The maximum subarray sum has a value equal to 132382. In our rendering algorithm we used the devIL library (Developer’s Image Library)1 and the GraphicsMagick package2 , both are open source and are dedicated to image processing.

5

Conclusion and Future Work

In this work we propose efficient BSP/CGM parallel solutions to the maximum subarray sum (Algorithm 1) and five related problems (Algorithm 4 ): the maximum largest subarray sum, the maximum smallest subarray sum, the number of subarrays of maximum sum, the selection of the subarray with k-maximum sum and the location of the subarray with the maximum relative density sum. To the best of our knowledge, there are no parallel algorithms  these related  for problems. Our algorithms use p processors and require O n3 /p parallel time with a constant number of communication rounds. The good performance of the BSP/CGM parallel algorithms is confirmed by experimental results. Unlike of the solution presented by Alves et al. [1] in our algorithms, we do not use data compression, and whenever possible, we kept the processors running independently of each other. These strategies were essential to the solutions of the related problems. The results confirmed that, even without data compression, our implementations are effective, with speedups dozens of times better than the sequential solution. It is important to observe that our CUDA implementation (Algorithm 4) solves the general problem and five other problems related to the maximum subarray sum and, even so, obtained a speedup, in the last instance, 331 times better when compared with the sequential solution. In addition, an experiment of image application is also part of our results. We show that the algorithm can be applied successfully in the localization of the brightest regions in X-ray images. We also conclude that the BSP/CGM model can be used efficiently to design parallel algorithms in GPGPU/CUDA environments. Besides all the details of these environments, we have mapped our algorithms into implementations with good speedups, where a single GPGPU got better performance than a cluster of workstations. We believe that the abstraction of the Figure 2 can be useful to design BSP/CGM algorithms for GPGPU’s environments. Finally, we 1 2

http://openil.sourceforge.net/ http://graphicsmagick.org/

406

A.C. Lima et al. 0

1

0

- 127 - 127

1

. . . . . .

.. .

511

...

... ...

Mapping Algorithm

511

(a) Input: Bitmap version of the X-ray image 0... 0 .. .

- 127

. . .

...

245 . . . 285

...

- 127

(b) Two-dimensional array of integers

511

Rendering Algorithm 319

(319, 245)

(319, 285)

(392, 245)

(392, 285)

.. .

392

. . .

511

(c) Maximum subarray sum is located

(d) Output

Fig. 12. (a) Bitmap image of a X-ray of colon [8]. (b) Two-dimensional (512 × 512) array generated after the mapping. (c) Location of the maximum subarray sum. (d) The input image with the brightest region highlighted (after the rendering of the brightest region).

also believe that the speedup of our CUDA algorithms can be further improved through the use of multiple GPUs. 5.1

Future Work

A first possibility of future work consists in extending our solutions, in order to address the issue of the tri-dimensional version of the maximum subarray sum problem. To the best of our knowledge, there are no parallel algorithms for this problem. However, we believe it is possible to develop, based on the same principles, real and efficient parallel solutions. With this, practical applications for viewing images can be viewed with even more details.

Efficient BSP/CGM Algorithms for the Maximum Subarray

407

References 1. Alves, C.E.R., C´ aceres, E.N., Song, S.W.: BSP/CGM algorithms for maximum subsequence and maximum subarray. In: Kranzlm¨ uller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 139–146. Springer, Heidelberg (2004) 2. Bae, S.E.: Sequential and Parallel Algorithms for the Generalized Maximum Subarray Problem. PhD thesis, University of Canterbury, Christchurch, New Zealand (2007) 3. Bentley, J.: Programming pearls: Algorithm design techniques. Commun. ACM 27(9), 865–873 (1984) 4. Dehne, F., Ferreira, A., C´ aceres, E.N., Song, S.W., Roncato, A.: Efficient para- llel graph algorithms for coarse grained multicomputers and BSP. Algorithmica 33(2), 183–200 (2002) 5. Dehne, F., Fabri, A., Rau-chaplin, A.: Scalable parallel computational geometry for coarse grained multicomputers. International Journal on Computational Geometry 6, 298–307 (1994) 6. Ferreira, C.S., Camargo, R.Y., Song, S.W.: A parallel maximum subarray algorithm on GPUs. In: IEEE International Symposium on Computer Architecture and High Performance Computing Workshops, pp. 12–17, October 2014 7. Gotz, S.M.: Communication-Efficient Parallel Algoritms for Minimum Spanning Tree Computation. PhD thesis, University of Paderborn, May 1998 8. National Cancer Institute: Cancer imaging archive (2015) (accessed 11- Jan-2014). https://public.cancerimagingarchive.net 9. Perumalla, K., Deo, N.: Parallel algorithms for maximum subse- quence and maximum subarray. Parallel Processing Letters 5, 367–373 (1995) 10. Qiu, K., Akl, S.G.: Parallel maximum sum algorithms on interconnection networks. Technical report, Queens University Dept. of Com., Ontario, Canada (1999) 11. Saleh, S., Abdellah, M., Raouf, A.A.A., Kadah, Y.M.: High performance cudabased implementation for the 2d version of the maximum sub- array problem (msp). In: Cairo International Biomedical Engineering Conference (CIBEC) (2012) 12. ITU International Telecommunication Union. Recommendation ITU-R BT. 470– 6,7. Conventional Analog Television Systems (1998) (accessed 03-January- 2015). http://www.itu.int/rec/R-REC-BT.470-6-199811-S/en 13. Wen, Z.: Fast parallel algorithms for the maximum sum problem. Parallel Computing 21(3), 461–466 (1995)