i(k) - e p u b s . s i a m . o r g

2 downloads 0 Views 4MB Size Report
Aug 13, 1992 - We consider acceleration techniques for the block Cimmino iterative ... Ifwe assume, for simplicity, that the blockrows A have full row rank, we ...
() 1995 Society for Industrial and Applied Mathematics

SIAM J. ScI. COMPUT. Vol. 16, No. 6, pp. 1478-1511, November 1995

015

BLOCK LANCZOS TECHNIQUES FOR ACCELERATING THE BLOCK CIMMINO METHOD* MARIO ARIOLIt, IAIN S. DUFF*, DANIEL RUIZ, AND MILOUD SADKANE Abstract. We consider acceleration techniques for the block Cimmino iterative method for solving general sparse systems. For iteration matrices that are not too ill conditioned, the conjugate gradient algorithm is a good method for accelerating the convergence. For ill-conditioned problems, the use of preconditioning techniques can improve the situation, but the classical conjugate gradient acceleration still performs poorly because of clusters of eigenvalues at the ends of the spectrum of the iteration matrix.. We therefore try variants of the block Lanczos method, including the block conjugate gradient method. On some test examples, these techniques are convergent whereas the conjugate gradient method is not.

Key words, sparse matrices, block iterative methods, block conjugate gradient, numerical stability, efficiency analysis

AMS subject classifications. 65F50, 65F10, 65F35, 65Y05

1. Introduction. We consider a block projection method for the solution of the linear equations

Ax

(1)

b,

where A is a large nonsingular sparse unsymmetric matrix of order n. The blocks are obtained by partitioning the system (1) into strips of rows, A and b i, 1 < < p < n. The block Cimmino method projects the current iterate simultaneously onto the manifolds corresponding to the strips and takes a convex combination of all the resulting vectors. If we assume, for simplicity, that the block rows A have full row rank, we can define the Moore-Penrose pseudo inverse of A by Ai+ AiT (AiAiT) -1, and the block Cimmino algorithm can be described in the following way.

Algorithm 1.1 (block Cimmino method) Choose x (), set k 0 repeat until convergence begin do in parallel

i(k)

(2)

1

p

Ai+(b -Aix(/0)

end parallel P

x(k+l)

i(k)

X (k) i=l

setk=k+ 1 end *Received by the editors August 13, 1992; accepted for publication (in revised form) September 6, 1994. The work of these authors was performed while the authors were at Centre Europ6en de Recherche et de Formation Avanc6e en Calcul Scientifique, 31057 Toulouse Cedex, France. tlstituto di Analisi Numerica, Consiglio Nazionale delle Ricerche, Via Abbiategrasso, 209, 27100 Pavia, Italy (aioli@bond. ian. pv. cnr. it). *Rutherford Appleton Laboratory, Didcot,Oxon OXl 0QX, England (I. Duf f@ruther f or d. ac. uk). Ecole Nationale Sup6deure d’Electrotechnique, d’Electronique, d’Informatique, et d’Hydraulique de Toulouse, Institut de Recherche en Informatique de Toulouse, 2 rue Camichel, 31071 Toulouse Cedex, France (ruiz@enseeiht. fr). Institut de Recherche en Informatique et Syst6mes Alatoires, Campus Universitaire de Beaulieu, 35042 Rennes Cedex, France (sadkane@+/-r +/-sa. fr).

1478

BLOCK LANCZOS ACCELERATION TECHNIQUES

1479

where v is a relaxation parameter. The conjugate gradient acceleration of the block Cimmino method, and all related acceleration techniques, are not influenced by the value of the relaxation parameter v. Thus, we will consider from now on that v is equal to 1. If the block rows A are nearly mutually orthogonal, i.e., AAr is strongly block-diagonally dominant, we can expect that the method will converge very quickly. To this end, we exploit algorithms that maximize the minimum element on the diagonal, and that reorder A to block tridiagonal form. The case of block tridiagonal matrices, which is very common in partial differential equation (PDE) discretization problems, for instance, is very important because, for such structures, we can introduce a partitioning with a high degree of parallelism and a fast convergence. Consider, for instance, a block tridiagonal matrix A with blocks of size x l, partitioned in block rows A so that each A has ki 1, 2 rows, ki > 2, p. It can be reordered as

(3)

-

A A3 A.5

B

A4

Be

In this way, we can identify a partitioning in which A will be structurally orthogonal to Ai+2 1, 2 p 2), and is therefore equivalent to a partitioning in two blocks B and B2, (i where B 1-- {Ui Ai / odd} andB {t3iAi/i even}. We denote by ml andm2 the number of rows in B and B2, respectively, and by interface block, the block of smallest size (for instance B2 in the above example). For such a partitioning, Elfving (1980) shows that the eigenvalues of the iteration matrix of the block Cimmino method are the cosines of the principal angles between the two manifolds defined by these two blocks. In this case, the spectrum of the iteration matrix lies strictly between -1 and 1, is symmetric around 0, and has many eigenvalues clustered around 0. There are at most 2m2 nonzero eigenvalues in the iteration matrix so that the block Cimmino algorithm with conjugate gradient acceleration in general works better for small interface block sizes, and in exact arithmetic takes not more than 2m2 steps for convergence. Our aim, in this paper, is to concentrate on the acceleration process. This is crucial because of the slow convergence often achieved by Algorithm 1.1 alone. More details about the implementation, the convergence, and the parallelism of the method can be found in Arioli, Duff, Noailles, and Ruiz (1992). For iteration matrices that are not too ill conditioned, the conjugate gradient algorithm is a good method for accelerating the convergence of the iterative scheme, and generally achieves fast convergence. In these cases, efficiency is mainly improved through the choice of the partitioning, which affects the amount of computation and the degree of parallelism of the method (see Kamath and Sameh (1988), Bramley and Sameh (1992), and Arioli, Duff, Noailles, and Ruiz (1992)). But for ill-conditioned problems, the convergence is slow; for example, the convergence curve of the residuals shows a long plateau followed eventually by a sharp drop. Preconditioning techniques, such as those discussed in 2, help by shortening the plateau but do not remove it. In the following, we are principally concerned with ill-conditioned problems. In 3 we introduce a particular test problem of this kind and discuss its characteristics. For such test problems, clusters of eigenvalues at the ends of the spectrum can be observed, which can cause poor convergence of the classical conjugate gradient method. In 4, we accelerate

1480

M. ARIOLI, I. S. DUFF, D. RUIZ, AND M. SADKANE

the block Cimmino method using variants of the block Lanczos method for solving linear systems, including the block conjugate gradient method. In contrast to the classical conjugate gradient algorithm, these block techniques allow the computation of multiple eigenvalues, and thus provide a nice ability for detecting clusters of eigenvalues in finite arithmetic. These basic properties are recalled in 5. However, the computations in the block Lanczos methods involve the solution of symmetric positive definite (SPD) systems rather than the division by scalars as in the classical conjugate gradient algorithm. Whereas the scalar in this division has a condition number of 1, the matrices in the corresponding block operation may be very ill conditioned, especially when convergence is almost reached. This can amplify round-off errors in the computations, as illustrated in 6, and disrupt the convergence. From among this class of methods, we identify in 7 those that are numerically stable and study their efficiency. Although these block acceleration techniques increase the amount of arithmetic, they can still be more efficient in a parallel environment since they allow the use of efficient level 3 BLAS kernels. Furthermore, in some test examples, these techniques are convergent whereas the conjugate gradient method is not. These last two points are illustrated in 8 and 9. We try to see in 10 if the block conjugate gradient algorithm still presents the same potential in a more general framework, e.g., matrices other than those coming from the acceleration of the block Cimmino method with a two-block partitioning strategy. Some concluding remarks are presented in 11. 2. Some experiments and comments. One of our main goals was to determine the limitations of the method. The results (see Kamath and Sameh (1988), Bramley and Sameh (1992), and Arioli, Duff, Noailles, and Ruiz (1992)) showed that the block Cimmino method with conjugate gradient acceleration could be considered as a robust solver, since in most cases it would converge and give an accurate solution. We also observed that our results co.uld almost be separated into two classes. For iteration matrices that are not too ill conditioned, the conjugate gradient algorithm is a good method for accelerating the convergence of the iterative scheme. But for ill-conditioned problems, the convergence is slow; e.g., the iteration profile of the residuals shows a long plateau followed eventually by a sharp drop. We then focused on the second class of test problems in order to understand the reasons for this poor convergence and to see if there were some ways to avoid it. The results we obtained from these particular experiments are well illustrated by experiments on two test problems coming from the Harwell-Boeing set of sparse matrices (see Duff, Grimes, and Lewis (1989)). Matrix SHERMAN1 is a sparse symmetric matrix of order n 1000, from oil reservoir modelling programs, and arises from a three-dimensional simulation of black oil. Matrix GRE1107 arises in the simulation of computer systems. The matrix is unsymmetric, of order n 1107, and is very ill conditioned. Its classical condition number in the infinity norm (llAIIo A-1110) is reported by Arioli, Demmel, and Duff (1989) to be equal to 1.8 1010. For all these test problems, we generate the right-hand side vector b from a given solution vector x* where n. This simple solution vector has been chosen to avoid n! i, 1 special case homogeneous and/or sparse solution vectors. For such test problems, we have investigated and combined different approaches aimed at improving the conditioning of the resulting block Cimmino iteration matrix as well as the eigenvalue clustering of its spectrum. The test problems are first preprocessed before applying any kind of partitioning. We first perform row scaling in order to avoid problems due to bad scaling of the original data. To improve the diagonal dominance, the rows of the original matrix were then reordered to maximize the minimum entry, on the diagonal, and a symmetric permutation (see Cuthill and McKee (1969)) was computed to extract a block tridiagonal structure with small bandwidth. The resulting block tridiagonal matrix was then partitioned in a two-block partitioning with minimum interface block size, as suggested in 1, in order to obtain reasonably fast convergence.

x

BLOCK LANCZOS ACCELERATION TECHNIQUES

1481

Additionally, to improve the convergence, we tried three different preconditioners that preserve the structural orthogonality between the blocks of the given two-block partitioning. These preconditioners are based on ellipsoidal norm considerations, where the orthogonal projections onto the manifolds defined by the strips of rows A and b involved in the block Cimmino algorithm (see 1), are replaced by oblique ones. When doing this, a generalized pseudo inverse (see Rao and Mitra (1971))

(4)

A’G_

G-1AiT(AiG-1AiT)-I

must be defined for each block row Ai, where G is the SPD matrix defining the ellipsoidal norm. Matrix (4) is used instead of the usual Moore-Penrose pseudo inverse of A in iteration (2). Additionally, the use of ellipsoidal norms is equivalent to preconditioning on the fight side of the original matrix A with the matrix G -1/2. We have not performed an extensive study of preconditioning techniques associated with ellipsoidal norms since their effectiveness is strongly related to the partitioning of the systems, which may also be influenced by the preprocessing of the original matrix. Our main motivation is to highlight that they can be trivially accommodated in our implementation of row projection methods. It is also natural to think that the use of ellipsoidal norms will not only have an influence on the profile of the convergence, but will also cause more work per iteration. Therefore, a substantial reduction in the number of iterations may not overcome the loss of efficiency due to the extra number of operations performed. This was one of the main drawbacks we experienced in using ellipsoidal norms. The other problems we encountered concerned the load balancing of the tasks. Indeed, depending on the ill conditioning of subproblems, the use of ellipsoidal norms can affect the amount of computation in the solution of each subproblem in an unquantifiable way. All these observations are more fully illustrated experimentally by Arioli, Duff, Noailles, and Ruiz (1992) and by Ruiz (1992). The three preconditioners of this kind we have considered act on the block of columns C which overlap only two blocks A and ATM. The first one is a simple diagonal scaling, by dividing the columns C by a given number ct in an attempt to open the principal angles between the two subspaces involved in the two-block partitioning (see Arioli, Duff, Noailles, and Ruiz (1992)). The second preconditioner is a block diagonal one and replaces the block diagonal part, Di, in each set of overlapping columns C by the identity matrix. This would make the set of rows associated with each block D orthogonal, but relies on the assumption that the diagonal blocks D are invertible, which may not always be the case. At any rate, the C should be full rank; otherwise the matrix A would be rank deficient. The third preconditioner simply aims to orthonormalize these columns. These last two preconditioners can additionally be combined with the diagonal scaling, and the resulting preconditioning techniques will be referred to as "central-block preconditioning" and "column-block preconditioning," respec-

tively. In the following, we are mainly concerned by the behaviour of the block Cimmino method. We have introduced the above preprocessing and preconditioning aspects very brietly in order to settle the context in which the following discussions take place, and for the reader to understand a little better on what systems the acceleration technique will be applied. The various aspects of our method have all been tested by runs on an eight processor Alliant FX/80. In all the test examples, the conjugate gradient iterations are initialized with a zero starting guess, and the normwise backward error, which is used to monitor the convergence, is the scaled residual ak defined by (’Ok

IIAx_ 1, and for j 0 if we set p(0) q(0). The second process is just a minimization of the GH-norm of the error over the Krylov space/C(H, r(), j + 1), which, because of the availability of a GH-orthogonal basis for this Krylov space, can be carried out iteratively in a very simple way; viz.,

x(J+l)

x(j)

+ p(j) (p(j)r

GHD(J))- p(j)’Gr(J),

k (G) Hx (j) the residual vector at step (j). The main simplification in the conjugate gradient algorithm, which links the above two processes together, results from the observation that the residual vectors can also be written as

with r (j)

(12)

r(j+l)

r(j)

Hp(J

(p(j)TGHp(J))

-1

p(j)T Gr(J),

q(J) is identified which closely resembles equation (11). Thus, if the set of vectors q(0) with the set of residuals {r() r(j) }, then the computation of the G-orthogonal basis for

BLOCK LANCZOS ACCELERATION TECHNIQUES

1493

the Krylov space can be overlapped with the computation of the residuals in the minimization process. But this is not mandatory, and, indeed, in the Lanczos algorithm, the set of v (j) vectors is not only G-orthogonal but also G-orthonormal and thus does not correspond to the set of residual vectors at all. In the following, for the block conjugate gradient algorithm, we will decouple the computation of the residuals from that of the G-orthogonal basis for the block Krylov space, which we will G-orthonormalize as in the block Lanczos algorithm. The availability of the GH-orthogonal basis for the Krylov space will still enable easy updates of the iterates and residuals, as shown in Algorithm 4.2. In order to minimize the amount of extra computation, we will perform a GH-orthonormalization of the P(J) matrices instead of a QR factorization as indicated in Algorithm 4.2. This considerably simplifies the computation of the ?j and aj matrices, and results, after simplification of the projection steps (11) and (10), in Algorithm 7.1.

Algorithm 7.1 (stabilized block conjugate gradient acceleration)

X () is arbitrary,

R ()

()=R()Tl () ()

suchthat

for j

(13)

(()rG())=l

such that until convergence do:

0, 1, 2

(j+l)

HX (),

K

((j)H(J)Nj ) Tj+I such that ((J+ITG(J+I) )=I -1

tXj

(14)

-Ij+l (j+l)((j+l)(j))_lsuchthat((j+l)TGH(J+l) ) =I

(15)

X (j+l)

IXj

X (j)

+ (J)(J)TGR(J)

where R(j)

K- HX (j)

As for the V matrices in the block Lanczos algorithm, the R of the block case do not correspond to the residuals. However, the update of the next iterate X (j+l), which results from an oblique projection, still involves the residuals R (j). Obviously, computing the residuals as indicated in (15) is not very wise since usually the most time consuming part of the algorithm is the computation of the product of the matrix H with any vector or set of vectors. The computation of the residuals could still be done as in (12): (16)

R(j+l)

R(j)

H(J)(J)TGR(J).

This would cause extra overhead since it is necessary to keep the residual R (j) and to compute

(J)TGR(J).

the additional matrix by matrix product However, since equation (13) comes from the simplification of the projection step (11), we can also write

(17)

(J+l)Tj+l

Using (16) and (17), it can be shown by recurrence that the residuals R (j) are linked to the (J) matrices by the reverse product of the 71 matrices; viz.,

1494

M. ARIOLI, I. S. DUFF, D. RUIZ, AND M. SADKANE

-_

Thus, the update of X (j +1) becomes

x(J+l)

(j)(j)T((j)

x(j)

which is equivalent to

(18)

X(j + 1)

X(j)

(J) j

"t’i

"Yi

With (18) replacing (15), the stabilized block conjugate gradient algorithm strongly resembles Algorithm 4.2. Not much overhead comes from the manipulation of the yj matrices. Moreover, the update of X (j) can be considerably simplified if one is interested in the solution of one system only, as is the case here. Indeed, if the columns of the (J) are not permuted, the 3j transformation matrices can be considered as upper triangular, and consequently the matrix VI/=j yi is also upper triangular. Thus, if the solution vector we are looking for is kept in the first position in the colunms of X (j), the contribution of the matrix I-I/=j ’i to the update of this first column vector is reduced to

X

j+l)

X

j)

+ (J)e

"Yi(1, 1)

where gj is the first column of matrix hj. This trick, which saves computation and storage, is used in all the following tests. Note also that in the computation of the upper triangular matrices Yi, which result from a Gram-Schmidt process for instance, Yi (1, 1) is always the most accurately computed element. As can be seen in Fig. 7, Algorithm 7.1 is as stable as the block Lanczos algorithm, and is the one that achieves the least normwise backward error for any block size. In fact, the ill conditioning of the residuals is now included in the reverse product of the transformation matrices yj. This matrix actually becomes zero when convergence is reached, and the iterates X (j) are not updated any more. Moreover, each of these ,j can be expected to be much less ill conditioned than the residuals themselves, as the ill conditioning of the residuals is distributed over all these ,j matrices. Therefore, the building of the basis of the Krylov space, which never involves the matrix ]--I/=j "Yi, should be significantly improved. This is reflected by the numbers in Table 2 and by the convergence of the algorithm in Fig. 7. Before ending this section, we would like to discuss the monitoring of the rank deficiency these block techniques, which was introduced for the monitored block conjugate gradient in algorithm and was not mentioned afterwards for the block Lanczos and the stabilized block conjugate gradient algorithms. In the latter two algorithms, the rank deficiency can also be detected when building the G-orthonormal matrices V (j) or (J). In the previous discussions, we did not really indicate how this can be done. One solution is to perform a rank revealing QR factorization of the P(J) matrices and to detect the potentially linearly dependent vectors on the basis of a threshold. The point is that this monitoring is not really necessary for either the block Lanczos algorithm or the stabilized block conjugate gradient algorithm. What we observed in our experiments is that, after an increase in the condition number of the yj matrices in the stabilized block conjugate gradient algorithm, which reaches its maximum just a few steps before the maximum accuracy is obtained, the condition number of these matrices

1495

BLOCK LANCZOS ACCELERATION TECHNIQUES

PORES3 ( N

532

With column-block preconditioning

10" 1 103 10"4 o

10

.

10

-o

Z

’-Stabilized Biock CG(4)

tt

Stabilized Block CG(6) t-----’-- Stabilized Block CG(8)

ti

10 10 10 10 10 10 10 10

0

4

8

12

16

20

24

32

28

Iteration number

FIG. 7. Behaviour of the stabilized block conjugate gradient acceleration on problem PORES3.

TABLE 2 Stabilized block conjugate gradient, on problem PORES3, with block size equal to 8. Computations performed on an Alliant FX/80 (8 processors). Iteration number

8 9 10 11 12

Reciprocal condition number of "Tf"/j 2.27E 05 8.54E 08 5.09E 11 1.58E 10 2.66E 06

Reciprocal condition number of

yj

3.54E 1.89E 1.35E 9.83E 3.32E

04 04 04 01 01

decreases again and stabilizes itself at a low value on convergence. On the other hand, the reverse product of the yj matrices has a norm that is almost zero after convergence. Indeed, in finite arithmetic, exact convergence seldom occurs, and there exists instead a sort of "illconditioning barrier," which happens just before convergence is reached. The modifications we propose to the conjugate gradient algorithm enable the stabilized block conjugate gradient algorithm to go through this "ill-conditioning barrier" and to reach accurate solutions. 8. The use of level 3 BLAS kernels. The condition numbers shown in Table 2 allow us now to use other algorithms than modified Gram-Schmidt. For example, we can build the normal equations and factorize them using Cholesky factorization. The advantage of this strategy is that we can use level 3 BLAS kernels (see Dongarra, Du Croz, Duff, and Hammarling (1990), Dayd6 and Duff (1991), Anderson, Bai, Bischof, et al. (1992)) to take more advantage of the vector and parallel features of the target computer. At each iteration of the block conjugate gradient algorithm, the subsystems coming from the blocks (see Algorithm 1.1) must be solved for the s different linear systems, and this can also be performed using level 3 BLAS kernels.

1496

M. ARIOLI, I. S. DUFF, D. RUIZ, AND M. SADKANE

SHERMAN3 N

5005

With central-block preconditioning 10

"10"4 "10 106 0"

Stabilized Block CG(16) Stabilized Block CG(32)

-

10-9

----:-

10 10

0

90

180 270 360 450 540 630 720 810 900 990

Iteration number Block size

FIG. 8. Normalized iteration counts performed by the stabilized block conjugate gradient acceleration on problem SHERMAN3.

In Figs. 8 and 9, we use another test problem coming from oil reservoir modelling, recorded as problem SHERMAN3 in the Harwell-Boeing set of sparse matrices. The original matrix is preprocessed, partitioned into a two-block partitioning, and preconditioned with a central-block preconditioning (see 2). We use this new problem because of its larger size (n 5005), which enables us to illustrate better the influence of level 3 BLAS kernels on the block conjugate gradient algorithm. These experiments have been performed on an Alliant FX/80 with 8 processors. Figure 8 shows that there is an optimal block size that achieves the least amount of work. In addition to causing more work, increasing the block size further does not improve the convergence. This is illustrated by the curve for block size 32, which performs more work than the classical conjugate gradient algorithm itself. However, the efficiency of level 3 BLAS kernels, which grows as the block size increases, enables the block conjugate gradient algorithm with block size 32 to be faster than the classical conjugate gradient algorithm. It is interesting to note, in Fig. 9, that the block size achieving the fastest solution (block size 8) is not the one realizing the minimum amount of work (block size 4). 9. And when everything goes well? In this section, we would like to highlight the main differences in the behaviour of the classical conjugate gradient algorithm and the block

-

conjugate gradient algorithm. We introduce the following three-dimensional PDE defined on the unit cube Uxx dr" Uyy 4- UZZ

lO00exy (Ux -t- Uy

where the fight-hand side F is computed using the solution

u=x+y+z.

tlz)

F,

-

BLOCK LANCZOS ACCELERATION TECHNIQUES

10

10 10 10

I [I-1--ttii

5005

With central-block preconditioning

I

101 12 |I 0" 1"

SHERMAN3 N

1497

lasicalcnjugateGradient Stabilized Block CG(4) Stabilized Block CG(8) Stabilized Block CG(16) 1--- Stabilized Block CG(32)

1----1"----*

0 20 40 60 80 100120140160180200220240260280300

Time

Seconds

FIG. 9. Time in seconds of the stabilized block conjugate gradient acceleration on problem SHERMAN3. Computations performed on an Alliant FX/80 (8 processors).

This problem is referenced as Problem 3 in Bramley (1989). The test matrix is obtained by discretizing the above equation by the seven-point finite-difference formula on an 8 x 8 x 8 mesh, and the resulting system is partitioned into 4 blocks of equal size. The spectrum of the matrix H for this test problem (see Fig. 10) shows that there are many clusters of eigenvalues, and that the clustering around 1 is much less pronounced than for problem PORES3 (see Fig. 4). Indeed, although these two test problems have a comparable size, for Problem 3 there are only 40 eigenvalues clustered around 1, whereas for PORES3 there are 464 eigenvalues clustered around 1. However, the condition number of matrix H for Problem 3 is of order 5 102 and is much less than that of problem PORES3. The convergence profile using the normwise backward error for the classical conjugate gradient algorithm (see Fig. 11) shows no plateaus, and the convergence rate of the classical conjugate gradient algorithm is rather linear. On the other hand, the corresponding curves for the block conjugate gradient algorithm with different block sizes separate into two different phases. The first phase shows a linear rate of convergence, similar to that of the classical conjugate gradient algorithm, and the second phase shows a sharp drop with a strong improvement in the accuracy within few iterations. The normalized number of iterations (e.g., number of iterations multiplied by the block size) performed by the block conjugate gradient algorithm for reac.hing a normwise backward error less than 10 -16 is almost independent of the block size, and is equal to 480, which corresponds roughly to the number of eigenvalues outside the interval centered on 1. These observations indicate that the block conjugate gradient algorithm, which needs to compute global information from the matrix before giving an accurate solution, is closer to a direct method than the classical conjugate gradient algorithm, which behaves more like an iterative method and exhibits a linear rate of convergence. This is reflected by the timing profile of the normwise backward error in Fig. 12, which shows that,

1498

M. ARIOLI, I. S. DUFF, D. RUIZ, AND M. SADKANE

PROBLEM 3 from BRAMLEY (1989) on an 8*8*8 grid ( N = 512 ) Eigenspectrum and eigenvalue distribution

Eigenvalues

Minimum eigenvalue is

0.47351E-02

Distribution of eigenvalues by intervals of size 1.0E-02 20

Center of intervals

0 Number of eigenvalues equal to 1. is Number of eigenvalues in interval [0.995,1.005] is

40

FIG. 10. Eigenvalue distribution of matrix I-I, on Problem 3 from Bramley (1989). The spectrum is computed with EISPACK.

when the problem is well conditioned, the classical conjugate gradient algorithm is much more efficient than the block conjugate gradient algorithm. It is interesting to notice also the effect of level 3 BLAS kernels, which enhance the efficiency of the block conjugate gradient algorithm as the block size increases. This analogy to direct methods of the behaviour of the block conjugate gradient algorithm, as opposed to the classical conjugate gradient algorithm, occurs also for very ill conditioned problems. We saw in 2, for instance, that the classical conjugate gradient algorithm could not reach an accurate solution for the test problem GRE1107 (see Fig. 2). Figure 13 shows the spectrum of matrix I-I for this test problem. A strong clustering around 1 can be observed, as well as a very high condition number (of order 1012). Again, clustering of eigenvalues occurs at the ends of the spectrum. The curves given in Fig. 14 show that the block conjugate gradient algorithm is capable of reaching a solution with a normwise backward error less than

1499

BLOCK LANCZOS ACCELERATION TECHNIQUES

PROBLEM 3 from BRAMLEY (1989) on an 8*8*8 grid N

512

10 Classical Conjugate Gradient O----O Stabilized Block CG(4) l----I Stabilized Block CG(8) Stabilized Block CG(16) 10

10

010

0-9 10

0.11 0.12 10

10 10 10 0

20

40

60

80

100 120 140 160 180 200 220

Iteration number

FIG. 11. Comparison Bramley (1989).

of classical and stabilized block conjugate gradient acceleration on Problem 3 from

PROBLEM 3 from BRAMLEY (1989) on an 8*8*8 grid N 10

Classical conjugate gradient Stabilized Block CG(4) Stabilized Block CG(8) Stabilized Block CG(16)

10

10

i..K.

[.

0

4

512

10 10 10

C

10"11 0 10 10 10

10 2

6

Time

8

10

12

14

Seconds

FIG. 12. Efficiency ofclassical and stabilized block conjugate gradient acceleration on Problem 3from Bramley (1989). Computations performed on an Alliant FX/80 (8 processors).

1500

M. ARIOLI, I. S. DUFF, D. RUIZ, AND M. SADKANE

Problem GRE1107 ( N = 1107 ) with column block preconditioning Eigenspectrum and eigenvalue distribution

’1""

Eigenvalues

Minimum eigenvalue is" 0.96803E-12

Distribution of eigenvalues by intervals of size 1.0E-02 24

20 16 12

z

8

Center of intervals

Number of eigenvalues equal to 1. is 15 Number of eigenvalues in interval [0.995,1.005] is" 489 FIG. 13. Eigenvalue distribution of matrix It, on problem GRE1107. The spectrum is computed with EISPACK.

10 -12, which corresponds to a solution with one correct digit componentwise. This is not the case for the classical conjugate gradient algorithm, which cannot find a solution for such an ill-conditioned problem. We notice again that, apart from the block size 4 whose convergence curve still shows some plateaus, the normalized number of iterations performed by the block conjugate gradient algorithm for reaching a normwise backward error less than 10-12 is almost independent of the block size, and is equal to 500, which corresponds roughly to the number of eigenvalues outside the interval centered on 1. The discussion in this section emphasizes the difference in the behaviour of these two algorithms. When the classical conjugate gradient algorithm behaves as a powerful iterative method, the block conjugate gradient algorithm exhibits the drawback of any direct method, the sense that it gives a very accurate solution after computing global information from the linear system and thus is less efficient. On the other hand, when the classical conjugate gradient

n

BLOCK LANCZOS ACCELERATION TECHNIQUES

GRE1107 N

1501

1107

With column-block preconditioning

101 12 [i

o

10

t_....

10’

..-i

CG(4)I--i l]---" taliliz’ed Block Block CG(8)I-.i Stabilized

[----- Stabilized Block CG(16)I..i

Stabilized Block CG(20) l*-----* Stabilized Block CG(32)li [A

010 10

Z

10

10 10 10

0

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

Iteration number

FIG. 14. Normalized iteration counts performed by the stabilized block conjugate gradient acceleration on problem GRE1107.

algorithm has difficulties reaching a superlinear rate of convergence, the block conjugate gradient algorithm improves the situation since it is much less sensitive to the problems that cause the poor convergence of the classical conjugate gradient algorithm. The observations about the normalized number of iterations performed by the block conjugate gradient suggest that the efficiency of the block conjugate gradient algorithm relies strongly on the clustering of the eigenvalues. The figures showing the spectrum of the matrix It on some of our test problems indicate that the block Cimmino method is a good method in generating matrices with such clustering. 10. What about the more general ease? One question which remains is how well the stabilized block conjugate gradient algorithm compares with the classical conjugate gradient one when applied to more general matrices, e.g., matrices not coming only from the acceleration of the block Cimmino method with two-block partitioning strategy. In this section, we present some extra experiments where we compare the classical conjugate gradient algorithm with the block conjugate gradient one, when applied to the normal equations of some matrices coming from the Harwell-Boeing set of sparse matrices. For all the following test problems, we generate (as indicated before in 2 for the other test problems) n. 1 the fight-hand side vector b from a given solution vector x*, where n/i, All the results that will be shown in the following have been computed on a SUN SPARC 10 workstation, using double precision arithmetic. One of the main difficulties we encountered with normal equations is their very varying characteristics. Indeed, the range of the eigenvalues in the normal equations matrices can be almost anything, the spectrum of the normal equations can be completely spread or can show a very strong clustering, and finally, which may be the worst thing, the ill conditioning of such matrices can explode dramatically. By contrast, for the block Cimmino method with twoblock partitioning we know in advance the interval where the spectrum of the .iteration matrix

x

1502

M. ARIOLI, I. S. DUFF, D. RUIZ, AND M. SADKANE

GRE512 ( N

512)

Eigenvalue decomposition of the Normal Equations matrix

Eigenvalues

Minimum eigenvalue is" 0.40566E-04

Distribution of eigenvalues by intervals of size 1.0E-02

:3

>

20 16

(1)

12

Z

8

Center of intervals FIG. 15. Eigenvalue distribution ofthe normal equations matrix, on problem GRE512. The spectrum is computed with LAPACK (through a singular value decomposition).

will be. These variations make it difficult to classify the results and to extract a representative set of test matrices illustrating the range of behaviours that can be observed. Nevertheless, since we are only concerned in this section by the relative behaviour of the classical and the block conjugate gradient algorithms, we can classify the results in three main cases. The first case that can occur is that the condition number of the normal equations matrix is rather small. This is illustrated by matrix GRE512, which arises in the same type of simulation as the earlier studied matrix GRE1107 and whose normal equations matrix has a condition number of order 104 as indicated by the values in Fig. 15. The spectrum shows some clusters of eigenvalues, but is mostly spread. In such cases, the classical conjugate gradient algorithm behaves quite well, and the use of the block conjugate gradient algorithm does not improve the situation. This can be seen in Fig. 16, where the iteration profile of the normwise backward error (see 2) is drawn for various block sizes. In this plot, as well as in the following ones, the results are normalized in the sense that the iteration counts are multiplied by the block size.

BLOCK LANCZOS ACCELERATION TECHNIQUES

1503

GRE512 N = 512) 101 102 10

10"4

Block CG on Normal Equations Classical Conjugate Stabilized Block CG(2) 1------1 Stabilized Block CG(4) Stabilized Block CG(8) t--- Stabilized Block CG(23)

Gradient[

1[\i \-:[..-K....’_._;.i.: t-----

[\

10

10

0

60

120 180 240 300 360 420 480 540 600 660

Iteration number Block size

FIG. 16. Comparison of classical and stabilized block conjugate gradient acceleration on the normal equations

ofproblem GRE512. This enables better comparisons since we recall that the block conjugate gradient algorithm solves several systems at the same time. The second case is when the. condition number of the normal equations matrix is too high for reaching a reasonable solution with any of these conjugate gradient techniques. This is the case in problem GRE1107 (see Figs. 17 and 18), for which the condition number of the normal equations matrix is of order 10:5 The spectrum of the normal equations matrix is relatively spread, and a clustering of the eigenvalues can also be observed near the origin with 50 eigenvalues in the interval [0.0, 0.005]. A zoom in this eigenvalue distribution reveals that there are also 28 eigenvalues in the interval [0.0, 0.0005], the others being relatively separated. The normalized iteration profiles in Fig. 18 tend to show some improvement with block sizes greater than 16, but there is no bigger improvement even when block sizes corresponding to the above clustering values are used. We mention also that the componentwise relative error

max l