Link/Page Citation

1. Introduction. Poisson-type problems given by 5- or 9-pointstencils arise in many applications and in many research fields. Forexample, in our real-time ship simulator a Poisson-type problem given bya 5-point stencil needs to be solved multiple times per second in orderto compute realistic, interactive waves and realistic ship movements[8]. The size of this Poisson-type problem is directly related to thesize of the computational domain of interest, e.g., a harbour, a river,or a sea. To fulfil the requirements of a real-time simulation, a veryfast solver is required; simply stated: the faster the solver, thelarger the domain that can be computed in real-time.

The proposed solution method in this paper can be applied to solvePoisson-type problems given by 9-point stencils as well. Therefore,throughout this paper we consider a linear system of the form

(1.1) Ax = b,

where A [member of] [R.sup.nxn] is a large symmetric positivedefinite (SPD) matrix given by the stencils,

[mathematical expression not reproducible]

for each node (i, j) in a rectangular n x n grid. Here N refers to'North', NE to 'North-East', and so on. There areseveral methods that can be used to solve (1.1):

1. A direct method, for example a construction of a completeCholesky decomposition. All fill-in occurring during the Gaussianelimination is within a relatively small band around the main diagonal.An advantage is that with this method relatively high speed-ups can beobtained on parallel architectures with block versions; the drawback,however, is that the number of operations increases rather fast withgrid refinement. For example, when a rectangular n x n grid is used, thesize of the matrix is [n.sup.2], and the width of the above-mentionedband is n. In that case, the number of required operations isO([n.sup.4]). Special reordering methods can be used to speed up thecomputations; however, these techniques have poor parallelizationopportunities. Examples of software: Mumps [1], Umfpack [7], Pardiso[15].

2. A spectral method (e.g., Fishpack [16]) is very fast andparallelizable. However, it is limited to matrices having constantcoefficients.

3. Iterative methods. They may be classified into stationary andgradient methods. The stationary methods of Jacobi and Gauss-Seidel arethe earliest iterative methods and they are based on a splitting of thematrix. For many cases they can be accelerated by using relaxationtechniques, leading, for example, to the well-known successiveover-relaxation (SOR) method. In case of an SPD matrix, one can also usethe Conjugate Gradient (CG) method. In exact arithmetic, this methodfinds an approximate solution in such a way that the A-norm of the erroris minimized over the Krylov subspace

[K.sub.k](A, [r.sub.0]) = span{[r.sub.0], [Ar.sub.0],[A.sup.2][r.sub.0],..., [A.sup.k-1][r.sub.0]},

where [r.sub.0] = b - [Ax.sub.0] is the initial residual. Hence theresidual at step k can be written as [P.sub.k] ([r.sub.0]), in which[P.sub.k] is the 'optimal' polynomial of degree less than k -1 such that [P.sub.k] (0) = I.

It is possible to use Chebychev polynomials to give bounds for theconvergence rate of the CG method based on the fact that they cannotproduce better reductions of the error than the optimal polynomial. Awell-known upper bound for the error is [17]:

(1.2) [[parallel][x.sub.k] - x[parallel].sub.A] [less than or equalto] 2[([[square root of [kappa](A)] - 1/[square root of [kappa](A)]] +1).sup.k][[parallel][x.sub.0] - x[parallel].sub.A],

where [[parallel] * [parallel].sub.A] denotes the A-norm, and[kappa](A) := [[lambda].sub.max]/[[lambda].sub.min] the spectralcondition number, i.e., the ratio of the largest to the smallesteigenvalue of A. The convergence behaviour of CG thus strongly dependson the spectral condition number. The convergence rate can oftenstrongly be improved by applying CG to the preconditioned system

(1.3) [M.sup.-1]Ax = [M.sup.-1]b.

The SPD matrix M is called the preconditioner. There is a widechoice of preconditioned, see, for example, [5] and [14].

The matrix M should be a proper approximation of A, such that thespectral condition number of [M.sup.-1]A is much smaller than that of A,and solving the systems Mz = r should be cheap. One possibility is totry to make a SParse Approximation of the Inverse (SPAI-methods).However, even when the coefficient matrix itself is very sparse, itsinverse is often not, and therefore it is questionable if a properSPAI-preconditioner can be constructed. For symmetric matrices manypreconditioners are based on an Incomplete Cholesky decomposition. Inthis method, a lower-triangular matrix L is constructed such that A[approximately equal to] [LL.sup.T], and where the factor L is sparse.In that case: M := [LL.sup.T] and systems Mz = r can be readily solvedby two triangular sweeps.

A good solution method should have a limited number of operationsper node, and this number should not increase too fast with meshrefinement. In addition, it should be possible to exploit moderncomputer architectures in order to obtain a high flop rate. Suppose thatone has to solve a Poisson equation on a rectangular mesh with constantmesh size h = 1/(n +1). In that case, one can show that SOR takesO([n.sup.3]) operations, whereas preconditioned CG, in which thediagonal of the preconditioner is modified according to Gustafsson [10],the computational complexity is O([n.sup.5/2]).

The so-called Repeated Red-Black (RRB) method, described in [3]combines a reordering of the rows and columns of the matrix with anIncomplete Cholesky decomposition. This method has two importantadvantages:

1. For the test case with uniform mesh mentioned above, an upperbound of the computational complexity is given by (cf. [12], eq. (1.8)):

[kappa]([M.sup.-1]A) [less than or equal to] [[square root of5][([square root of 5] - 1).sup.l-1]]/[1 + [(-1).sup.l][([3-[square rootof 5]/2]).sup.l-1] [approximately equal to] 1.8 * [1.23.sup.l],

where l is the number of consecutive red-black levels. Hence thenumber of iterations is expected to increase only mildly with meshrefinement.

2. The reordering strategy based on repeated red-black guaranteesthat large parts of the preconditioner can be built in parallel, andlarge parts of the triangular sweeps can be performed in parallel aswell.

In this paper we provide an efficient implementation, that is, animplementation of the RRB-method that can fully exploit the parallelismof modern architectures, such as GPUs. Key to this is a new reorderingstrategy such that all global memory operations can be performed in acoalesced manner.

The remaining sections are organized as follows. In Section 2 theRRB-solver and its aspects are presented. In Section 3 it is explainedwhich techniques can be used to obtain an efficient parallelimplementation of the RRB-solver on both multi-core CPU and GPU systems.In Section 4 the experimental setup and the test problem are discussed.In Section 5 several performance results are presented, as well as adetailed throughput analysis and a speed comparison between theRRB-solver and algebraic Multigrid. In Section 6 the conclusions can befound.

2. The RRB-solver. The RRB-solver is a PCG-type solver where theRRB-method serves as the preconditioner M. RRB stands for 'RepeatedRed-Black' (or 'Recursive Red-Black') and refers to howthe nodes in a 2D grid are colored and numbered. The RRB-method foundits origin in the late eighties where multigrid V-cycles withintermediate skew meshes were investigated. Since then the method hasbeen further investigated by Axelsson and Eijkhout [2] and Brand andHeinemann [3,4]. They showed that the RRB-method can be used as apreconditioner in Krylov methods leading to a method with nearly optimalscaling. In [6, 12] additional information can be found as well asderivations for upper bounds for the condition number k.

In order to show how the RRB-method can be parallelized, theRRB-method is described in this section. First we discuss RRB-numbering,then we present the RRB-factorization algorithm and, finally, we explainhow the RRB-method can be used as a preconditioner.

2.1. The RRB numbering procedure. Let

G ={(i, j) | 1 [less than or equal to] i [less than or equal to][N.sub.x], 1 [less than or equal to] j [less than or equal to][N.sub.y]}

be the set of all nodes in an [N.sub.x] x [N.sub.y] grid. If (1, 1)is chosen to be a black node, then a standard red-black ordering isgiven by:

[R.sup.[1]]={(i, j) [member of] G | mod(i + j, 2) = 1},

[B.sup.[1]]= G \ [R.sup.[1]],

where [R.sup.[1]] denotes the set of first level red nodes anddenotes the set of first level black nodes. Next, a standard red-blackordering is reapplied to the -nodes as follows:

[R.sup.[2]] ={(i, j) [member of] [B.sup.[1]] | mod(j, 2) = 0},

[B.sup.[2] = [B.sup.[1]]\[R.sup.[2]]

= G\([R.sup.[1]]

The second level black nodes, i.e., the [B.sup.[2]] -nodes, arethus the nodes in G that neither belong to the sets [R.sup.[2]] nor[R.sup.[2]]. Generally,

[mathematical expression not reproducible]

The maximum number of levels that the [N.sub.x] x [N.sub.y] -gridallows for is given by

(2.1) [l.sub.max] = 2[[log.sub.2](max{[N.sub.x], [N.sub.y]})[??] +1.

Example 2.1. In this example the RRB-numbering procedure is appliedto a matrix A [member of] [R.sup.64x64] resulting from an 8 x 8 grid ofunknowns. For this matrix A the maximum number of levels is [l.sub.max]= 2[??][log.sub.2](8)[??] +1 = 7 according to equation (2.1). The effectof the RRB-numbering on the ordering can be seen in Figure 2.1. Forreadability the black nodes are represented by gray squares and the rednodes by white squares. The effect of the RRB-numbering on the sparsitypattern of the matrix A [member of] [R.sup.64x64] belonging to the 8 x 8grid of unknowns is shown in Figure 2.2 for the first four levels.

2.2. The RRB-factorization algorithm. The RRB-method factorizes thematrix A into

(2.2) A = LD[L.sup.T] + R,

where L is a lower triangular matrix with unitary diagonal entries,D a diagonal matrix, and R a matrix that contains adjustments resultingfrom lumping procedures. A detailed description of the RRB-method can befound in [9]. Starting from a 9-point stencil, the factorization (2.2)is performed as follows:

1. Renumber the points and equations corresponding to a red-blackordering; first number all red points. The coefficient matrix then has a2 by 2 block structure, in which the upper-left block [A.sub.11] givesthe interaction between the red points only.

2. The nonzero off-diagonal elements of block [A.sub.11] are firstadded to the main diagonal and then put to zero (lumping). The modifiedmatrix obtained in this way can be represented by a 5-point stencil.

3. In the modified system of linear equations all red points can beeliminated, which gives again a system of linear equations given by a9-point stencil. This system of equations has only half the number ofunknowns as in Step 1.

4. Go to Step 1, and repeat until only 1 node remains.

2.2.1. Full RRB versus RRB-l. By reapplying RRB-iterations over andover again, after a certain number of iterations only a single noderemains. This is called full RRB. However, it is not needed to go allthe way down. After each level of red-black numbering, lumping andelimination of the red nodes in the respective level, one can stop atlevel l [less than or equal to] [l.sub.max]. For the remaining blacknodes [B.sup.[l]], which have a 9-point dependency structure, a fullCholesky decomposition

(2.3) [M.sub.l] = [L.sub.l][D.sub.l][L.sup.T.sub.l]

is computed. This is called l-step RRB, denoted by RRB-l Obviously,by stopping earlier the factorization (2.2) becomes more exact. However,the disadvantage is that constructing the Cholesky factor L takes moreeffort and L has more fill-in. Therefore, this should be done only whenthe size of the remaining system of equations becomes so small that thecosts of a direct solver are not much larger than the costs of the restof the incomplete factorization.

2.2.2. Starting with a 5-point stencil. In case the RRB-method isapplied to a matrix A given by a 5-point stencil, e.g., the 2D Poissonproblem, then the elimination of R [1] is exact and accordingly R = 0for the first level.

2.2.3. Algorithm. Pseudo code for the RRB-method is provided inAlgorithm 1. EXAMPLE 2.2. In this example the RRB-method is applied to amatrix A [member of] [R.sup.64x64] resulting from an 8 x 8 grid, seeFigure 2.3. The figure shows the effects of consecutive red-blackorderings, lumping, and elimination of red nodes on the dependencystructure and sparsity pattern of L + D + [L.sup.T].

Algorithm 1 The RRB-method starting with a 9-point stencil.1. Choose the number of levels l [less than or equal to] [l.sub.max]2. Set k = 03. While (k [less than or equal to] l) do4. Apply red-black ordering: [R.sup.[k]]/[B.sup.[k]]5. Apply lumping procedure to [R.sup.[k]]-nodes6. Eliminate [R.sup.[k]] -nodes using 5-point stencils for [R.sup.[k]]-nodes7. k = k + 18. End while9. Make Cholesky decomposition for remaining 9-point stencil: [M.sub.l]= [L.sub.l][D.sub.l][L.sup.T.sub.l]

2.3. The RRB-method used as a preconditioner in PCG. The matrix Ais factorized as A = [LDL.sup.T] + R. As preconditioner for thePCG-algorithm the matrix

M = [LDL.sup.T] [approximately equal to] A

is taken. The smaller the numbers in R in absolute value are, thebetter M resembles A. The combination of the PCG-algorithm and theRRB-method as preconditioner is called the RRB-solver.

As remarked earlier, starting from a 5-point stencil, e.g., the 2DPoisson problem, the elimination of R [1]-nodes is exact. Hence, in thiscase PCG can be applied to the resulting 1st Schur complement [S.sub.1]instead of the entire matrix A. This is beneficial for the total amountof computational work. Since [S.sub.1] consists of only the -nodes, thenumber of flops for computing the vector updates and inner products inthe PCG-algorithm is reduced by a factor two. The number of flops in thematrix-vector product q = [S.sub.1]p remains the same as thematrix-vector product with A, because the matrix [S.sub.1] is now givenby a 9-point stencil instead of by a 5-point stencil.

2.4. Spectral condition number of RRB-l. The convergence rate ofPCG depends on the spectrum of matrix [M.sup.-1]A. Since thepreconditioner M is an approximation of the system matrix A, thecondition number of [M.sup.-1] A is smaller than that of A. This gives asharper upper bound of the error (1.2) and therefore most likely afaster convergence.

In [3], [6], and [12] the RRB-l preconditioner is investigated indetail for the Poisson problem with Dirichlet boundary conditions on a2D uniform grid with n x n unknowns and where n is of the form n =[2.sup.l] - 1. Different upper bounds can be found in the aforementionedliterature. Notay [12] gives an upper bound:

[kappa]([M.sup.-1]A) [less than or equal to] [[square root of5][([square root of 5] - 1).sup.l-1]]/[1 + [(-1).sup.l][([3-[square rootof 5]/2]).sup.l-1] [approximately equal to] 1.8 * [1.23.sup.l].

Since the mesh spacing h = 1/(n +1) and l = [log.sub.2](1/n)[approximately equal to] [log.sub.2] [h.sup.-1], alternatively,

[kappa]([M.sup.-1]A) [less than or equal to] 1.8 [h.sup.-0.306].

2.5. Application of the preconditioner. At each iteration of thePCG-algorithm, the system Mz = r needs to be solved for z. Thepreconditioning matrix M is factorized as M = LD[L.sup.T]. Therefore, Mz= r can be solved efficiently in three steps as follows. Set y :=[L.sup.T]z and x := D[L.sup.T]z = Dy, then:

1. solve x from Lx = r using forward substitution;

2. compute y = [D.sup.-1] x;

3. solve z from [L.sup.T]z = y using backward substitution.

Each of the three steps is computationally cheap.

2.5.1. Algorithm. For memory efficiency a single vector z is usedinstead of the three vectors x, y, and z. Moreover, if l [less than orequal to] [l.sub.max] then at the final level l a full Choleskydecomposition (2.3) is made for the remaining level of black nodes[B.sup.[l]]. Pseudo code is provided below, see Algorithm 2.

EXAMPLE 2.3. For a matrix A [member of] [R.SUP.64X64] resultingfrom an 8 x 8 grid of unknowns the forward substitution steps arevisualized in Figure 2.4 and Figure 2.5.

As indicated by Algorithm 2, both the forward and backwardsubstitution are performed level-wise in two phases. First, the z-valuesat [B.sup.[k]]-nodes are updated using the skew 5-point stencils (x) atthe [R.sup.[k]]-nodes, see Figure 2.4. Second, the z-values at[B.sup.[k+1]]-nodes are updated using the straight 5-point stencils (+)at the [R.sup.[k+1]]-nodes, see Figure 2.5. The backward substitution isdone level-wise in the same manner, yet in the reverse order.

Algorithm 2 Application of the RRB preconditioner.1. Given number of levels l [less than or equal to] [l.sub.max]2. Set k = 23. While (k [less than or equal to] l) do % forward subs.4. Update z-values at [B.sup.[k]] -nodes using [R.sup.[k]] -nodes %skew 5-point (x)5. If (k +1 = l) then6. break7. End if8. Update z-values at [B.sup.[k+1]]-nodes using [R.sup.[k+1]]-nodes %5-point (+)9. k = k + 210. End while11. Update x-values at [B.sup.[1]] -nodes % diagonal scaling12. Solve [L.sub.l][D.sub.l][L.sup.T.sub.l][x.sub.l]= [z.sub.l] andupdate x-values in level l % final level exact13. Set k = l14. While (k [greater than or equal to] 2) do % backward subs.15. Update z-values at [R.sup.[k]] -nodes using [B.sup.[k]] -nodes %5-point (+)16. If (k - 1 == 2) then17. break18. End if19. Update z-values at [R.sup.[k-1]]-nodes using [R.sup.[k]-1]-nodes %skew 5-point (x)20. k = k - 221. End while

3. Parallel implementation of the RRB-solver. To obtain a parallelimplementation of the RRB-solver, all operations of the CG-algorithm,i.e., the inner products, the vector updates, and the matrix-vector q =[S.sub.1]p, where [S.sub.1] is the first Schur complement, can readilybe parallelized on shared memory machines [11]. Secondly, theconstruction of the preconditioner, i.e., M = LD[L.sup.T], can beperformed level-wise in parallel on shared memory machines. FromAlgorithm 1 it can be seen that per level each of the operationslumping, elimination, and substitution can be performed fully inparallel. Finally, the application of the preconditioner, i.e., solvingMz = r for z, can be performed level-wise in parallel on shared memorymachines as well. This can be seen from Algorithm 2 as well as fromFigure 2.5, both indicate that at each level the gray nodes can beupdated fully in parallel. Figure 3.1 illustrates this even moreclearly. The gray areas indicate the blocks where fill-in has beenlumped, and where the computations can be done fully in parallel.

As forward and backward substitution are inherently sequential, itis not possible to compute the different levels (the gray blocks) inparallel as well. Fortunately, however, as the number of unknownsdecreases fast, namely by a factor 2 per level, so does the amount ofwork. All together, it can be concluded that the RRB-solver can beparallelized very well on shared memory machines.

3.1. A key idea to maximize bandwidth. In this section we discusshow to actually implement the RRB-solver on modern parallelarchitectures such as the GPU. The RRB-solver consists of BLAS-1 andBLAS-2 routines and is therefore a bandwidth-bound, a data intensivealgorithm, i.e., the ratio

r = [Number of memory operations]/[Number of floating pointoperations]

is large. For already relatively small problem sizes, the amount ofdata is such that the problem does not fit in the (fast) cache of mostmodern computers anymore. Hence throughout the RRB-algorithm most datais read from and written to the global memory over and over again. Amain concern thus is how to achieve high global memory bandwidththroughout the algorithm.

This is certainly not trivial for the RRB-solver because of therepeated red-black orderings, which can be seen from Figure 2.1. A naivestorage of the data implies that all the data required for thevector-updates, inner products, and the matrix-vector product would beread and written with a stride of two. This follows from the fact thatin case of the RRB-solver the underlying PCG-algorithm operates on the B[1]-nodes only. Even worse, during application of the preconditionerstep, i.e., solving Mz = r for z, the data would be accessed withincreasing stride: 2, 4, 8, 16,.... It is well-known that reading datafrom and writing data to the global memory with a stride leads to poorperformance. Even though the amount of work decreases fast for each nextlevel, at the first few levels already too much bandwidth would getwasted and the RRB-solver would suffer from poor performance.

Therefore, to guarantee maximal bandwidth throughout the algorithm,for the first few (the finest) levels a different storage scheme isproposed, see Figure 3.2. This storage scheme is called the[r.sub.1]/[r.sub.2]/[b.sub.1]/[b.sub.2]-storage scheme [8].

Comparison of Figure 2.1 and Figure 3.2 shows: the R [1] -nodes aredivided into [r.sub.1]- and [r.sub.2]-nodes and the B [1]-nodes aredivided into [b.sub.1]- and [b.sub.2]-nodes. Hence the [b.sub.1]- and[b.sub.2]-nodes are the nodes to which the majority of the PCG-algorithmis applied to. The first important observation is now that, by storingthe B [1] nodes into these two new arrays of [b.sub.1]- and[b.sub.2]-nodes, the data can always be accessed in a coalesced manner,without a stride.

The second important observation is that the [b.sub.2]-nodes formthe next coarser grid, and that, on this coarser grid, the[r.sub.1]/[r.sub.2]/[b.sub.1]/[b.sub.2]-storage scheme can be reapplied.In general, the odd levels k are stored into [b.sup.[k].sub.1] - and[b.sup.[k].sub.2] -nodes, and the even levels k are stored into the[b.sup.[k].sub.2] -nodes. Given the problem size, the layout of storageschemes can be determined a priori.

It is thus very beneficial for performance to store the data usedby the RRB-solver as much as possible in the[r.sub.1]/[r.sub.2]/[b.sub.1]/[b.sub.2]-storage scheme, i.e., the 5vectors in the PCG-algorithm (r, x, z, p, q) and the 5 + 9 vectors thatdescribe the system matrix [S.sub.1] and the preconditioner matrix M,respectively. Actually, the matrix M and the vector z occurring in thepreconditioner step Mz = r are stored in a recursive[r.sub.1]/[r.sub.2]/[b.sub.1]/[b.sub.2]-storage scheme. Although theproposed scheme is basically a renumbering and reordering of the gridpoints, for performance it is needed to use multiple physical arrays toensure that all data describing M and z can be accessed in a coalescedmanner. This is at most just 1 + 1/4 + 1/16 +... = 4/3 times asexpensive as the naive storage. The[r.sub.1]/[r.sub.2]/[b.sub.1]/[b.sub.2]-storage scheme can be usedvirtually for free during the preconditioner step Mz = r. This can beseen from Figure 2.5: in case of application of the preconditioner, theupdated z-values at level k are directly written into the[r.sub.1]/[r.sub.2]/[b.sub.1]/[b.sub.2]-arrays of level k + 1 in case offorward substitution and vice versa in case of backward substitution.Hence, by using the [r.sub.1]/[r.sub.2]/[b.sub.1]/[b.sub.2]-storagescheme, maximal bandwidth can be obtained for all routines in thePCG-algorithm.

4. Implementation and experimental setup.

4.1. Hardware. All experiments are performed on a system having axeon E5-1620 processor, 32 GB of memory and a GeForce Titan X graphicscard. The OS is Linux Debian 8.8 (64-bit) with kernel 4.9.23. TheRRB-solver is implemented in C++ with OpenMP for parallelization as wellas in CUDA C. The code is compiled with GCC version 4.9.2 and NVVC7.5.17. All experiments are performed in double-precision.

4.2. Test problem. Similar to [3], [6], and [12], as a test problemthe 2D Poisson problem with Dirichlet boundary conditions on a 2Duniform grid is taken:

-[DELTA]u = f(x, y) on [OMEGA] = (0, 1) x (0, 1), u(x, y) = 0 on[partial derivative][OMEGA],

having n x n unknowns, and where n is of the form n = [2.sul.l] -1. The right-hand side f is computed such that u(x, y) = x(x - 1)exp(xy).

4.3. Condition number and number of PCG-iterations. The conditionnumber [kappa] of the preconditioned system (1.3) is given by

[kappa]: = [kappa]([M.sup.-1]A) =[[[lambda].sub.max]([M.sup.-1]A)]/[[[lambda].sub.min]([M.sup.-1]A)],

where [[lambda].sub.max] and [[lambda].sub.min] are the largest andsmallest eigenvalues, respectively. Since for the RRB-preconditioner[[lambda].sub.min] [approximately equal to] 1, we have [3]:

[kappa] [approximately equal to] [[lambda].sub.min]([M.sup.-1]A).

The required number of PCG-iterations depends on:

(i) the termination criterion, i.e., the demanded accuracy given bya parameter tol;

(ii) the choice of the initial guess [x.sub.0];

(iii) the number of RRB-levels l (accuracy of preconditioner);

In our experiments we have used the termination criterion:

[mathematical expression not reproducible]

with tol := [10.sup.-6]. As initial guess we have taken [x.sub.0] =0. The number of RRB-levels l is varied in the experiments.

4.4. Four implementations. Four implementations of the RRB-solverare used:

* S1: Sequential solver in C++, uses naive storage scheme for all llevels;

* S2: Sequential solver in C++, uses[r.sub.1]/[r.sub.2]/[b.sub.1]/[b.sub.2]-storage scheme for the first 2glevels; calls S1 for the remaining l - 2g levels;

* S3: Parallel solver in C++/OpenMP, uses[r.sub.1]/[r.sub.2]/[b.sub.1]/[b.sub.2]-storage scheme for the first 2glevels; calls S1 for the remaining l - 2g levels;

* S4: Parallel solver in CUDA C, uses[r.sub.1]/[r.sub.2]/[b.sub.1]/[b.sub.2]-storage scheme for the first 2glevels; calls S1 for the remaining l - 2g levels.

For all solvers the desired level t can be set and for the solversS2, S3, and S4 the desired number of[r.sub.1]/[r.sub.2]/[b.sub.1]/[b.sub.2]-grids g can be set as well. S1is a complete sequential implementation of the RRB-solver. The solversS2, S3, and S4 only take care of the first 2g levels and rely on solverS1 for the remaining levels. If l [greater than or equal to] [t.sub.max]then l := [l.sub.max] and the final grid consists of just 1 node; if l< [l.sub.max] then a complete Cholesky decomposition is made for theremaining 9-point stencil. On the coarsest grid solver S1 takes care ofthe remaining problem, see Line 12 in Algorithm 2. This is done exactlywith a sequential band solver.

5. Results.

5.1. Condition numbers and number of PCG-iterations. In Figure 5.1the condition number [kappa] is shown for different numbers ofRRB-levels t and different problem sizes. In

Figure 5.2 the corresponding number of PCG-iterations is shown. Themaximal number of RRB-levels [l.sub.max] depends on the problem size,see Equation (2.1). The figures show that:

1. The larger l, the larger [kappa] and the larger the requirednumber of PCG-iterations. This makes perfect sense as the accuracy ofthe preconditioner M decreases with the number of RRB-levels l; beyond acertain level the accuracy of the preconditioner M is hardly effectedanymore, hence the horizontal slope in the figures;

2. For the 2D Poisson test problem, the RRB-solver scales verywell: both the condition number k and the required number ofPCG-iterations hardly increase with mesh refinement. Going from h =[1/128] to h = [1/2048] only gives a doubling of iterations.

The RRB-solver is thus very efficient in itself: low conditionnumbers, low numbers of PCG-iterations, and very good scaling behaviour.

5.2. Timings and speedup. The results in this section are all forthe test problem with a fixed problem size of 2047 x 2047 unknowns. InFigure 5.3 the computation time is shown for the four implementations S1to S4, for three numbers of RRB-levels t, and for different numbers of[r.sub.1]/[r.sub.2]/[b.sub.1]/[b.sub.2]-grids. As an example: S2-2 meanssolver S2 with g = 2. In Figure 5.4 the corresponding speedups are shownwith respect to solver S1. The figures show:

1. Introduction of the[r.sub.1]/[r.sub.2]/[b.sub.1]/[b.sub.2]-storage scheme instead of thenaive storage scheme gives already a speedup factor 2.1 (with l =12 andg = 3);

2. Computation with 4 cores of the Xeon E5-1620 via OpenMP on topof introduction of the [r.sub.1]/[r.sub.2]/[b.sub.1]/[b.sub.2]-storagescheme gives a speedup factor 4.9 (with l =12 and g = 3);

3. Computation with all cores of the Titan X via CUDA on top ofintroduction of the [r.sub.1]/[r.sub.2]/[b.sub.1]/[b.sub.2]-storagescheme gives a speedup factor 35.1 (with l =12 and g = 3).

In Figure 5.5 and Figure 5.6 more timings are shown for solver S4.In Figure 5.5 the time is shown to compute the factorization (2.2) forvarious g and various l. In Figure 5.6 the time to solve the linearsystem (1.3) is shown. The figures show:

1. The higher the number of[r.sub.1]/[r.sub.2]/[b.sub.1]/[b.sub.2]-grids g the faster thefactorization; however, above g = 3 and/or l = 13 the time reduction isnegligible;

2. The higher the number of[r.sub.1]/[r.sub.2]/[b.sub.1]/[b.sub.2]-grids g the faster the solution;however, above g = 3 the time reduction is negligible; moreover, thereis an 'optimal' number of

RRB-levels t: first the solution time decreases until a certainnumber of RRB-levels (l = 8 for g = 0 or l = 12 for g = 4) due to thefact that the remainder of the Cholesky factor is smaller; thereafterthe solution time goes up again due to the fact that the number ofiterations increases.

5.3. Throughput, flop rate, and time breakdown. All results in thissection are for the test problem having 2047 x 2047 unknowns. In Table5.1 the time breakdown of solver S4 is shown. The table lists thedifferent kernels in the PCG-algorithm and their corresponding time andperformance. We observe that:

1. The solver is well-balanced in time; all kernels contributeequally, in the same order, to the overall computation time;

2. The throughput rates are really good: the theoretical peak ofthe Titan X is 313 GiB/s. We observe that the throughput rates of allkernels are very good (above 80% of the theoretical peak). The solverand matvec kernels operate even close to and a bit beyond thetheoretical peak, respectively.

The data transfers from and to the device (the vectors x and b) areperformed rather slowly (only 2.6 GiB/s) and therefore take 30.4% of thetime. Most likely, this is caused by the rather

FIG. 5.6. Solution time of solver S4.Time [s]g = 0g = 1g = 2g = 3g = 4

old motherboard with low PCI-express transfer rates in our testsystem; an upgrade of the motherboard would most likely lead to asignificant reduction in time.

Finally, in Table 5.2 the time breakdown is shown for the forwardsubstitution part of the solver kernel.

We observe that:

1. The first two grids 1 and 2 (the first four RRB-levels) take themost computation time (37.3% and 24.9%, respectively). Since the solverkernel runs close to the theoretical peak of the device (see Table 5.1)one cannot gain much speed anymore;

2. The last two grids 3 and 4 only take a fraction of the time(5.5% and 1.9% respectively); this shows that the typical Multigridissue of idle cores on coarser levels does not really have an impact inour solver;

3. The levels 9 to 12 and the remainder are performed by solver S1on the CPU which takes 30.5% of the overall time. Although this isrelative much due to a rather old CPU, it does not really matter, sincethe finest levels take an equal amount of time; see the observation initem 1.

5.4. Comparison with multigrid. In this section we compare theRRB-solver with the default algebraic Multigrid (AMG) solver from thePARALUTION [13] library, a high-performance C++ library that offersvarious sparse iterative solvers and preconditioners for multicore CPUand GPU devices. A comparison with Multigrid is chosen, becauseMultigrid-type solvers are known to be very efficient as well for thetype of problems considered in this paper. It was shown that,throughout, the performance of the RRB-solver is near-optimal,therefore, the only remaining two options to gain speed are when therequired number of iterations in the PCG-algorithm can be reduced, or byusing a different type of solver. From our experience, most (if not all)other PCG-type solvers will perform (much) worse because they typicallyrequire significantly more iterations until convergence. Therefore, weconfine ourselves to a comparison with the Multigrid method. It is alsoimportant to note that FFT-solvers cannot be used directly as the systemmatrix A does not have constant diagonals.

Although for structured, rectangular grids, such as our 2DPoisson-type problem, geometric Multigrid (GMG) seems to be the betteralternative, its application within PARALUTION comes with severaladditional requirements, making it rather complicated to use. For GMG itis required that all operations are defined beforehand, such as explicitdefinition of the restriction/prolongation operations and smoother foreach level, and the coarse grid solver. A major advantage of AMG overGMG is that it can be used as a black-box solver, as there is no need toexplicitly define all operations. Similarly, the RRB-solver can be usedas a black-box solver, and we show here that it can outperform AMG as ablack-box solver for Poisson-type problems.

The default AMG solver in PARALUTION uses smoothed aggregationinterpolation, a fixed-point iteration solver, and two-colorGauss-Seidel as a smoother. As of today, a GPU implementation is missingfor the construction of AMG, and the construction is therefore performedon the host (CPU). The solution steps however are performed on the GPU.PARA-LUTION's AMG solver also supports the diagonal (DIA) storageformat, which is the most efficient format to use in our application.The AMG solver is compared with our S4-4 solver, with l =12 fixed. Bothsolvers were tested on the same hardware, see Section 4.1.

Again the 2D Poisson test problem, see Section 4.2, is solved fordifferent problem sizes and tol = [10.sup.-6]. In Table 5.3 the resultsare listed. The setup time is the total time taken by the solver toinitialize all memory (on both CPU and GPU) and to setup/construct thesolver. From the table the following can be observed:

1. Setting up the S4-4 solver on the CPU is 7 times faster thansetting up the AMG solver; setting up the S4-4 solver on the GPU is even10 times faster than setting up the default AMG solver;

2. The S4-4 solver outperforms the default AMG solver by a factor 7for the solve step;

3. The number of iterations taken by S4-4 (with l = 12) is nearlyas low as the number of iterations taken by AMG. This shows that theRRB-solver scales nearly as good as Multigrid;

4. For larger problem sizes ([h.sup.-1] > 256) and fixed l = 12the number of RRB-iterations does not increase.

Although AMG could be tuned to perform better, and GMG may provideeven better performance than AMG (in particular for construction of thesolver), our comparison study already clearly demonstrates that the GPURRB-solver is really fast, and that it can challenge or, for specifictype of problems, even outperform existing Multigrid-type solvers fromacknowledged high-performance computing libraries.

6. Conclusions. This paper addressed an efficient implementation ofthe RRB-solver. The RRB-solver is a PCG-type solver based on theRRB-method for the incomplete factorization of the preconditioner.Literature and a comparison study by us shows that the RRB-method isvery efficient in itself and that it scales very well with meshrefinement, nearly as good as Multigrid. Besides being very efficient initself, this paper demonstrates that the RRB-method also allows for anefficient parallelization. A clever storage scheme is key in theefficient parallelization. Using the so-called r1/r2/61/62-storagescheme both the construction and the application of the preconditionerare efficiently parallelized on both multi-core CPU using C++/OpenMP andthe GPU using CUDA C. The r1/r2/61/62-storage format can be used indifferent solvers as well to obtain an efficient implementation on theGPU, e.g., (geometric) Multigrid may benefit from this. Performancestudies on the 2D Poisson test problem showed that the performance ofthe RRB-solver is very good overall: no wasted throughput, nobottleneck, and a well-balanced algorithm in terms of time spent perroutine in the PCG-algorithm. On the GPU the RRB-solver reaches athroughput close to the theoretical peak of the device and a speedupfactor of more than 30 compared to a sequential implementation on theCPU. A comparison with algebraic Multigrid from a recognizedhigh-performance computing library showed that the RRB-solver canoutperform it by a factor 7 to 10, which again demonstrates that theRRB-solver can be implemented very efficiently on the GPU.

REFERENCES

[1] P. R. AMESTOY, I. S. DUFF, J. Y. L'EXCELLENT, AND J.KOSTER, MUMPS: a general purpose distributed memory sparse solver, inProc. PARA2000, 5th Int. Workshop on Appl. Parallel Comp., T. S0revik,F. Manne, A. H. Gebremedhin, and R. Moe, eds., vol. 1947 of LectureNotes in Comput. Sci., Springer, Berlin, 2000, pp. 122-131.

[2] O. AXELSSON AND V. EIJKHOUT, The nested recursive two-levelfactorization method for nine-point difference matrices, SIAM J. Sci.Statist. Comput., 12 (1991), pp. 1373-1400.

[3] C. W. BRAND, An incomplete-factorization preconditioning usingrepeated red-black ordering, Numer. Math., 61 (1992), pp. 433-454.

[4] C. W. BRAND AND Z. HEINEMANN, A new iterative solutiontechnique for reservoir simulation equations on locally refined grids,SPE Reservoir Eng., 5 (1990), pp. 555-560.

[5] A. M. BRUASET, Preconditioned for Discretized EllipticProblems, PhD. Thesis, Department of Informatics, University of Oslo,1992.

[6] P. CIARLET, JR., Repeated red-black ordering: a new approach,Numer. Algorithms, 7 (1994), pp. 295-324.

[7] T. A. DAVIS, Algorithm 832: UMFPACK V4.3: anunsymmetric-pattern multifrontal method, ACM Trans. Math. Software, 30(2004), pp. 196-199.

[8] M. DE JONG, A. VAN DER PLOEG, A. DITZEL, AND C. VUIK, Real-timecomputation of interactive waves on the GPU, in 15th Numerical TowingTank Symposium, Cortona, Italy, V. Bertran, E. Campana, eds., CurranAssoc., Red Hook, 2012, pp. 111-116.

[9] M. DE JONG AND C. VUIK, GPU implementation of the RRB-solver,Tech. Rep. 16-06, Delft University of Technology, Delft, Netherlands,2016.

[10] I. GUSTAFSSON, A class of first order factorization methods,BIT, 18 (1978), pp. 142-156.

[11] R. LI AND Y. SAAD, GPU-accelerated preconditioned iterativelinear solvers, J. Supercomputing, 63 (2013), pp. 443-466.

[12] Y. NOTAY AND Z. O. AMAR, A nearly optimal preconditioningbased on recursive red-black orderings, Numer. Linear Algebra Appl., 4(1997), pp. 369-391.

[13] PARALUTION LABS, PARALUTION v1.1.0, 2016.http://www.paralution.com/

[14] Y. SAAD, Krylov subspace methods on supercomputers, SIAM J.Sci. Statist. Comput., 10 (1989), pp. 1200-1232.

[15] O. SCHENK AND K. GARTNER, Solving unsymmetric sparse systemsof linear equations with PARDISO, in Computational Science--ICCS 2002,Part II (Amsterdam), P. M. A. Sloot, C. J. K. Tan, J. J. Dongarra, andA. G. Hoekstra, eds., vol. 2330 of Lecture Notes in Comput. Sci.,Springer, Berlin, 2002, pp. 355-363.

[16] P. N. SWARZTRAUBER AND R. A. SWEET, Algorithm 541: FISHPACK:efficient fortran subprograms for the solution of separable ellipticpartial differential equations [d3], ACM Trans. Math. Softw., 5 (1979),pp. 352-364.

[17] H. A. VAN DER VORST, Iterative Krylov Methods for Large LinearSystems, Cambridge University Press, Cambridge, 2003.

Jong, Martijn De^Ploeg, Auke Van Der^Ditzel[double dagger],Auke^Vuik, Kees

MARTIJN DE JONG ([dagger]), AUKE VAN DER PLOEG ([double dagger]),AUKE DITZEL ([double dagger]), AND KEES VUIK ([dagger])

([dagger]) Delft University of Technology, Mekelweg 4, 2628 CDDelft, Netherlands ({m.a.dejong-2, c.vuik}@tudelft.nl).

([double dagger]) Maritime Research Institute Netherlands,Haagsteeg 2, 6708 PM Wageningen, Netherlands ({a.v.d.ploeg,a.ditzel}@marin.nl).

(*) Received November 11, 2016. Accepted July, 19, 2017. Publishedonline on October 3, 2017. Recommended by Y. Saad.

TABLE 5.1Time breakdown of solver S4 for g = 4 and l =12. (*)Total time for 19iterations. (**)GPU part only; CPU part is accumulated in rest.Kernel Time* % GiB/s GFlops/s (ms)memcpy (3x) 38.7 30.4 2.6 --axpy (3x) 11.0 8.7 246.1 30.9dotp (2x) 4.9 3.9 246.3 31.0matvec 18.3 14.4 325.0 36.4solver (**) 29.8 23.4 299.1 24.5rest 24.4 19.2 -- --Total 127.1 100.0* memcpy 30.4%* axpy 8.7%* dotp 3.9%* matvec 14.4%* solver 23.4%* rest 19.2%Table 5.2Time breakdown of S4's forward substitution (= part of solver) for g =5 and I = 12. * Levels 9 to 12 and the remainder are performed by S1(on the CPU).Grid RRB-levels cx x cy Time % ([micro]s)1 1 + 2 1024 x 1024 258 37.32 3 + 4 512 x 512 172 24.93 5 + 6 256 x 256 38 5.54 7 + 8 128 x 128 13 1.9rest 9 - 12 and rem (*) 211 30.5Total 692 100.01 37.3%2 24.9%3 5.5%4 1.9%rest 30.5%TABLE 5.3Solver comparison: PARALUTION's default AMG solver versus the S4-4solver. AMG S4-4, l =12Problem Setup [s] Solve [s] Setup [s]size [h.sup.-1] #Iter CPU GPU #Iter CPU64 7 0.075 0.085 13 0.005128 8 0.105 0.128 16 0.007256 9 0.121 0.162 19 0.026512 10 0.317 0.241 20 0.0761024 13 1.157 0.498 20 0.3082048 16 4.511 1.084 19 0.725Problem Solve [s]size [h.sup.-1] GPU GPU64 0.012 0.007128 0.028 0.010256 0.042 0.010512 0.073 0.0181024 0.145 0.0422048 0.426 0.142

COPYRIGHT 2017 Institute of Computational Mathematics

No portion of this article can be reproduced without the express written permission from the copyright holder.

Copyright 2017 Gale, Cengage Learning. All rights reserved.