Chapter 4

Parallel linear solvers

4.1 Introduction

The reliability of the engineering community on CFD is growing due to the ability to solve complex fluid flow and heat transfer phenomena with accuracy and within a reasonable elapsed time. During the past decade, the increase of the speed of processors and memories has contributed to the reduction of this time, and hence, it has enabled to afford large engineering problems. However, while the complexity of such problems grows, the improvements in processor technology become physically limited. For that reason, only parallelism is able to boost performance significantly and to deal with long time and large memory consuming problems.

4.1.1 Hardware for parallel computing

Among the different architecture types derived along the past two decades, only a few of them have become nowadays the commonly used machines for scientific computing.

Following the classification of computers introduced by Flynn (1972), the most recent architectures fit into the SIMD (Single Instruction stream/Multiple Data stream) and MIMD (Multiple Instruction stream/Multiple Data stream) categories. The term stream relates to the sequence of data or instructions as seen by the machine during the execution of a program. However, it was not until the early to mid 1980s that machines conforming to the MIMD classification were available commercially. At this point, it is worth distinguishing between two MIMD categories, shared memory and distributed memory MIMD computers. The shared memory machines are considered to be tightly coupled, whilst the distributed memory machines are regarded as loosely coupled and employ the relatively slow message passing approach. See [56] for a comparative study of various computers.

More recently and following the development of distributed memory computers, there is an increasing interest in the use of clusters of workstations [57] connected together by high speed networks. The trend is mainly driven by the cost effectiveness of such systems as compared to large multiprocessor systems with tightly coupled processors and memories.

Although the numerical algorithms presented throughout this work are developed for any MIMD machine, the validation and measures of performance are carried out only for distributed memory machines. A Cray T3E with 32 tightly coupled processors and a 32 PC
cluster with fast ethernet based network are used for the numerical experiments. Further details of these computers may be found in the appendix.

4.1.2 Parallel programming models

The two basic models used in parallel computing are the Single Program Multiple Data (SPMD) model and the Multiple Program Multiple Data (MPMD) model. For our purposes, i.e., CFD problems, the model most commonly used is SPMD. In this approach, each processor runs an identical program but only computes on its own data $data_p$. To do so, the whole data or computational domain is distributed over all the processors, say $np$. For CFD problems, the distribution of data is done via domain decomposition, as it will be described later on. Furthermore, since the communication of data among processors may be necessary, each processor knows who is it and who are its neighbour processors $nb_p$. Hence, the SPMD model is written in Alg. 19 as follows.

**Algorithm 19 SPMD model**

get a processor, an identification number from $1,...,np : p$
set the neighbour processors: $nb_p$
get a portion of data from the $np$-partitioned domain: $data_p$
compute on $data_p$
communicate with $nb_p$

This model enables the programmer to write the same or different operations (both computations and communications) for each processor. This depends on the parallelisation of the operations.

For example, the matrix-vector product, say $y = Ax$, where $A$, $x$ and $y$ represent the whole computational domain is partitioned into $np$ subdomains containing the data: $A_p$, $x_p$ and $y_p$ for $p = 1,...,np$. In order to perform the global product, each processor performs its own subproduct and exchanges data with its neighbour processors $nb_p$ where needed. Since all processors do the same this operation is summarised with independence of the processor in Alg. 20.

**Algorithm 20 SPMD example: $y = Ax$**

get a processor, an identification number from $1,...,np : p$
set the neighbour processors: $nb_p$
get a portion of data from the $np$-partitioned domain: $data_p$
$A_p$=partition($A$, $np$)
$x_p$=partition($x$, $np$)
$y_p$=partition($y$, $np$)
exchange data with processors $nb_p$
evaluate the matrix-vector product with the $p$-portion of the domain
$y_p = A_p x_p$
Although Alg. 20 will be explained in detail later on, it is executed at each processor doing more or less all processors the same, and hence, taking similar timings. However, this fact depends on the load of processors in both the computation and communication sense. It is worth noting that a significant delay among processors would produce bottlenecks. Hence, in order to ensure the high efficiency of the parallel algorithm, a few synchronizations, like barriers, must be introduced in strategic points and for all processors. By doing so, the processors drop the continuously unbalanced load that may arise during the computation or communication of several, say $k$, consecutive steps. The synchronization points are introduced implicitly in the communication steps such as the represented in Alg. 21.

**Algorithm 21** Synchronization of computation and communication

\begin{align*}
&\text{computation (1)} \\
&\text{communication + synchronization (1)} \\
&\text{computation (2)} \\
&\text{communication + synchronization (2)} \\
&\cdots \\
&\text{computation (4)} \\
&\text{communication + synchronization (k)}
\end{align*}

More precisely, the communication is composed by an exchange of data followed by a synchronization. At the exchange step, there is a pair of send and receive processes from processor $p$ with the neighbour processors $np_{p}$. Once a processor has finished these tasks, it waits until the rest have done their respective tasks. This procedure is detailed in Alg. 22.

**Algorithm 22** Communication + synchronization

\begin{align*}
&\text{communications of processor } p \\
&\text{send local data to neighbour processors } np_{p} \\
&\text{receive local data from neighbour processors } np_{p} \\
&\text{synchronization with all processors} \\
&\text{wait for all processors}
\end{align*}

MPI implements this algorithm in different manners as it will be discussed in detail later on.

Although most of the parallel algorithms implemented under the SPMD model do the same operations for all processors, there are few operations that, for their implementation, require a hierarchy of processors, and hence, a different set of operations. As an example, we refer to the master-slave paradigm.

An example of the master-slave paradigm is the following. There is a master processor (often it is identified with $p = 0$) which collects the data sent from the rest of processors, named slaves (for $p = 1, \ldots, np$). The master carries out a set of operations which are
considered global operations, and finally, if it is required, the final result is distributed back to all the slaves. As represented in Alg. 23, there is an idle step for the group of slave processors while the master is doing its work. Therefore, a synchronization point at the end of the algorithm must be introduced in order to drop the arising bottlenecks.

Algorithm 23 Master-slave paradigm

\[
\begin{align*}
&\text{if } (p = \text{master}) \\
&\quad \text{receive local data from } p = \text{slaves} \\
&\quad \text{compute master operations} \\
&\quad \text{send global data to } p = \text{slaves} \\
&\quad \text{end if} \\
&\text{if } (p = \text{slave}) \\
&\quad \text{send local data to } p = \text{master} \\
&\quad \text{keep waiting idle} \\
&\quad \text{receive global data from } p = \text{master} \\
&\quad \text{end if} \\
&\text{synchronize master and slaves}
\end{align*}
\]

The inner product of two vectors can be coded as an example of the master-slave paradigm. Although all processor compute a part of the inner product, namely local inner product, one processor (the master) collects these partial results and computes a global summation. After that, the master distributes back to the rest of processor (the slave) the resulting value. However, there are better implementations of this operation by minimizing the number of messages and hence, improving the efficiency. Other operations that can be implemented with the master-slave paradigm are the input(read)/output (write) of global results from/to a file disk respectively.

4.1.3 Message-passing programming

On these distributed memory machines, the parallel programming model is based on explicit message-passing programming. The first MIMD machines implemented proprietary message-passing libraries, basically specifying the sending and receiving messages between processors and grouped in two categories of communication: the point-to-point communication and the group communication. Although the concepts of these proprietary libraries were equal, the no portability between platforms was not considered as ideal. For that reason, research efforts have been conducted to develop portable and standard message-passing libraries. Currently, the PVM (Parallel Virtual Machine) and more recently, MPI (Message Passing Interface) are adopted by all of the parallel computer vendors.

Since MPI is becoming de-facto standard for message passing, it has been set for the implementation and description of the communication subroutines of this work.
4.2 Performance measurements of an implementation

Once an algorithm is implemented for the parallel execution the next step is the evaluation of the parallel performance for np processors. To do so, we compare the wall clock time of the computation of the parallel implementation, denoted by $\tau_{np}$ with the wall clock time of computation of the sequential version, namely $\tau(1)$. We call this rapport the efficiency $E(np)$ of the parallel implementation for np processors.

$$E(np) = \frac{\tau(1)}{np\tau(np)}$$

If a parallel implementation with np processes was ideally 100% efficient, it means a np reduction of the time of computation respect to the computation with one processor. This reduction is called the speed-up $S(np)$ and it is measured as

$$S(np) = npE(np) = \frac{\tau(1)}{\tau(np)}$$

This is the ideal case but under certain circumstances, it is also possible to obtain a super linear speed-up. This usually occurs when the sparsity of a matrix is exploited on a parallel architecture or advantageous caching occurs.

It should be noted that there is currently some debate as to which sequential time, the parallel computation should be compared to. In this work the sequential time has been evaluated from the serialized version $np = 1$ of the parallel implementations.

The efficiency and speed-up are closely related and give an indication of how well balanced the computational load is and how the problem scales. In practice, an ideal speed-up cannot be attained and there are several reasons for this, which are:

- inherent sequential parts
- Unbalanced load
- Overhead of the communication and synchronization

The first item refers to the those parts of the algorithm that may not be perfectly parallel. For example, the inner product between two vectors and the factorization of a matrix. The second one refers to a the different distribution of load among processors. For example, the computational domain may not be equally distributed, or some operations require a master which performs more tasks than the slaves. These factors may be more or less controllable by our implementation. However, the major factor that contributes to the decrease of the efficiency of the parallel implementation is the overhead due to the exchange of data among processors. Each time a message is sent, an overhead in timing cost is incurred. Therefore, if these factors are included in the efficiency formula, it is possible to point out the drawbacks to a given implementation.

For instance, an improved measure of the efficiency is based on an accurate description of the operations and the time spent in each operation. This is in essence the measure of the sequential algorithm. For the parallel algorithm, we have to add the time spent in the communication processes needed in some operations. Then the efficiency is computed as

$$E(np) = \frac{n_{seq}(1)\tau_{seq}}{np(n_{seq}(np)\tau_{seq} + n_{comm}(np)\tau_{comm})}$$
Where $n_o$ stands for the number of operations, $\tau_s$ the time spent in one of these operations, $n_c(p)$ the number of communications and $\tau_c(np)$ the time spent in one of these communications. Notice that the time spent in the communication process depends on the number of processors $p$. More processors means more number of communications $n_c(np)$ but the quantity of data transferred per communication and processor is reduced. Furthermore it is convenient to express more accurately the time of communication $\tau_c$ split into two parts: the proper time of communication when sending a packet of data with size $d$ at $\beta^{-1}$ rate of communication and the time of setting up the communication $\tau_s$

$$\tau_c(np) = \tau_s + \beta d$$

Since the bandwidth $\beta$ and the latency $\tau_s$ are hardware dependent parameters, the efficiency would be very different when it is evaluated in either tightly coupled processors or loosely coupled processors. Introducing these concepts in the previous expression the efficiency leads to

$$E(np) = \frac{n_o(1)\tau_s}{n_p(n_o(p)\tau_s + n_c(np)(\tau_s + \beta d(np)))}$$

Having a look at this expression and assuming that $n_o(1) \approx npm_o(np)$ in our algorithms, we see that the efficiency is always less than 100% and it decreases as $np$ increases. The number of communications $n_c(np)$ increases linearly with $np$, and although the time of communication of data $\tau_c(np)$ decreases because the overall data are better distributed, the latency remains constant and hence the overall time of communication per process increases.

If we partition our domain with a certain overlapping, the number of operations performed in sequential and in parallel per processor $np$ is different. If we express the number of operations per process $n_o(np)$ in terms of the number of control volumes handled by the processor $n_c(np)$ multiplied by the number of operations per control volume $m_o(np)$, we get

$$n_o(np) = n_o(np)n_c(np)$$

and substituting it into the efficiency, we obtain an expression with three component effects: algorithm $E_a(np)$, load $E_l(np)$ and communication $E_r(np)$.

$$E(np) = E_a(np)E_l(np)E_r(np)$$

Where

$$E_a(np) = \frac{n_o(1)}{npm_o(np)}, \quad E_l(np) = \frac{n_c(1)}{npm_o(np)}, \quad E_r(np) = \frac{1}{1 + \frac{n_c(np)}{n_o(np)}\tau_s + \beta d(np)\tau_s}$$

The first effect means the raport of operations performed in the sequential algorithm versus the parallel algorithm. For Krylov solvers we consider an equal number of operations so this effect is neglected. The second effect takes account of the overlapping on $n$. Each processor contains its own data plus an overlapping. However for large scale problems the overlapping is much smaller than the data contained in the processor so this effect is reduced. The third effect is the raport between the amount of work done of computation and communication per processor. Therefore, the efficiency not only depends on the parallel implementation but also on the parameters of communication, i.e. the latency $\tau_s$ and the bandwidth $\beta^{-1}$ which are hardware dependent.
Finally, there is another performance measure so called scalability, which is related with the computing time of the problem with \( np \) and how it increase when increasing in the same ratio (e.g. doubling) both the problem size and the number of processors. It is clear that, for an ideal scalability of the implementation, the computing time will remain constant. Hence, the problem may be scaled by two, four, and so on, by means of \( 2np, 4np, ... \) processors.

Thus, the scalability may be computed as

\[
S_c(np) = \frac{\text{size}_{np}}{\text{size}_1}
\]

4.3 Modellization of the communication time

As mentioned above, the loss of performance of the parallel algorithm depends directly on the time spent in the communication of data between processors. A low communication cost is always desirable for high performance computing. The time of communication while sending a block of data is affected by two parameters the latency or setup time \( r_s \) and the bandwidth or byte transferred rate per second \( B^{-1} \). The correlation between the time of communication \( \tau_c \) (given in seconds) and the transferred data \( d \) (given in bytes) was expressed in the first approximation by a linear equation:

\[
\tau_c = r_s + \beta d
\]

Indeed, both parameters depend clearly on the hardware (networks, switch and crossbars) and the software (operating system, TCP/IP and MPI library implementations, compilation and executing flags). Details about the influence of such hardware and software issues over the communication performance are out of the scope of this work.

An approximate evaluation of the latency and bandwidth parameters can be done by sending blocks of data with different sizes, and measuring the time spent in the communication [56]. This procedure or test is commonly named microbenchmark of the network because it transfers small data packets that takes short time of communication. Since the measure of time can be either intrusive or does not have enough accuracy, the time of communication is enlarged by returning the same packets of data (i.e. a round-trip of the packets) and halving the measured time. The test does a thousand of experiments by sampling the size of the packed data randomly within a range of 0.1MB and measuring the time of the round-trip. For instance, the type of data transferred is the char, which size in bytes is the unit.

In order to ensure non overlapped send and receive process which can reduce the time of the round-trip, the block communication mode is used. Alg. 24 summarizes this test.
Algorithm 24 Latency and bandwidth parameters

for (sample = 1 to sample \leq 1000)
    set size[sample]=random(0,1000000)
synchronize processors
    MPL.Barrier(MPL.COMM.WORLD)
    time the round-trip
    t0=MPL.Wtime()
    if (rank mod 2=0)
        MPL.Send(char.send, size[sample], rank+1,...)
        MPL.Recv(char.recv, size[sample], rank+1,...)
    else
        MPL.Recv(char.recv, size[sample], rank-1,...)
        MPL.Send(char.send, size[sample], rank-1,...)
    end if
    t1=MPL.Wtime()
    time[sample]=(t1-t0)/2
end for

The MPL.Barrier step ensures the synchronization of the measured times and therefore reduces the dispersion of results. Furthermore, this test is executed by a wide range of even numbers of processors 2,4,6,8 obtaining results (see Fig. 4.2) without significant differences.

This figure shows the linear behaviour of the communication time of the transferred pack of bytes. The difference of slopes for each facility indicates different byte rates per second. For the estimation of the slope, the range of large messages (1KB,...,1MB) has been used. And the inverse of the slope is the estimated bandwidth.

For the estimation of the latency parameter, there are two possibilities. The first one is taking the time value for a zero pack of bytes. But as mentioned before, time measure may be inaccurate. In a second approach, a zoom (see Fig. 4.1) near the short transferred pack of bytes (0,...,1KB) is used with a linear fit regression. The latency parameter is then estimated for a zero pack of bytes. The different values of latency and bandwidth parameters for both facilities are summarized in the table 4.1.

<table>
<thead>
<tr>
<th>Facility</th>
<th>Latency (1) ( \mu \text{sec} )</th>
<th>Latency (2) ( \mu \text{sec} )</th>
<th>Bandwidth MB/sec</th>
</tr>
</thead>
<tbody>
<tr>
<td>PC cluster</td>
<td>226.98</td>
<td>218.95</td>
<td>10.46</td>
</tr>
<tr>
<td>Cray T3E</td>
<td>23.17</td>
<td>21.18</td>
<td>119.98</td>
</tr>
</tbody>
</table>

Table 4.1: Latencies measured (1) and estimated (2) and bandwidth estimated.
Figure 4.1: Zoom of previous figure near the short transferred pack of bytes.

Figure 4.2: Timing results on two facilities the PC cluster and the Cray T3E.
These values clearly show the great difference of communication performance between facilities. The low latency and large bandwidth of the Cray T3E explain in part the high price against the PC cluster.

4.4 Communication modes

The exchange of data among processors can be performed over two different modes of communication: blocking and non-blocking. The blocking communication mode disables the processors to perform other operations while the communication is being done. Since the process of sending or receiving the data involves the access to the buffer of memory, MPI library protects the data by blocking the processor and hence, avoiding the access to the data for reading or writing. The blocking communication mode among processor $p$ and its neighbours $ngh_p$ for a given buffer of data, say $data_p$, is described in Alg. 25.

```
Algorithm 25 Blocking communication

send $data_p$ to neighbours $ngh_p$
block.send ($data_p$, $ngh_p$)

receive $data_p$ from neighbours $ngh_p$
block.receive ($data_p$, $ngh_p$)
```

Since the communications in this mode follow the order of the algorithm, processors either the sender or the receiver keep idle waiting for completion of the communication processes, and hence, resulting in poor efficiency. However, this mode has a implicit synchronization procedure so it is not necessary to explicitly synchronize the processors after the ends of the pairs of send and receive processes.

The order of the communication processes among processors has to be considered carefully. A very important fact in a bad scheduled communication is the dead lock. It appears when two or more processors send and receive data in a wrong order. When this happens, the processors do not understand themselves and keep waiting infinitely for the communication request. The dead lock may be avoided by organizing the communication in two simple steps. Firstly, all processors send their respective data to their respective receivers, i.e., the neighbour processors. Any order for the senders could be defined. Secondly, the senders are ready to receive the data from their respective receivers. The dead locks are avoided by ensuring the existence of pairs of send and receive messages. If not, it will appear to have an orphan message process (a sender without a receiver or a receiver without a sender) and thus resulting into a dead lock. The use of tags is convenient for the right schedule of sending and receiving data.
The other mode of communication to be used in data exchange is the non-blocking communication. Each processor copies the data to communicate $x_p$ to another buffer $y_p$. Then MPI library acts on that data without blocking the processors to perform other operations on the original buffer. The non-blocking communication for the same vector is described in Alg. 26.

Algorithm 26 Non-blocking communication ($x_p$)

- copy data from $x_p$ into another buffer $y_p$
- send data of $y_p$ from $p$ to neighbours $ngh_p$
  - non.block.send ($y_p, ngh_p$)
- operate over $x_p$

... 

- receive data in $y_p$ from neighbours $ngh_p$ to $p$
  - non.block.receive ($y_p, ngh_p$)
- operate over $x_p$

... 

- synchronize all processes
- wait for all
- copy data from $y_p$ into the buffer $x_p$

However, this procedure could reduce the performance of parallel algorithms due to the elapsed times of the pre step of copying data from the buffer $x_p$ to the buffer $y_p$ and, after the communication step, the post step of copying back the data to the original buffer. In addition, for large amounts of data, the performance of buffering decreases due to the cache missings.

Furthermore, the data in $x_p$ to be exchanged, e.g. halos and inside blocks, are usually located in non contiguous locations of the buffer. Therefore the data must be arranged continuously in the buffer $y_p$ before any communication. After the communication is done, the data contained in the buffer $y_p$ must be redistributed to $x_p$ in the non contiguous locations.

In order to see the differences of performance of both modes of communication, a second test is done. This test (see Alg. 27) shows the performance of the implementation of an exchanged pack of bytes.
Algorithm 37: Blocking and non-blocking communication nodes

for (mode=blocking to mode=non-blocking) 
for (sample=1 to 1000)
    size[node][sample]=random(0,1000000) 
    MPI_Barrier(MPI_COMM_WORLD) 
    t0=MPI_Wtime() 
    If (mode=blocking)
        If (frac(rank,2)==0)
            MPI_Send(char_send.size[node][sample],rank+1,...)
            MPI_Recv(char_recv.size[node][sample],rank+1,...) 
        else
            MPI_Send(char_send.size[node][sample],rank-1,...)
            MPI_Recv(char_recv.size[node][sample],rank-1,...) 
        end if 
    end if 
    If (mode=non-blocking)
        If (frac(rank,2)==0)
            MPI_Isend(char_send.size[node][sample],rank+1,...)
            MPI_Irecv(char_recv.size[node][sample],rank+1,...) 
        else
            MPI_Isend(char_send.size[node][sample],rank-1,...)
            MPI_Irecv(char_recv.size[node][sample],rank-1,...) 
        end if 
        MPI_Wait(isend) 
        MPI_Wait(irecv) 
    end if 
    t1=MPI_Wtime() 
    time[node][sample]=t1-t0 
end for 
end for 

Regarding the blocking communication mode, it performs the send and the receive processes of a given pack of bytes consecutively. It is like the round-trip of the pack described in the previous test, but the measure of time is not halved.

If the network enables communications among processors in both senses at same time (i.e. duplex network), a pair of send and receive processes can be done in half of time. Conversely, the non-blocking communication mode enables one to do this pair of processes at same time and to wait until it completes the two processes independently.

Like in the previous test, a thousand of experiments with randomized packet sizes of type char and covering the range of (0,...,1MB) is done.

Due to the duplex network feature, the order of the non-blocking communication pro-
cesses within the algorithm does not affect the communication procedure and therefore no deadlocks could arise. The scope of the MPI-Wait step is to wait for termination of communication processes.

The test is carried out, analogously to the previous test, on both facilities PC cluster and Cray T3E, whose networks are duplex. The results are shown in Fig. 4.3.

![Bar chart showing blocking and non-blocking communication modes for the PC cluster and the Cray T3E.](image)

Figure 4.3: Blocking and non-blocking communication modes for the PC cluster and the Cray T3E.

Fig. 4.3 shows that the behaviour of both communication modes are similar in both machines but at different scales of time. The non-blocking communication of any pack of bytes is performed more efficiently and nearly at half time of the the time taken by the blocking communication. The byte exchanged rates (i.e. the inverse of the slopes), for each implementation and their reports are given in table 4.2.

Therefore, it is convenient to use a non-blocking communication implementation for those operations where an exchange of information is required such the matrix-vector product. For that reason, MPI library provides a set of data types which enables to skip these pre and post copying steps, and hence, to improve the communication performance.
<table>
<thead>
<tr>
<th>modes</th>
<th>PC cluster</th>
<th>Cray T3E</th>
</tr>
</thead>
<tbody>
<tr>
<td>blocking (1) MB/sec</td>
<td>5.255</td>
<td>69.321</td>
</tr>
<tr>
<td>non-blocking (2) MB/sec</td>
<td>7.889</td>
<td>110.387</td>
</tr>
<tr>
<td>rapport: (2)/(1)</td>
<td>0.66</td>
<td>0.54</td>
</tr>
</tbody>
</table>

Table 4.2: Megabyte exchanged rates (MB/sec) and rapports for both modes of communication on the PC cluster and the Cray T3E.

4.5 Domain decomposition

The most popular approach to solving CFD problems on MIMD architectures whether it is memory shared or memory distributed is that of the domain decomposition [58]. The objective is to distribute the computational domain onto a number of processors such that the work load on each processor is equivalent. The practical application involves the discretization and the solution of the systems of equations derived in previous chapters on an equally distributed load per processor. The details of the implementation of the domain decomposition are given hereinafter from an algebraic point of view.

Most algebraic operations involved in the discretization and solution procedures are based on the addition, the subtraction and the product of three types of objects: scalars, vectors and matrices. The implementation of these operations on these objects can be thought to be performed either in sequential or in parallel. In a one-processor machine the operation and the storage of objects are done as it is expressed mathematically. However, in a MIMD machine, the algebraic operations are performed on a part of the objects. The vector object, for example, is equally distributed as much as possible and stored among all processors. This equally distribution of objects is stressed to balance the load, and hence, to obtain a good efficiency of the implementation.

From here on, we shall use subindex for distinct parts of the objects. If we have a np-processor machine where np stands for the number of processors and object, named o, the object is partitioned into np objects and stored at each processor. Labelling the resulting objects from p = 1 to p = np it follows that

$$o = \bigcup_{i=1}^{np} o_i$$

An algebraic operation, named for generality @, between two objects x and y is performed in each process p using only the respective parts xp and yp. The result can be stored in another distributed object z.

$$z = x @ y \iff z_p = x_p @ y_p, \quad p = 1, \ldots, np$$

Thinking this way, it seems easy to implement parallel np > 1 or sequential np = 1 algebraic operations with distributed objects. Nevertheless, some operations like the inner product between two vectors or the 2-norm of a vector, the maximum and minimum value of the components of a vector, and the product of a matrix with a vector involve information stored in the closest processors or even in all processors. Therefore, this information must
be copied from these processors, named neighbour processors $o_{np}$ to the affected processor $p$. This yields to an extra information per processor of the object.

\[ ov \neq \bigcap_{p \in o_{np}} o_{p} \]

Hereinafter, we shall call this additional information as the overlapping values ov. Most operations requires an overlapping of a single point, one line of points or a surface of points for a one, two or three dimensional object respectively. The implementation of these ideas to vectors and matrices are developed in next subsections.

### 4.5.1 Block vector

Let us suppose $x$ to be a vector object which maps a $d$-dimensional domain $\Omega$. The partition of this domain in $np$ parts is carried out in an equally distributed manner among the $np$ processors. Assuming that the domain $\Omega$ is discretized onto a structured grid of points, the partition is performed easily following the orthogonal directions. For instance, a vector that maps a three dimensional domain (see Fig. 4.4) is partitioned in two orthogonal directions $p_1 = 2$ and $p_2 = 3$ leading into 6 block vectors $x_p$ with $p = 1, \ldots, np = p_1 \times p_2$.

![Figure 4.4: Two dimensional partition of a vector that maps a three dimensional domain.](image)

Let $I \times J \times K$ be the overall grid points over the domain $\Omega$, the block vector $x_p$ maps the $p$-part of the domain at least in $(I/p_1) \times (J/p_2) \times K$ grid points. Notice that, a perfect load balancing in all directions is considered. An unbalanced partition leads into a generalized grid size $id \times jd \times kd$. Fig. 4.5 shows the mapping of $x_p$ with a generalized index $i,j,k$. 
Figure 4.5: Generalized $id \times jd \times kd$ partition $x_p$ of a three dimensional vector $x$. The index $i, j, k$ has a range from $i_1, j_1, k_1$ to $i_2, j_2, k_2$.

The overlapping $ov$ is added in those directions where the information of the neighbour processors is needed. The blue dashed lines in Fig. 4.6 show the overlapping areas among processors.

Figure 4.6: Overlapping areas $ov$ for a $2 \times 3$ partitioned vector that maps a three dimensional domain.
The resulting dimension of a generalized $x_p$ vector with grid size $id \times jd \times kd$ and with the same overlapping $ov$ in all directions is represented in Fig. 4.7.

\[ x = (x_1, x_2, \ldots, x_{np-1}, x_{np})^T \]

By doing so, it is easy to represent the operations with vectors. For instance, the copy of a vector $x$ into another vector $y$ is represented by omitting the overlapping as

\[
y = x \iff \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_{np-1} \\ x_{np} \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_{np-1} \\ y_{np} \end{bmatrix}
\]
However, the implementation of this operation considers the overlapping. Alg. 28 gives such example considering an overlapping of \( ov \).

**Algorithm 28 Copy_vect(\( x_{p} \), \( y_{p} \))**

```
for (\( i = i1 - ov \) to \( i = i2 + ov \))
    for (\( j = j1 - ov \) to \( j = j2 + ov \))
        for (\( k = k1 - ov \) to \( k = k2 + ov \))
            \( y_{p}(i, j, k) = x_{p}(i, j, k) \)
        end for
    end for
end for
```

4.5.2 Block matrix

Let \( A \) be a \( N \)-square matrix such that \( N = (I \times J \times K) \) arises from the discretisation of the governing equations of a CFD problem onto a grid of \( I \times J \times K \) points. We represent the matrix \( A \) partitioned by rows as balanced as possible in \( np \) blocks of matrices \( A_{p} \):

\[
A = \begin{bmatrix}
    A_{1} \\
    A_{2} \\
    \vdots \\
    A_{np-1} \\
    A_{np}
\end{bmatrix}
\]

Each of these matrices \( A_{p} \) contains the matrix coefficients of the linear system \( Ax = b \) associated with the partition \( p \).

\[
\begin{bmatrix}
    A_{1} \\
    A_{2} \\
    \vdots \\
    A_{np-1} \\
    A_{np}
\end{bmatrix} \begin{bmatrix}
    x_{1} \\
    x_{2} \\
    \vdots \\
    x_{np-1} \\
    x_{np}
\end{bmatrix} = \begin{bmatrix}
    b_{1} \\
    b_{2} \\
    \vdots \\
    b_{np-1} \\
    b_{np}
\end{bmatrix}
\]

In previous chapter, we described the different patterns of these sparse matrices for a natural ordering (i.e. the 7-point and the 19-point formulation is a three dimensional CFD problem. Since a narrowed band of coefficients are non zero, we shall store only the blocks of non zero coefficients of each partition.

Here, we used the following notation for a one dimensional partition:

\[
\begin{bmatrix}
    A_{1} & A_{1,2} & 0 & \cdots & \cdots & 0 \\
    A_{2,1} & A_{2} & A_{2,3} & 0 & \cdots & \cdots \\
    0 & A_{np-2,1} & A_{np-2,2} & A_{np-2,3} & \cdots & 0 \\
    \cdots & \cdots & \cdots & \cdots & \cdots & \cdots \\
    0 & A_{np-1,1} & A_{np-1,2} & A_{np-1,3} & \cdots & A_{np}
\end{bmatrix} \begin{bmatrix}
    x_{1} \\
    x_{2} \\
    \vdots \\
    x_{np-1} \\
    x_{np}
\end{bmatrix} = \begin{bmatrix}
    b_{1} \\
    b_{2} \\
    \vdots \\
    b_{np-1} \\
    b_{np}
\end{bmatrix}
\]
Having a look at the structure of this matrix each process \( p \) stores two different matrices \( A_p \) and \( A_{p,nghp} \). Where \( nghp \) stands for the neighbour processes of process \( p \). The first matrix indicates the operations performed with the information stored within the processor \( p \) and the second one indicates the operations performed with the information stored at neighbour processors.

For a generalized partition in two or three directions, the structure of blocks of matrix \( A \) the linear system is written as follows:

\[
\begin{bmatrix}
A_1 & A_2 & \vdots & A_{np-1} & A_{np} \\
A_{n1} & x_{n1} & \vdots & x_{n(np-1)} & x_{np} \\
A_{n2} & A_{n(np+1)} & \vdots & A_{n(n(np-1))} & A_{n(nnp)} \\
\vdots & \vdots & \ddots & \vdots & \vdots \\
A_{n(np-1)} & A_{n(np+np-1)} & \vdots & A_{n(np(np-1))} & A_{n(np(np+np))} \\
A_{n(np)} & A_{n(np+np)} & \vdots & A_{n(np(np+np+1))} & A_{n(np(np+npnp))} \\
\end{bmatrix}
\begin{bmatrix}
x_1 \\
x_2 \\
\vdots \\
x_{np-1} \\
x_{np} \\
\end{bmatrix}
= 
\begin{bmatrix}
x_{n1} \\
x_{n2} \\
\vdots \\
x_{n(np-1)} \\
x_{nnp} \\
\end{bmatrix}
= 
\begin{bmatrix}
b_1 \\
b_2 \\
\vdots \\
b_{np-1} \\
b_{np} \\
\end{bmatrix}
\]

Thus, a set of communications between the processes involved in such operations must share the information of the \( nghp \) process to the \( p \) process. In the chapter ahead, we shall explain some issues of this communication between the different processes.

### 4.6 Exchange of data blocks

MPI data type has been used to send and receive at once the non contiguous data \( g_p \) of a vector \( x_p \). MPI data type provides a new type of variables in order to send and receive blocks of information located in different points of the buffer at once. By doing so, we can reduce, on one hand, the number of communications and latency, and on the other, the number of cache missings in the buffering processes. A sender processor can explicitly pack noncontiguous data into a contiguous buffer and then send it. A receiver processor can explicitly unpack data received in a contiguous buffer and store in noncontiguous locations.

For simplicity, let us suppose that a \( x_p \) vector has dimensions \( 4 \times 4 \) and the noncontiguous data \( g_p \) has dimensions \( 2 \times 3 \) (see Fig. 4.8).

![Figure 4.8: The 4 x 4 x_p vector and the 2 x 2 g_p data filled in blue. The numbers written within the x_p vector represent the order of data in the buffer.](image)
A schematic representation of the buffer (see Fig. 4.9) shows the noncontiguous data $y_p$ embedded in the map $x_p$.

![Figure 4.9: Buffer representation of $x_p$ and the noncontiguous data $y_p$. The noncontiguous data follows a pattern of $cnt = 3$ noncontiguous blocks of length $bl = 2$ separated with an stride of $str = 4$.](image)

Notice that the noncontiguous data follows a pattern of 3 noncontiguous blocks of dimension 2 separated with an stride of 4. This information is enough to construct a new data type with continuous data by packing the $y_p$ data.

This idea has been easily extended to more complex noncontiguous data such as a three dimensional vectors, with two different number of blocks with different lengths and strides. Fig. 4.10 shows a three dimensional $x_p$ vector with dimensions $I \times J \times K$ and a $i \times j \times k$ noncontiguous data $y_p$.

![Figure 4.10: Representation of $x_p$ vector and the noncontiguous data $y_p$. The noncontiguous data follows a pattern of two noncontiguous blocks $cnt_1$, $cnt_2$ with different lengths $bl_1$, $bl_2$ separated with strides $str_1$ and $str_2$ respectively.](image)

In this case, we apply recursively two data types. The first one has $j$ blocks of length $i$ with an stride of $I$. The second one put over the first one, has $k$ blocks of length $1$ with an stride of $IJ$. 
Finally, the data type implementation over a non blocking communication leads to Alg. 29, used in the product of a matrix by a vector.

**Algorithm 29 Update (on, zp)**

- set the data type \( y_p \) for noncontiguous data of \( z_p \)
- *data.type* (on, zp, yp)
- send \( y_p \) data from \( p \) to neighbours nghp
- *non.blocking.send* (yp, nghp)
- compute over \( z_p \)

...  

receive \( y_p \) data from neighbours nghp to \( p \)
- *non.blocking.receive* (yp, nghp)
- compute over \( z_p \)
...  

synchronization of all process
- *wait for all*

Further details on data types can be found in MPI documentation available in internet [7].

The following test measures and compares the communication performance of two possible implementations of the `update(on, zp)` subroutine. Both implementations use the non-blocking mode of communication which has been tested to be the better. Furthermore, both implementations perform a copy of values to be exchanged to a buffer and then they are exchanged. By doing so, it is possible to perform operations with the original values and exchange a copy at the same time. Therefore, this procedure increases the performance of operations because it overlaps the computation and the communication. A draft (see Fig. 4.11) of the processes shows this procedure.

However the implementation of the buffering process affects this overlapping. The original buffer contains the amounts of data in non-contiguous blocks and the copied buffer must contains the data in contiguous blocks before they are sent to another process. This process is called packing. Conversely, after the communication is done, the received pack of data contained in a contiguous buffer has to be unpacked at non-contiguous locations in the original buffer of the receiver process.

In order to reduce the time of the whole procedure of exchange of data, an efficient implementation of the packing and unpacking procedures is desired. A first implementation does an explicit copy of the non-contiguous blocks of data in a contiguous buffer. This implementation is tedious to implement: dynamic allocations, initializations and copying data from one buffer to the other, in addition, it may suffer possible overheads due to the cache misses. This implementation is called simply non-blocking.
A second and easier implementation uses the MPI datatype to do implicitly the copy from the original and non-contiguous buffer to a contiguous buffer, thus reporting coding and time savings. This implicit copy means that only a structure of pointers to the different locations of the non-contiguous blocks are stored in a MPI datatype. After that, the MPI library performs the packing and unpacking processes implicitly and in an efficient way when the non-blocking send and receive subroutines are called.

Differences of both implementations are tested as follow. A three-dimensional halo with random size in $k$ direction is updated between two processes, left and right. It has been chosen to vary the $k$ direction instead of $i$ or $j$ since the distribution of non-contiguous blocks is sparser and may produce more cache missings. See Fig. 4.12 for details of block data sizes and graphical explanation of the communication process.
Figure 4.12: Update of ov halo between processes left and right.

The test is run at P5 cluster for three different halos: ov=1, ov=2 and ov=4, which are often used in the algebraic operations with communications. For each size of halo, two measures of times are taken. The time of packing and unpacking, called pack, and the time of communication, called comm. These results and the overall time (i.e. the addition of these quantities) are presented for both implementations.
A comparison of all cases is given in Fig. 4.13:

![Graph showing comparison of packing and unpacking times for different halo sizes.]

Figure 4.13: Comparison of both implementations for all halo sizes.

For each size of halos, the times of packing and unpacking of both implementations are very different. The details of each case are reported in Figs. 4.14, 4.15, 4.16, 4.17, 4.18 and 4.19.

The efficiency of the MPI datatype implementation at PC cluster is based on the cache optimization for non-contiguous blocks when they are packed and unpacked. In addition, the packing and unpacking processes of this implementation have to be done only once because it only points to the non-contiguous data. Conversely, in the first implementation, the packing and unpacking processes are done at each exchange thus it is an explicit copy. So the performance of the overall communication is poor.

Although the results of Cray T3E has not been presented here, the difference of both implementations are not meaningful. We guess that this fact is due to the special cache built-in the cpu (see hardware issues in Appendix).
Figure 4.14: Update of ov = 1 with MPI datatype.

Figure 4.15: Update of ov = 1 with explicit copy.
Figure 4.16: Update of $o\ell = 2$ with MPL datatype.

Figure 4.17: Update of $o\ell = 2$ with explicit copy.
Figure 4.18: Update of ov = 4 with MPI datatype.

Figure 4.19: Update of ov = 4 with explicit copy.
4.7 Algebraic operations with vectors and matrices

Three types of algebraic operations are detailed: those without communication between processors, so the parallel efficiency is 100%, those operations that combine computation and communication, and those operations where most part of the job is the communication.

4.7.1 Addition, difference and scaling of vectors

The addition or the difference of two vectors $x$ and $y$ is stored in a third vector $z$. The algorithm (see Alg. 30) that represents any of these operations may be written as

Algorithm 30 operation $\text{vec}(x_p, y_p, z_p)$

$$
\begin{align*}
\text{for} \ (i = i_1 \ \text{to} \ i = i_2) \\
& \quad \text{for} \ (j = j_1 \ \text{to} \ j = j_2) \\
& \quad \quad \text{for} \ (k = k_1 \ \text{to} \ k = k_2) \\
& \quad \quad \quad z_p(i, j, k) = x_p(i, j, k) \pm y_p(i, j, k) \\
& \quad \quad \quad \text{end for} \\
& \quad \quad \text{end for} \\
& \quad \text{end for}
\end{align*}
$$

Another 100% parallel algebraic operation is the vector scaling (see Alg. 31). A vector $x$ can be scaled by a real value $\alpha$ leading to a vector $z$.

Algorithm 31 Scal.$\text{vec}(x_p, \alpha, z_p)$

$$
\begin{align*}
\text{for} \ (i = i_1 \ \text{to} \ i = i_2) \\
& \quad \text{for} \ (j = j_1 \ \text{to} \ j = j_2) \\
& \quad \quad \text{for} \ (k = k_1 \ \text{to} \ k = k_2) \\
& \quad \quad \quad z_p(i, j, k) = \alpha x_p(i, j, k) \\
& \quad \quad \quad \text{end for} \\
& \quad \quad \text{end for} \\
& \quad \text{end for}
\end{align*}
$$

Notice that we have not considered the overlapping area in the compustion of the vector $z$. This fact reduces the number of floating point operations, and it produces a reduction of time of the computation. Furthermore it is possible to perform these operations reusing any vector, e.g., $z = x \pm x$, $z = x \pm y$ and $z = \alpha x$, and hence, reducing storage requirements.
4.7.2 Saxpy operation

The name of the subroutine saxpy [59] comes from the scientific literature and it represents a composition of two previous types of operations.

\[ z = x + ay \]

Although it can be implemented in two steps by means of the above algebraic operations, it has been packed in one step. Alg. 32 also enables the reuse of any vector.

```
Algorithm 32 Saxpy(xp, α, yp, zp)
    for (i = i1 to i = i2)
        for (j = j1 to j = j2)
            for (k = k1 to k = k2)
                zp(i, j, k) = xp(i, j, k) + αyp(i, j, k)
            end for
        end for
    end for
```

4.7.3 Inner product of vectors

The computation of the inner product of two vectors is one of the most important keys in the parallel efficiency of most solvers, because it involves a global communication between all processors. Let

\[ ρ = < z, y > \]

be the inner product of two vectors, it is performed in two steps (see Alg. 33). It starts with the inner product of each pair of sub-vectors \( x_p \) and \( y_p \), where \( p = 1, 2, \ldots, n_p \). The resulting set of \( n_p \) inner products is stored in an auxiliary variable \( ρ_p \). Then a global sum of these values is performed and shared to all the processors by means of a global communication, so there is a fraction of time spent on the communication, and hence a loss of efficiency.
Algorithm 33 Inner_product($x_p, y_p, p$)

evaluate inner product of vectors $x_p, y_p$

$p_0 = 0$

for ($i = 1$ to $i = i_2$)

for ($j = j_1$ to $j = j_2$)

for ($k = k_1$ to $k = k_2$)

$p_p = p_p + x_p(i, j, k)y_p(i, j, k)$

end for

end for

end for

evaluate global summation of $p_p$

global_sum ($p_p, p$)

Notice that Alg. 33 enables to compute the 2-norm of a vector $x$ (see Alg. 34).

The inner product operation contains a global communication that broadcast each sub inner product to the rest of processors. After that, a sum of all values is performed in each processor. The implementation of the broadcast plus the summation relies on the MPI library and it is performed in the MPI:Allreduce subroutine.

Algorithm 34 Norm_vect($x_p, p$)

compute the inner product of $x$ with itself.

inner_product($x_p, x_p, p$)

evaluate the 2-norm of $x$

$p = \sqrt{p_0}$

Since the number of messages and data does not depend on the partitioning configuration, differences of the speed-ups are due to the differences in computation. Therefore, only the cache missing effects may arise for some configurations for a given case with large size. For instance, the size of vectors are $20 \times 20 \times 20, 40 \times 40 \times 40, 60 \times 60 \times 60, 80 \times 80 \times 80$ and $100 \times 100 \times 100$.

The test is executed on both facilities, PC cluster and Cray T3E within the range of 1 to 12 processors and for different partitioning directions (see Fig. 4.20).
Figure 4.20: partitioning directions that yield different topologies of processors (line, plane and hexahedron) with 2, 4, 6, 8 and 12 processors.

The results are represented in terms of the speed-up in Figs. 4.21, 4.22 and tables 4.3, 4.4 for PC cluster and the Cray T3E respectively.
Figure 4.21: Speed-up for the inner product of 3D vectors in PC cluster.

<table>
<thead>
<tr>
<th>np</th>
<th>partition</th>
<th>20 x 20 x 20</th>
<th>40 x 40 x 40</th>
<th>60 x 60 x 60</th>
<th>80 x 80 x 80</th>
<th>100 x 100 x 100</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1x1y1z</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>2</td>
<td>1x1y2z</td>
<td>0.27</td>
<td>1.30</td>
<td>1.71</td>
<td>1.22</td>
<td>1.85</td>
</tr>
<tr>
<td>4</td>
<td>1x1y4z</td>
<td>0.13</td>
<td>1.63</td>
<td>1.82</td>
<td>3.15</td>
<td>3.38</td>
</tr>
<tr>
<td>4</td>
<td>1x2y2x</td>
<td>0.13</td>
<td>1.42</td>
<td>2.49</td>
<td>3.09</td>
<td>3.35</td>
</tr>
<tr>
<td>6</td>
<td>1x1y6x</td>
<td>0.12</td>
<td>2.00</td>
<td>3.14</td>
<td>4.10</td>
<td>4.71</td>
</tr>
<tr>
<td>6</td>
<td>1x2y3x</td>
<td>0.12</td>
<td>2.02</td>
<td>3.11</td>
<td>3.77</td>
<td>4.73</td>
</tr>
<tr>
<td>8</td>
<td>1x1y8x</td>
<td>0.11</td>
<td>2.00</td>
<td>3.30</td>
<td>4.80</td>
<td>5.80</td>
</tr>
<tr>
<td>8</td>
<td>1x2y4x</td>
<td>0.11</td>
<td>1.95</td>
<td>3.38</td>
<td>4.72</td>
<td>3.99</td>
</tr>
<tr>
<td>8</td>
<td>2x2y2x</td>
<td>0.11</td>
<td>1.98</td>
<td>3.55</td>
<td>4.76</td>
<td>5.86</td>
</tr>
<tr>
<td>10</td>
<td>1x1y10x</td>
<td>0.09</td>
<td>1.74</td>
<td>3.41</td>
<td>5.02</td>
<td>6.20</td>
</tr>
<tr>
<td>10</td>
<td>1x2y5x</td>
<td>0.09</td>
<td>1.73</td>
<td>3.25</td>
<td>5.28</td>
<td>5.99</td>
</tr>
<tr>
<td>12</td>
<td>1x1y12x</td>
<td>0.09</td>
<td>1.73</td>
<td>3.75</td>
<td>5.73</td>
<td>5.03</td>
</tr>
<tr>
<td>12</td>
<td>1x2y6x</td>
<td>0.09</td>
<td>1.71</td>
<td>3.90</td>
<td>5.48</td>
<td>7.28</td>
</tr>
<tr>
<td>12</td>
<td>1x3y4z</td>
<td>0.09</td>
<td>1.69</td>
<td>3.79</td>
<td>5.68</td>
<td>7.30</td>
</tr>
<tr>
<td>12</td>
<td>2x2y3z</td>
<td>0.09</td>
<td>1.64</td>
<td>3.60</td>
<td>5.29</td>
<td>7.33</td>
</tr>
</tbody>
</table>

Table 4.3: Speed-up of the inner product of 3D vectors in the PC cluster.
4.7. ALGEBRAIC OPERATIONS WITH VECTORS AND MATRICES

Figure 4.22: Speed-up for the inner product of 3D vectors in the Cray T3E.

<table>
<thead>
<tr>
<th>np</th>
<th>partition</th>
<th>$20 \times 20 \times 20$</th>
<th>$40 \times 40 \times 40$</th>
<th>$60 \times 60 \times 60$</th>
<th>$80 \times 80 \times 80$</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>lxlylz</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>2</td>
<td>lxly2z</td>
<td>1.70</td>
<td>2.23</td>
<td>1.96</td>
<td>1.98</td>
</tr>
<tr>
<td>4</td>
<td>lxly4z</td>
<td>2.72</td>
<td>4.14</td>
<td>3.77</td>
<td>3.84</td>
</tr>
<tr>
<td>4</td>
<td>lxly2z</td>
<td>2.70</td>
<td>4.21</td>
<td>3.82</td>
<td>3.88</td>
</tr>
<tr>
<td>6</td>
<td>lxly6z</td>
<td>2.74</td>
<td>5.85</td>
<td>5.32</td>
<td>5.66</td>
</tr>
<tr>
<td>6</td>
<td>lxly3z</td>
<td>2.94</td>
<td>5.74</td>
<td>5.47</td>
<td>5.79</td>
</tr>
<tr>
<td>8</td>
<td>lxly8z</td>
<td>3.06</td>
<td>7.18</td>
<td>7.28</td>
<td>7.26</td>
</tr>
<tr>
<td>8</td>
<td>lxly4z</td>
<td>3.85</td>
<td>7.51</td>
<td>7.24</td>
<td>7.46</td>
</tr>
<tr>
<td>8</td>
<td>lxly2z</td>
<td>3.92</td>
<td>7.47</td>
<td>7.29</td>
<td>7.49</td>
</tr>
<tr>
<td>10</td>
<td>lxly10x</td>
<td>4.08</td>
<td>7.66</td>
<td>7.90</td>
<td>8.71</td>
</tr>
<tr>
<td>12</td>
<td>lxly5z</td>
<td>3.23</td>
<td>8.18</td>
<td>8.54</td>
<td>9.10</td>
</tr>
<tr>
<td>12</td>
<td>lxly12z</td>
<td>3.57</td>
<td>8.94</td>
<td>9.36</td>
<td>10.96</td>
</tr>
<tr>
<td>12</td>
<td>lxly6z</td>
<td>3.53</td>
<td>9.38</td>
<td>10.01</td>
<td>10.80</td>
</tr>
<tr>
<td>12</td>
<td>lxly4z</td>
<td>3.60</td>
<td>9.36</td>
<td>10.03</td>
<td>11.01</td>
</tr>
<tr>
<td>12</td>
<td>lxly8z</td>
<td>3.28</td>
<td>9.25</td>
<td>10.09</td>
<td>10.96</td>
</tr>
</tbody>
</table>

Table 4.4: Speed-up of the inner product of 2D vectors in the Cray T3E.
As stated above, the effect of the partitioning configuration has a slight influence on the inner product speed-up.

4.7.4 Matrix-vector product

This operation appears in almost all the solver algorithms. Due to the intensive computational work, it is even used as a work counter in solvers instead of the number of iterations which involve additional operations but with cheaper work. Moreover in a \( np \)-parallel machine, the matrix-vector product becomes less effective due to the extra work of communication among the the processor \( p \) and the neighbour processors \( ngh_p \). The information of neighbour processors \( ngh_p \) is previously "passed" and stored in the mentioned overlapping areas of processor \( p \) and then the operation is fully performed in processor \( p \) as

\[
y_p = A_p x_p + A_{ngh_p} x_{ngh_p}
\]

Indeed the size of the overlapping area plays an important role in the time spent in the communication processes. Furthermore, the type of formulation defines the sparsity of the matrix or in other words, the dependencies between the nodes stored in the processor \( p \) and those nodes stored in the neighbour processors \( ngh_p \). For a matrix-vector product in a 5.7.9 or 19-point formulation, it is only necessary an overlapping of one \( ov \geq 2 \) in the orthogonal directions of the domain.

Higher order schemes lead to formulations where the overlapping must be higher \( ov \geq 2 \), and thus, the amount of data passed among processes increases the time of communication. A generalization of the product of a 7-point formulation matrix with a 3D vector named mat – vect is written for a given overlapping \( ov \) in Alg. 35.

Algorithm 35 Mat.vect \((A_p, x_p, y_p)\)

update overlapped information of \( x_p \)
update \((ov, x_p)\)

evaluate the matrix-vector product

for \( i = (i1 - ov + 1 \text{ to } i = i2 + ov - 1) \)
for \( j = (j1 - ov + 1 \text{ to } j = j2 + ov - 1) \)
for \( k = (k1 - ov + 1 \text{ to } k = k2 + ov - 1) \)

\[
y_p(i, j, k) = A_p(i, j, k) x_p(i, j, k) + A_{ngh}(i, j, k) x_{ngh}(i, j, k) + A_p(i, j, k) x_p(i - 1, j, k) + A_{ngh}(i, j, k) x_{ngh}(i - 1, j, k) + A_p(i, j, k) x_p(i + 1, j, k) + A_{ngh}(i, j, k) x_{ngh}(i + 1, j, k) + A_p(i, j, k) x_p(i, j, k - 1) + A_{ngh}(i, j, k) x_{ngh}(i, j, k - 1) + A_p(i, j, k) x_p(i, j, k + 1) + A_{ngh}(i, j, k) x_{ngh}(i, j, k + 1)
\]
end for
end for
end for

At this point, the above algebraic operations enable us to build more complicate operations. For example, let us show how the residual needed in the stopping criteria for a 7-point formulation matrix is built in Alg. 36.
Algorithm 36 Residual($A_p, x_p, r_p, r_p$)

evaluate the operation \( r_p = A_p x_p \)

\text{mat. vect} (A_p, x_p, r_p)

evaluate the operation \( r_p = b_p - r_p \)

diff. vect (b_p, r_p, r_p)

4.7.5 Minimum matrix-vector product size per processor

The following test models the communication and computation timings of the matrix-vector product. It is designed to show the need of an overlapping of the communication and the serial local computation within the operation. The matrix-vector operation for both two dimensional \( I \times I \) and three dimensional \( I \times I \times I \) CFD problems has been taken (i.e. 5-point and 7-point formulations respectively). The measures of both times, the serial local computation of the matrix-vector product and the exchange of halos of size \( ov=1 \), are compared for a wide range of size problems.

The measures of the serial local computation timings are carried out in one processor for different size problems. For instance, the two dimensional problem version is described in Alg. 37.

Algorithm 37 Serial local computation of \( y = A x \)

\begin{verbatim}
for (sample = 1 to sample = 1000) 
  [sample] = J[sample] = random(i,1000)
  MPI.Barrier()
  t0=MPI.Wtime()
  for (j = 1 to j = J[sample])
    for (i = 1 to i = I[sample])
      y(i,j) = A(i,j)[:,i,j]
      + A(i,j)[:,i-1,j]
      + A(i,j)[:,i+1,j]
      + A(i,j)[:,i,j-1]
      + A(i,j)[:,i,j+1]
  end for
end for
  t1=MPI.Wtime()
  tcomp[sample]=t1-t0
end for
\end{verbatim}

Meanwhile, the exchange of halos is simulated in only two processors for the same range of size problems and on the assumption that the exchange of halos is in all directions. The measure of the times while the exchange of the halos of \( z \) for the two dimensional problem is described in Alg. 38.
Algorithm 38 Exchange of halos \( ov=1 \) of \( x \)

\[
\begin{align*}
\text{for } (\text{sample} = 1 \text{ to sample } = 1000) \\
\text{[sample]=random(1,1000)} \\
\text{MPI.Barrier()} \\
\text{exchange sides} \\
0=\text{MPI.Wtime()} \\
\text{exchange([sample]*ov)} \\
1=\text{MPI.Wtime()} \\
\text{tcomm}[\text{sample}]=t^*(1-t0)
\end{align*}
\]

end for

Here, in order to reduce as much as possible the time of communication, the exchange subroutine transfers the packed data from one processor to another in the non-blocking mode.

The test is run in both machines PC cluster and the Cray T3E. The time results versus the size problem are presented in Figs. 4.23, 4.24, 4.25 and 4.26.

If the communication process and the serial local computation process are performed consecutively (i.e. in non-overlapped fashion), the intersection points for each case define the minimum estimate sizes of the local problem which a processor could do with a parallel efficiency of 50%.

Let \( t_{\text{comp}(1)} \) and \( t_{\text{comp}(np)} + t_{\text{comm}(np)} \) be the sequential and parallel (for \( np \) processors) timings of the overall operation, the efficiency is approximately expressed as

\[
E(np) = \frac{t_{\text{comp}(1)}}{np(t_{\text{comp}(np)} + t_{\text{comm}(np)})} \approx \frac{np}{np(t_{\text{comp}(np)} + t_{\text{comm}(np)})}
\]

In the intersection point \( t_{\text{comm}(np)} = t_{\text{comp}(np)} \). Therefore

\[
E(np) = \frac{np}{np(t_{\text{comp}(np)} + t_{\text{comp}(np)})} = \frac{1}{2}
\]

These points are given in Table 4.5:

<table>
<thead>
<tr>
<th>Problem</th>
<th>PC cluster</th>
<th>Cray T3E</th>
</tr>
</thead>
<tbody>
<tr>
<td>Two dimensional</td>
<td>40,000 = (200 x 200)</td>
<td>2,500 = (50 x 50)</td>
</tr>
<tr>
<td>Three dimensional</td>
<td>64,000 = (40 x 40 x 40)</td>
<td>3,375 = (15 x 15 x 15)</td>
</tr>
</tbody>
</table>

Table 4.5: Minimum estimate size per processor for a parallel efficiency of 50% for the non-overlapped computation and communication.

Analogously to the previous section, the measure of the speed-up of the matrix-vector product for a given size problem is obtained by timing the algebraic operation at different number of processors. The test is executed within the range of 1 to 12 processors in both facilities PC cluster and Cray T3E. Indeed, partitioning in two or three directions affects
Figure 4.23: Timings of computation and communication of a two-dimensional matrix-vector product in PC cluster.

Figure 4.24: Timings of computation and communication of a two-dimensional matrix-vector product in the Cray T3E.
Figure 4.25: Timings of computation and communication of a three-dimensional matrix-vector product in PC cluster.

Figure 4.26: Timings of computation and communication of a three-dimensional matrix-vector product in the Cray T3E.
the efficiency due to (1) the different ratios of computation versus the exchange of data and (2) the cache effects for large amounts of data distributed in few processors.

These effects are analyzed for a set of cases (20 $\times$ 20 $\times$ 20), (40 $\times$ 40 $\times$ 40), (60 $\times$ 60 $\times$ 60), (80 $\times$ 80 $\times$ 80) and (100 $\times$ 100 $\times$ 100) (the last one only for PC cluster).

As mentioned above, the size of all of these cases is over the minimum estimated for the Cray T3E (15 $\times$ 15 $\times$ 15) so one may expect efficiencies higher than 50%. For PC cluster, these efficiencies are expected under the (40 $\times$ 40 $\times$ 40) size.

The experiment is repeated several times for each number of processors and for each partitioning configuration. The speed-up for each case is evaluated with the averaged timings. Full results (all partitioning configurations) are represented in Figs. 4.27, 4.28 and tables 4.6, 4.7 for the PC cluster and the Cray T3E respectively.

The successive experiments have been computed with a generalised algorithm 39 where the number of processors, partitioning configurations, and problem sizes are expressed in a set of nested loops.

Algorithm 39 Performance of operations

```plaintext
for (np = 1 to np = 12)
  for (partitions = p_x, p_y, p_z = 1 to 12, such that p_xp_yp_z = np)
    for (size = 20 to size = 100, size = size + 20)
      for (sample = 1 to sample = 20)
        MPI_Barrier()
        t0=MPI_Wtime()
        evaluate algebraic operation or solve a problem
        t1=MPI_Wtime()
        time=time + t1-t0
        t_comp[operation][size][partition][np] = time/20
      end for
    end for
  end for
end for
```

Where the algebraic operations are the matrix vector product and inner product, and the solvers are Jacobi, Gauss-Seidel, MSIP, Conjugate Gradient, BiCGSTAB and GMRESR.
Figure 4.27: Speed-up for the matrix-vector product in PC cluster.

<table>
<thead>
<tr>
<th>np</th>
<th>partition</th>
<th>$20 \times 20 \times 20$</th>
<th>$40 \times 30 \times 40$</th>
<th>$60 \times 60 \times 60$</th>
<th>$80 \times 80 \times 80$</th>
<th>$100 \times 100 \times 100$</th>
<th>Ideal Speed-up</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1x1x1x</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>2</td>
<td>1x1x2x</td>
<td>1.04</td>
<td>1.04</td>
<td>1.10</td>
<td>0.84</td>
<td>1.82</td>
<td>2.07</td>
</tr>
<tr>
<td>4</td>
<td>1x1x4x</td>
<td>1.29</td>
<td>1.87</td>
<td>1.10</td>
<td>2.79</td>
<td>2.37</td>
<td>2.37</td>
</tr>
<tr>
<td>4</td>
<td>1x2x2x</td>
<td>0.93</td>
<td>1.18</td>
<td>3.08</td>
<td>3.09</td>
<td>3.09</td>
<td>3.50</td>
</tr>
<tr>
<td>6</td>
<td>1x1y6x</td>
<td>1.36</td>
<td>2.90</td>
<td>3.42</td>
<td>3.50</td>
<td>3.61</td>
<td>3.27</td>
</tr>
<tr>
<td>6</td>
<td>1x2y3x</td>
<td>0.71</td>
<td>2.09</td>
<td>3.37</td>
<td>3.22</td>
<td>4.67</td>
<td>4.57</td>
</tr>
<tr>
<td>8</td>
<td>1x1y8x</td>
<td>1.43</td>
<td>2.55</td>
<td>3.03</td>
<td>3.95</td>
<td>4.45</td>
<td>2.91</td>
</tr>
<tr>
<td>8</td>
<td>1x2y4x</td>
<td>0.71</td>
<td>2.33</td>
<td>3.91</td>
<td>3.81</td>
<td>2.91</td>
<td>2.91</td>
</tr>
<tr>
<td>8</td>
<td>2x2y2x</td>
<td>0.53</td>
<td>2.01</td>
<td>4.33</td>
<td>5.14</td>
<td>6.36</td>
<td>6.36</td>
</tr>
<tr>
<td>10</td>
<td>1x1y10x</td>
<td>1.52</td>
<td>2.63</td>
<td>4.26</td>
<td>5.74</td>
<td>4.10</td>
<td>4.10</td>
</tr>
<tr>
<td>10</td>
<td>1x2y5x</td>
<td>0.72</td>
<td>2.50</td>
<td>4.62</td>
<td>5.32</td>
<td>4.82</td>
<td>4.82</td>
</tr>
<tr>
<td>12</td>
<td>1x2y12x</td>
<td>1.51</td>
<td>2.67</td>
<td>4.59</td>
<td>4.80</td>
<td>3.31</td>
<td>3.31</td>
</tr>
<tr>
<td>12</td>
<td>1x2y6x</td>
<td>0.72</td>
<td>2.74</td>
<td>5.04</td>
<td>5.19</td>
<td>6.05</td>
<td>6.05</td>
</tr>
<tr>
<td>12</td>
<td>1x3y4x</td>
<td>0.52</td>
<td>2.31</td>
<td>4.53</td>
<td>6.37</td>
<td>7.33</td>
<td>7.33</td>
</tr>
<tr>
<td>12</td>
<td>2x2y3x</td>
<td>0.37</td>
<td>1.85</td>
<td>4.50</td>
<td>4.47</td>
<td>7.75</td>
<td>7.75</td>
</tr>
</tbody>
</table>

Table 4.6: Speed-up of matrix-vector product in PC cluster.
Figure 4.28: Speed-up for the matrix-vector product in the Cray T3E.

<table>
<thead>
<tr>
<th>np</th>
<th>partition</th>
<th>$20 \times 20 \times 20$</th>
<th>$40 \times 40 \times 40$</th>
<th>$60 \times 60 \times 60$</th>
<th>$80 \times 80 \times 80$</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1x1y1z</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>2</td>
<td>1x1y2z</td>
<td>2.12</td>
<td>2.09</td>
<td>1.97</td>
<td>2.02</td>
</tr>
<tr>
<td>4</td>
<td>1x2y4z</td>
<td>3.29</td>
<td>3.84</td>
<td>3.74</td>
<td>3.92</td>
</tr>
<tr>
<td>4</td>
<td>1x2y2z</td>
<td>3.46</td>
<td>4.03</td>
<td>3.31</td>
<td>3.67</td>
</tr>
<tr>
<td>6</td>
<td>1x1y6z</td>
<td>5.20</td>
<td>5.98</td>
<td>5.30</td>
<td>5.85</td>
</tr>
<tr>
<td>6</td>
<td>1x2y4z</td>
<td>4.70</td>
<td>5.78</td>
<td>5.21</td>
<td>5.63</td>
</tr>
<tr>
<td>8</td>
<td>1x1y8z</td>
<td>7.29</td>
<td>6.42</td>
<td>7.21</td>
<td>7.38</td>
</tr>
<tr>
<td>8</td>
<td>1x2y4z</td>
<td>5.47</td>
<td>7.15</td>
<td>6.66</td>
<td>6.83</td>
</tr>
<tr>
<td>8</td>
<td>2x2y2z</td>
<td>3.96</td>
<td>6.44</td>
<td>6.79</td>
<td>7.44</td>
</tr>
<tr>
<td>10</td>
<td>1x1y10z</td>
<td>6.89</td>
<td>8.01</td>
<td>8.29</td>
<td>8.78</td>
</tr>
<tr>
<td>10</td>
<td>1x2y5z</td>
<td>6.04</td>
<td>8.96</td>
<td>8.12</td>
<td>9.31</td>
</tr>
<tr>
<td>12</td>
<td>1x2y12z</td>
<td>8.65</td>
<td>9.77</td>
<td>7.98</td>
<td>11.47</td>
</tr>
<tr>
<td>12</td>
<td>1x2y6z</td>
<td>7.82</td>
<td>10.59</td>
<td>8.93</td>
<td>11.18</td>
</tr>
<tr>
<td>12</td>
<td>1x3y6z</td>
<td>6.65</td>
<td>10.44</td>
<td>9.84</td>
<td>9.99</td>
</tr>
<tr>
<td>12</td>
<td>2x2y3z</td>
<td>4.56</td>
<td>9.18</td>
<td>9.06</td>
<td>10.01</td>
</tr>
</tbody>
</table>

Table 4.7: Speed-up of matrix by vector in the Cray T3E.
Seeing these figures it is stated that for a given size of the problem and number of processors, the partitioning configuration is definitively a critical factor on the speed-up. The reason is that while the time of computation remains constant independently of the partitioning configuration, the time of communication varies strongly. Then the ratio between the times of computation and communication varies with the partitioning configuration. This ratio in PC cluster is greater than the obtained in the Cray T3E because the time of communication of any packet of data is longer in PC cluster while the time of computation is quite similar in both facilities.

4.8 Parallel performance of solvers

The parallel implementation of these solvers has been tested and the performance, in terms of the speed-up, is measured for a three dimensional Laplace problem (e.g. the 3D heat conduction problem) with full Dirichlet boundary conditions. In this test, the stopping criterion of $\epsilon = 10^{-8}$ has been chosen and analogously to the parallel performance of the algebraic operations, the different partitioning configurations have been tested in different size of problems: $20 \times 20 \times 20$, $40 \times 40 \times 40$, $60 \times 60 \times 60$, $80 \times 80 \times 80$, $100 \times 100 \times 100$.

For each solver, size of problem number of processors and partitioning configurations, the test is repeated several times and the timings are averaged.

The test is executed at both computers PC cluster and the Cray T3E. The results are summarized by means of the speed-ups. Figures and tables for each solver are reported below.

- For Jacobi solver see Figs. 4.29, 4.30 and tables 4.8,4.9.
- For Gauss-Seidel solver see Figs. 4.31, 4.32 and tables 4.10,4.11.
- For MSIP solver with $\alpha = 0.5$ see Figs. 4.33, 4.34 and tables 4.12,4.13.
- For preconditioned BiCGSTAB solver see Figs. 4.35, 4.36 and tables 4.14,4.15.
- For preconditioned GMRESR(10) solver see Figs. 4.37,4.38 and tables 4.16,4.17.
4.8. Parallel performance of solvers

Figure 4.29: Speed-up for the Jacobi in PC cluster.

<table>
<thead>
<tr>
<th>up</th>
<th>partition</th>
<th>$20 \times 20 \times 20$</th>
<th>$40 \times 40 \times 40$</th>
<th>$60 \times 60 \times 60$</th>
<th>$80 \times 80 \times 80$</th>
<th>$100 \times 100 \times 100$</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1x1y1z</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>2</td>
<td>1x1y2z</td>
<td>1.32</td>
<td>2.04</td>
<td>0.94</td>
<td>1.95</td>
<td>1.95</td>
</tr>
<tr>
<td>4</td>
<td>1x1y4z</td>
<td>1.74</td>
<td>1.39</td>
<td>3.31</td>
<td>3.47</td>
<td>3.47</td>
</tr>
<tr>
<td>4</td>
<td>1x2y2z</td>
<td>1.47</td>
<td>3.71</td>
<td>3.92</td>
<td>3.78</td>
<td>3.78</td>
</tr>
<tr>
<td>6</td>
<td>1x1y6z</td>
<td>1.77</td>
<td>4.30</td>
<td>4.28</td>
<td>4.37</td>
<td>4.37</td>
</tr>
<tr>
<td>6</td>
<td>1x2y3z</td>
<td>1.28</td>
<td>4.38</td>
<td>3.47</td>
<td>5.21</td>
<td>5.21</td>
</tr>
<tr>
<td>8</td>
<td>1x1y8z</td>
<td>1.85</td>
<td>3.63</td>
<td>4.68</td>
<td>5.28</td>
<td>5.28</td>
</tr>
<tr>
<td>8</td>
<td>1x2y4z</td>
<td>1.26</td>
<td>5.47</td>
<td>4.46</td>
<td>3.40</td>
<td>3.40</td>
</tr>
<tr>
<td>8</td>
<td>2x2y2z</td>
<td>1.05</td>
<td>5.77</td>
<td>6.50</td>
<td>7.39</td>
<td>7.39</td>
</tr>
<tr>
<td>10</td>
<td>1x1y10z</td>
<td>1.75</td>
<td>5.68</td>
<td>4.41</td>
<td>4.72</td>
<td>4.72</td>
</tr>
<tr>
<td>10</td>
<td>1x2y5z</td>
<td>1.24</td>
<td>6.94</td>
<td>6.79</td>
<td>5.40</td>
<td>5.40</td>
</tr>
<tr>
<td>12</td>
<td>1x2y12z</td>
<td>1.74</td>
<td>6.12</td>
<td>6.04</td>
<td>3.72</td>
<td>3.72</td>
</tr>
<tr>
<td>12</td>
<td>1x2y6z</td>
<td>1.23</td>
<td>8.01</td>
<td>6.11</td>
<td>8.46</td>
<td>8.46</td>
</tr>
<tr>
<td>12</td>
<td>1x3y4z</td>
<td>0.99</td>
<td>6.94</td>
<td>8.72</td>
<td>9.05</td>
<td>9.05</td>
</tr>
<tr>
<td>12</td>
<td>2x2y3z</td>
<td>0.77</td>
<td>0.89</td>
<td>5.00</td>
<td>10.01</td>
<td>10.01</td>
</tr>
</tbody>
</table>

Table 4.8: Speed-up of Jacobi solver in PC cluster.
Figure 4.30: Speed-up for the Jacobi in the Cray T3E.

<table>
<thead>
<tr>
<th>np</th>
<th>partition</th>
<th>$20 \times 20 \times 20$</th>
<th>$40 \times 40 \times 40$</th>
<th>$60 \times 60 \times 60$</th>
<th>$80 \times 80 \times 80$</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1x1y1z</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>2</td>
<td>1x1y2z</td>
<td>1.86</td>
<td>1.95</td>
<td>1.96</td>
<td>1.95</td>
</tr>
<tr>
<td>4</td>
<td>1x1y4z</td>
<td>3.69</td>
<td>3.75</td>
<td>3.88</td>
<td>3.85</td>
</tr>
<tr>
<td>4</td>
<td>1x2y2z</td>
<td>3.78</td>
<td>4.00</td>
<td>3.97</td>
<td>3.89</td>
</tr>
<tr>
<td>6</td>
<td>1x1y6z</td>
<td>4.62</td>
<td>5.12</td>
<td>5.69</td>
<td>4.67</td>
</tr>
<tr>
<td>6</td>
<td>1x2y3z</td>
<td>4.83</td>
<td>5.43</td>
<td>5.82</td>
<td>5.61</td>
</tr>
<tr>
<td>8</td>
<td>1x1y8z</td>
<td>6.12</td>
<td>7.37</td>
<td>7.06</td>
<td>7.33</td>
</tr>
<tr>
<td>8</td>
<td>1x2y4z</td>
<td>5.62</td>
<td>7.01</td>
<td>6.87</td>
<td>7.54</td>
</tr>
<tr>
<td>8</td>
<td>2x2y2z</td>
<td>6.53</td>
<td>7.25</td>
<td>7.76</td>
<td>7.94</td>
</tr>
<tr>
<td>10</td>
<td>1x1y10z</td>
<td>8.05</td>
<td>8.47</td>
<td>9.13</td>
<td>9.35</td>
</tr>
<tr>
<td>10</td>
<td>1x2y5z</td>
<td>6.93</td>
<td>8.47</td>
<td>9.44</td>
<td>9.12</td>
</tr>
<tr>
<td>12</td>
<td>1x2y12z</td>
<td>8.18</td>
<td>8.51</td>
<td>10.59</td>
<td>10.59</td>
</tr>
<tr>
<td>12</td>
<td>1x2y6z</td>
<td>7.13</td>
<td>9.41</td>
<td>10.48</td>
<td>10.43</td>
</tr>
<tr>
<td>12</td>
<td>1x3y6z</td>
<td>6.62</td>
<td>10.14</td>
<td>11.35</td>
<td>10.71</td>
</tr>
<tr>
<td>12</td>
<td>2x2y3z</td>
<td>5.48</td>
<td>9.52</td>
<td>10.40</td>
<td>10.74</td>
</tr>
</tbody>
</table>

Table 4.9: Speed-up of Jacobi solver at Cray T3E.
Figure 4.31: Speed-up for the Gauss-Seidel in PC cluster.

<table>
<thead>
<tr>
<th>np</th>
<th>partition</th>
<th>20 × 20 × 20</th>
<th>40 × 40 × 40</th>
<th>60 × 60 × 60</th>
<th>80 × 80 × 80</th>
<th>100 × 100 × 100</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1x1y1z</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>2</td>
<td>1x1y2z</td>
<td>1.21</td>
<td>1.07</td>
<td>2.03</td>
<td>0.92</td>
<td>1.91</td>
</tr>
<tr>
<td>4</td>
<td>1x1y4z</td>
<td>1.49</td>
<td>2.22</td>
<td>1.27</td>
<td>3.14</td>
<td>3.35</td>
</tr>
<tr>
<td>4</td>
<td>1x2y2z</td>
<td>1.19</td>
<td>1.88</td>
<td>3.54</td>
<td>3.88</td>
<td>3.77</td>
</tr>
<tr>
<td>6</td>
<td>1x1y6z</td>
<td>1.42</td>
<td>2.91</td>
<td>4.06</td>
<td>4.12</td>
<td>4.33</td>
</tr>
<tr>
<td>6</td>
<td>1x2y3z</td>
<td>1.06</td>
<td>2.87</td>
<td>4.10</td>
<td>3.48</td>
<td>5.13</td>
</tr>
<tr>
<td>8</td>
<td>1x1y8z</td>
<td>1.47</td>
<td>3.33</td>
<td>3.50</td>
<td>4.67</td>
<td>5.03</td>
</tr>
<tr>
<td>8</td>
<td>1x2y4z</td>
<td>1.02</td>
<td>2.22</td>
<td>5.18</td>
<td>4.34</td>
<td>3.32</td>
</tr>
<tr>
<td>8</td>
<td>2x2y2z</td>
<td>0.84</td>
<td>2.86</td>
<td>5.20</td>
<td>6.27</td>
<td>7.54</td>
</tr>
<tr>
<td>10</td>
<td>1x1y10z</td>
<td>1.33</td>
<td>3.40</td>
<td>4.98</td>
<td>4.17</td>
<td>4.60</td>
</tr>
<tr>
<td>10</td>
<td>1x2y5z</td>
<td>0.98</td>
<td>3.40</td>
<td>6.15</td>
<td>6.87</td>
<td>5.51</td>
</tr>
<tr>
<td>12</td>
<td>1x2y12z</td>
<td>1.33</td>
<td>3.15</td>
<td>5.16</td>
<td>5.53</td>
<td>3.52</td>
</tr>
<tr>
<td>12</td>
<td>1x2y6z</td>
<td>0.95</td>
<td>3.73</td>
<td>7.06</td>
<td>5.91</td>
<td>8.13</td>
</tr>
<tr>
<td>12</td>
<td>1x3y4z</td>
<td>0.78</td>
<td>3.28</td>
<td>6.22</td>
<td>8.28</td>
<td>8.86</td>
</tr>
<tr>
<td>12</td>
<td>2x2y3z</td>
<td>0.62</td>
<td>2.85</td>
<td>6.13</td>
<td>5.77</td>
<td>9.72</td>
</tr>
</tbody>
</table>

Table 4.10: Speed-up of Gauss-Seidel solver in PC cluster.
Figure 4.32: Speed-up for the Gauss-Seidel in the Cray T3E.

<table>
<thead>
<tr>
<th>np</th>
<th>partition</th>
<th>$20 \times 20 \times 20$</th>
<th>$40 \times 40 \times 40$</th>
<th>$60 \times 60 \times 60$</th>
<th>$80 \times 80 \times 80$</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1x1x1z</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>2</td>
<td>1x1y2z</td>
<td>1.84</td>
<td>1.86</td>
<td>1.93</td>
<td>1.98</td>
</tr>
<tr>
<td>4</td>
<td>1x1y4z</td>
<td>3.43</td>
<td>3.67</td>
<td>3.80</td>
<td>3.94</td>
</tr>
<tr>
<td>4</td>
<td>1x2y2z</td>
<td>3.59</td>
<td>3.97</td>
<td>3.73</td>
<td>3.93</td>
</tr>
<tr>
<td>6</td>
<td>1x1y6z</td>
<td>4.20</td>
<td>4.83</td>
<td>5.50</td>
<td>4.97</td>
</tr>
<tr>
<td>6</td>
<td>1x2y3z</td>
<td>4.42</td>
<td>5.53</td>
<td>5.55</td>
<td>5.78</td>
</tr>
<tr>
<td>8</td>
<td>1x1y8z</td>
<td>5.56</td>
<td>6.87</td>
<td>6.75</td>
<td>7.65</td>
</tr>
<tr>
<td>8</td>
<td>1x2y4z</td>
<td>5.96</td>
<td>7.01</td>
<td>7.42</td>
<td>7.61</td>
</tr>
<tr>
<td>8</td>
<td>2x2y2z</td>
<td>4.97</td>
<td>6.75</td>
<td>6.47</td>
<td>7.73</td>
</tr>
<tr>
<td>10</td>
<td>1x1y10z</td>
<td>6.97</td>
<td>7.86</td>
<td>8.67</td>
<td>9.40</td>
</tr>
<tr>
<td>10</td>
<td>1x2y5z</td>
<td>6.30</td>
<td>7.97</td>
<td>8.89</td>
<td>9.31</td>
</tr>
<tr>
<td>12</td>
<td>1x2y12z</td>
<td>7.60</td>
<td>7.77</td>
<td>9.91</td>
<td>10.32</td>
</tr>
<tr>
<td>12</td>
<td>1x2y6z</td>
<td>6.32</td>
<td>9.12</td>
<td>9.84</td>
<td>10.57</td>
</tr>
<tr>
<td>12</td>
<td>1x3y4z</td>
<td>5.70</td>
<td>9.68</td>
<td>10.62</td>
<td>10.98</td>
</tr>
<tr>
<td>12</td>
<td>2x2y3z</td>
<td>4.86</td>
<td>8.80</td>
<td>9.80</td>
<td>11.03</td>
</tr>
</tbody>
</table>

Table 4.11: Speed-up of Gauss-Seidel solver at Cray T3E
Figure 4.33: Speed-up for the MSIP in PC cluster.

![Graph showing speed-up for MSIP in PC cluster.](image)

<table>
<thead>
<tr>
<th>np</th>
<th>partition</th>
<th>20 x 20 x 20</th>
<th>40 x 40 x 40</th>
<th>60 x 60 x 60</th>
<th>80 x 80 x 80</th>
<th>100 x 100 x 100</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1x1y1z</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>2</td>
<td>1x1y2z</td>
<td>1.10</td>
<td>1.01</td>
<td>1.96</td>
<td>0.85</td>
<td>1.89</td>
</tr>
<tr>
<td>4</td>
<td>1x1y4z</td>
<td>1.19</td>
<td>2.05</td>
<td>2.52</td>
<td>2.93</td>
<td>3.22</td>
</tr>
<tr>
<td>4</td>
<td>1x2y2z</td>
<td>0.85</td>
<td>1.15</td>
<td>3.18</td>
<td>3.46</td>
<td>3.59</td>
</tr>
<tr>
<td>6</td>
<td>1x1y6z</td>
<td>1.02</td>
<td>2.56</td>
<td>3.78</td>
<td>3.84</td>
<td>4.07</td>
</tr>
<tr>
<td>6</td>
<td>1x2y3z</td>
<td>0.77</td>
<td>2.42</td>
<td>3.86</td>
<td>3.14</td>
<td>4.89</td>
</tr>
<tr>
<td>8</td>
<td>1x1y8z</td>
<td>1.03</td>
<td>2.79</td>
<td>3.04</td>
<td>3.99</td>
<td>4.78</td>
</tr>
<tr>
<td>8</td>
<td>1x2y4z</td>
<td>0.77</td>
<td>2.77</td>
<td>4.49</td>
<td>3.91</td>
<td>2.92</td>
</tr>
<tr>
<td>8</td>
<td>2x2y2z</td>
<td>0.69</td>
<td>2.41</td>
<td>4.69</td>
<td>5.43</td>
<td>6.47</td>
</tr>
<tr>
<td>10</td>
<td>1x1y10z</td>
<td>0.95</td>
<td>2.78</td>
<td>4.50</td>
<td>3.52</td>
<td>3.92</td>
</tr>
<tr>
<td>10</td>
<td>1x2y5z</td>
<td>0.73</td>
<td>2.85</td>
<td>5.89</td>
<td>5.85</td>
<td>5.97</td>
</tr>
<tr>
<td>12</td>
<td>1x2y12z</td>
<td>0.91</td>
<td>2.57</td>
<td>4.56</td>
<td>4.97</td>
<td>3.16</td>
</tr>
<tr>
<td>12</td>
<td>1x5y6z</td>
<td>0.69</td>
<td>3.15</td>
<td>6.13</td>
<td>5.14</td>
<td>7.60</td>
</tr>
<tr>
<td>12</td>
<td>1x3y4z</td>
<td>0.61</td>
<td>2.81</td>
<td>5.52</td>
<td>7.10</td>
<td>8.69</td>
</tr>
<tr>
<td>12</td>
<td>2x2y6z</td>
<td>0.45</td>
<td>2.42</td>
<td>5.56</td>
<td>4.78</td>
<td>8.62</td>
</tr>
</tbody>
</table>

Table 4.12: Speed-up of MSIP solver with $\alpha = 0.5$ in PC cluster.
Figure 4.34: Speed-up for the MSIP in the Cray T3E.

<table>
<thead>
<tr>
<th>np</th>
<th>partition</th>
<th>$20 \times 20 \times 20$</th>
<th>$40 \times 40 \times 40$</th>
<th>$60 \times 60 \times 60$</th>
<th>$80 \times 80 \times 80$</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>lxl1x</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>2</td>
<td>lxl2x</td>
<td>1.53</td>
<td>1.74</td>
<td>1.97</td>
<td>1.98</td>
</tr>
<tr>
<td>4</td>
<td>lxl4x</td>
<td>2.44</td>
<td>3.10</td>
<td>3.63</td>
<td>3.84</td>
</tr>
<tr>
<td>4</td>
<td>lxl2x</td>
<td>2.22</td>
<td>3.09</td>
<td>3.61</td>
<td>3.73</td>
</tr>
<tr>
<td>6</td>
<td>lxl6x</td>
<td>2.53</td>
<td>4.04</td>
<td>5.92</td>
<td>5.55</td>
</tr>
<tr>
<td>6</td>
<td>lxl3x</td>
<td>2.64</td>
<td>4.11</td>
<td>5.14</td>
<td>5.28</td>
</tr>
<tr>
<td>8</td>
<td>lxl8x</td>
<td>2.90</td>
<td>5.19</td>
<td>5.86</td>
<td>6.72</td>
</tr>
<tr>
<td>8</td>
<td>lxl4z</td>
<td>1.08</td>
<td>5.31</td>
<td>5.58</td>
<td>6.88</td>
</tr>
<tr>
<td>8</td>
<td>lxl2z</td>
<td>2.63</td>
<td>5.05</td>
<td>5.97</td>
<td>7.01</td>
</tr>
<tr>
<td>10</td>
<td>lxl10x</td>
<td>3.44</td>
<td>5.75</td>
<td>7.33</td>
<td>8.03</td>
</tr>
<tr>
<td>10</td>
<td>lxl5z</td>
<td>3.31</td>
<td>5.60</td>
<td>7.84</td>
<td>8.32</td>
</tr>
<tr>
<td>12</td>
<td>lxl12z</td>
<td>3.86</td>
<td>5.33</td>
<td>8.01</td>
<td>8.84</td>
</tr>
<tr>
<td>12</td>
<td>lxl6z</td>
<td>2.18</td>
<td>6.88</td>
<td>8.89</td>
<td>9.19</td>
</tr>
<tr>
<td>12</td>
<td>lxl4z</td>
<td>3.11</td>
<td>7.13</td>
<td>9.30</td>
<td>10.05</td>
</tr>
<tr>
<td>12</td>
<td>lxl3z</td>
<td>2.69</td>
<td>6.70</td>
<td>8.87</td>
<td>9.79</td>
</tr>
</tbody>
</table>

Table 4.13: Speed-up of MSIP solver with $\alpha = 0.5$ in the Cray T3E.
Figure 4.35: Speed-up for the BiCGSTAB preconditioned with MSIP in PC cluster.

<table>
<thead>
<tr>
<th>np</th>
<th>partition</th>
<th>$20 \times 20 \times 20$</th>
<th>$40 \times 40 \times 40$</th>
<th>$60 \times 60 \times 60$</th>
<th>$80 \times 80 \times 80$</th>
<th>$100 \times 100 \times 100$</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1x1y1z</td>
<td>1.00</td>
<td>1.06</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>2</td>
<td>1x1y2z</td>
<td>1.40</td>
<td>1.23</td>
<td>1.96</td>
<td>1.02</td>
<td>1.97</td>
</tr>
<tr>
<td>4</td>
<td>1x1y4z</td>
<td>1.30</td>
<td>2.58</td>
<td>1.78</td>
<td>3.50</td>
<td>2.91</td>
</tr>
<tr>
<td>4</td>
<td>1x2y2z</td>
<td>1.02</td>
<td>1.72</td>
<td>3.83</td>
<td>4.17</td>
<td>3.83</td>
</tr>
<tr>
<td>6</td>
<td>1x1y6z</td>
<td>1.48</td>
<td>3.59</td>
<td>4.48</td>
<td>4.79</td>
<td>4.52</td>
</tr>
<tr>
<td>6</td>
<td>1x2y3z</td>
<td>0.93</td>
<td>3.30</td>
<td>4.42</td>
<td>4.38</td>
<td>4.88</td>
</tr>
<tr>
<td>8</td>
<td>1x1y8z</td>
<td>1.41</td>
<td>4.22</td>
<td>4.02</td>
<td>5.01</td>
<td>5.48</td>
</tr>
<tr>
<td>8</td>
<td>1x2y4z</td>
<td>0.97</td>
<td>3.80</td>
<td>5.83</td>
<td>5.23</td>
<td>3.04</td>
</tr>
<tr>
<td>8</td>
<td>2x2y2z</td>
<td>0.82</td>
<td>3.50</td>
<td>5.78</td>
<td>7.15</td>
<td>6.66</td>
</tr>
<tr>
<td>10</td>
<td>1x1y10z</td>
<td>1.30</td>
<td>3.78</td>
<td>5.40</td>
<td>5.86</td>
<td>4.48</td>
</tr>
<tr>
<td>10</td>
<td>1x2y5z</td>
<td>1.01</td>
<td>4.68</td>
<td><strong>6.97</strong></td>
<td><strong>7.56</strong></td>
<td>5.24</td>
</tr>
<tr>
<td>12</td>
<td>1x2y12z</td>
<td>1.20</td>
<td>3.91</td>
<td>5.16</td>
<td>7.07</td>
<td>4.10</td>
</tr>
<tr>
<td>12</td>
<td>1x2y9z</td>
<td>0.84</td>
<td>4.71</td>
<td>7.40</td>
<td>8.33</td>
<td>7.07</td>
</tr>
<tr>
<td>12</td>
<td>1x3y4z</td>
<td>0.85</td>
<td>4.47</td>
<td>7.24</td>
<td>9.07</td>
<td>7.56</td>
</tr>
<tr>
<td>12</td>
<td>2x2y3z</td>
<td>0.63</td>
<td>3.52</td>
<td>6.79</td>
<td>7.57</td>
<td>8.82</td>
</tr>
</tbody>
</table>

Table 4.14: Speed-up of BiCGSTAB solver preconditioned with MSIP in PC cluster.
Figure 4.36: Speed-up for the BiCGSTAB preconditioned with MSIP in the Cray T3E.

Table 4.15: Speed-up of BiCGSTAB solver preconditioned with MSIP in the Cray T3E.

<table>
<thead>
<tr>
<th>np</th>
<th>partition</th>
<th>20 × 20 × 20</th>
<th>40 × 40 × 40</th>
<th>60 × 60 × 60</th>
<th>80 × 80 × 80</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1x1x1z</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>2</td>
<td>1x1y2z</td>
<td>1.97</td>
<td>1.71</td>
<td>1.60</td>
<td>1.69</td>
</tr>
<tr>
<td>4</td>
<td>1x1y4z</td>
<td>3.23</td>
<td>3.50</td>
<td>3.95</td>
<td>3.61</td>
</tr>
<tr>
<td>4</td>
<td>1x2y2z</td>
<td>2.99</td>
<td>3.91</td>
<td>3.43</td>
<td>3.51</td>
</tr>
<tr>
<td>6</td>
<td>1x1y6z</td>
<td>4.13</td>
<td>5.15</td>
<td>4.78</td>
<td>4.36</td>
</tr>
<tr>
<td>6</td>
<td>1x2y3z</td>
<td>3.64</td>
<td>5.07</td>
<td>5.00</td>
<td>5.40</td>
</tr>
<tr>
<td>8</td>
<td>1x1y8z</td>
<td>4.87</td>
<td>5.06</td>
<td>5.77</td>
<td>6.72</td>
</tr>
<tr>
<td>8</td>
<td>1x2y4z</td>
<td>4.71</td>
<td>8.83</td>
<td>7.25</td>
<td>7.65</td>
</tr>
<tr>
<td>8</td>
<td>2x2y2z</td>
<td>4.16</td>
<td>6.59</td>
<td>6.11</td>
<td>7.26</td>
</tr>
<tr>
<td>10</td>
<td>1x1y10z</td>
<td>6.14</td>
<td>7.41</td>
<td>7.30</td>
<td>9.17</td>
</tr>
<tr>
<td>10</td>
<td>1x2y5z</td>
<td>5.79</td>
<td>7.86</td>
<td>8.28</td>
<td>8.47</td>
</tr>
<tr>
<td>12</td>
<td>1x2y12z</td>
<td>5.54</td>
<td>7.76</td>
<td>9.50</td>
<td>9.12</td>
</tr>
<tr>
<td>12</td>
<td>1x2y6z</td>
<td>4.53</td>
<td>9.66</td>
<td>9.33</td>
<td>11.12</td>
</tr>
<tr>
<td>12</td>
<td>1x3y4z</td>
<td>5.62</td>
<td>10.45</td>
<td>10.50</td>
<td>10.13</td>
</tr>
<tr>
<td>17</td>
<td>2x2y3z</td>
<td>4.24</td>
<td>8.67</td>
<td>9.04</td>
<td>10.64</td>
</tr>
</tbody>
</table>
Figure 4.37: Speed-up for the GMRESR preconditioned with MSIP in PC cluster.

<table>
<thead>
<tr>
<th>up</th>
<th>partition</th>
<th>20 × 20 × 20</th>
<th>40 × 40 × 40</th>
<th>60 × 60 × 60</th>
<th>80 × 80 × 80</th>
<th>100 × 100 × 100</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1x1y1z</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>2</td>
<td>1x1y2z</td>
<td>1.15</td>
<td>1.27</td>
<td>1.90</td>
<td>1.05</td>
<td>1.73</td>
</tr>
<tr>
<td>4</td>
<td>1x1y4z</td>
<td>1.07</td>
<td>2.07</td>
<td>1.59</td>
<td>3.08</td>
<td>3.22</td>
</tr>
<tr>
<td>4</td>
<td>1x2y2z</td>
<td>0.94</td>
<td>1.56</td>
<td>3.06</td>
<td>3.35</td>
<td>3.43</td>
</tr>
<tr>
<td>6</td>
<td>1x1y6z</td>
<td>1.09</td>
<td>3.07</td>
<td>4.27</td>
<td>3.99</td>
<td>4.04</td>
</tr>
<tr>
<td>6</td>
<td>1x2y3z</td>
<td>0.73</td>
<td>2.67</td>
<td>3.95</td>
<td>3.38</td>
<td>4.54</td>
</tr>
<tr>
<td>8</td>
<td>1x1y8z</td>
<td>0.88</td>
<td>3.18</td>
<td>3.81</td>
<td>4.78</td>
<td>4.82</td>
</tr>
<tr>
<td>8</td>
<td>1x2y4z</td>
<td>0.70</td>
<td>3.94</td>
<td>4.56</td>
<td>4.74</td>
<td>3.49</td>
</tr>
<tr>
<td>8</td>
<td>2x2y2z</td>
<td>0.62</td>
<td>2.81</td>
<td>4.63</td>
<td>7.14</td>
<td>5.35</td>
</tr>
<tr>
<td>10</td>
<td>1x1y10z</td>
<td>0.80</td>
<td>3.17</td>
<td>4.84</td>
<td>4.26</td>
<td>5.07</td>
</tr>
<tr>
<td>10</td>
<td>1x2y5z</td>
<td>0.67</td>
<td>3.16</td>
<td>5.64</td>
<td>6.72</td>
<td>5.01</td>
</tr>
<tr>
<td>12</td>
<td>1x2y12z</td>
<td>0.74</td>
<td>3.05</td>
<td>5.13</td>
<td>5.24</td>
<td>3.58</td>
</tr>
<tr>
<td>12</td>
<td>1x2y6z</td>
<td>0.66</td>
<td>3.37</td>
<td>6.72</td>
<td>6.26</td>
<td>7.55</td>
</tr>
<tr>
<td>12</td>
<td>1x3y4z</td>
<td>0.58</td>
<td>3.10</td>
<td>5.87</td>
<td>9.70</td>
<td>7.76</td>
</tr>
<tr>
<td>12</td>
<td>2x2y8z</td>
<td>0.49</td>
<td>2.77</td>
<td>5.34</td>
<td>5.04</td>
<td>3.62</td>
</tr>
</tbody>
</table>

Table 4.16: Speed-up of GMRESR solver preconditioned with MSIP in PC cluster.
Figure 4.38: Speed-up for the GMRESR preconditioned with MSIP in the Cray T3E.

<table>
<thead>
<tr>
<th>np</th>
<th>partition</th>
<th>20 x 20 x 20</th>
<th>40 x 40 x 40</th>
<th>60 x 60 x 60</th>
<th>80 x 80 x 80</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1x1y1z</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>2</td>
<td>1x1y2z</td>
<td>1.99</td>
<td>2.06</td>
<td>1.98</td>
<td>1.56</td>
</tr>
<tr>
<td>4</td>
<td>1x1y4z</td>
<td>3.63</td>
<td>3.59</td>
<td>3.90</td>
<td>3.03</td>
</tr>
<tr>
<td>4</td>
<td>1x2y2z</td>
<td>3.67</td>
<td>3.78</td>
<td>3.52</td>
<td>3.87</td>
</tr>
<tr>
<td>6</td>
<td>1x1y6z</td>
<td>4.11</td>
<td>4.91</td>
<td>5.71</td>
<td>3.57</td>
</tr>
<tr>
<td>6</td>
<td>1x2y3z</td>
<td>5.59</td>
<td>4.60</td>
<td>5.02</td>
<td>3.98</td>
</tr>
<tr>
<td>8</td>
<td>1x1y8z</td>
<td>4.34</td>
<td>6.05</td>
<td>6.66</td>
<td>5.65</td>
</tr>
<tr>
<td>8</td>
<td>1x2y4z</td>
<td>4.67</td>
<td>6.11</td>
<td>6.93</td>
<td>6.44</td>
</tr>
<tr>
<td>8</td>
<td>2x2y2z</td>
<td>4.14</td>
<td>5.86</td>
<td>6.20</td>
<td>7.32</td>
</tr>
<tr>
<td>10</td>
<td>1x1y10z</td>
<td>4.03</td>
<td>7.00</td>
<td>8.30</td>
<td>7.41</td>
</tr>
<tr>
<td>10</td>
<td>1x2y5z</td>
<td>5.19</td>
<td>6.96</td>
<td>8.20</td>
<td>6.28</td>
</tr>
<tr>
<td>12</td>
<td>1x2y12z</td>
<td>5.17</td>
<td>7.09</td>
<td>9.00</td>
<td>7.20</td>
</tr>
<tr>
<td>12</td>
<td>1x2y6z</td>
<td>4.91</td>
<td>8.06</td>
<td>10.25</td>
<td>7.99</td>
</tr>
<tr>
<td>12</td>
<td>1x3y4z</td>
<td>4.74</td>
<td>8.22</td>
<td>9.94</td>
<td>10.60</td>
</tr>
<tr>
<td>12</td>
<td>2x2y3z</td>
<td>4.06</td>
<td>7.74</td>
<td>8.46</td>
<td>7.04</td>
</tr>
</tbody>
</table>

Table 4.17: Speed-up of GMRESR solver preconditioned with MSIP in the Cray T3E.
It is shown in these figures that both facilities PC cluster and the Cray T3E have reported similar behaviours in spite of the differences on their respective communication networks, being the speed-ups in the PC cluster as good as those of the Cray T3E.

An unexpected result observed in these figures is that matrix-vector product’s speed-ups are worse than Jacobi or Gauss-Seidel’s speed-ups for a given problem size. The reason may be due to the better computation with communication ratio in solvers. I.e. the computational load in solvers embeds other operations apart of the matrix-vector product (e.g. the difference of vectors for the computation of the residual and its norm) while the additional communication load, (e.g. communication in the inner product) doesn’t increase as the computational load.

Apart from the Jacobi and Gauss-Seidel solvers, there is a similar degradation of MSIP solver and MSIP preconditioner based solvers, i.e. BCGSTAB and GMRESR. The degradation increases for the one partitioning direction while the multiple partitioning direction reduce this degradation. This seems in contradiction with a bigger number of communication messages for a 3D partitioned domain respect to the 1D partitioned domain. The reason is that the global ILU factorization, e.g. MSIP, is better represented by local ILU factorizations in multiple directions.

This fact is also observed in the increment of the number of iterations when the the number of processes in one direction increases. These numbers have been reported in the table 4.18 for the case of $100 \times 100 \times 100$, i.e. a linear system of $10^6$ unknowns.

<table>
<thead>
<tr>
<th>np</th>
<th>partition</th>
<th>Jacobi</th>
<th>Gauss</th>
<th>MSIP</th>
<th>bCGSTAB+prec</th>
<th>GMRESR(10)+prec</th>
</tr>
</thead>
<tbody>
<tr>
<td>7</td>
<td>1x1y1z</td>
<td>6040</td>
<td>3241</td>
<td>634</td>
<td>35</td>
<td>32</td>
</tr>
<tr>
<td>2</td>
<td>1x2y1z</td>
<td>6040</td>
<td>3041</td>
<td>666</td>
<td>34</td>
<td>36</td>
</tr>
<tr>
<td>4</td>
<td>1x1y2z</td>
<td>6040</td>
<td>3063</td>
<td>710</td>
<td>41</td>
<td>35</td>
</tr>
<tr>
<td>4</td>
<td>1x2y2z</td>
<td>6040</td>
<td>3060</td>
<td>712</td>
<td>35</td>
<td>36</td>
</tr>
<tr>
<td>6</td>
<td>1x1y6z</td>
<td>6040</td>
<td>3084</td>
<td>731</td>
<td>33</td>
<td>37</td>
</tr>
<tr>
<td>6</td>
<td>1x2y3z</td>
<td>6040</td>
<td>3072</td>
<td>730</td>
<td>38</td>
<td>38</td>
</tr>
<tr>
<td>8</td>
<td>1x1y8z</td>
<td>6040</td>
<td>3106</td>
<td>763</td>
<td>34</td>
<td>37</td>
</tr>
<tr>
<td>8</td>
<td>1x2y4z</td>
<td>6040</td>
<td>3083</td>
<td>745</td>
<td>41</td>
<td>35</td>
</tr>
<tr>
<td>8</td>
<td>2x2y2z</td>
<td>6040</td>
<td>3080</td>
<td>755</td>
<td>38</td>
<td>40</td>
</tr>
<tr>
<td>10</td>
<td>1x1y10z</td>
<td>6040</td>
<td>3127</td>
<td>794</td>
<td>41</td>
<td>34</td>
</tr>
<tr>
<td>10</td>
<td>1x2y7z</td>
<td>6040</td>
<td>3093</td>
<td>761</td>
<td>41</td>
<td>40</td>
</tr>
<tr>
<td>12</td>
<td>1x2y12z</td>
<td>6040</td>
<td>3149</td>
<td>826</td>
<td>35</td>
<td>45</td>
</tr>
<tr>
<td>12</td>
<td>1x2y6z</td>
<td>6040</td>
<td>3103</td>
<td>776</td>
<td>41</td>
<td>38</td>
</tr>
<tr>
<td>12</td>
<td>1x3y4z</td>
<td>6040</td>
<td>3094</td>
<td>770</td>
<td>41</td>
<td>38</td>
</tr>
<tr>
<td>12</td>
<td>2x2y2z</td>
<td>6040</td>
<td>3092</td>
<td>772</td>
<td>39</td>
<td>40</td>
</tr>
</tbody>
</table>

Table 4.18: Iterations for the solution of a linear system with 1000000 unknowns.

It is worth noting that the Jacobi number of iterations remains constant and in the case of the BCGSTAB it also decreases for some partitioning configurations. In order to represent these numbers, the increment of the number of iterations respect to the sequential computation have been calculated and given in terms of the percentage in figure 4.39.
Figure 4.39: Percentage of increment of iterations for the solution of a linear system with $10^6$ unknowns when increasing number of processors.
4.9 Nomenclature

Greek symbols
\[ \alpha \] network latency
\[ \beta \] network bandwidth
\[ \epsilon \] precision
\[ \rho \] scalar value

Other symbols
\( 7 - PF \) seven point formulation
\( \langle, \rangle \) inner product
\( \| \|_2 \) 2-norm of a vector
\( \theta \) general algebraic operation

Superscripts
\( (k) \) \( k \)-th iteration

\( A \) discretisation matrix
\( a \) coeff in \( A \)
\( b \) right hand side
\( bl \) block length
\( d \) block of data
\( E \) efficiency
\( I \) unknowns in \( x \)-direction
\( i \) index for \( x \)-direction
\( J \) unknowns in \( y \)-direction
\( j \) similar to \( i \)
\( K \) unknowns in \( x \)-direction
\( k \) similar to \( i \)
\( N \) number of unknowns
\( n \) number of
\( np \) number of processors
\( ngh \) neighbour processors
\( c \) operations of
\( ov \) overlapping area
\( p \) processor identification number
\( Re \) Reynolds number
\( r \) residual
\( S \) speed up
\( str \) stride
\( t \) measure of time
\( x \) unknown
\( y \) auxiliary vector, buffer vector
\( z \) auxiliary vector
\( (x, y, z) \) cartesian coordinates