Scaling Irregular Array-type Reductions in OmpSs

Jan Ciesko, Sergi Mateo, Xavier Teruel, Vicenç Beltran, Xavier Martorell, Rosa M. Badia and Jesús Labarta

1 Barcelona Supercomputing Center
2 Universitat Politècnica de Catalunya
{jan.ciesko, sergi.mateo, xavier.teruel, vicenc.beltran, xavier.martorell, rosa.m.badia, jesus.labarta}@bsc.es

Abstract – Array-type reductions represent a frequently occurring algorithmic pattern in many scientific applications. A special case occurs if array elements are accessed in a non-linear, often random manner, which makes their concurrent and scalable execution difficult. In this work we present a new approach that consists of language- and runtime support to facilitate programming and delivers high scalability on modern shared-memory systems for such irregular array-type reductions. A reference implementation in OmpSs, a task-parallel programming model, shows promising results with speed-ups up to 15x on the Intel Xeon processor.

I. INTRODUCTION

Irregular array-type reductions, also referred to as scatter-update, represent memory updates over an array type. The non-atomic operation as well their dynamic memory access pattern make their concurrent execution non-trivial and require careful handling to achieve scalability and correctness. Fig. 1 shows a scalar, regular and irregular array-type reduction over target where in case of an irregular array-type reduction, the update positions depend on indexes generated by a function f. It becomes obvious that algorithms containing an irregular

array-type reduction are cache inefficient due to distant memory accesses and consequently execution performance is bound to the speed of the memory subsystem. Further in order to avoid a race condition where multiple threads perform an update of a single memory location at the same time, accesses either need to be synchronized (via thread synchronization or memory barriers such as atomics), ordered [1] or redirected [2].

Access redirection to a thread-private copy of the reduction target is a common approach that eliminates the need for access synchronization. While this works well for scalar types, it becomes expensive for arrays and even useless for large data sets.

Figure 2 shows the performance impact of atomics and array privatization in the RandomAccess [3] kernel benchmark over serial execution running with 16 threads and different problem sizes. Its source code is shown in Figure 3. Consequently a new approach is needed that improves cache efficiency, reduces lock contention, eliminates memory barriers and is applicable on large input data sets at the same time. It turns out that by redirecting accesses to an array of thread-local linear buffers to temporarily store memory updates of a certain memory region of the reduction array and to flush the buffers when they are full is a simple yet efficient technique to meet the above requirement. We present this approach in more detail in the next chapter.

II. RUNTIME SUPPORT

To support irregular array-type reductions in OmpSs, we developed a new approach called Privatization with In-lined Block-ordered Reductions - PIBOR. In this approach all memory accesses to the original reduction array are redirected to a thread-private buffer. While this is comparable to regular privatization, the buffer is filled linearly, is limited to a pre-set size and additionally stores the memory address along the data of each access. Once the buffer is filled up, the owning thread reduces the buffer to global memory.
Typically writing out data to global memory requires to perform a global lock over the entire data structure which serializes execution. We prevent this by assigning buffers to discrete memory regions of the reduction array. In this case accesses to the original reduction array in a certain region are stored in the corresponding buffer. In case the buffer runs full, the owning thread tries to acquire a lock that protects only the particular memory region of the global array. Buffers corresponding to different regions can now be reduced in parallel and by increasing the number of regions, the effect of lock contention over a single region can be efficiently mitigated. A schematic overview of an application that runs \( N \) tasks on \( N \) threads and performs a reduction over an array divided into \( M \) locations is shown in Figure 4.

Since buffers correspond to different regions, each memory access needs to be inspected in order to determine its correct buffer. We do so by applying a hash function on the address of the accessed element. The entire process is shown in Figure 5.

### III. LANGUAGE SUPPORT

High programmability while maintaining execution transparency is a key requirement for modern programming models. Since PIBOR is conceptually related to privatization, an approach often found in declarative programming models such as OpenMP [4], its introduction puts minimal effort on front-end compilers, current specifications and user understanding. The following shows language support for array reductions in OmpSs and its compiler generated code.

```c
#pragma omp task reduction (array[0:N])
{...
array[pos] += RMS;...array[pos] += RMS;
array[pos] += RMS;

RT.request( & array[pos], _tmp)
_T * _tmp;
...}
```

### IV. CASE STUDIES

We evaluate the presented approach using RandomAccess on a single MareNostrum 16-way SMP node. RandomAccess is a kernel benchmark that allows to simulate different access patterns. In particular we looked at three representative scenarios: uniform random distribution, block random distribution and single block random distribution where all accesses are restricted to a single memory region. Performance results are shown in Figure 6 and 7.

### V. CONCLUSION AND FUTURE WORK

The presented approach scales by redirecting previously random memory accesses of a region into a linear buffer. Since each buffer corresponds to a memory region of the reduction array, buffers can be flushed in parallel. Further work is directed towards automated tuning of location granularity and buffer sizes and experiments on different processors including Xeon Phi and Power8.

### ACKNOWLEDGMENT

I would like to thank all my coauthors for their invaluable insights and their patience when exposed to my ideas during countless meetings.

### REFERENCES