Compiler and runtime based parallelization & optimization for GPUs

Ozen, Guray

doi:10.5821/dissertation-2117-125844

Visualitza/Obre

TGO1de1.pdf (5,148Mb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Ozen, Guray

Tutor / directorLabarta Mancho, Jesús José

; Ayguadé Parra, Eduard

Càtedra / Departament / Institut

Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors

Tipus de documentTesi

Data de defensa2018-12-13

EditorUniversitat Politècnica de Catalunya

Condicions d'accésAccés obert

Llevat que s'hi indiqui el contrari, els continguts d'aquesta obra estan subjectes a la llicència de Creative Commons : Reconeixement 4.0 Internacional

Abstract

Graphics Processing Units (GPU) have been widely adopted to accelerate the execution of HPC workloads due to their vast computational throughput, ability to execute a large number of threads inside SIMD groups in parallel and their use of hardware multithreading to hide long pipelining and memory access latencies. There are two APIs commonly used for native GPU programming: CUDA, which only targets NVIDIA GPUs and OpenCL, which targets all types of GPUs as well as other accelerators. However these APIs only expose low-level hardware characteristics to the programmer. So developing applications able to exploit the dazzling performance of GPUs is not a trivial task, and becomes even harder when they have irregular data access patterns or control flows. Several approaches have been proposed to help simplify accelerator programming. Models like OpenACC and OpenMP are intended to solve the aforementioned programming challenges. They take a directive based approach which allows the users to insert non-executable directives that guide the compiler to handle the low-level complexities of the system. However they have a performance gap with native programming models as their compiler does not have comprehensive knowledge about how to transform code and what to optimize. This thesis targets directive-based programming models to enhance their capabilities for GPU programming. The thesis introduces a new dialect model, which is a combination of OpenMP and OmpSs. It also includes several extensions and the MACC infrastructure, a source-to-source compiler targeting CUDA developed on top of BSC's Mercurium compiler and able to support the new dialect model. The new model allows the use of multiple GPUs in conjunction with the vector and heavily multithreaded capabilities in multicore processors automatically. Moreover, it introduces new clauses to make use of on-chip memory efficiently. Secondly the thesis focusses on code transformation techniques and proposes the LazyNP method to support nested parallelism for irregular applications such as sparse matrix operations, graph and graphics algorithms. The method efficiently increases thread granularity for the code region where nested parallelism is desired. The compiler generates code to dynamically pack kernel invocations and to postpone their execution until a bunch of them are available. To the best of our knowledge, LazyNP code transformation was the first successful code transformation method related to nested directives for GPUs. Finally, the thesis conducts a thorough exploration of conventional loop scheduling methods on GPUs to find the advantage and disadvantages of each method. It then proposes the concept of optimized dynamic loop scheduling as an improvement to all the existing methods. The contributions of this thesis improve the programmability of GPUs. This has had an outstanding impact on the whole OpenMP and OpenACC language committee. Additionally, our work includes contributions to widely used compilers such as Mercurium, Clang and PGI, helping thousands of users to take advantage of our work.

CitacióOzen, G. Compiler and runtime based parallelization & optimization for GPUs. Tesi doctoral, UPC, Departament d'Arquitectura de Computadors, 2018. DOI 10.5821/dissertation-2117-125844. Disponible a: <http://hdl.handle.net/2117/125844>

URIhttp://hdl.handle.net/2117/125844

DOI10.5821/dissertation-2117-125844

Col·leccions

Tesis - Departament d'Arquitectura de Computadors [361]
Tesis - Totes les tesis [5.459]

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
TGO1de1.pdf		5,148Mb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Compiler and runtime based parallelization & optimization for GPUs

Visualitza/Obre

Explora