Compiler and runtime based parallelization & optimization for GPUs
Chair / Department / Institute
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
Document typeDoctoral thesis
PublisherUniversitat Politècnica de Catalunya
Rights accessOpen Access
Graphics Processing Units (GPU) have been widely adopted to accelerate the execution of HPC workloads due to their vast computational throughput, ability to execute a large number of threads inside SIMD groups in parallel and their use of hardware multithreading to hide long pipelining and memory access latencies. There are two APIs commonly used for native GPU programming: CUDA, which only targets NVIDIA GPUs and OpenCL, which targets all types of GPUs as well as other accelerators. However these APIs only expose low-level hardware characteristics to the programmer. So developing applications able to exploit the dazzling performance of GPUs is not a trivial task, and becomes even harder when they have irregular data access patterns or control flows. Several approaches have been proposed to help simplify accelerator programming. Models like OpenACC and OpenMP are intended to solve the aforementioned programming challenges. They take a directive based approach which allows the users to insert non-executable directives that guide the compiler to handle the low-level complexities of the system. However they have a performance gap with native programming models as their compiler does not have comprehensive knowledge about how to transform code and what to optimize. This thesis targets directive-based programming models to enhance their capabilities for GPU programming. The thesis introduces a new dialect model, which is a combination of OpenMP and OmpSs. It also includes several extensions and the MACC infrastructure, a source-to-source compiler targeting CUDA developed on top of BSC's Mercurium compiler and able to support the new dialect model. The new model allows the use of multiple GPUs in conjunction with the vector and heavily multithreaded capabilities in multicore processors automatically. Moreover, it introduces new clauses to make use of on-chip memory efficiently. Secondly the thesis focusses on code transformation techniques and proposes the LazyNP method to support nested parallelism for irregular applications such as sparse matrix operations, graph and graphics algorithms. The method efficiently increases thread granularity for the code region where nested parallelism is desired. The compiler generates code to dynamically pack kernel invocations and to postpone their execution until a bunch of them are available. To the best of our knowledge, LazyNP code transformation was the first successful code transformation method related to nested directives for GPUs. Finally, the thesis conducts a thorough exploration of conventional loop scheduling methods on GPUs to find the advantage and disadvantages of each method. It then proposes the concept of optimized dynamic loop scheduling as an improvement to all the existing methods. The contributions of this thesis improve the programmability of GPUs. This has had an outstanding impact on the whole OpenMP and OpenACC language committee. Additionally, our work includes contributions to widely used compilers such as Mercurium, Clang and PGI, helping thousands of users to take advantage of our work.
CitationOzen, G. "Compiler and runtime based parallelization & optimization for GPUs". Tesi doctoral, UPC, Departament d'Arquitectura de Computadors, 2018. Available at: <http://hdl.handle.net/2117/125844>