On the programmability of heterogeneous massively-parallel computing systems
ColaboratorHwu, Wen-mei William; Navarro, Nacho; Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
Document typeDoctoral thesis
PublisherUniversitat Politècnica de Catalunya
Rights accessOpen Access
Heterogeneous parallel computing combines general purpose processors with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous parallel computing impose added coding complexity when compared to traditional sequential shared-memory programming models for homogeneous systems. This extra code complexity is assumable in supercomputing environments, where programmability is sacrificed in pursuit of high performance. However, heterogeneous parallel systems are massively reaching the desktop market (e.g., 425.4 million of GPU cards were sold in 2009), where the trade-off between performance and programmability is the opposite. The code complexity required when using accelerators and the lack of compatibility prevents programmers from exploiting the full computing capabilities of heterogeneous parallel systems in general purpose applications. This dissertation aims to increase the programmability of CPU - accelerator systems, without introducing major performance penalties. The key insight is that general purpose application programmers tend to favor programmability at the cost of system performance. This fact is illustrated by the tendency to use high-level programming languages, such as C++, to ease the task of programming at the cost of minor performance penalties. Moreover, currently many general purpose applications are being developed using interpreted languages, such as Java, C# or python, which raise the abstraction level even further introducing relatively large performance overheads. This dissertation also takes the approach of raising the level of abstraction for accelerators to improve programmability and investigates hardware and software mechanisms to efficiently implement these high-level abstractions without introducing major performance overheads. Heterogeneous parallel systems typically implement separate memories for CPUs and accelerators, although commodity systems might use a shared memory at the cost of lower performance. However, in these commodity shared memory systems, coherence between accelerator and CPUs is not guaranteed. This system architecture implies that CPUs can only access system memory, and accelerators can only access their own local memory. This dissertation assumes separate system and accelerator memory and shows that low-level abstractions for these disjoint address spaces are the source of poor programmability of heterogeneous parallel systems. A first consequence of having separate system and accelerator memories are the current data transfer models for heterogeneous parallel systems. In this dissertation two data transfer paradigms are identified: per-call and double-buffered. In these two models, data structures used by accelerators are allocated in both, system and accelerator memories. These models differ on how data between accelerator and system memories is managed. The per-call model transfers the input data needed by accelerators before accelerator calls, and transfers back the output data produced by accelerators on accelerator call return. The per-call model is quite simple, but might impose unacceptable performance penalties due to data transfer overheads. The double-buffered model aims to overlap data communication and CPU and accelerator computation. This model requires a relative quite complex code due to parallel execution and the need of synchronization between data communication and processing tasks. The extra code required for data transfers in these two models is necessary due to the lack of by-reference parameter passing to accelerators. This dissertation presents a novel accelerator-hosted data transfer model. In this model, data used by accelerators is hosted in the accelerator memory, so when the CPU accesses this data, it is effectively accessing the accelerator memory. Such a model cleanly supports by-reference parameter passing to accelerator calls, removing the need to explicit data transfers. The second consequence of separate system and accelerator memories is that current programming models export separate virtual system and accelerator address spaces to application programmers. This dissertation identifies the double-pointer problem as a direct consequence of these separate virtual memory spaces. The double-pointer problem is that data structures used by both, accelerators and CPUs, are referenced by different virtual memory addresses (pointers) in the CPU and accelerator code. The double-pointer problem requires programmers to add extra code to ensure that both pointers contain consistent values (e.g., when reallocating a data structure). Keeping consistency between system and accelerator pointers might penalize accelerator performance and increase the accelerator memory requirements when pointers are embedded within data structures (e.g., a linked-list). For instance, the double-pointer problem requires increasing the numbers of global memory accesses by 2X in a GPU code that reconstructs a linked-list. This dissertation argues that a unified virtual address space that includes both, system and accelerator memories is an efficient solution to the double-pointer problem. Moreover, such a unified virtual address space cleanly complements the accelerator-hosted data model previously discussed. This dissertation introduces the Non-Uniform Accelerator Memory Access (NUAMA) architecture, as a hardware implementation of the accelerator-hosted data transfer model and the unified virtual address space. In NUAMA an Accelerator Memory Collector (AMC) is included within the system memory controller to identify memory requests for accelerator-hosted data. The AMC buffers and coalesces such memory requests to efficiently transfer data from the CPU to the accelerator memory. NUAMA also implements a hybrid L2 cache memory. The L2 cache in NUAMA follows a write-throughwrite-non-allocate policy for accelerator hosted data. This policy ensures that the contents of the accelerator memory are updated eagerly and, therefore, when the accelerator is called, most of the data has been already transferred. The eager update of the accelerator memory contents effectively overlaps data communication and CPU computation. A write-backwrite-allocate policy is used for the data hosted by the system memory, so the performance of applications that does not use accelerators is not affected. In NUAMA, accelerator-hosted data is identified using a TLB-assisted mechanism. The page table entries are extended with a bit, which is set for those memory pages that are hosted by the accelerator memory. NUAMA increases the average bandwidth requirements for the L2 cache memory and the interconnection network between the CPU and accelerators, but the instantaneous bandwidth, which is the limiting factor, requirements are lower than in traditional DMA-based architectures. The NUAMA architecture is compared to traditional DMA systems using cycle-accurate simulations. Experimental results show that NUAMA and traditional DMA-based architectures perform equally well. However, the application source code complexity of NUAMA is much lower than in DMA-based architectures. A software implementation of the accelerator-hosted model and the unified virtual address space is also explored. This dissertation presents the Asymmetric Distributed Shared Memory (ADSM) model. ADSM maintains a shared logical memory space for CPUs to access data in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data structures to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs. ADSM reduces programming efforts for heterogeneous parallel computing systems and enhances application portability. The design and implementation of an ADSM run-time, called GMAC, on top of CUDA in a GNU/Linux environment is presented. Experimental results show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmer-managed data transfers. This dissertation presents the GMAC system, evaluates different design choices, and it further suggests additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model. Finally, the execution model of heterogeneous parallel systems is considered. Accelerator execution is abstracted in different ways in existent programming models. This dissertation explores three approaches implemented by existent programming models. OpenCL and the NVIDIA CUDA driver API use file descriptor semantics to abstract accelerators: user processes access accelerators through descriptors. This approach increases the complexity of using accelerators because accelerator descriptors are needed in any call involving the accelerator (e.g., memory allocations or passing a parameter to the accelerator). The IBM Cell SDK abstract accelerators as separate execution threads. This approach requires adding the necessary code to create new execution threads and synchronization primitives to use of accelerators. Finally, the NVIDIA CUDA run-time API abstract accelerators as Remote Procedure Calls (RPC). This approach is fundamentally incompatible with ADSM, because it assumes separate virtual address spaces for accelerator and CPU code. The Heterogeneous Parallel Execution (HPE) model is presented in this dissertation. This model extends the execution thread abstraction to incorporate different execution modes. Execution modes define the capabilities (e.g., accessible virtual address space, code ISA, etc) of the code being executed. In this execution model, accelerator calls are implemented as execution mode switches, analogously to system calls. Accelerator calls in HPE are synchronous, on the contrary of CUDA, OpenCL and the IBM Cell SDK. Synchronous accelerator calls provide full compatibility with the existent sequential execution model provided by most operating systems. Moreover, abstracting accelerator calls as execution mode switches allows application that use accelerator to run on system without accelerators. In these systems, the execution mode switch falls back to an emulation layer, which emulates the accelerator execution in the CPU. This dissertation further presents different design and implementation choices for the HPE model, in GMAC. The necessary hardware support for an efficient implementation of this model is also presented. Experimental results show that HPE introduces a low execution-time overhead while offering a clean and simple programming interface to applications.
- Tesis - TDX-UPC