Seamless optimization of the GEMM kernel for task-based programming models

Lorenzon, Arthur F.; Marques, Sandro M. V. N.; Navarro Muñoz, Antoni; Beltran Querol, Vicenç

doi:10.1145/3524059.3532385

dc.contributor.author	Lorenzon, Arthur F.
dc.contributor.author	Marques, Sandro M. V. N.
dc.contributor.author	Navarro Muñoz, Antoni
dc.contributor.author	Beltran Querol, Vicenç
dc.contributor.other	Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors
dc.date.accessioned	2022-06-30T08:17:19Z
dc.date.available	2022-06-30T08:17:19Z
dc.date.issued	2022
dc.identifier.citation	Lorenzon, A. [et al.]. Seamless optimization of the GEMM kernel for task-based programming models. A: International Conference on Supercomputing. "Proceedings of the 36th ACM International Conference on Supercomputing (ICS-2022): virtual event, June 27–30, 2022". New York: Association for Computing Machinery (ACM), 2022, ISBN 978-1-4503-9281-5. DOI 10.1145/3524059.3532385.
dc.identifier.isbn	978-1-4503-9281-5
dc.identifier.uri	http://hdl.handle.net/2117/369338
dc.description.abstract	The general matrix-matrix multiplication (GEMM) kernel is a fundamental building block of many scientific applications. Many libraries such as Intel MKL and BLIS provide highly optimized sequential and parallel versions of this kernel. The parallel implementations of the GEMM kernel rely on the well-known fork-join execution model to exploit multi-core systems efficiently. However, these implementations are not well suited for task-based applications as they break the data-flow execution model. In this paper, we present a task-based implementation of the GEMM kernel that can be seamlessly leveraged by task-based applications while providing better performance than the fork-join version. Our implementation leverages several advanced features of the OmpSs-2 programming model and a new heuristic to select the best parallelization strategy and blocking parameters based on the matrix and hardware characteristics. When evaluating the performance and energy consumption on two modern multi-core systems, we show that our implementations provide significant performance improvements over an optimized OpenMP fork-join implementation, and can beat vendor implementations of the GEMM (e.g., Intel MKL and AMD AOCL). We also demonstrate that a real application can leverage our optimized task-based implementation to enhance performance.
dc.language.iso	eng
dc.publisher	Association for Computing Machinery (ACM)
dc.subject	Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors::Arquitectures paral·leles
dc.subject.lcsh	Parallel processing (Electronic computers)
dc.subject.lcsh	Microprocessors -- Energy consumption
dc.subject.other	GEMM
dc.subject.other	Malleability
dc.subject.other	Parallel computing
dc.subject.other	Energy-efficiency
dc.title	Seamless optimization of the GEMM kernel for task-based programming models
dc.type	Conference report
dc.subject.lemac	Processament en paral·lel (Ordinadors)
dc.subject.lemac	Microprocessadors -- Consum d'energia
dc.identifier.doi	10.1145/3524059.3532385
dc.description.peerreviewed	Peer Reviewed
dc.relation.publisherversion	https://dl.acm.org/doi/10.1145/3524059.3532385
dc.rights.access	Open Access
local.identifier.drac	33880378
dc.description.version	Postprint (author's final draft)
local.citation.author	Lorenzon, A.; Marques, S.; Navarro, A.; Beltran, V.
local.citation.contributor	International Conference on Supercomputing
local.citation.pubplace	New York
local.citation.publicationName	Proceedings of the 36th ACM International Conference on Supercomputing (ICS-2022): virtual event, June 27–30, 2022

Fitxers d'aquest items

Nom:: ICS_BLIS_2022 (1).pdf
Mida:: 3,610Mb
Format:: PDF

Visualitza/Obre

Aquest ítem apareix a les col·leccions següents

Ponències/Comunicacions de congressos [292]

Mostra el registre d'ítem simple

UPCommons. Portal del coneixement obert de la UPC

Seamless optimization of the GEMM kernel for task-based programming models

Fitxers d'aquest items

Aquest ítem apareix a les col·leccions següents

Explora