BLAS-3 optimized by OmpSs regions (LASs library)
Document typeConference report
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Rights accessOpen Access
In this paper we propose a set of optimizations for the BLAS-3 routines of LASs library (Linear Algebra routines on OmpSs) and perform a detailed analysis of the impact of the proposed changes in terms of performance and execution time. OmpSs allows to use regions in the dependences of the tasks. This helps not only in the programming of the algorithmic optimizations, but also in the reduction of the execution time achieved by such optimizations. Different strategies are implemented in order to reduce the amount of tasks created (when there is enough parallelism) during the execution of BLAS-3 operations in the original LASs. Also a better IPC is obtained thanks to a better memory hierarchy exploitation. More specifically, we increase the performance, in particular on big matrices, about 12% for TRSM, and 17% for GEMM with respect to the original version of LASs, even using less cores in the case of GEMM/SYMM. Moreover, when LASs is compared to the OpenMP reference dense linear algebra library PLASMA, performance is increased up to 12.5% for GEMM/SYMM, while for TRSM/TRMM this value raises to 15%.
CitationValero-Lara, P. [et al.]. BLAS-3 optimized by OmpSs regions (LASs library). A: Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. "27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2019: Pavia, Italy 13-15 February 2019: proceedings". Institute of Electrical and Electronics Engineers (IEEE), 2019, p. 25-32.
All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder