Variable batched DGEMM
Document typeConference report
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Rights accessRestricted access - publisher's policy
European Commisision's projectEC-H2020-720270
Many scientific applications are in need to solve a high number of small-size independent problems. These individual problems do not provide enough parallelism and then, these must be computed as a batch. Today, vendors such as Intel and NVIDIA are developing their own suite of batch routines. Although most of the works focus on computing batches of fixed size, in real applications we can not assume a uniform size for all set of problems. We explore and analyze different strategies based on parallel for, task and taskloop OpenMP pragmas. Although these strategies are straightforward from a programmer's point of view, they have a different impact on performance. We also analyze a new prototype provided by Intel (MKL), which deals with batch operations (cblas dgemm batch). We propose a new approach called grouping. It basically groups a set of problems until filling a limit in terms of memory occupancy or number of operations. In this way, groups composed by different number of problems are distributed on cores, achieving a more balanced distribution in terms of computational cost. This strategy is able to be up to 6× faster than the Intel (MKL) batch routine
CitationValero-Lara, P., Martinez-Perez, I., Mateo, S., Sirvent, R., Beltran, V., Martorell, X., Labarta, J. Variable batched DGEMM. A: Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. "PDP 2018: 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing: proceedings: Cambridge, United Kingdom 21-23 March 2018". Institute of Electrical and Electronics Engineers (IEEE), 2018, p. 363-367.