Show simple item record

dc.contributor.authorValero Lara, Pedro
dc.contributor.authorMartinez Pérez, Ivan
dc.contributor.authorSirvent, Raül
dc.contributor.authorMartorell Bofill, Xavier
dc.contributor.authorPeña, Antonio J.
dc.contributor.otherUniversitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
dc.contributor.otherBarcelona Supercomputing Center
dc.date.accessioned2020-06-11T10:55:50Z
dc.date.available2020-06-11T10:55:50Z
dc.date.issued2018-01-01
dc.identifier.citationValero-Lara, P. [et al.]. cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs. "Concurrency and computation: practice and experience", 1 Gener 2018, vol. 30, núm. 24, p. 1-10.
dc.identifier.issn1532-0634
dc.identifier.urihttp://hdl.handle.net/2117/190528
dc.description.abstractThe solving of tridiagonal systems is one of the most computationally expensive parts in many applications, so that multiple studies have explored the use of NVIDIA GPUs to accelerate such computation. However, these studies have mainly focused on using parallel algorithms to compute such systems, which can efficiently exploit the shared memory and are able to saturate the GPUs capacity with a low number of systems, presenting a poor scalability when dealing with a relatively high number of systems. The gtsvStridedBatch routine in the cuSPARSE NVIDIA package is one of these examples, which is used as reference in this article. We propose a new implementation (cuThomasBatch) based on the Thomas algorithm. Unlike other algorithms, the Thomas algorithm is sequential, and so a coarse-grained approach is implemented where one CUDA thread solves a complete tridiagonal system instead of one CUDA block as in gtsvStridedBatch. To achieve a good scalability using this approach, it is necessary to carry out a transformation in the way that the inputs are stored in memory to exploit coalescence (contiguous threads access to contiguous memory locations). Different variants regarding the transformation of the data are explored in detail. We also explore some variants for the case of variable batch, when the size of the systems of the batch has different size (cuThomasVBatch). The results given in this study prove that the implementations carried out in this work are able to beat the reference code, being up to 5× (in double precision) and 6× (in single precision) faster using the latest NVIDIA GPU architecture, the Pascal P100.
dc.description.sponsorshipThis project was funded from the European Union's Horizon 2020 research and innovation programme under grant agreement 720270 (HBPSGA1), from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII (TIN2015-65316-P)and the Departament d'Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació iEntorns d'Execució Paral·lels (2014-SGR-1051). We thank the support of NVIDIA through the BSC/UPC NVIDIA GPU Center of Excellence andthe valuable feedback provided by Lung Sheng Chien and Alex Fit-Florea. Antonio J. Peña was cofinanced by the Spanish Ministry of Economy andCompetitiveness under Juan de la Cierva fellowship number IJCI-2015-23266.
dc.format.extent10 p.
dc.language.isoeng
dc.publisherWiley
dc.subjectÀrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
dc.subject.lcshParallel processing (Electronic computers)
dc.subject.lcshGraphics processing units
dc.subject.otherCR
dc.subject.otherCUDA
dc.subject.othercuSPARSE
dc.subject.otherPCR
dc.subject.otherScalability
dc.subject.otherThomasalgorithm
dc.subject.otherTridiagonallinear systems
dc.titlecuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs
dc.typeArticle
dc.subject.lemacProcessament en paral·lel (Ordinadors)
dc.subject.lemacUnitats de processament gràfic
dc.contributor.groupUniversitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
dc.identifier.doi10.1002/cpe.4909
dc.description.peerreviewedPeer Reviewed
dc.relation.publisherversionhttps://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4909
dc.rights.accessOpen Access
local.identifier.drac23344884
dc.description.versionPostprint (author's final draft)
dc.relation.projectidinfo:eu-repo/grantAgreement/MINECO/1PE/TIN2015-65316-P
dc.relation.projectidinfo:eu-repo/grantAgreement/AGAUR/V PRI/2014 SGR 1051
dc.relation.projectidinfo:eu-repo/grantAgreement/EC/H2020/720270/EU/Human Brain Project Specific Grant Agreement 1/HBP SGA1
dc.relation.projectidinfo:eu-repo/grantAgreement/MINECO/1PE/IJCI-2015-23266
local.citation.authorValero-Lara, P.; Martínez-Pérez, I.; Sirvent, R.; Martorell, X.; Peña, A.
local.citation.publicationNameConcurrency and computation: practice and experience
local.citation.volume30
local.citation.number24
local.citation.startingPage1
local.citation.endingPage10


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder