Mostra el registre d'ítem simple
Parallelizing general histogram application for CUDA architectures
dc.contributor.author | Milic, Ugljesa |
dc.contributor.author | Gelado Fernandez, Isaac |
dc.contributor.author | Puzovic, Nikola |
dc.contributor.author | Ramírez Bellido, Alejandro |
dc.contributor.author | Tomasevic, Milo |
dc.contributor.other | Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors |
dc.date.accessioned | 2014-11-12T14:28:12Z |
dc.date.created | 2013 |
dc.date.issued | 2013 |
dc.identifier.citation | Milic, U. [et al.]. Parallelizing general histogram application for CUDA architectures. A: International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation. "2013 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation: proceedings: IC-SAMOS 2013: July 15-18, 2013: Samos, Greece". Agios Konstantinos: IEEE Computational Intelligence Society, 2013, p. 11-18. |
dc.identifier.isbn | 978-1-4799-0103-6 |
dc.identifier.uri | http://hdl.handle.net/2117/24701 |
dc.description.abstract | Histogramming is a tool commonly used in data analysis. Although its serial version is simple to implement, providing an efficient and scalable way to parallelize it can be challenging. This especially holds in case of platforms that contain one or several massively parallel devices like CUDA-capable GPUs due to issues with domain decomposition, use of global memory and similar. In this paper we compare two approaches for implementing general purpose histogramming on GPUs. The first algorithm is based on private copies of bin counters stored in shared memory for each block of threads. The second one uses the Thrust library to sort the input elements and then to search for upper bounds according to bin widths. For both algorithms we analyze how the speedup over the sequential version depends on the size of input collection, number of bins, and the type and distribution of input elements. We also implement overlapping of data transfers between host CPU and CUDA device with kernel execution. For both algorithms we analyze the pros and cons in detail. For example, privatization strategy can be up to 2x faster than sort-search with realistic inputs, but can only support a limited number of bins. On the other hand, sort-search strategy has about 50% higher speedup than privatization when we use characters as input and can support unlimited number of bins. Finally, we perform an exploration to determine the optimal algorithm depending on the characteristics and values of input parameters. |
dc.format.extent | 8 p. |
dc.language.iso | eng |
dc.publisher | IEEE Computational Intelligence Society |
dc.rights | Attribution-NonCommercial-NoDerivs 3.0 Spain |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/3.0/es/ |
dc.subject | Àrees temàtiques de la UPC::Matemàtiques i estadística::Estadística matemàtica |
dc.subject | Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors::Arquitectures paral·leles |
dc.subject.lcsh | Mathematical statistics |
dc.subject.lcsh | Parallel programming (Computer science) |
dc.subject.other | Data analysis |
dc.subject.other | Graphics processing units |
dc.subject.other | Parallel architectures |
dc.subject.other | Shared memory systems |
dc.subject.other | CUDA architectures |
dc.subject.other | CUDA-capable GPU |
dc.subject.other | Thrust library |
dc.subject.other | Bin counters |
dc.subject.other | Bin widths |
dc.subject.other | Data analysis |
dc.subject.other | Data transfer overlapping |
dc.subject.other | Domain decomposition |
dc.subject.other | General histogram application parallelization |
dc.subject.other | General purpose histogramming |
dc.subject.other | Global memory |
dc.subject.other | Host CPU |
dc.subject.other | Kernel execution |
dc.subject.other | Optimal algorithm |
dc.subject.other | Parallel devices |
dc.subject.other | Privatization strategy |
dc.subject.other | Shared memory |
dc.subject.other | Sort-search strategy |
dc.subject.other | Algorithm design and analysis |
dc.subject.other | Graphics processing units |
dc.subject.other | Histograms |
dc.subject.other | Instruction sets |
dc.subject.other | Kernel |
dc.subject.other | Privatization |
dc.subject.other | Radiation detectors |
dc.title | Parallelizing general histogram application for CUDA architectures |
dc.type | Conference report |
dc.subject.lemac | Estadística matemàtica |
dc.subject.lemac | Programació en paral·lel (Informàtica) |
dc.contributor.group | Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions |
dc.identifier.doi | 10.1109/SAMOS.2013.6621100 |
dc.description.peerreviewed | Peer Reviewed |
dc.relation.publisherversion | http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6621100 |
dc.rights.access | Restricted access - publisher's policy |
local.identifier.drac | 15076452 |
dc.description.version | Postprint (published version) |
dc.relation.projectid | info:eu-repo/grantAgreement/EC/FP7/287759/EU/High Performance and Embedded Architecture and Compilation/HIPEAC |
dc.date.lift | 10000-01-01 |
local.citation.author | Milic, U.; Gelado, I.; Puzovic, N.; Alex Ramirez; Tomasevic, M. |
local.citation.contributor | International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation |
local.citation.pubplace | Agios Konstantinos |
local.citation.publicationName | 2013 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation: proceedings: IC-SAMOS 2013: July 15-18, 2013: Samos, Greece |
local.citation.startingPage | 11 |
local.citation.endingPage | 18 |