Mostra el registre d'ítem simple

dc.contributor.authorMilic, Ugljesa
dc.contributor.authorGelado Fernandez, Isaac
dc.contributor.authorPuzovic, Nikola
dc.contributor.authorRamírez Bellido, Alejandro
dc.contributor.authorTomasevic, Milo
dc.contributor.otherUniversitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
dc.date.accessioned2014-11-12T14:28:12Z
dc.date.created2013
dc.date.issued2013
dc.identifier.citationMilic, U. [et al.]. Parallelizing general histogram application for CUDA architectures. A: International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation. "2013 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation: proceedings: IC-SAMOS 2013: July 15-18, 2013: Samos, Greece". Agios Konstantinos: IEEE Computational Intelligence Society, 2013, p. 11-18.
dc.identifier.isbn978-1-4799-0103-6
dc.identifier.urihttp://hdl.handle.net/2117/24701
dc.description.abstractHistogramming is a tool commonly used in data analysis. Although its serial version is simple to implement, providing an efficient and scalable way to parallelize it can be challenging. This especially holds in case of platforms that contain one or several massively parallel devices like CUDA-capable GPUs due to issues with domain decomposition, use of global memory and similar. In this paper we compare two approaches for implementing general purpose histogramming on GPUs. The first algorithm is based on private copies of bin counters stored in shared memory for each block of threads. The second one uses the Thrust library to sort the input elements and then to search for upper bounds according to bin widths. For both algorithms we analyze how the speedup over the sequential version depends on the size of input collection, number of bins, and the type and distribution of input elements. We also implement overlapping of data transfers between host CPU and CUDA device with kernel execution. For both algorithms we analyze the pros and cons in detail. For example, privatization strategy can be up to 2x faster than sort-search with realistic inputs, but can only support a limited number of bins. On the other hand, sort-search strategy has about 50% higher speedup than privatization when we use characters as input and can support unlimited number of bins. Finally, we perform an exploration to determine the optimal algorithm depending on the characteristics and values of input parameters.
dc.format.extent8 p.
dc.language.isoeng
dc.publisherIEEE Computational Intelligence Society
dc.rightsAttribution-NonCommercial-NoDerivs 3.0 Spain
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/es/
dc.subjectÀrees temàtiques de la UPC::Matemàtiques i estadística::Estadística matemàtica
dc.subjectÀrees temàtiques de la UPC::Informàtica::Arquitectura de computadors::Arquitectures paral·leles
dc.subject.lcshMathematical statistics
dc.subject.lcshParallel programming (Computer science)
dc.subject.otherData analysis
dc.subject.otherGraphics processing units
dc.subject.otherParallel architectures
dc.subject.otherShared memory systems
dc.subject.otherCUDA architectures
dc.subject.otherCUDA-capable GPU
dc.subject.otherThrust library
dc.subject.otherBin counters
dc.subject.otherBin widths
dc.subject.otherData analysis
dc.subject.otherData transfer overlapping
dc.subject.otherDomain decomposition
dc.subject.otherGeneral histogram application parallelization
dc.subject.otherGeneral purpose histogramming
dc.subject.otherGlobal memory
dc.subject.otherHost CPU
dc.subject.otherKernel execution
dc.subject.otherOptimal algorithm
dc.subject.otherParallel devices
dc.subject.otherPrivatization strategy
dc.subject.otherShared memory
dc.subject.otherSort-search strategy
dc.subject.otherAlgorithm design and analysis
dc.subject.otherGraphics processing units
dc.subject.otherHistograms
dc.subject.otherInstruction sets
dc.subject.otherKernel
dc.subject.otherPrivatization
dc.subject.otherRadiation detectors
dc.titleParallelizing general histogram application for CUDA architectures
dc.typeConference report
dc.subject.lemacEstadística matemàtica
dc.subject.lemacProgramació en paral·lel (Informàtica)
dc.contributor.groupUniversitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
dc.identifier.doi10.1109/SAMOS.2013.6621100
dc.description.peerreviewedPeer Reviewed
dc.relation.publisherversionhttp://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6621100
dc.rights.accessRestricted access - publisher's policy
local.identifier.drac15076452
dc.description.versionPostprint (published version)
dc.relation.projectidinfo:eu-repo/grantAgreement/EC/FP7/287759/EU/High Performance and Embedded Architecture and Compilation/HIPEAC
dc.date.lift10000-01-01
local.citation.authorMilic, U.; Gelado, I.; Puzovic, N.; Alex Ramirez; Tomasevic, M.
local.citation.contributorInternational Conference on Embedded Computer Systems: Architectures, Modeling and Simulation
local.citation.pubplaceAgios Konstantinos
local.citation.publicationName2013 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation: proceedings: IC-SAMOS 2013: July 15-18, 2013: Samos, Greece
local.citation.startingPage11
local.citation.endingPage18


Fitxers d'aquest items

Imatge en miniatura

Aquest ítem apareix a les col·leccions següents

Mostra el registre d'ítem simple