Static versus dynamic task scheduling of the Lu factorization on ARM big. LITTLE architectures
Visualitza/Obre
Static_Versus_Dynamic_Task_Scheduling_of_the_Lu_Factorization_on_ARM_big._LITTLE_Architectures.pdf (361,5Kb) (Accés restringit)
Sol·licita una còpia a l'autor
Què és aquest botó?
Aquest botó permet demanar una còpia d'un document restringit a l'autor. Es mostra quan:
- Disposem del correu electrònic de l'autor
- El document té una mida inferior a 20 Mb
- Es tracta d'un document d'accés restringit per decisió de l'autor o d'un document d'accés restringit per política de l'editorial
Cita com:
hdl:2117/362107
Tipus de documentText en actes de congrés
Data publicació2017
EditorInstitute of Electrical and Electronics Engineers (IEEE)
Condicions d'accésAccés restringit per política de l'editorial
Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i
industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva
reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets
Abstract
We investigate several parallel algorithmic variants of the LU factorization with partial pivoting (LUpp) that trade off the exploitation of increasing levels of task-parallelism in exchange for a more cache-oblivious execution. In particular, our first variant corresponds to the classical implementation of LUpp in the legacy version of LAPACK, which constrains the concurrency exploited to that intrinsic to the basic linear algebra kernels that appear during the factorization, but exerts an strict control of the cache memory and a static mapping of kernels to cores. A second variant relaxes this task-constrained scenario by introducing a look-ahead of depth one to increase task-parallelism, increasing the pressure on the cache system in terms of cache misses. Finally, the third variant orchestrates an execution where the degree of concurrency is only limited by the actual data dependencies in LUpp, potentially yielding to a higher volume of conflicts due to competition for the cache memory resources. The target platform for our implementations and experiments is a specific asymmetric multicore processor (AMP) from ARM, which introduces the additional scheduling complexity of having to deal with two distinct types of cores; and an L2-shared cache per cluster of the AMP, which results in more conflictivity in the access to this key cache level.
CitacióCatalán, S. [et al.]. Static versus dynamic task scheduling of the Lu factorization on ARM big. LITTLE architectures. A: IEEE International Parallel and Distributed Processing Symposium Workshops. "2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops: 29 May-2 June 2017, Orlando, Florida: proceedings". Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 733-742. ISBN 978-1-5386-3408-0. DOI 10.1109/IPDPSW.2017.10.
ISBN978-1-5386-3408-0
Versió de l'editorhttps://ieeexplore.ieee.org/document/7965115
Fitxers | Descripció | Mida | Format | Visualitza |
---|---|---|---|---|
Static_Versus_D ... ._LITTLE_Architectures.pdf | 361,5Kb | Accés restringit |