Ponències/Comunicacions de congressos
http://hdl.handle.net/2117/3126
2024-03-19T05:14:58ZData prefetching on in-order processors
http://hdl.handle.net/2117/393499
Data prefetching on in-order processors
Ortega Carrasco, Cristobal; García Flores, Víctor; Moretó Planas, Miquel; Casas, Marc; Rositoru, Roxana
Low-power processors have attracted attention due to their energy-efficiency. A large market, such as the mobile one, relies on these processors for this very reason. Even High Performance Computing (HPC) systems are starting to consider low-power processors as a way to achieve exascale performance within 20MW, however, they must meet the right performance/Watt balance. Current low-power processors contain in-order cores, which cannot re-order instructions to avoid data dependency-induced stalls. Whilst this is useful to reduce the chip's total power consumption, it brings several challenges. Due to the evolving performance gap between memory and processor, memory is a significant bottleneck. In-order cores cannot re-order instructions and are memory latency bound, something data prefetching can help alleviate by ensuring data is readily available. In this work, we do an exhaustive analysis of available data prefetching techniques in state-of-The-Art in-order cores. We analyze 5 static prefetchers and 2 dynamic aggressiveness and destination mechanisms applied to 3 data prefetchers on a set of HPC mini-and proxy-Applications, whilst running on in-order processors. We show that next-line prefetching can achieve nearly top performance with a reasonable bandwidth consumption when throttled, whilst neighbor prefetchers have been found to be best, overall.
2023-09-14T11:28:41ZOrtega Carrasco, CristobalGarcía Flores, VíctorMoretó Planas, MiquelCasas, MarcRositoru, RoxanaLow-power processors have attracted attention due to their energy-efficiency. A large market, such as the mobile one, relies on these processors for this very reason. Even High Performance Computing (HPC) systems are starting to consider low-power processors as a way to achieve exascale performance within 20MW, however, they must meet the right performance/Watt balance. Current low-power processors contain in-order cores, which cannot re-order instructions to avoid data dependency-induced stalls. Whilst this is useful to reduce the chip's total power consumption, it brings several challenges. Due to the evolving performance gap between memory and processor, memory is a significant bottleneck. In-order cores cannot re-order instructions and are memory latency bound, something data prefetching can help alleviate by ensuring data is readily available. In this work, we do an exhaustive analysis of available data prefetching techniques in state-of-The-Art in-order cores. We analyze 5 static prefetchers and 2 dynamic aggressiveness and destination mechanisms applied to 3 data prefetchers on a set of HPC mini-and proxy-Applications, whilst running on in-order processors. We show that next-line prefetching can achieve nearly top performance with a reasonable bandwidth consumption when throttled, whilst neighbor prefetchers have been found to be best, overall.Transparent load balancing of MPI programs using OmpSs-2@Cluster and DLB
http://hdl.handle.net/2117/388005
Transparent load balancing of MPI programs using OmpSs-2@Cluster and DLB
Aguilar Mena, Jimmy; Ali, Omar Shaaban Ibrahim; López Herrero, Víctor; Garcia Casulla, Marta; Carpenter, Paul Matthew; Ayguadé Parra, Eduard; Labarta Mancho, Jesús José
Load imbalance is a long-standing source of inefficiency in high performance computing. The situation has only got worse as applications and systems increase in complexity, e.g., adaptive mesh refinement, DVFS, memory hierarchies, power and thermal management, and manufacturing processes. Load balancing is often implemented in the application, but it obscures application logic and may need extensive code refactoring. This paper presents an automated and transparent dynamic load balancing approach for MPI applications with OmpSs-2 tasks, which relieves applications from this burden. Only local and trivial changes are required to the application. Our approach exploits the ability of OmpSs-2@Cluster to offload tasks for execution on other nodes, and it reallocates compute resources among ranks using the Dynamic Load Balancing~(DLB) library. It employs LeWI to react to fine-grained load imbalances and DROM to address coarse-grained load imbalances by reserving cores on other nodes that can be reclaimed on demand. We use an expander graph to limit the amount of point-to-point communication and state. The results show 46% reduction in time-to-solution for micro-scale solid mechanics on 32 nodes and a 20% reduction beyond DLB for $n$-body on 16 nodes, when one node is running slow. A synthetic benchmark shows that performance is within 10% of optimal for an imbalance of up to 2.0 on 8 nodes. All software is released open source.
2023-05-29T10:46:03ZAguilar Mena, JimmyAli, Omar Shaaban IbrahimLópez Herrero, VíctorGarcia Casulla, MartaCarpenter, Paul MatthewAyguadé Parra, EduardLabarta Mancho, Jesús JoséLoad imbalance is a long-standing source of inefficiency in high performance computing. The situation has only got worse as applications and systems increase in complexity, e.g., adaptive mesh refinement, DVFS, memory hierarchies, power and thermal management, and manufacturing processes. Load balancing is often implemented in the application, but it obscures application logic and may need extensive code refactoring. This paper presents an automated and transparent dynamic load balancing approach for MPI applications with OmpSs-2 tasks, which relieves applications from this burden. Only local and trivial changes are required to the application. Our approach exploits the ability of OmpSs-2@Cluster to offload tasks for execution on other nodes, and it reallocates compute resources among ranks using the Dynamic Load Balancing~(DLB) library. It employs LeWI to react to fine-grained load imbalances and DROM to address coarse-grained load imbalances by reserving cores on other nodes that can be reclaimed on demand. We use an expander graph to limit the amount of point-to-point communication and state. The results show 46% reduction in time-to-solution for micro-scale solid mechanics on 32 nodes and a 20% reduction beyond DLB for $n$-body on 16 nodes, when one node is running slow. A synthetic benchmark shows that performance is within 10% of optimal for an imbalance of up to 2.0 on 8 nodes. All software is released open source.Automatic aggregation of subtask accesses for nested OpenMP-style tasks
http://hdl.handle.net/2117/384603
Automatic aggregation of subtask accesses for nested OpenMP-style tasks
Ali, Omar Shaaban Ibrahim; Aguilar Mena, Jimmy; Beltran Querol, Vicenç; Carpenter, Paul Matthew; Ayguadé Parra, Eduard; Labarta Mancho, Jesús José
Task-based programming is a high performance and productive model to express parallelism. Tasks encapsulate work to be executed across multiple cores or offloaded to GPUs, FPGAs, other accelerators or other nodes. In order to maintain parallelism and afford maximum freedom to the scheduler, the task dependency graph should be created in parallel and well in advance of task execution. A key limitation with OpenMP and OmpSs-2 tasking is that a task cannot be created until all its accesses and its descendents' accesses are known. Current approaches to work around this limitation either stop task creation and execution using a taskwait or they substitute “fake” accesses known as sentinels. This paper proposes the auto clause, which indicates that the task may create subtasks that access unspecified memory regions or it may allocate and return memory at addresses that are of course not yet known. Unlike approaches using taskwaits, there is no interruption to the concurrent creation and execution of tasks, maintaining parallelism and the scheduler's ability to optimize load balance and data locality. Unlike existing approaches using sentinels, all tasks can be given a precise specification of their own data accesses, so that a single mechanism is used to control task ordering, program data transfers on distributed memory and optimize data locality, e.g. on NUMA systems. The auto clause also provides an incremental path to develop programs with nested tasks, by removing the need for every parent task to have a complete specification of the accesses of its descendent tasks. This is redundant information that can be time consuming and error-prone to describe. We present a straightforward runtime implementation that achieves a 1.4 times speedup for n-body with OmpSs-2@Cluster task offloading to 32 nodes and <4% slowdown for three benchmarks with task offloading to 8 nodes. All code is open source.
2023-03-06T14:47:17ZAli, Omar Shaaban IbrahimAguilar Mena, JimmyBeltran Querol, VicençCarpenter, Paul MatthewAyguadé Parra, EduardLabarta Mancho, Jesús JoséTask-based programming is a high performance and productive model to express parallelism. Tasks encapsulate work to be executed across multiple cores or offloaded to GPUs, FPGAs, other accelerators or other nodes. In order to maintain parallelism and afford maximum freedom to the scheduler, the task dependency graph should be created in parallel and well in advance of task execution. A key limitation with OpenMP and OmpSs-2 tasking is that a task cannot be created until all its accesses and its descendents' accesses are known. Current approaches to work around this limitation either stop task creation and execution using a taskwait or they substitute “fake” accesses known as sentinels. This paper proposes the auto clause, which indicates that the task may create subtasks that access unspecified memory regions or it may allocate and return memory at addresses that are of course not yet known. Unlike approaches using taskwaits, there is no interruption to the concurrent creation and execution of tasks, maintaining parallelism and the scheduler's ability to optimize load balance and data locality. Unlike existing approaches using sentinels, all tasks can be given a precise specification of their own data accesses, so that a single mechanism is used to control task ordering, program data transfers on distributed memory and optimize data locality, e.g. on NUMA systems. The auto clause also provides an incremental path to develop programs with nested tasks, by removing the need for every parent task to have a complete specification of the accesses of its descendent tasks. This is redundant information that can be time consuming and error-prone to describe. We present a straightforward runtime implementation that achieves a 1.4 times speedup for n-body with OmpSs-2@Cluster task offloading to 32 nodes and <4% slowdown for three benchmarks with task offloading to 8 nodes. All code is open source.An extension of the StarSs programming model for platforms with multiple GPUs
http://hdl.handle.net/2117/384555
An extension of the StarSs programming model for platforms with multiple GPUs
Ayguadé Parra, Eduard; Badia Sala, Rosa Maria; Igual Peña, Francisco D.; Labarta Mancho, Jesús José; Mayo Gual, Rafael; Quintana Ortí, Enrique Salvador
While general-purpose homogeneous multi-core architectures are becoming ubiquitous, there are clear indications that, for a number of important applications, a better performance/power ratio can be attained using specialized hardware accelerators. These accelerators require specific SDK or programming languages which are not always easy to program. Thus, the impact of the new programming paradigms on the programmer’s productivity will determine their success in the high-performance computing arena. In this paper we present GPU Superscalar (GPUSs), an extension of the Star Superscalar programming model that targets the parallelization of applications on platforms consisting of a general-purpose processor connected with multiple graphics processors. GPUSs deals with architecture heterogeneity and separate memory address spaces, while preserving simplicity and portability. Preliminary experimental results for a well-known operation in numerical linear algebra illustrate the correct adaptation of the runtime to a multi-GPU system, attaining notable performance results.
2023-03-03T13:37:38ZAyguadé Parra, EduardBadia Sala, Rosa MariaIgual Peña, Francisco D.Labarta Mancho, Jesús JoséMayo Gual, RafaelQuintana Ortí, Enrique SalvadorWhile general-purpose homogeneous multi-core architectures are becoming ubiquitous, there are clear indications that, for a number of important applications, a better performance/power ratio can be attained using specialized hardware accelerators. These accelerators require specific SDK or programming languages which are not always easy to program. Thus, the impact of the new programming paradigms on the programmer’s productivity will determine their success in the high-performance computing arena. In this paper we present GPU Superscalar (GPUSs), an extension of the Star Superscalar programming model that targets the parallelization of applications on platforms consisting of a general-purpose processor connected with multiple graphics processors. GPUSs deals with architecture heterogeneity and separate memory address spaces, while preserving simplicity and portability. Preliminary experimental results for a well-known operation in numerical linear algebra illustrate the correct adaptation of the runtime to a multi-GPU system, attaining notable performance results.Space compression algorithms acceleration on embedded multi-core and GPU platforms
http://hdl.handle.net/2117/384002
Space compression algorithms acceleration on embedded multi-core and GPU platforms
Jover Álvarez, Álvaro; Rodríguez Ferrández, Iván; Kosmidis, Leonidas; Steenari, David
Future space missions will require increased on-board computing power to process and compress massive amounts of data. Consequently, embedded multi-core and GPU platforms are considered, which have been shown beneficial for data processing. However, the acceleration of data compression - an inherently sequential task - has not been explored. In this on-going research paper, we parallelize two space compression standards on both CPUs and GPUs using two candidate embedded GPU platforms for space showing that despite the challenging nature of CCSDS algorithms, their parallelization is possible and can provide significant performance benefits.
2023-02-23T09:09:40ZJover Álvarez, ÁlvaroRodríguez Ferrández, IvánKosmidis, LeonidasSteenari, DavidFuture space missions will require increased on-board computing power to process and compress massive amounts of data. Consequently, embedded multi-core and GPU platforms are considered, which have been shown beneficial for data processing. However, the acceleration of data compression - an inherently sequential task - has not been explored. In this on-going research paper, we parallelize two space compression standards on both CPUs and GPUs using two candidate embedded GPU platforms for space showing that despite the challenging nature of CCSDS algorithms, their parallelization is possible and can provide significant performance benefits.Analyzing the performance of hierarchical collective algorithms on ARM-based multicore clusters
http://hdl.handle.net/2117/381194
Analyzing the performance of hierarchical collective algorithms on ARM-based multicore clusters
Utrera Iglesias, Gladys Miriam; Gil, Marisa; Martorell Bofill, Xavier
MPI is the de facto communication standard library for parallel applications in distributed memory architectures. Collective operations performance is critical in HPC applications as they can become the bottleneck of their executions. The advent of larger node sizes on multicore clusters has motivated the exploration of hierarchical collective algorithms aware of the process placement in the cluster and the memory hierarchy. This work analyses and compares several hierarchical collective algorithms from the literature that do not form part of the current MPI standard. We implement the algorithms on top of OpenMPI using the shared-memory facility provided by MPI-3 at the intra-node level and evaluate them on ARM-based multicore clusters. From our results, we evidence aspects of the algorithms that impact the performance and applicability of the different algorithms. Finally, we propose a model that helps us to analyze the scalability of the algorithms.
2023-01-26T09:35:52ZUtrera Iglesias, Gladys MiriamGil, MarisaMartorell Bofill, XavierMPI is the de facto communication standard library for parallel applications in distributed memory architectures. Collective operations performance is critical in HPC applications as they can become the bottleneck of their executions. The advent of larger node sizes on multicore clusters has motivated the exploration of hierarchical collective algorithms aware of the process placement in the cluster and the memory hierarchy. This work analyses and compares several hierarchical collective algorithms from the literature that do not form part of the current MPI standard. We implement the algorithms on top of OpenMPI using the shared-memory facility provided by MPI-3 at the intra-node level and evaluate them on ARM-based multicore clusters. From our results, we evidence aspects of the algorithms that impact the performance and applicability of the different algorithms. Finally, we propose a model that helps us to analyze the scalability of the algorithms.Tuning dynamic web applications using fine-grain analysis
http://hdl.handle.net/2117/380760
Tuning dynamic web applications using fine-grain analysis
Guitart Fernández, Jordi; Carrera Pérez, David; Torres Viñals, Jordi; Ayguadé Parra, Eduard; Labarta Mancho, Jesús José
In this paper we present a methodology to analyze the behavior and performance of Java application servers using a performance analysis framework. This framework, considers all levels involved in the application server execution (application, server, virtual machine and operating system), allowing a fine-grain analysis of dynamic Web applications. The proposed methodology is based on the suggestion of hypotheses that could explain the presence of certain symptoms that lead to bad server performance, an unexplained server behavior or a server malfunction. The methodology establishes that hypotheses must be verified (in order to confirm or discard them) by performing some actions with the performance analysis framework. In order to show the potential of the proposed analysis methodology, we present three successful experiences where a detailed and correlated analysis of the application server behavior has allowed the detection and correction of three performance degradation situations.
2023-01-19T11:34:39ZGuitart Fernández, JordiCarrera Pérez, DavidTorres Viñals, JordiAyguadé Parra, EduardLabarta Mancho, Jesús JoséIn this paper we present a methodology to analyze the behavior and performance of Java application servers using a performance analysis framework. This framework, considers all levels involved in the application server execution (application, server, virtual machine and operating system), allowing a fine-grain analysis of dynamic Web applications. The proposed methodology is based on the suggestion of hypotheses that could explain the presence of certain symptoms that lead to bad server performance, an unexplained server behavior or a server malfunction. The methodology establishes that hypotheses must be verified (in order to confirm or discard them) by performing some actions with the performance analysis framework. In order to show the potential of the proposed analysis methodology, we present three successful experiences where a detailed and correlated analysis of the application server behavior has allowed the detection and correction of three performance degradation situations.Soporte para el análisis de workloads en el proyecto eNANOS
http://hdl.handle.net/2117/380475
Soporte para el análisis de workloads en el proyecto eNANOS
Rodero Castro, Iván; Corbalán González, Julita; Duran González, Alejandro; Labarta Mancho, Jesús José
El proyecto eNANOS plantea la planificación coordinada de trabajos entre varios niveles, desde el entorno heterogéneo y dinámico de un Grid hasta la ejecución de procesos y threads en las CPU’s de un computador o un cluster. Para estudiar las políticas de planificación se necesita algún mecanismo de análisis de workloads. En este artículo presentamos un mecanismo de monitorización de trabajos integrado en el planificador eNANOS Scheduler. También presentamos una herramienta que a partir de la información obtenida en la monitorización es capaz de generar trazas que pueden ser visualizadas y analizadas fácilmente con Paraver. Para mostrar las ventajas del sistema planteado se presenta la evaluación de un workload y se compara con posibles alternativas.
2023-01-16T10:48:16ZRodero Castro, IvánCorbalán González, JulitaDuran González, AlejandroLabarta Mancho, Jesús JoséEl proyecto eNANOS plantea la planificación coordinada de trabajos entre varios niveles, desde el entorno heterogéneo y dinámico de un Grid hasta la ejecución de procesos y threads en las CPU’s de un computador o un cluster. Para estudiar las políticas de planificación se necesita algún mecanismo de análisis de workloads. En este artículo presentamos un mecanismo de monitorización de trabajos integrado en el planificador eNANOS Scheduler. También presentamos una herramienta que a partir de la información obtenida en la monitorización es capaz de generar trazas que pueden ser visualizadas y analizadas fácilmente con Paraver. Para mostrar las ventajas del sistema planteado se presenta la evaluación de un workload y se compara con posibles alternativas.Optimizing NANOS OpenMP for the IBM Cyclops multithreaded architecture
http://hdl.handle.net/2117/380089
Optimizing NANOS OpenMP for the IBM Cyclops multithreaded architecture
Ródenas Picó, David; Martorell Bofill, Xavier; Ayguadé Parra, Eduard; Labarta Mancho, Jesús José; Almási, George; Cascaval, Calin; Castaños, José G.; Moreira, Jose E.
In this paper, we present two approaches to improve the execution of OpenMP applications on the IBM Cyclops multithreaded architecture. Both solutions are independent and they are focused to obtain better performance through a better management of the cache locality. The first solution is based on software modifications to the OpenMP runtime library to balance stack accesses across all data caches. The second solution is a small hardware modification to change the data cache mapping behavior, with the same goal. Both solutions help parallel applications to improve scalability and obtain better performance in this kind of architectures. In fact, they could also be applied to future multi-core processors. We have executed (using simulation) some of the NAS benchmarks to prove these proposals. They show how, with small changes in both the software and the hardware, we achieve very good scalability in parallel applications. Our results also show that standard execution environments oriented to multiprocessor architectures can be easily adapted to exploit multithreaded processors.
2023-01-12T11:21:18ZRódenas Picó, DavidMartorell Bofill, XavierAyguadé Parra, EduardLabarta Mancho, Jesús JoséAlmási, GeorgeCascaval, CalinCastaños, José G.Moreira, Jose E.In this paper, we present two approaches to improve the execution of OpenMP applications on the IBM Cyclops multithreaded architecture. Both solutions are independent and they are focused to obtain better performance through a better management of the cache locality. The first solution is based on software modifications to the OpenMP runtime library to balance stack accesses across all data caches. The second solution is a small hardware modification to change the data cache mapping behavior, with the same goal. Both solutions help parallel applications to improve scalability and obtain better performance in this kind of architectures. In fact, they could also be applied to future multi-core processors. We have executed (using simulation) some of the NAS benchmarks to prove these proposals. They show how, with small changes in both the software and the hardware, we achieve very good scalability in parallel applications. Our results also show that standard execution environments oriented to multiprocessor architectures can be easily adapted to exploit multithreaded processors.WAS control center: an autonomic performance-triggered tracing environment for WebSphere
http://hdl.handle.net/2117/379323
WAS control center: an autonomic performance-triggered tracing environment for WebSphere
Carrera Pérez, David; García, David; Torres Viñals, Jordi; Ayguadé Parra, Eduard; Labarta Mancho, Jesús José
Studying any aspect of an application server with high availability requirements can become a tedious task when a continuous monitoring of the server status is necessary. The creation of performance-driven autonomic systems can hurry up the analysis of this kind of complex systems. In this paper we present an autonomic performance-driven environment for WebSphere application server that can be used as the basis to construct systems that must monitor the performance of the system. As an applied use of this infrastructure, we present the WAS Control Center which is a deep tracing tool-set for 24×7 environments. It exploits the benefits of autonomic computing to lighten the costs of highly detailed system tracing on a J2EE application server. The WAS Control Center is helping us in the creation of performance models of the WebSphere application server.
2022-12-22T15:52:23ZCarrera Pérez, DavidGarcía, DavidTorres Viñals, JordiAyguadé Parra, EduardLabarta Mancho, Jesús JoséStudying any aspect of an application server with high availability requirements can become a tedious task when a continuous monitoring of the server status is necessary. The creation of performance-driven autonomic systems can hurry up the analysis of this kind of complex systems. In this paper we present an autonomic performance-driven environment for WebSphere application server that can be used as the basis to construct systems that must monitor the performance of the system. As an applied use of this infrastructure, we present the WAS Control Center which is a deep tracing tool-set for 24×7 environments. It exploits the benefits of autonomic computing to lighten the costs of highly detailed system tracing on a J2EE application server. The WAS Control Center is helping us in the creation of performance models of the WebSphere application server.