Ponències/Comunicacions de congressos
http://hdl.handle.net/2117/3126
2017-07-25T04:52:32ZParaView + Alya + D8tree: Integrating high performance computing and high performance data analytics
http://hdl.handle.net/2117/106697
ParaView + Alya + D8tree: Integrating high performance computing and high performance data analytics
Artigues, Antoni; Cugnasco, Cesare; Becerra Fontal, Yolanda; Cucchietti, Fernando; Houzeaux, Guillaume; Vázquez, Mariano; Torres Viñals, Jordi; Ayguadé Parra, Eduard; Labarta Mancho, Jesús José
Large scale time-dependent particle simulations can generate massive amounts of data, making it so that storing the results is often the slowest phase and the primary time bottleneck of the simulation. Furthermore, analysing this amount of data with traditional tools has become increasingly challenging, and it is often virtually impossible to have a visual representation of the full set.
We propose a novel architecture that integrates an HPC-based multi-physics simulation code, a NoSQL database, and a data analysis and visualisation application. The goals are two: On the one hand, we aim to speed up the simulations taking advantage of the scalability of key-value data stores, while at the same time enabling real-time approximated data visualisation and interactive exploration. On the other hand, we want to make it efficient to explore and analyse the large data base of results produced. Therefore, this work represents a clear example of integrating High Performance Computing with High Performance Data Analytics. Our prototype proves the validity of our approach and shows great performance improvements. Indeed, we reduced by 67.5% the time to store the simulation while we made real-time queries run 52 times faster than alternative solutions.
2017-07-21T09:47:16ZArtigues, AntoniCugnasco, CesareBecerra Fontal, YolandaCucchietti, FernandoHouzeaux, GuillaumeVázquez, MarianoTorres Viñals, JordiAyguadé Parra, EduardLabarta Mancho, Jesús JoséLarge scale time-dependent particle simulations can generate massive amounts of data, making it so that storing the results is often the slowest phase and the primary time bottleneck of the simulation. Furthermore, analysing this amount of data with traditional tools has become increasingly challenging, and it is often virtually impossible to have a visual representation of the full set.
We propose a novel architecture that integrates an HPC-based multi-physics simulation code, a NoSQL database, and a data analysis and visualisation application. The goals are two: On the one hand, we aim to speed up the simulations taking advantage of the scalability of key-value data stores, while at the same time enabling real-time approximated data visualisation and interactive exploration. On the other hand, we want to make it efficient to explore and analyse the large data base of results produced. Therefore, this work represents a clear example of integrating High Performance Computing with High Performance Data Analytics. Our prototype proves the validity of our approach and shows great performance improvements. Indeed, we reduced by 67.5% the time to store the simulation while we made real-time queries run 52 times faster than alternative solutions.Access to streams in multiprocessor systems
http://hdl.handle.net/2117/106627
Access to streams in multiprocessor systems
Valero Cortés, Mateo; Peirón Guardia, Montse; Ayguadé Parra, Eduard
When accessing streams in vector multiprocessor machines, degradation in the interconnection network and conflicts in the memory modules are the factors that reduce the efficiency of the system. In this paper, we present a synchronous access mechanism that allows conflict-free access to streams in a SIMD vector multiprocessor system. Each processor accesses the corresponding elements out of order, in such a way that in each cycle the requested elements do not collide in the interconnection network. Moreover, memory modules are accessed so that conflicts are avoided. The use of the proposed mechanism in present-day architectures would allow conflict-free access to streams with the most common strides that appear in real applications. The additional hardware is described and is shown to be of a similar complexity as that required for access in order.
2017-07-20T07:12:42ZValero Cortés, MateoPeirón Guardia, MontseAyguadé Parra, EduardWhen accessing streams in vector multiprocessor machines, degradation in the interconnection network and conflicts in the memory modules are the factors that reduce the efficiency of the system. In this paper, we present a synchronous access mechanism that allows conflict-free access to streams in a SIMD vector multiprocessor system. Each processor accesses the corresponding elements out of order, in such a way that in each cycle the requested elements do not collide in the interconnection network. Moreover, memory modules are accessed so that conflicts are avoided. The use of the proposed mechanism in present-day architectures would allow conflict-free access to streams with the most common strides that appear in real applications. The additional hardware is described and is shown to be of a similar complexity as that required for access in order.A systolic algorithm for the fast computation of the connected components of a graph
http://hdl.handle.net/2117/106368
A systolic algorithm for the fast computation of the connected components of a graph
Núñez, Fernando J.; Valero Cortés, Mateo
The authors consider the description of a systolic algorithm to solve the connected-component problem. It is executed in a ring topology with N processors, requiring O(Nlog N) time without regard to the graph's sparsity. The algorithm-partitioning issue is also addressed, indicating how to optimally map the computations into fixed-size rings or linear arrays. The proposed algorithm leads to simple processing elements, data addressing, and control. These points make the systolic array highly implementable.
2017-07-13T08:49:44ZNúñez, Fernando J.Valero Cortés, MateoThe authors consider the description of a systolic algorithm to solve the connected-component problem. It is executed in a ring topology with N processors, requiring O(Nlog N) time without regard to the graph's sparsity. The algorithm-partitioning issue is also addressed, indicating how to optimally map the computations into fixed-size rings or linear arrays. The proposed algorithm leads to simple processing elements, data addressing, and control. These points make the systolic array highly implementable.Analysis and simulation of multiplexed single-bus networks with and without buffering
http://hdl.handle.net/2117/106203
Analysis and simulation of multiplexed single-bus networks with and without buffering
Llaberia Griñó, José M.; Valero Cortés, Mateo; Herrada Lillo, Enrique; Labarta Mancho, Jesús José
Performance issues of a single-bus interconnection network for multiprocessor systems, operating in a multiplexed way, are presented in this paper. Several models are developed and used
to allow system performance evaluation. Comparisons with equivalent crossbar systems are provided. It is shown how crossbar EBW values can be reached and exceeded when appropriate operation parameters are chosen in a multiplexed
single-bus system. Another architectural feature is considered, concerning the utilization of buffers at the memory modules. With the buffering scheme, memory interference can be reduced so that the system performance is practically improved.
2017-07-06T11:52:08ZLlaberia Griñó, José M.Valero Cortés, MateoHerrada Lillo, EnriqueLabarta Mancho, Jesús JoséPerformance issues of a single-bus interconnection network for multiprocessor systems, operating in a multiplexed way, are presented in this paper. Several models are developed and used
to allow system performance evaluation. Comparisons with equivalent crossbar systems are provided. It is shown how crossbar EBW values can be reached and exceeded when appropriate operation parameters are chosen in a multiplexed
single-bus system. Another architectural feature is considered, concerning the utilization of buffers at the memory modules. With the buffering scheme, memory interference can be reduced so that the system performance is practically improved.A two level load/store queue based on execution locality
http://hdl.handle.net/2117/105883
A two level load/store queue based on execution locality
Pericàs Gleim, Miquel; Cristal Kestelman, Adrián; Cazorla, Francisco; González García, Rubén; Veidenbaum, Alexander V; Jiménez, Daniel A.; Valero Cortés, Mateo
Multicore processors have emerged as a powerful platform on which to efficiently exploit thread-level parallelism (TLP). However, due to Amdahl’s Law, such designs will be increasingly limited by the remaining sequential components of applications. To overcome this limitation it is necessary to design processors with many lower–performance cores for TLP and some high-performance cores designed to execute sequential algorithms. Such cores will need to address the memory-wall by implementing kilo-instruction windows. Large window processors require large Load/Store Queues that would be too slow if implemented using current CAMbased designs. This paper proposes an Epoch-based Load Store Queue (ELSQ), a new design based on Execution Locality. It is integrated into a large-window processor that has a fast, out-of-order core operating only on L1/L2 cache hits and N slower cores that process L2 misses and their dependent instructions. The large LSQ is coupled with the slow cores and is partitioned into N small and local LSQs, one per core. We evaluate ELSQ in a large-window environment, finding that it enables high performance at low power. By exploiting locality among loads and stores, ELSQ outperforms even an idealized central LSQ when implemented on top of a decoupled processor design.
2017-06-27T07:08:37ZPericàs Gleim, MiquelCristal Kestelman, AdriánCazorla, FranciscoGonzález García, RubénVeidenbaum, Alexander VJiménez, Daniel A.Valero Cortés, MateoMulticore processors have emerged as a powerful platform on which to efficiently exploit thread-level parallelism (TLP). However, due to Amdahl’s Law, such designs will be increasingly limited by the remaining sequential components of applications. To overcome this limitation it is necessary to design processors with many lower–performance cores for TLP and some high-performance cores designed to execute sequential algorithms. Such cores will need to address the memory-wall by implementing kilo-instruction windows. Large window processors require large Load/Store Queues that would be too slow if implemented using current CAMbased designs. This paper proposes an Epoch-based Load Store Queue (ELSQ), a new design based on Execution Locality. It is integrated into a large-window processor that has a fast, out-of-order core operating only on L1/L2 cache hits and N slower cores that process L2 misses and their dependent instructions. The large LSQ is coupled with the slow cores and is partitioned into N small and local LSQs, one per core. We evaluate ELSQ in a large-window environment, finding that it enables high performance at low power. By exploiting locality among loads and stores, ELSQ outperforms even an idealized central LSQ when implemented on top of a decoupled processor design.Computing size-independent matrix problems on systolic array processors
http://hdl.handle.net/2117/105729
Computing size-independent matrix problems on systolic array processors
Navarro Guerrero, Juan José; Llaberia Griñó, José M.; Valero Cortés, Mateo
A methodology to transform dense to band matrices is presented in this paper. This transformation, is accomplished by triangular blocks partitioning, and allows the implementation of solutions to problems with any given size, by means of contraflow systolic arrays, originally proposed by H.T. Kung. Matrix-vector and matrix-matrix multiplications are the operations considered here.The proposed transformations allow the optimal utilization of processing elements (PEs) of the systolic array when dense matrix are operated. Every computation is made inside the array by using adequate feedback. The feedback delay time depends only on the systolic array size.
2017-06-22T09:59:25ZNavarro Guerrero, Juan JoséLlaberia Griñó, José M.Valero Cortés, MateoA methodology to transform dense to band matrices is presented in this paper. This transformation, is accomplished by triangular blocks partitioning, and allows the implementation of solutions to problems with any given size, by means of contraflow systolic arrays, originally proposed by H.T. Kung. Matrix-vector and matrix-matrix multiplications are the operations considered here.The proposed transformations allow the optimal utilization of processing elements (PEs) of the systolic array when dense matrix are operated. Every computation is made inside the array by using adequate feedback. The feedback delay time depends only on the systolic array size.Agent-based simulation of large population dynamics
http://hdl.handle.net/2117/105723
Agent-based simulation of large population dynamics
Montañola Sales, Cristina; Casanovas Garcia, Josep; Cela Espín, José M.; Onggo, B.S.S.; Kaplan Marcusan, Adriana
Agent-based modelling and simulation is a promising methodology that can be used in the study of population dynamics. One of the main obstacles hindering the use of agent-based simulation in practice is its scalability, especially if the analysis requires large-scale models. A possible solution is to run the agent-based models on top of a scalable parallel discrete-event simulation engine. In this paper we present a modelling and simulation platform implemented to provide a basic support for M&S of agent-based demographic systems. As a simulation application, we conducted a study to evaluate its performance in a parallel
environment: a supercomputer. A user interface was also designed to allow modellers to easily define models to describe different demographic processes and transparently run them on any computer architecture environment. Our results prove that agent-based modelling can work effectively in the study of demographic scenarios which can help to better family policy planning and analysis. Moreover, parallel environment looks suitable for the study of large-based individual-based simulations of this kind.
2017-06-22T08:51:32ZMontañola Sales, CristinaCasanovas Garcia, JosepCela Espín, José M.Onggo, B.S.S.Kaplan Marcusan, AdrianaAgent-based modelling and simulation is a promising methodology that can be used in the study of population dynamics. One of the main obstacles hindering the use of agent-based simulation in practice is its scalability, especially if the analysis requires large-scale models. A possible solution is to run the agent-based models on top of a scalable parallel discrete-event simulation engine. In this paper we present a modelling and simulation platform implemented to provide a basic support for M&S of agent-based demographic systems. As a simulation application, we conducted a study to evaluate its performance in a parallel
environment: a supercomputer. A user interface was also designed to allow modellers to easily define models to describe different demographic processes and transparently run them on any computer architecture environment. Our results prove that agent-based modelling can work effectively in the study of demographic scenarios which can help to better family policy planning and analysis. Moreover, parallel environment looks suitable for the study of large-based individual-based simulations of this kind.Parallel simulation of large population dynamics
http://hdl.handle.net/2117/105722
Parallel simulation of large population dynamics
Montañola Sales, Cristina; Casanovas Garcia, Josep; Cela Espín, José M.; Kaplan Marcusan, Adriana
Agent-based modeling and simulation is a promising methodology that can be used in the study of population dynamics. We present the design and development of a simulation tool which provides basic support for
modeling and simulating agent-based demographic systems. Our results prove that agent-based modeling can work effectively in the study of demographic scenarios which can help to better policy planning
and analysis. Moreover, parallel environment looks suitable for the study of large-scale individual-based
simulations of this kind.
2017-06-22T08:47:21ZMontañola Sales, CristinaCasanovas Garcia, JosepCela Espín, José M.Kaplan Marcusan, AdrianaAgent-based modeling and simulation is a promising methodology that can be used in the study of population dynamics. We present the design and development of a simulation tool which provides basic support for
modeling and simulating agent-based demographic systems. Our results prove that agent-based modeling can work effectively in the study of demographic scenarios which can help to better policy planning
and analysis. Moreover, parallel environment looks suitable for the study of large-scale individual-based
simulations of this kind.CellSim: a validated modular heterogeneous multiprocessor simulator
http://hdl.handle.net/2117/105715
CellSim: a validated modular heterogeneous multiprocessor simulator
Cabarcas Jaramillo, Felipe; Rico Carro, Alejandro; Ródenas Picó, David; Martorell Bofill, Xavier; Ramírez Bellido, Alejandro; Ayguadé Parra, Eduard
As the number of transistors on a chip continues increasing the power consumption has become the most important constraint in processors design. Therefore, to increase performance, computer architects have decided to use multiprocessors. Moreover, recent studies have shown that heterogeneous chip multiprocessors have greater potential than homogeneous ones. We have built a modular simulator for heterogeneous multiprocessors that can be configure to model IBM's Cell Processor. The simulator has been validated against the real
machine to be used as a research tool.
2017-06-22T06:54:32ZCabarcas Jaramillo, FelipeRico Carro, AlejandroRódenas Picó, DavidMartorell Bofill, XavierRamírez Bellido, AlejandroAyguadé Parra, EduardAs the number of transistors on a chip continues increasing the power consumption has become the most important constraint in processors design. Therefore, to increase performance, computer architects have decided to use multiprocessors. Moreover, recent studies have shown that heterogeneous chip multiprocessors have greater potential than homogeneous ones. We have built a modular simulator for heterogeneous multiprocessors that can be configure to model IBM's Cell Processor. The simulator has been validated against the real
machine to be used as a research tool.Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster
http://hdl.handle.net/2117/105599
Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster
Campos Camunez, Victor; Sastre, Francesc; Yagües, Maurici; Bellver, Míriam; Giró Nieto, Xavier; Torres Viñals, Jordi
Deep learning algorithms base their success on building high learning capacity models with millions of parameters that are tuned in a data-driven fashion. These models are trained by processing millions of examples, so that the development of more accurate algorithms is usually limited by the throughput of the computing devices on which they are trained. In this work, we explore how the training of a state-of-the-art neural network for computer vision can be parallelized on a distributed GPU cluster. The effect of distributing the training process is addressed from two different points of view. First, the scalability of the task and its performance in the distributed setting are analyzed. Second, the impact of distributed training methods on the final accuracy of the models is studied.
2017-06-19T10:20:21ZCampos Camunez, VictorSastre, FrancescYagües, MauriciBellver, MíriamGiró Nieto, XavierTorres Viñals, JordiDeep learning algorithms base their success on building high learning capacity models with millions of parameters that are tuned in a data-driven fashion. These models are trained by processing millions of examples, so that the development of more accurate algorithms is usually limited by the throughput of the computing devices on which they are trained. In this work, we explore how the training of a state-of-the-art neural network for computer vision can be parallelized on a distributed GPU cluster. The effect of distributing the training process is addressed from two different points of view. First, the scalability of the task and its performance in the distributed setting are analyzed. Second, the impact of distributed training methods on the final accuracy of the models is studied.