ARCO - Microarquitectura i Compiladors
http://hdl.handle.net/2117/3111
2017-05-29T15:19:16ZStatistical analysis and comparison of 2T and 3T1D e-DRAM minimum energy operation
http://hdl.handle.net/2117/103076
Statistical analysis and comparison of 2T and 3T1D e-DRAM minimum energy operation
Rana, Manish; Canal Corretger, Ramon; Amat Bertran, Esteve; Rubio Sola, Jose Antonio
Bio-medical wearable devices restricted to their small-capacity embedded-battery require energy-efficiency of the highest order. However, minimum-energy point (MEP) at sub-threshold voltages is unattainable with SRAM memory, which fails to hold below 0.3V because of its vanishing noise margins. This paper examines the minimum-energy operation point of 2T and 3T1D e-DRAM gain cells at the 32-nm technology node with different design points: up-sizing transistors, using high- V th transistors, read/write wordline assists; as well as operating conditions (i.e., temperature). First, the e-DRAM cells are evaluated without considering any process variations. Then, a full-factorial statistical analysis of e-DRAM cells is performed in the presence of threshold voltage variations and the effect of upsizing on mean MEP is reported. Finally, it is shown that the product of the read and write lengths provides a knob to tradeoff energy-efficiency for reliable MEP energy operation.
2017-03-30T08:05:53ZRana, ManishCanal Corretger, RamonAmat Bertran, EsteveRubio Sola, Jose AntonioBio-medical wearable devices restricted to their small-capacity embedded-battery require energy-efficiency of the highest order. However, minimum-energy point (MEP) at sub-threshold voltages is unattainable with SRAM memory, which fails to hold below 0.3V because of its vanishing noise margins. This paper examines the minimum-energy operation point of 2T and 3T1D e-DRAM gain cells at the 32-nm technology node with different design points: up-sizing transistors, using high- V th transistors, read/write wordline assists; as well as operating conditions (i.e., temperature). First, the e-DRAM cells are evaluated without considering any process variations. Then, a full-factorial statistical analysis of e-DRAM cells is performed in the presence of threshold voltage variations and the effect of upsizing on mean MEP is reported. Finally, it is shown that the product of the read and write lengths provides a knob to tradeoff energy-efficiency for reliable MEP energy operation.Exploiting path parallelism in logic programming
http://hdl.handle.net/2117/102606
Exploiting path parallelism in logic programming
Tubella Murgadas, Jordi; González Colás, Antonio María
This paper presents a novel parallel implementation of Prolog. The system is based on Multipath, a novel execution model for Prolog that implements a partial breadth-first search of the SLD-tree. The paper focusses on the type of parallelism inherent to the execution model, which is called path parallelism. This is a particular case of data parallelism that can be efficiently exploited in a SPMD architecture. A SPMD architecture oriented to the Multipath execution model is presented. A simulator of such system has been developed and used to assess the performance of path parallelism. Performance figures show that path parallelism is effective for non-deterministic programs.
2017-03-17T10:18:48ZTubella Murgadas, JordiGonzález Colás, Antonio MaríaThis paper presents a novel parallel implementation of Prolog. The system is based on Multipath, a novel execution model for Prolog that implements a partial breadth-first search of the SLD-tree. The paper focusses on the type of parallelism inherent to the execution model, which is called path parallelism. This is a particular case of data parallelism that can be efficiently exploited in a SPMD architecture. A SPMD architecture oriented to the Multipath execution model is presented. A simulator of such system has been developed and used to assess the performance of path parallelism. Performance figures show that path parallelism is effective for non-deterministic programs.A methodology for user-oriented scalability analysis
http://hdl.handle.net/2117/102605
A methodology for user-oriented scalability analysis
Royo Vallés, María Dolores; Valero García, Miguel; González Colás, Antonio María; Marí, Carme
Scalability analysis provides information about the effectiveness of increasing the number of resources of a parallel system. Several methods have been proposed which use different approaches to provide this information. This paper presents a family of analysis methods oriented to the user. The methods in this family should assist the user in estimating the benefits when increasing the system size. The key issue in the proposal is the appropriate combination of a scaling model, which reflects the way the users utilize an increasing number of resources, and a figure of merit that the user wants to improve with the larger system. Another important element in the proposal is the approach to characterize the scalability, which enables quick visual analyses and comparisons. Finally, three concrete examples of methods belonging to the proposed family are introduced in this paper.
2017-03-17T09:59:20ZRoyo Vallés, María DoloresValero García, MiguelGonzález Colás, Antonio MaríaMarí, CarmeScalability analysis provides information about the effectiveness of increasing the number of resources of a parallel system. Several methods have been proposed which use different approaches to provide this information. This paper presents a family of analysis methods oriented to the user. The methods in this family should assist the user in estimating the benefits when increasing the system size. The key issue in the proposal is the appropriate combination of a scaling model, which reflects the way the users utilize an increasing number of resources, and a figure of merit that the user wants to improve with the larger system. Another important element in the proposal is the approach to characterize the scalability, which enables quick visual analyses and comparisons. Finally, three concrete examples of methods belonging to the proposed family are introduced in this paper.A Jacobi-based algorithm for computing symmetric eigenvalues and eigenvectors in a two-dimensional mesh
http://hdl.handle.net/2117/102603
A Jacobi-based algorithm for computing symmetric eigenvalues and eigenvectors in a two-dimensional mesh
Royo Vallés, María Dolores; Valero García, Miguel; González Colás, Antonio María
The paper proposes an algorithm for computing symmetric eigenvalues and eigenvectors that uses a one-sided Jacobi approach and is targeted to a multicomputer in which nodes can be arranged as a two-dimensional mesh with an arbitrary number of rows and columns. The algorithm is analysed through simple analytical models of execution time, which show that an adequate choice of the mesh configuration (number of rows and columns) can improve performance significantly, with respect to a one-dimensional configuration, which is the most frequently considered scenario in current proposals. This improvement is especially noticeable in large systems.
2017-03-17T09:46:09ZRoyo Vallés, María DoloresValero García, MiguelGonzález Colás, Antonio MaríaThe paper proposes an algorithm for computing symmetric eigenvalues and eigenvectors that uses a one-sided Jacobi approach and is targeted to a multicomputer in which nodes can be arranged as a two-dimensional mesh with an arbitrary number of rows and columns. The algorithm is analysed through simple analytical models of execution time, which show that an adequate choice of the mesh configuration (number of rows and columns) can improve performance significantly, with respect to a one-dimensional configuration, which is the most frequently considered scenario in current proposals. This improvement is especially noticeable in large systems.An efficient solver for Cache Miss Equations
http://hdl.handle.net/2117/102595
An efficient solver for Cache Miss Equations
Bermudo, Nerina; Vera Rivera, Francisco Javier; González Colás, Antonio María; Llosa Espuny, José Francisco
Cache Miss Equations (CME) (S. Ghosh et al., 1997) is a method that accurately describes the cache behavior by means of polyhedra. Even though the computation cost of generating CME is a linear function of the number of references, solving them is a very time consuming task and thus trying to study a whole program may be infeasible. The paper presents effective techniques that exploit some properties of the particular polyhedra generated by CME. Such techniques reduce the complexity of the algorithm to solve CME, which results in a significant speedup when compared with traditional methods. In particular, the proposed approach does not require the computation of the vertices of each polyhedron, which has an exponential complexity
2017-03-16T14:00:35ZBermudo, NerinaVera Rivera, Francisco JavierGonzález Colás, Antonio MaríaLlosa Espuny, José FranciscoCache Miss Equations (CME) (S. Ghosh et al., 1997) is a method that accurately describes the cache behavior by means of polyhedra. Even though the computation cost of generating CME is a linear function of the number of references, solving them is a very time consuming task and thus trying to study a whole program may be infeasible. The paper presents effective techniques that exploit some properties of the particular polyhedra generated by CME. Such techniques reduce the complexity of the algorithm to solve CME, which results in a significant speedup when compared with traditional methods. In particular, the proposed approach does not require the computation of the vertices of each polyhedron, which has an exponential complexityA quantitative assessment of thread-level speculation techniques
http://hdl.handle.net/2117/102590
A quantitative assessment of thread-level speculation techniques
Marcuello Pascual, Pedro; González Colás, Antonio María
Speculative thread-level parallelism has been recently proposed as an alternative source of parallelism that can boost the performance for applications where independent threads are hard to find. Several schemes to exploit thread level parallelism have been proposed and significant performance gains have been reported. However, the sources of the performance gains are poorly understood as well as the impact of some design choices. In this work, the advantages of different thread speculation techniques are analyzed as are the impact of some critical issues including the value predictor, the branch predictor, the thread initialization overhead and the connectivity among thread units
2017-03-16T13:45:43ZMarcuello Pascual, PedroGonzález Colás, Antonio MaríaSpeculative thread-level parallelism has been recently proposed as an alternative source of parallelism that can boost the performance for applications where independent threads are hard to find. Several schemes to exploit thread level parallelism have been proposed and significant performance gains have been reported. However, the sources of the performance gains are poorly understood as well as the impact of some design choices. In this work, the advantages of different thread speculation techniques are analyzed as are the impact of some critical issues including the value predictor, the branch predictor, the thread initialization overhead and the connectivity among thread unitsNear-optimal loop tiling by means of cache miss equations and genetic algorithms
http://hdl.handle.net/2117/102559
Near-optimal loop tiling by means of cache miss equations and genetic algorithms
Abella Ferrer, Jaume; González Colás, Antonio María; Llosa Espuny, José Francisco; Vera Rivera, Francisco Javier
The effectiveness of the memory hierarchy is critical for the performance of current processors. The performance of the memory hierarchy can be improved by means of program transformations such as loop tiling, which is a code transformation targeted to reduce capacity misses. This paper presents a novel systematic approach to perform near-optimal loop tiling based on an accurate data locality analysis (cache miss equations) and a powerful technique to search the solution space that is based on a genetic algorithm. The results show that this approach can remove practically all capacity misses for all considered benchmarks. The reduction of replacement misses results in a decrease of the miss ratio that can be as significant as a factor of 7 for the matrix multiply kernel.
2017-03-16T10:01:59ZAbella Ferrer, JaumeGonzález Colás, Antonio MaríaLlosa Espuny, José FranciscoVera Rivera, Francisco JavierThe effectiveness of the memory hierarchy is critical for the performance of current processors. The performance of the memory hierarchy can be improved by means of program transformations such as loop tiling, which is a code transformation targeted to reduce capacity misses. This paper presents a novel systematic approach to perform near-optimal loop tiling based on an accurate data locality analysis (cache miss equations) and a powerful technique to search the solution space that is based on a genetic algorithm. The results show that this approach can remove practically all capacity misses for all considered benchmarks. The reduction of replacement misses results in a decrease of the miss ratio that can be as significant as a factor of 7 for the matrix multiply kernel.High-Performance low-vcc in-order core
http://hdl.handle.net/2117/102557
High-Performance low-vcc in-order core
Abella Ferrer, Jaume; Chaparro, Pedro; Vera Rivera, Francisco Javier; Carretero Casado, Javier; González Colás, Antonio María
Power density grows in new technology nodes, thus requiring Vcc to scale especially in mobile platforms where energy is critical. This paper presents a novel approach to decrease Vcc while keeping operating frequency high. Our mechanism is referred to as immediate read after write (IRAW) avoidance. We propose an implementation of the mechanism for an Intel® SilverthorneTM in-order core. Furthermore, we show that our mechanism can be adapted dynamically to provide the highest performance and lowest energy-delay product (EDP) at each Vcc level. Results show that IRAW avoidance increases operating frequency by 57% at 500mV and 99% at 400mV with negligible area and power overhead (below 1%), which translates into large speedups (48% at 500mV and 90% at 400mV) and EDP reductions (0.61 EDP at 500mV and 0.33 at 400mV).
2017-03-16T09:39:13ZAbella Ferrer, JaumeChaparro, PedroVera Rivera, Francisco JavierCarretero Casado, JavierGonzález Colás, Antonio MaríaPower density grows in new technology nodes, thus requiring Vcc to scale especially in mobile platforms where energy is critical. This paper presents a novel approach to decrease Vcc while keeping operating frequency high. Our mechanism is referred to as immediate read after write (IRAW) avoidance. We propose an implementation of the mechanism for an Intel® SilverthorneTM in-order core. Furthermore, we show that our mechanism can be adapted dynamically to provide the highest performance and lowest energy-delay product (EDP) at each Vcc level. Results show that IRAW avoidance increases operating frequency by 57% at 500mV and 99% at 400mV with negligible area and power overhead (below 1%), which translates into large speedups (48% at 500mV and 90% at 400mV) and EDP reductions (0.61 EDP at 500mV and 0.33 at 400mV).Exploiting pseudo-schedules to guide data dependence graph partitioning
http://hdl.handle.net/2117/102511
Exploiting pseudo-schedules to guide data dependence graph partitioning
Aleta Ortega, Alexandre; Codina Viñas, Josep M.; Sánchez Navarro, F. Jesús; González Colás, Antonio María; David, Kaeli
This paper presents a new modulo scheduling algorithm for clustered microarchitectures. The main feature of the proposed scheme is that the assignment of instructions to clusters is done by means of graph partitioning algorithms that are guided by a pseudo-scheduler. This pseudo-scheduler is a simplified version of the full instruction scheduler and estimates key constraints that would be encountered in the final schedule. The final scheduling process is bi-directional and includes on-the-fly spill code generation. The proposed scheme is evaluated against previous scheduling approaches using the SPECfp95 benchmark suite. Our modeling results show that better schedules are obtained for most programs across a range of different architectures. For a 4-cluster VLIW architecture with 32 registers and a 2-cycle inter-cluster communication delay we obtain an average speedup of 38.5%.
2017-03-15T12:39:02ZAleta Ortega, AlexandreCodina Viñas, Josep M.Sánchez Navarro, F. JesúsGonzález Colás, Antonio MaríaDavid, KaeliThis paper presents a new modulo scheduling algorithm for clustered microarchitectures. The main feature of the proposed scheme is that the assignment of instructions to clusters is done by means of graph partitioning algorithms that are guided by a pseudo-scheduler. This pseudo-scheduler is a simplified version of the full instruction scheduler and estimates key constraints that would be encountered in the final schedule. The final scheduling process is bi-directional and includes on-the-fly spill code generation. The proposed scheme is evaluated against previous scheduling approaches using the SPECfp95 benchmark suite. Our modeling results show that better schedules are obtained for most programs across a range of different architectures. For a 4-cluster VLIW architecture with 32 registers and a 2-cycle inter-cluster communication delay we obtain an average speedup of 38.5%.Dynamic cluster resizing
http://hdl.handle.net/2117/102496
Dynamic cluster resizing
González González, José; González Colás, Antonio María
Processor resources required for an effective execution of an application vary across different sections. We propose to take advantage of clustering to turn-off resources that do not contribute to improve performance. First, we present a simple hardware scheme to dynamically compute the energy consumed by each processor block and the energy-delay2 product for a given interval of time. This scheme is used to compute the effectiveness of the current configuration in terms of energy-delay2 and evaluate the benefits of increasing/decreasing the number of active issue queues. Performance evaluation shows an average energy-delay2 product improvement of 18%, and up to 50% for some applications, in a quad-cluster architecture.
2017-03-15T10:49:10ZGonzález González, JoséGonzález Colás, Antonio MaríaProcessor resources required for an effective execution of an application vary across different sections. We propose to take advantage of clustering to turn-off resources that do not contribute to improve performance. First, we present a simple hardware scheme to dynamically compute the energy consumed by each processor block and the energy-delay2 product for a given interval of time. This scheme is used to compute the effectiveness of the current configuration in terms of energy-delay2 and evaluate the benefits of increasing/decreasing the number of active issue queues. Performance evaluation shows an average energy-delay2 product improvement of 18%, and up to 50% for some applications, in a quad-cluster architecture.