<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>DSpace Collection:</title>
    <link>http://hdl.handle.net/2117/3126</link>
    <description />
    <pubDate>Thu, 20 Jun 2013 06:55:42 GMT</pubDate>
    <dc:date>2013-06-20T06:55:42Z</dc:date>
    <itunes:owner>
      <itunes:email>webmaster.bupc@upc.edu</itunes:email>
      <itunes:name>Universitat Politècnica de Catalunya. Servei de Biblioteques i Documentació</itunes:name>
    </itunes:owner>
    <itunes:explicit>no</itunes:explicit>
    <itunes:keywords />
    <item>
      <title>Evaluación formativa usando exámenes no presenciales</title>
      <link>http://hdl.handle.net/2117/19516</link>
      <description>Title: Evaluación formativa usando exámenes no presenciales
Authors: López Álvarez, David; Sánchez Carracedo, Fermín; Cruz Díaz, Josep Llorenç; Fernández Jiménez, Agustín
Abstract: Los exámenes tradicionales están orientados a la evaluación sumativa, no a la formativa, y provocan un aprendizaje superficial, más que un aprendizaje profundo. Su objetivo es evaluar, no facilitar el aprendizaje. Los estudiantes perciben que su futuro a corto plazo depende de su nota en un examen, por lo que orientan su estudio a aprobar dicho examen. En este artículo se exponen las ventajas e inconvenientes de realizar&#xD;
un examen no presencial, con evaluación sumativa y formativa, que los estudiantes realizan fuera de clase a lo largo de un periodo de tiempo mucho más largo que el de un examen tradicional, lo que les ayuda a conseguir un aprendizaje profundo.&#xD;
&#xD;
Traditional exams are focused on the summative assessment, not on the formative one. Its aim is to evaluate, not to facilitate learning, so it results in&#xD;
superficial learning rather than deep learning. Thus, students perceive that their short-term future depends on their note in the exam, so their study is guided to pass the examination. In this paper we propose a take-home exam in which students have more time to solve the questions and are not&#xD;
restricted by the sources they can consult, thereby providing a highly educational task in which students experience a deep learning process.</description>
      <pubDate>Wed, 05 Jun 2013 09:50:29 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/2117/19516</guid>
      <dc:date>2013-06-05T09:50:29Z</dc:date>
      <itunes:author>López Álvarez, David; Sánchez Carracedo, Fermín; Cruz Díaz, Josep Llorenç; Fernández Jiménez, Agustín</itunes:author>
      <itunes:explicit>no</itunes:explicit>
      <itunes:keywords>Modelos de evaluación, Evaluación formativa, Evaluación sumativa, Evaluación de competencias, Modelos de aprendizaje</itunes:keywords>
      <itunes:summary>Los exámenes tradicionales están orientados a la evaluación sumativa, no a la formativa, y provocan un aprendizaje superficial, más que un aprendizaje profundo. Su objetivo es evaluar, no facilitar el aprendizaje. Los estudiantes perciben que su futuro a corto plazo depende de su nota en un examen, por lo que orientan su estudio a aprobar dicho examen. En este artículo se exponen las ventajas e inconvenientes de realizar&#xD;
un examen no presencial, con evaluación sumativa y formativa, que los estudiantes realizan fuera de clase a lo largo de un periodo de tiempo mucho más largo que el de un examen tradicional, lo que les ayuda a conseguir un aprendizaje profundo.&#xD;
&#xD;
Traditional exams are focused on the summative assessment, not on the formative one. Its aim is to evaluate, not to facilitate learning, so it results in&#xD;
superficial learning rather than deep learning. Thus, students perceive that their short-term future depends on their note in the exam, so their study is guided to pass the examination. In this paper we propose a take-home exam in which students have more time to solve the questions and are not&#xD;
restricted by the sources they can consult, thereby providing a highly educational task in which students experience a deep learning process.</itunes:summary>
    </item>
    <item>
      <title>Systematic energy characterization of CMP/SMT processor systems via automated micro-benchmarks</title>
      <link>http://hdl.handle.net/2117/19512</link>
      <description>Title: Systematic energy characterization of CMP/SMT processor systems via automated micro-benchmarks
Authors: Bertrán, Ramon; Buyuktosunoglu, Alper; Gupta, Meeta S.; González Tallada, Marc; Bose, Pradip
Abstract: Microprocessor-based systems today are composed of multi-core, multi-threaded processors with complex cache hierarchies and gigabytes of main memory. Accurate characterization of such a system, through predictive pre-silicon modeling and/or diagnostic postsilicon measurement based analysis are increasingly cumbersome and error prone. This is especially true of energy-related characterization studies. In this paper, we take the position that automated micro-benchmarks generated with particular objectives in mind hold the key to obtaining accurate energy-related characterization. As such, we first present a flexible micro-benchmark generation framework (MicroProbe) that is used to probe complex multi-core/multi-threaded systems with a variety and range of energy-related queries in mind. We then present experimental results centered around an&#xD;
IBM POWER7 CMP/SMT system to demonstrate how the systematically generated micro-benchmarks can be used to answer three&#xD;
specific queries: (a) How to project application-specific (and if needed, phase-specific) power consumption with component-wise breakdowns? (b) How to measure energy-per-instruction (EPI) values for the target machine? (c) How to bound the worst-case (maximum) power consumption in order to determine safe, but practical (i.e. affordable) packaging or cooling solutions? The solution approaches to the above problems are all new. Hardware measurement&#xD;
based analysis shows superior power  projection accuracy (with error margins of less than 2.3% across SPEC CPU2006) as well as max-power stressing capability (with 10.7% increase in processor power over the very worst-case power seen during the execution of SPEC CPU2006 applications).</description>
      <pubDate>Wed, 05 Jun 2013 07:39:22 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/2117/19512</guid>
      <dc:date>2013-06-05T07:39:22Z</dc:date>
      <itunes:author>Bertrán, Ramon; Buyuktosunoglu, Alper; Gupta, Meeta S.; González Tallada, Marc; Bose, Pradip</itunes:author>
      <itunes:explicit>no</itunes:explicit>
      <itunes:keywords>Microprocessor chips, Multi-threading, Multiprocessing systems</itunes:keywords>
      <itunes:summary>Microprocessor-based systems today are composed of multi-core, multi-threaded processors with complex cache hierarchies and gigabytes of main memory. Accurate characterization of such a system, through predictive pre-silicon modeling and/or diagnostic postsilicon measurement based analysis are increasingly cumbersome and error prone. This is especially true of energy-related characterization studies. In this paper, we take the position that automated micro-benchmarks generated with particular objectives in mind hold the key to obtaining accurate energy-related characterization. As such, we first present a flexible micro-benchmark generation framework (MicroProbe) that is used to probe complex multi-core/multi-threaded systems with a variety and range of energy-related queries in mind. We then present experimental results centered around an&#xD;
IBM POWER7 CMP/SMT system to demonstrate how the systematically generated micro-benchmarks can be used to answer three&#xD;
specific queries: (a) How to project application-specific (and if needed, phase-specific) power consumption with component-wise breakdowns? (b) How to measure energy-per-instruction (EPI) values for the target machine? (c) How to bound the worst-case (maximum) power consumption in order to determine safe, but practical (i.e. affordable) packaging or cooling solutions? The solution approaches to the above problems are all new. Hardware measurement&#xD;
based analysis shows superior power  projection accuracy (with error margins of less than 2.3% across SPEC CPU2006) as well as max-power stressing capability (with 10.7% increase in processor power over the very worst-case power seen during the execution of SPEC CPU2006 applications).</itunes:summary>
    </item>
    <item>
      <title>Automatic I/O scheduler selection through online workload analysis</title>
      <link>http://hdl.handle.net/2117/19470</link>
      <description>Title: Automatic I/O scheduler selection through online workload analysis
Authors: Nou Castell, Ramon; Giralt, Jacobo; Cortés Rosselló, Antonio
Abstract: I/O performance is a bottleneck for many workloads. The I/O scheduler plays an important role in it. It is typically configured once by the administrator and there is no selection that suits the system at every time. Every I/O scheduler has a&#xD;
different behavior depending on the workload and the device. We present a method to select automatically the most suitable I/O scheduler for the ongoing workload. This selection is done online, using a workload analysis method with small I/O traces, finding common I/O patterns. Our dynamic mechanism adapts automatically to one of the best schedulers, sometimes achieving improvements on I/O performance for heterogeneous workloads beyond those of any fixed configuration (up to 5%). This technique works with any application and&#xD;
device type (RAID, HDD, SSD), as long as we have a system parameter to tune. It does not need disk simulations or hardware models, which are normally unavailable. We evaluate&#xD;
it in different setups, and with different benchmarks.</description>
      <pubDate>Fri, 31 May 2013 09:33:22 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/2117/19470</guid>
      <dc:date>2013-05-31T09:33:22Z</dc:date>
      <itunes:author>Nou Castell, Ramon; Giralt, Jacobo; Cortés Rosselló, Antonio</itunes:author>
      <itunes:explicit>no</itunes:explicit>
      <itunes:keywords>I/O scheduling, Pattern matching, Optimization</itunes:keywords>
      <itunes:summary>I/O performance is a bottleneck for many workloads. The I/O scheduler plays an important role in it. It is typically configured once by the administrator and there is no selection that suits the system at every time. Every I/O scheduler has a&#xD;
different behavior depending on the workload and the device. We present a method to select automatically the most suitable I/O scheduler for the ongoing workload. This selection is done online, using a workload analysis method with small I/O traces, finding common I/O patterns. Our dynamic mechanism adapts automatically to one of the best schedulers, sometimes achieving improvements on I/O performance for heterogeneous workloads beyond those of any fixed configuration (up to 5%). This technique works with any application and&#xD;
device type (RAID, HDD, SSD), as long as we have a system parameter to tune. It does not need disk simulations or hardware models, which are normally unavailable. We evaluate&#xD;
it in different setups, and with different benchmarks.</itunes:summary>
    </item>
    <item>
      <title>Symmetric rank-k update on clusters of multicore processors with SMPSs</title>
      <link>http://hdl.handle.net/2117/19425</link>
      <description>Title: Symmetric rank-k update on clusters of multicore processors with SMPSs
Authors: Badia Sala, Rosa Maria; Labarta Mancho, Jesús José; Marjanovic, Vladimir; Martín Huertas, Alberto Francisco; Mayo, Rafael; Quintana-Ortí, Enrique Salvador; Reyes, Ruymán
Abstract: We investigate the use of the SMPSs programming model to leverage task parallelism in the execution of a message-pas&#xD;
sing implementation of the symmetric rank-&#xD;
k update on clusters equipped with multicore processors. Our experience shows that the major difficulties to adapt the code to the MPI/SMPSs instance of this programming model&#xD;
are due to the usage of the conventional column-major layout of matrices in numerical libraries. On the other hand, the experimental results show a considerable increase in the performance and scalability of our solution when compared with the standard options based on the use of a pure MPI approach or a hybrid one that combines MPI/multi-threaded BLAS.</description>
      <pubDate>Tue, 28 May 2013 09:48:52 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/2117/19425</guid>
      <dc:date>2013-05-28T09:48:52Z</dc:date>
      <itunes:author>Badia Sala, Rosa Maria; Labarta Mancho, Jesús José; Marjanovic, Vladimir; Martín Huertas, Alberto Francisco; Mayo, Rafael; Quintana-Ortí, Enrique Salvador; Reyes, Ruymán</itunes:author>
      <itunes:explicit>no</itunes:explicit>
      <itunes:keywords>Linear algebra, ScaLAPACK, Clusters of multi-core processors, SMPSs, Message passing numerical libraries</itunes:keywords>
      <itunes:summary>We investigate the use of the SMPSs programming model to leverage task parallelism in the execution of a message-pas&#xD;
sing implementation of the symmetric rank-&#xD;
k update on clusters equipped with multicore processors. Our experience shows that the major difficulties to adapt the code to the MPI/SMPSs instance of this programming model&#xD;
are due to the usage of the conventional column-major layout of matrices in numerical libraries. On the other hand, the experimental results show a considerable increase in the performance and scalability of our solution when compared with the standard options based on the use of a pure MPI approach or a hybrid one that combines MPI/multi-threaded BLAS.</itunes:summary>
    </item>
    <item>
      <title>Analyzing long-term access locality to find ways to improve distributed storage systems</title>
      <link>http://hdl.handle.net/2117/19424</link>
      <description>Title: Analyzing long-term access locality to find ways to improve distributed storage systems
Authors: Miranda Bueno, Alberto; Cortés Rosselló, Antonio
Abstract: An efficient design for a distributed filesystem originates from a deep understanding of common access patterns and&#xD;
user behavior which is obtained through a deep analysis of traces and snapshots. In this paper we analyze traces for eight distributed filesystems that represent a mix of workloads taken from educational, research and commercial environments. We focused on characterizing block access patterns, amount of block sharing and working set size over long periods of time, and we tried to find common behaviors for all workloads that can be generalized to other storage systems. We found that most environments shared large amounts of blocks over time, and that block sharing was significantly affected by repetitive human behavior. We also found that block lifetimes tended to be short, but there were significant amounts of blocks with long lifetimes that were accessed over many consecutive days. Lastly, we determined that most daily accesses were made to a reduced set of blocks. We strongly believe that these findings can be used to improve long-term caching policies as well as data placement algorithms, thus increasing the performance of distributed storage systems.</description>
      <pubDate>Tue, 28 May 2013 09:15:54 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/2117/19424</guid>
      <dc:date>2013-05-28T09:15:54Z</dc:date>
      <itunes:author>Miranda Bueno, Alberto; Cortés Rosselló, Antonio</itunes:author>
      <itunes:explicit>no</itunes:explicit>
      <itunes:keywords />
      <itunes:summary>An efficient design for a distributed filesystem originates from a deep understanding of common access patterns and&#xD;
user behavior which is obtained through a deep analysis of traces and snapshots. In this paper we analyze traces for eight distributed filesystems that represent a mix of workloads taken from educational, research and commercial environments. We focused on characterizing block access patterns, amount of block sharing and working set size over long periods of time, and we tried to find common behaviors for all workloads that can be generalized to other storage systems. We found that most environments shared large amounts of blocks over time, and that block sharing was significantly affected by repetitive human behavior. We also found that block lifetimes tended to be short, but there were significant amounts of blocks with long lifetimes that were accessed over many consecutive days. Lastly, we determined that most daily accesses were made to a reduced set of blocks. We strongly believe that these findings can be used to improve long-term caching policies as well as data placement algorithms, thus increasing the performance of distributed storage systems.</itunes:summary>
    </item>
    <item>
      <title>Empowering automatic data-center management with machine learning</title>
      <link>http://hdl.handle.net/2117/19370</link>
      <description>Title: Empowering automatic data-center management with machine learning
Authors: Berral García, Josep Lluís; Gavaldà Mestre, Ricard; Torres Viñals, Jordi
Abstract: The Cloud as computing paradigm has become nowadays crucial for most Internet business models. Managing and optimizing its performance on a moment-by-moment basis is not easy given as the amount and diversity of elements involved (hardware, applications, workloads, customer needs...). Here we show how a combination of scheduling algorithms and data mining techniques helps improving the performance and profitability of a data-center running virtualized web-services. We model the data-center's main resources (CPU, memory, IO), quality of service (viewed as response time), and workloads (incoming streams of requests) from past executions. We show how these models to help scheduling algorithms make better decisions about job and resource allocation, aiming for a balance between throughput, quality of service, and power consumption.</description>
      <pubDate>Wed, 22 May 2013 11:19:56 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/2117/19370</guid>
      <dc:date>2013-05-22T11:19:56Z</dc:date>
      <itunes:author>Berral García, Josep Lluís; Gavaldà Mestre, Ricard; Torres Viñals, Jordi</itunes:author>
      <itunes:explicit>no</itunes:explicit>
      <itunes:keywords />
      <itunes:summary>The Cloud as computing paradigm has become nowadays crucial for most Internet business models. Managing and optimizing its performance on a moment-by-moment basis is not easy given as the amount and diversity of elements involved (hardware, applications, workloads, customer needs...). Here we show how a combination of scheduling algorithms and data mining techniques helps improving the performance and profitability of a data-center running virtualized web-services. We model the data-center's main resources (CPU, memory, IO), quality of service (viewed as response time), and workloads (incoming streams of requests) from past executions. We show how these models to help scheduling algorithms make better decisions about job and resource allocation, aiming for a balance between throughput, quality of service, and power consumption.</itunes:summary>
    </item>
    <item>
      <title>IT or not to be: the impact of Moodle in the education of developing countries</title>
      <link>http://hdl.handle.net/2117/19366</link>
      <description>Title: IT or not to be: the impact of Moodle in the education of developing countries
Authors: García Almiñana, Jordi; Somé, Michel; Ayguadé Parra, Eduard; Cabré Garcia, José M.; Casany Guerrero, María José; Frigola Bourlon, Manel; Galanis, Nikolaos; García-Cervigon Gutiérrez, Manuel; Guerrero Zapata, Manel; Muñoz Gracia, María del Pilar
Abstract: E-learning environments, such as Moodle, provide a technology that fosters the improvement of the educational system in developed countries, where education is traditionally performed with relatively high standards of quality. A large number of case studies and research have been conducted to demonstrate how e-learning technologies can be applied to improve both training and learning processes. However, these technologies have not been proved efficient when applied to developing countries. The challenges that must be addressed in developing countries, both technological and societal, are much more complex and the possible solution margins are more constrained than those existing in the context where these technologies have been created. In this paper we show how Moodle can be used to improve the quality of education in developing countries and, even more important, how can be used to turn the educational system more sustainable and effective in the long-term. We describe our experience in implementing a programming course in Moodle for the Higher School of Informatics at the Université Polytechnique de Bobo-Dioulasso, in Burkina Faso (West Africa), joining efforts with local professors in designing and implementing the&#xD;
learning system. The case example has been designed having in mind a number of contextual problems: lack of lecturers, excessive teaching hours per lecturer, massive classes, and curricula organization and stability, among others. We finally discuss how the teaching effort is reduced, the students’ knowledge and capacity improves, and the institutional academic model can be guaranteed with the proposal. For this reason, we claim that information technologies in developing countries are a cost-effective way to guarantee the objectives originally defined in the academic curricula and, therefore, deal with the problem of the education.</description>
      <pubDate>Wed, 22 May 2013 07:16:25 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/2117/19366</guid>
      <dc:date>2013-05-22T07:16:25Z</dc:date>
      <itunes:author>García Almiñana, Jordi; Somé, Michel; Ayguadé Parra, Eduard; Cabré Garcia, José M.; Casany Guerrero, María José; Frigola Bourlon, Manel; Galanis, Nikolaos; García-Cervigon Gutiérrez, Manuel; Guerrero Zapata, Manel; Muñoz Gracia, María del Pilar</itunes:author>
      <itunes:explicit>no</itunes:explicit>
      <itunes:keywords>Moodle, e-Learning, Education in developing countries, Programming course</itunes:keywords>
      <itunes:summary>E-learning environments, such as Moodle, provide a technology that fosters the improvement of the educational system in developed countries, where education is traditionally performed with relatively high standards of quality. A large number of case studies and research have been conducted to demonstrate how e-learning technologies can be applied to improve both training and learning processes. However, these technologies have not been proved efficient when applied to developing countries. The challenges that must be addressed in developing countries, both technological and societal, are much more complex and the possible solution margins are more constrained than those existing in the context where these technologies have been created. In this paper we show how Moodle can be used to improve the quality of education in developing countries and, even more important, how can be used to turn the educational system more sustainable and effective in the long-term. We describe our experience in implementing a programming course in Moodle for the Higher School of Informatics at the Université Polytechnique de Bobo-Dioulasso, in Burkina Faso (West Africa), joining efforts with local professors in designing and implementing the&#xD;
learning system. The case example has been designed having in mind a number of contextual problems: lack of lecturers, excessive teaching hours per lecturer, massive classes, and curricula organization and stability, among others. We finally discuss how the teaching effort is reduced, the students’ knowledge and capacity improves, and the institutional academic model can be guaranteed with the proposal. For this reason, we claim that information technologies in developing countries are a cost-effective way to guarantee the objectives originally defined in the academic curricula and, therefore, deal with the problem of the education.</itunes:summary>
    </item>
    <item>
      <title>Supporting stateful tasks in a dataflow graph</title>
      <link>http://hdl.handle.net/2117/19282</link>
      <description>Title: Supporting stateful tasks in a dataflow graph
Authors: Gajinov, Vladimir; Stipic, Srdjan; Unsal, Osman Sabri; Harris, Tim; Ayguadé Parra, Eduard; Cristal Kestelman, Adrián
Abstract: This paper introduces Atomic Dataflow Model (ADF) -&#xD;
a&#xD;
programming model for shared-memory systems that combines&#xD;
aspects of dataflow programming with the use of explicitly&#xD;
mutable state. The model provides language constructs that allow&#xD;
a&#xD;
programmer to delineate a program into a set of tasks and to&#xD;
explicitly define input data for each task. This information is&#xD;
conveyed to the ADF runtime system which constructs the task&#xD;
dependency graph and builds the necessary infrastructure for&#xD;
dataflow execution. However, the key aspect of the proposed&#xD;
model is that it does not require the programmer to specify all of&#xD;
the task’s dependencies exp&#xD;
licitly, but only those that imply&#xD;
logical ordering between tasks. The ADF model manages the&#xD;
remainder of inter-task dependencies automatically, by executing&#xD;
the body of the task within an implicit memory transaction. This&#xD;
provides an easy-&#xD;
to&#xD;
-program optimistic concurrency substrate and&#xD;
enables a task to safely share data with other concurrent tasks. In&#xD;
this paper, we describe the ADF model and show how it can&#xD;
increase the programmability of shared memory systems.</description>
      <pubDate>Thu, 16 May 2013 10:26:49 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/2117/19282</guid>
      <dc:date>2013-05-16T10:26:49Z</dc:date>
      <itunes:author>Gajinov, Vladimir; Stipic, Srdjan; Unsal, Osman Sabri; Harris, Tim; Ayguadé Parra, Eduard; Cristal Kestelman, Adrián</itunes:author>
      <itunes:explicit>no</itunes:explicit>
      <itunes:keywords>Dataflow, Parallelization, Transactional memory</itunes:keywords>
      <itunes:summary>This paper introduces Atomic Dataflow Model (ADF) -&#xD;
a&#xD;
programming model for shared-memory systems that combines&#xD;
aspects of dataflow programming with the use of explicitly&#xD;
mutable state. The model provides language constructs that allow&#xD;
a&#xD;
programmer to delineate a program into a set of tasks and to&#xD;
explicitly define input data for each task. This information is&#xD;
conveyed to the ADF runtime system which constructs the task&#xD;
dependency graph and builds the necessary infrastructure for&#xD;
dataflow execution. However, the key aspect of the proposed&#xD;
model is that it does not require the programmer to specify all of&#xD;
the task’s dependencies exp&#xD;
licitly, but only those that imply&#xD;
logical ordering between tasks. The ADF model manages the&#xD;
remainder of inter-task dependencies automatically, by executing&#xD;
the body of the task within an implicit memory transaction. This&#xD;
provides an easy-&#xD;
to&#xD;
-program optimistic concurrency substrate and&#xD;
enables a task to safely share data with other concurrent tasks. In&#xD;
this paper, we describe the ADF model and show how it can&#xD;
increase the programmability of shared memory systems.</itunes:summary>
    </item>
    <item>
      <title>Transactional access to shared memory in StarSs, a task based programming model</title>
      <link>http://hdl.handle.net/2117/19279</link>
      <description>Title: Transactional access to shared memory in StarSs, a task based programming model
Authors: Gayatri, Rahulkumar; Badia Sala, Rosa Maria; Ayguadé Parra, Eduard; Lujan, M; Watson, I.
Abstract: With an increase in the number of processors on a single&#xD;
chip, programming environments which facilitate the exploitation of par-&#xD;
allelism on multicore architectures have become a necessity. StarSs is a&#xD;
task-based programming model that enables a flexible and high level&#xD;
programming. Although task synchronization in StarSs is based on data&#xD;
flow and dependency analysis, some applications (e.g.&#xD;
reductions&#xD;
)require&#xD;
locks&#xD;
to access shared data.&#xD;
Transactional Memory is an alternative to lock-based synchronization&#xD;
for controlling access to shared data. In this paper we explore the idea of&#xD;
integrating a lightweight Software Transactional Memory (STM) library,&#xD;
TinySTM , into an implementation of StarSs (SMPSs). The SMPSs run-&#xD;
time and the compiler have been modified to include and use calls to&#xD;
the STM library. We evaluated this approach on four applications and&#xD;
observe better performance in applications with high lock contention.</description>
      <pubDate>Thu, 16 May 2013 09:52:14 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/2117/19279</guid>
      <dc:date>2013-05-16T09:52:14Z</dc:date>
      <itunes:author>Gayatri, Rahulkumar; Badia Sala, Rosa Maria; Ayguadé Parra, Eduard; Lujan, M; Watson, I.</itunes:author>
      <itunes:explicit>no</itunes:explicit>
      <itunes:keywords>Data flow, Dependency analysis, High-level programming, Lock-based synchronization, Multicore architectures, Programming environment, Programming models, Runtimes, Shared data, Shared memories, Single chips, Software transactional memory, STM Library, Task synchronization, Task-based, Transactional memory</itunes:keywords>
      <itunes:summary>With an increase in the number of processors on a single&#xD;
chip, programming environments which facilitate the exploitation of par-&#xD;
allelism on multicore architectures have become a necessity. StarSs is a&#xD;
task-based programming model that enables a flexible and high level&#xD;
programming. Although task synchronization in StarSs is based on data&#xD;
flow and dependency analysis, some applications (e.g.&#xD;
reductions&#xD;
)require&#xD;
locks&#xD;
to access shared data.&#xD;
Transactional Memory is an alternative to lock-based synchronization&#xD;
for controlling access to shared data. In this paper we explore the idea of&#xD;
integrating a lightweight Software Transactional Memory (STM) library,&#xD;
TinySTM , into an implementation of StarSs (SMPSs). The SMPSs run-&#xD;
time and the compiler have been modified to include and use calls to&#xD;
the STM library. We evaluated this approach on four applications and&#xD;
observe better performance in applications with high lock contention.</itunes:summary>
    </item>
    <item>
      <title>Vector extensions for decision support DBMS acceleration</title>
      <link>http://hdl.handle.net/2117/19276</link>
      <description>Title: Vector extensions for decision support DBMS acceleration
Authors: Hayes, Timothy; Palomar Pérez, Óscar; Unsal, Osman Sabri; Cristal Kestelman, Adrián; Valero Cortés, Mateo
Abstract: Database management systems (DBMS) have become an essential&#xD;
tool for industry and research and are often a significant component&#xD;
of data centres. As a result of this criticality, efficient execution of&#xD;
DBMS engines has become an important area of investigation. This&#xD;
work takes a top-down approach to accelerating decision support&#xD;
systems (DSS) on x86-64 microprocessors using vector ISA exten-&#xD;
sions. In the first step, a leading DSS DBMS is analysed for potential&#xD;
data-level parallelism. We discuss why the existing multimedia SIMD&#xD;
extensions (SSE/AVX) are not suitable for capturing this parallelism&#xD;
and propose a complementary instruction set reminiscent of classical&#xD;
vector architectures. The instruction set is implemented using unin-&#xD;
trusive modifications to a modern x86-64 microarchitecture tailored&#xD;
for DSS DBMS. The ISA and microarchitecture are evaluated using&#xD;
a cycle-accurate x86-64 microarchitectural simulator coupled with&#xD;
a highly-detailed memory simulator. We have found a single oper-&#xD;
ator is responsible for 41% of total execution time for the TPC-H&#xD;
DSS benchmark. Our results show performance speedups between&#xD;
1.94x and 4.56x for an implementation of this operator run with our&#xD;
proposed hardware modifications.</description>
      <pubDate>Thu, 16 May 2013 09:15:37 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/2117/19276</guid>
      <dc:date>2013-05-16T09:15:37Z</dc:date>
      <itunes:author>Hayes, Timothy; Palomar Pérez, Óscar; Unsal, Osman Sabri; Cristal Kestelman, Adrián; Valero Cortés, Mateo</itunes:author>
      <itunes:explicit>no</itunes:explicit>
      <itunes:keywords />
      <itunes:summary>Database management systems (DBMS) have become an essential&#xD;
tool for industry and research and are often a significant component&#xD;
of data centres. As a result of this criticality, efficient execution of&#xD;
DBMS engines has become an important area of investigation. This&#xD;
work takes a top-down approach to accelerating decision support&#xD;
systems (DSS) on x86-64 microprocessors using vector ISA exten-&#xD;
sions. In the first step, a leading DSS DBMS is analysed for potential&#xD;
data-level parallelism. We discuss why the existing multimedia SIMD&#xD;
extensions (SSE/AVX) are not suitable for capturing this parallelism&#xD;
and propose a complementary instruction set reminiscent of classical&#xD;
vector architectures. The instruction set is implemented using unin-&#xD;
trusive modifications to a modern x86-64 microarchitecture tailored&#xD;
for DSS DBMS. The ISA and microarchitecture are evaluated using&#xD;
a cycle-accurate x86-64 microarchitectural simulator coupled with&#xD;
a highly-detailed memory simulator. We have found a single oper-&#xD;
ator is responsible for 41% of total execution time for the TPC-H&#xD;
DSS benchmark. Our results show performance speedups between&#xD;
1.94x and 4.56x for an implementation of this operator run with our&#xD;
proposed hardware modifications.</itunes:summary>
    </item>
    <item>
      <title>Automatic refinement of parallel applications structure detection</title>
      <link>http://hdl.handle.net/2117/19275</link>
      <description>Title: Automatic refinement of parallel applications structure detection
Authors: González, Juan; Huck, Kevin; Giménez Lucas, Judit; Labarta Mancho, Jesús José
Abstract: Analyzing parallel programs has become increasingly difficult due to the immense amount of information&#xD;
collected on large systems. In this scenario, cluster analysis has&#xD;
been proved to be a useful technique to reduce the amount of&#xD;
data to analyze. A good example is the use of the density-based&#xD;
cluster algorithm DBSCAN to identify similar single program&#xD;
multiple data (SPMD) computing phases in message-passing&#xD;
applications. This structure detection simplifies the analyst&#xD;
work as the whole information available is reduced to a small&#xD;
set of clusters.&#xD;
However, DBSCAN presents two major problems: it is very&#xD;
sensitive to its parametrization and is not capable of correctly&#xD;
detect clusters when the data set has different densities across&#xD;
the data space. In this paper, we introduce the Aggregative&#xD;
Cluster Refinement, an iterative algorithm that produces more&#xD;
accurate structure detections of SPMD phases than DBSCAN.&#xD;
In addition, it is able to detect clusters with different densities</description>
      <pubDate>Thu, 16 May 2013 08:57:41 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/2117/19275</guid>
      <dc:date>2013-05-16T08:57:41Z</dc:date>
      <itunes:author>González, Juan; Huck, Kevin; Giménez Lucas, Judit; Labarta Mancho, Jesús José</itunes:author>
      <itunes:explicit>no</itunes:explicit>
      <itunes:keywords />
      <itunes:summary>Analyzing parallel programs has become increasingly difficult due to the immense amount of information&#xD;
collected on large systems. In this scenario, cluster analysis has&#xD;
been proved to be a useful technique to reduce the amount of&#xD;
data to analyze. A good example is the use of the density-based&#xD;
cluster algorithm DBSCAN to identify similar single program&#xD;
multiple data (SPMD) computing phases in message-passing&#xD;
applications. This structure detection simplifies the analyst&#xD;
work as the whole information available is reduced to a small&#xD;
set of clusters.&#xD;
However, DBSCAN presents two major problems: it is very&#xD;
sensitive to its parametrization and is not capable of correctly&#xD;
detect clusters when the data set has different densities across&#xD;
the data space. In this paper, we introduce the Aggregative&#xD;
Cluster Refinement, an iterative algorithm that produces more&#xD;
accurate structure detections of SPMD phases than DBSCAN.&#xD;
In addition, it is able to detect clusters with different densities</itunes:summary>
    </item>
    <item>
      <title>Integrating dataflow abstractions into the shared memory model</title>
      <link>http://hdl.handle.net/2117/18559</link>
      <description>Title: Integrating dataflow abstractions into the shared memory model
Authors: Gajinov, Vladimir; Stipic, Srdjan; Unsal, Osman Sabri; Harris, Tim; Ayguadé Parra, Eduard; Cristal Kestelman, Adrián
Abstract: In this paper we present Atomic Dataflow model&#xD;
(ADF), a new task-based parallel programming model for&#xD;
C/C++ which integrates dataflow abstractions into the shared&#xD;
memory programming model. The ADF model provides&#xD;
pragma directives that allow&#xD;
a programmer to organize a&#xD;
program into a set of tasks and to explicitly define input data&#xD;
for each task. The task dependency information is conveyed to&#xD;
the ADF runtime system which constructs the dataflow task&#xD;
graph and builds the necessary infrastructure for dataflow&#xD;
execution. Additionally, the ADF model allows tasks to share&#xD;
data. The key idea is that comput&#xD;
ation is triggered by dataflow&#xD;
between tasks but that, within a task, execution occurs by&#xD;
making atomic updates to common mutable state. To that end,&#xD;
the ADF model employs transactional memory which&#xD;
guarantees atomicity of shared memory updates. We show&#xD;
examples that illustrate how the programmability of shared&#xD;
memory can be improved using the ADF model. Moreover,&#xD;
our evaluation shows that the ADF model performs well in&#xD;
comparison with programs para&#xD;
llelized using OpenMP and&#xD;
transactional memory.</description>
      <pubDate>Wed, 03 Apr 2013 10:24:08 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/2117/18559</guid>
      <dc:date>2013-04-03T10:24:08Z</dc:date>
      <itunes:author>Gajinov, Vladimir; Stipic, Srdjan; Unsal, Osman Sabri; Harris, Tim; Ayguadé Parra, Eduard; Cristal Kestelman, Adrián</itunes:author>
      <itunes:explicit>no</itunes:explicit>
      <itunes:keywords />
      <itunes:summary>In this paper we present Atomic Dataflow model&#xD;
(ADF), a new task-based parallel programming model for&#xD;
C/C++ which integrates dataflow abstractions into the shared&#xD;
memory programming model. The ADF model provides&#xD;
pragma directives that allow&#xD;
a programmer to organize a&#xD;
program into a set of tasks and to explicitly define input data&#xD;
for each task. The task dependency information is conveyed to&#xD;
the ADF runtime system which constructs the dataflow task&#xD;
graph and builds the necessary infrastructure for dataflow&#xD;
execution. Additionally, the ADF model allows tasks to share&#xD;
data. The key idea is that comput&#xD;
ation is triggered by dataflow&#xD;
between tasks but that, within a task, execution occurs by&#xD;
making atomic updates to common mutable state. To that end,&#xD;
the ADF model employs transactional memory which&#xD;
guarantees atomicity of shared memory updates. We show&#xD;
examples that illustrate how the programmability of shared&#xD;
memory can be improved using the ADF model. Moreover,&#xD;
our evaluation shows that the ADF model performs well in&#xD;
comparison with programs para&#xD;
llelized using OpenMP and&#xD;
transactional memory.</itunes:summary>
    </item>
    <item>
      <title>On the instrumentation of OpenMP and OmpSs Tasking constructs</title>
      <link>http://hdl.handle.net/2117/18558</link>
      <description>Title: On the instrumentation of OpenMP and OmpSs Tasking constructs
Authors: Servat, Harald; Teruel, Xavier; Llort Sanchez, German; Duran, Alejandro; Giménez, J.; Martorell Bofill, Xavier; Ayguadé Parra, Eduard; Labarta Mancho, Jesús José
Abstract: Parallelism has become more and more commonplace with&#xD;
the advent of the multicore processors. Although different parallel pro-&#xD;
gramming models have arisen to exploit the computing capabilities of&#xD;
such processors, developing applications that take benefit of these pro-&#xD;
cessors may not be easy. And what is worse, the performance achieved&#xD;
by the parallel version of the application may not be what the developer&#xD;
expected, as a result of a dubious ut&#xD;
ilization of the resources offered by&#xD;
the processor.&#xD;
We present in this paper a fruitful synergy of a shared memory parallel&#xD;
compiler and runtime, and a performance extraction library. The objective of this work is not only to reduce the performance analysis life-cycle when doing the parallelization of an application, but also to extend the&#xD;
analysis experience of the parallel application by incorporating data that&#xD;
is only known in the compiler and runtime side. Additionally we present&#xD;
performance results obtained with the execution of instrumented application and evaluate the overhead of the instrumentation.</description>
      <pubDate>Wed, 03 Apr 2013 10:14:57 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/2117/18558</guid>
      <dc:date>2013-04-03T10:14:57Z</dc:date>
      <itunes:author>Servat, Harald; Teruel, Xavier; Llort Sanchez, German; Duran, Alejandro; Giménez, J.; Martorell Bofill, Xavier; Ayguadé Parra, Eduard; Labarta Mancho, Jesús José</itunes:author>
      <itunes:explicit>no</itunes:explicit>
      <itunes:keywords />
      <itunes:summary>Parallelism has become more and more commonplace with&#xD;
the advent of the multicore processors. Although different parallel pro-&#xD;
gramming models have arisen to exploit the computing capabilities of&#xD;
such processors, developing applications that take benefit of these pro-&#xD;
cessors may not be easy. And what is worse, the performance achieved&#xD;
by the parallel version of the application may not be what the developer&#xD;
expected, as a result of a dubious ut&#xD;
ilization of the resources offered by&#xD;
the processor.&#xD;
We present in this paper a fruitful synergy of a shared memory parallel&#xD;
compiler and runtime, and a performance extraction library. The objective of this work is not only to reduce the performance analysis life-cycle when doing the parallelization of an application, but also to extend the&#xD;
analysis experience of the parallel application by incorporating data that&#xD;
is only known in the compiler and runtime side. Additionally we present&#xD;
performance results obtained with the execution of instrumented application and evaluate the overhead of the instrumentation.</itunes:summary>
    </item>
    <item>
      <title>Accelerating boosting-based face detection on GPUs</title>
      <link>http://hdl.handle.net/2117/18498</link>
      <description>Title: Accelerating boosting-based face detection on GPUs
Authors: Oro, David; Fernández, Carles; Segura, Carlos; Martorell Bofill, Xavier; Hernando Pericás, Francisco Javier
Abstract: The goal of face detection is to determine the&#xD;
presence of faces in arbitrary images, along with their locations&#xD;
and dimensions. As it happens with any graphics workloads,&#xD;
these algorithms benefit from data-level parallelism. Existing&#xD;
parallelization efforts strictly focus on mapping different di-&#xD;
vide and conquer strategies into multicore CPUs and GPUs.&#xD;
However, even the most advanced single-chip many-core pro-&#xD;
cessors to date are still struggling to effectively handle real-&#xD;
time face detection under high-definition video workloads. To&#xD;
address this challenge, face detection algorithms typically avoid&#xD;
computations by dynamically evaluating a boosted cascade&#xD;
of classifiers. Unfortunately, this technique yields a low ALU&#xD;
occupancy in architectures such as GPUs, which heavily rely&#xD;
on large SIMD widths for maximizing data-level parallelism.&#xD;
In this paper we present several techniques to increase the&#xD;
performance of the cascade evaluation kernel, which is the&#xD;
most resource-intensive part of the face detection pipeline.&#xD;
Particularly, the usage of concurrent kernel execution in&#xD;
combination with cascades generated with the GentleBoost&#xD;
algorithm solves the problem of GPU underutilization, and&#xD;
achieves a 5X speedup in 1080p videos on average over&#xD;
the fastest known implementations, while slightly improving&#xD;
the accuracy. Finally, we also studied the parallelization of&#xD;
the cascade training process and its scalability under SMP&#xD;
platforms. The proposed parallelization strategy exploits both&#xD;
task and data-level parallelism and achieves a 3.5X speedup&#xD;
over single-threaded implementations</description>
      <pubDate>Fri, 22 Mar 2013 13:12:26 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/2117/18498</guid>
      <dc:date>2013-03-22T13:12:26Z</dc:date>
      <itunes:author>Oro, David; Fernández, Carles; Segura, Carlos; Martorell Bofill, Xavier; Hernando Pericás, Francisco Javier</itunes:author>
      <itunes:explicit>no</itunes:explicit>
      <itunes:keywords />
      <itunes:summary>The goal of face detection is to determine the&#xD;
presence of faces in arbitrary images, along with their locations&#xD;
and dimensions. As it happens with any graphics workloads,&#xD;
these algorithms benefit from data-level parallelism. Existing&#xD;
parallelization efforts strictly focus on mapping different di-&#xD;
vide and conquer strategies into multicore CPUs and GPUs.&#xD;
However, even the most advanced single-chip many-core pro-&#xD;
cessors to date are still struggling to effectively handle real-&#xD;
time face detection under high-definition video workloads. To&#xD;
address this challenge, face detection algorithms typically avoid&#xD;
computations by dynamically evaluating a boosted cascade&#xD;
of classifiers. Unfortunately, this technique yields a low ALU&#xD;
occupancy in architectures such as GPUs, which heavily rely&#xD;
on large SIMD widths for maximizing data-level parallelism.&#xD;
In this paper we present several techniques to increase the&#xD;
performance of the cascade evaluation kernel, which is the&#xD;
most resource-intensive part of the face detection pipeline.&#xD;
Particularly, the usage of concurrent kernel execution in&#xD;
combination with cascades generated with the GentleBoost&#xD;
algorithm solves the problem of GPU underutilization, and&#xD;
achieves a 5X speedup in 1080p videos on average over&#xD;
the fastest known implementations, while slightly improving&#xD;
the accuracy. Finally, we also studied the parallelization of&#xD;
the cascade training process and its scalability under SMP&#xD;
platforms. The proposed parallelization strategy exploits both&#xD;
task and data-level parallelism and achieves a 3.5X speedup&#xD;
over single-threaded implementations</itunes:summary>
    </item>
    <item>
      <title>Task-based parallel breadth-first search in heterogeneous environments</title>
      <link>http://hdl.handle.net/2117/18360</link>
      <description>Title: Task-based parallel breadth-first search in heterogeneous environments
Authors: Munguía, Lluis Miquel; Bader, David A.; Ayguadé Parra, Eduard
Abstract: Breadth-first search (BFS) is an essential&#xD;
graph traversal strategy widely used in many computing&#xD;
applications. Because of its irregular data access patterns,&#xD;
BFS has become a non-trivial problem hard to parallelize&#xD;
efficiently. In this paper, we introduce a parallelization&#xD;
strategy that allows the load balancing of computation&#xD;
resources as well as the execution of graph traversals in&#xD;
hybrid environments composed of CPUs and GPUs. To&#xD;
achieve that goal, we use a fine-grained task-based parallelization&#xD;
scheme and the OmpSs programming model. We&#xD;
obtain processing rates up to 2.8 billion traversed edges&#xD;
per second with a single GPU and a multi-core processor.&#xD;
Our study shows high processing rates are achievable&#xD;
with hybrid environments despite the GPU communication&#xD;
latency and memory coherence.</description>
      <pubDate>Mon, 18 Mar 2013 10:13:40 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/2117/18360</guid>
      <dc:date>2013-03-18T10:13:40Z</dc:date>
      <itunes:author>Munguía, Lluis Miquel; Bader, David A.; Ayguadé Parra, Eduard</itunes:author>
      <itunes:explicit>no</itunes:explicit>
      <itunes:keywords />
      <itunes:summary>Breadth-first search (BFS) is an essential&#xD;
graph traversal strategy widely used in many computing&#xD;
applications. Because of its irregular data access patterns,&#xD;
BFS has become a non-trivial problem hard to parallelize&#xD;
efficiently. In this paper, we introduce a parallelization&#xD;
strategy that allows the load balancing of computation&#xD;
resources as well as the execution of graph traversals in&#xD;
hybrid environments composed of CPUs and GPUs. To&#xD;
achieve that goal, we use a fine-grained task-based parallelization&#xD;
scheme and the OmpSs programming model. We&#xD;
obtain processing rates up to 2.8 billion traversed edges&#xD;
per second with a single GPU and a multi-core processor.&#xD;
Our study shows high processing rates are achievable&#xD;
with hybrid environments despite the GPU communication&#xD;
latency and memory coherence.</itunes:summary>
    </item>
  </channel>
</rss>

