Efficient hardware/software co-designed schemes for low-power processors
ColaboratorGibert Codina, Enric; Latorre Salinas, Fernando; Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
Document typeDoctoral thesis
PublisherUniversitat Politècnica de Catalunya
Rights accessOpen Access
Nowadays, we are reaching a point where further improving single thread performance can only be done at the expenses of significantly increasing power consumption. Thus, multi-core chips have been adopted by the industry and the scientific community as a proven solution to improve performance with limited power consumption. However, the number of units to be integrated into a single die is limited by its area and power restrictions, and therefore the thread level parallelism (TLP) that could be exploited is also limited. One way to continue incrementing the number of core units is to reduce the complexity of each individual core at the cost of sacrificing instruction level parallelism (ILP). We face a design trade-off here: to dedicate the total available die area to put a lot of simple cores and favor TLP or to dedicate it to put fewer cores and favor ILP. Among the different solutions already studied in the literature to deal with this challenge, we selected hybrid hardware/software co-designed processors. This solution provides high single thread performance on simple low-power cores through a software dynamic binary optimizer tightly coupled with the hardware underneath. For this reason, we believe that hardware/software co-designed processors is an area that deserves special attention on the design of multi-core systems since it allows implementing multiple simple cores suitable to maximize TLP but sustaining better ILP than conventional pure hardware approaches. In particular, this thesis explores three different techniques to address some of the most relevant challenges on the design of a simple low-power hardware/software co-designed processor. The first technique is a profiling mechanism, named as LIU Profiler, able to detect hot code regions. It consists in a small hardware table that uses a novel replacement policy aimed at detecting hot code. Such simple hardware structure implements this mechanism and allows the software to apply heuristics when building code regions and applying optimizations. The LIU Profiler achieves 85.5% code coverage detection whereas similar profilers implementing traditional replacement policies reach up to 60% coverage requiring a 4x bigger table. Moreover, the LIU Profiler only increases by 1% the total area of a simple low-power processor and consumes less than 0.87% of the total processor power. The LIU Profiler enables improving single thread performance without significantly incrementing the area and power of the processor. The second technique is a rollback scheme aimed to support code reordering and aggressive speculative optimizations on hot code regions. It is named HRC and combines software and hardware mechanisms to checkpoint and to recover the architectural register state of the processor. When compared with pure hardware solutions that require doubling the number of registers, the proposal reduces by 11% the area of the processor and by 24.4% the register file power consumption, at the cost of only degrading 1% the performance. The third technique is a loop parallelization (LP) scheme that uses the software layer to dynamically detect loops of instructions and to prepare them to execute multiple iterations in parallel by using Simultaneous Multi-Threading threads. These are optimized by employing dedicated loop parallelization binary optimizations to speed-up loop execution. LP scheme uses novel fine-grain register communication and thread dynamic register binding technique, as well as already existing processor resources. It introduces small overheads to the system and even small loops and loops that iterate just a few times are able to get significant performance improvements. The execution time of the loops is improved by more than a 16.5% when compared to a fully optimized baseline. LP contributes positively to the integration of a high number of simple cores in the same die and it allows those cores to cooperate to some extent to continue exploiting ILP when necessary.
- Tesis - TDX-UPC