Ponències/Comunicacions de congressos

Ponències/Comunicacions de congressos http://hdl.handle.net/2117/3114 2024-04-18T05:55:33Z DNA-TEQ: an adaptive exponential quantization of tensors for DNN inference http://hdl.handle.net/2117/406362 DNA-TEQ: an adaptive exponential quantization of tensors for DNN inference Khabbazan, Bahareh; Riera Villanueva, Marc; González Colás, Antonio María Quantization is commonly used in Deep Neural Networks (DNNs) to reduce the storage and computational complexity by decreasing the arithmetical precision of activations and weights, a.k.a. tensors. Efficient hardware architectures employ linear quantization to enable the deployment of recent DNNs onto embedded systems and mobile devices. However, linear uniform quantization cannot usually reduce the numerical precision to less than 8 bits without sacrificing high performance in terms of model accuracy. The performance loss is due to the fact that tensors do not follow uniform distributions. In this paper, we show that a significant amount of tensors fit into an exponential distribution. Then, we propose DNA-TEQ to exponentially quantize DNN tensors with an adaptive scheme that achieves the best trade-off between numerical precision and accuracy loss. The experimental results show that DNA-TEQ provides a much lower quantization bit-width compared to previous proposals, resulting in an average compression ratio of 40 % over the linear INT8 baseline, with negligible accuracy loss and without retraining the DNNs. Besides, DNA-TEQ leads the way in performing dot-product operations in the exponential domain. On average for a set of widely used DNNs, DNA-TEQ provides 1.5x speedup and 2.5x energy savings over a baseline DNN accelerator based on 3D-stacked memory. 2024-04-11T08:39:24Z Khabbazan, Bahareh Riera Villanueva, Marc González Colás, Antonio María Quantization is commonly used in Deep Neural Networks (DNNs) to reduce the storage and computational complexity by decreasing the arithmetical precision of activations and weights, a.k.a. tensors. Efficient hardware architectures employ linear quantization to enable the deployment of recent DNNs onto embedded systems and mobile devices. However, linear uniform quantization cannot usually reduce the numerical precision to less than 8 bits without sacrificing high performance in terms of model accuracy. The performance loss is due to the fact that tensors do not follow uniform distributions. In this paper, we show that a significant amount of tensors fit into an exponential distribution. Then, we propose DNA-TEQ to exponentially quantize DNN tensors with an adaptive scheme that achieves the best trade-off between numerical precision and accuracy loss. The experimental results show that DNA-TEQ provides a much lower quantization bit-width compared to previous proposals, resulting in an average compression ratio of 40 % over the linear INT8 baseline, with negligible accuracy loss and without retraining the DNNs. Besides, DNA-TEQ leads the way in performing dot-product operations in the exponential domain. On average for a set of widely used DNNs, DNA-TEQ provides 1.5x speedup and 2.5x energy savings over a baseline DNN accelerator based on 3D-stacked memory. δLTA:: Decoupling camera sampling from processing to avoid redundant computations in the vision pipeline http://hdl.handle.net/2117/403987 δLTA:: Decoupling camera sampling from processing to avoid redundant computations in the vision pipeline Taranco Serna, Raúl; Arnau Montañés, José María; González Colás, Antonio María Continuous Vision (CV) systems are essential for emerging applications like Autonomous Driving (AD) and Augmented/Virtual Reality (AR/VR). A standard CV System-on-a-Chip (SoC) pipeline includes a frontend for image capture and a backend for executing vision algorithms. The frontend typically captures successive similar images with gradual positional and orientational variations. As a result, many regions between consecutive frames yield nearly identical results when processed in the backend. Despite this, current systems process every image region at the camera’s sampling rate, overlooking the fact that the actual rate of change in these regions could be significantly lower. In this work, we introduce δ LTA (δont’t Look Twice, it’s Alright), a novel frontend that decouples camera frame sampling from backend processing by extending the camera with the ability to discard redundant image regions before they enter subsequent CV pipeline stages. δ LTA informs the backend about the image regions that have notably changed, allowing it to focus solely on processing these distinctive areas and reusing previous results to approximate the outcome for similar ones. As a result, the backend processes each image region using different processing rates based on its temporal variation. δ LTA features a new Image Signal Processing (ISP) design providing similarity filtering functionality, seamlessly integrated with other ISP stages to incur zero-latency overhead in the worst-case scenario. It also offers an interface for frontend-backend collaboration to fine-tune similarity filtering based on the application requirements. To illustrate the benefits of this novel approach, we apply it to a state-of-the-art CV localization application, typically employed in AD and AR/VR. We show that δ LTA removes a significant fraction of unneeded frontend and backend memory accesses and redundant backend computations, which reduces the application latency by 15.22% and its energy consumption by 17%. 2024-03-08T10:23:51Z Taranco Serna, Raúl Arnau Montañés, José María González Colás, Antonio María Continuous Vision (CV) systems are essential for emerging applications like Autonomous Driving (AD) and Augmented/Virtual Reality (AR/VR). A standard CV System-on-a-Chip (SoC) pipeline includes a frontend for image capture and a backend for executing vision algorithms. The frontend typically captures successive similar images with gradual positional and orientational variations. As a result, many regions between consecutive frames yield nearly identical results when processed in the backend. Despite this, current systems process every image region at the camera’s sampling rate, overlooking the fact that the actual rate of change in these regions could be significantly lower. In this work, we introduce δ LTA (δont’t Look Twice, it’s Alright), a novel frontend that decouples camera frame sampling from backend processing by extending the camera with the ability to discard redundant image regions before they enter subsequent CV pipeline stages. δ LTA informs the backend about the image regions that have notably changed, allowing it to focus solely on processing these distinctive areas and reusing previous results to approximate the outcome for similar ones. As a result, the backend processes each image region using different processing rates based on its temporal variation. δ LTA features a new Image Signal Processing (ISP) design providing similarity filtering functionality, seamlessly integrated with other ISP stages to incur zero-latency overhead in the worst-case scenario. It also offers an interface for frontend-backend collaboration to fine-tune similarity filtering based on the application requirements. To illustrate the benefits of this novel approach, we apply it to a state-of-the-art CV localization application, typically employed in AD and AR/VR. We show that δ LTA removes a significant fraction of unneeded frontend and backend memory accesses and redundant backend computations, which reduces the application latency by 15.22% and its energy consumption by 17%. SLIDEX: Sliding window extension for image processing http://hdl.handle.net/2117/403983 SLIDEX: Sliding window extension for image processing Taranco Serna, Raúl; Arnau Montañés, José María; González Colás, Antonio María With the rising need for efficient image processing in emerging applications such as Autonomous Driving (AD) and Augmented/Virtual Reality (AR/VR), many existing solutions do not meet their performance and energy efficiency requirements or are domain-specific and lack generality. In this work, we introduce SLIDEX, a novel ISA extension that leverages Sliding Window Processing (SWP) to bridge this gap in the image processing domain. SWP is a novel SIMD model that exposes to the programmer and natively manipulates vector registers as groups of overlapped windows of pixels to exploit the sliding-window dataflow found in convolutions and other stencil operations. SWP amplifies the available Data Level Parallelism (DLP) and reduces memory and register file accesses. We evaluated SLIDEX benefits in the critical image processing task of a state-of-the-art visual localization system widely used in AD and AR/VR. SLIDEX obtains an ~1.2× overall speedup and 22% energy reduction. 2024-03-08T10:07:02Z Taranco Serna, Raúl Arnau Montañés, José María González Colás, Antonio María With the rising need for efficient image processing in emerging applications such as Autonomous Driving (AD) and Augmented/Virtual Reality (AR/VR), many existing solutions do not meet their performance and energy efficiency requirements or are domain-specific and lack generality. In this work, we introduce SLIDEX, a novel ISA extension that leverages Sliding Window Processing (SWP) to bridge this gap in the image processing domain. SWP is a novel SIMD model that exposes to the programmer and natively manipulates vector registers as groups of overlapped windows of pixels to exploit the sliding-window dataflow found in convolutions and other stencil operations. SWP amplifies the available Data Level Parallelism (DLP) and reduces memory and register file accesses. We evaluated SLIDEX benefits in the critical image processing task of a state-of-the-art visual localization system widely used in AD and AR/VR. SLIDEX obtains an ~1.2× overall speedup and 22% energy reduction. QeiHaN: An energy-efficient DNN accelerator that leverages log quantization in NDP architectures http://hdl.handle.net/2117/403916 QeiHaN: An energy-efficient DNN accelerator that leverages log quantization in NDP architectures Khabbazan, Bahareh; Riera Villanueva, Marc; González Colás, Antonio María The constant growth of DNNs makes them challenging to implement and run efficiently on traditional computecentric architectures. Some works have attempted to enhance accelerators by adding more compute units and on-chip buffers, but they often worsen the memory issue due to increased bandwidth demands. Memory-centric designs based on Near-Data Processing (NDP) have been proposed to mitigate this problem by moving computations closer to the memory hierarchy. Leveraging 3D-stacked memory for its storage density and near-memory processing capabilities, this paper introduces QeiHaN, a hardware accelerator that optimizes DNN inference efficiency. QeiHaN employs a 3D-stacked memory-centric weight storage scheme combined with a logarithmic quantization of activations, resulting in reduced memory accesses by 25%. Evaluation demonstrates significant speedup and energy savings compared to a Neurocube-like accelerator across various DNNs. 2024-03-07T10:44:58Z Khabbazan, Bahareh Riera Villanueva, Marc González Colás, Antonio María The constant growth of DNNs makes them challenging to implement and run efficiently on traditional computecentric architectures. Some works have attempted to enhance accelerators by adding more compute units and on-chip buffers, but they often worsen the memory issue due to increased bandwidth demands. Memory-centric designs based on Near-Data Processing (NDP) have been proposed to mitigate this problem by moving computations closer to the memory hierarchy. Leveraging 3D-stacked memory for its storage density and near-memory processing capabilities, this paper introduces QeiHaN, a hardware accelerator that optimizes DNN inference efficiency. QeiHaN employs a 3D-stacked memory-centric weight storage scheme combined with a logarithmic quantization of activations, resulting in reduced memory accesses by 25%. Evaluation demonstrates significant speedup and energy savings compared to a Neurocube-like accelerator across various DNNs. Boustrophedonic frames: Quasi-optimal L2 caching for textures in GPUs http://hdl.handle.net/2117/403438 Boustrophedonic frames: Quasi-optimal L2 caching for textures in GPUs Joseph, Diya; Aragón Alcaraz, Juan Luis; Parcerisa Bundó, Joan Manuel; González Colás, Antonio María Literature is plentiful in works exploiting cache locality for GPUs. A majority of them explore replacement or bypassing policies. In this paper, however, we surpass this exploration by fabricating a formal proof for a no-overhead quasi-optimal caching technique for caching textures in graphics workloads. Textures make up a significant part of main memory traffic in mobile GPUs, which contributes to the total GPU energy consumption. Since texture accesses use a shared L2 cache, improving the L2 texture caching efficiency would decrease main memory traffic, thus improving energy efficiency, which is crucial for mobile GPUs. Our proposal reaches quasi-optimality by exploiting the frame-to-frame reuse of textures in graphics. We do this by traversing frames in a boustrophedonic 1 1 Boustrophedon is a style of writing in which alternate lines of writing are reversed in order. This is in contrast to most modern languages, where the order of lines is the same, usually left-to-right. manner w.r.t. the frame-to-frame tile order. We first approximate the texture access trace to a circular trace and then forge a formal proof for our proposal being optimal for such traces. We also complement the proof with empirical data that demonstrates the quasi-optimality of our no-cost proposal. 2024-02-29T11:06:15Z Joseph, Diya Aragón Alcaraz, Juan Luis Parcerisa Bundó, Joan Manuel González Colás, Antonio María Literature is plentiful in works exploiting cache locality for GPUs. A majority of them explore replacement or bypassing policies. In this paper, however, we surpass this exploration by fabricating a formal proof for a no-overhead quasi-optimal caching technique for caching textures in graphics workloads. Textures make up a significant part of main memory traffic in mobile GPUs, which contributes to the total GPU energy consumption. Since texture accesses use a shared L2 cache, improving the L2 texture caching efficiency would decrease main memory traffic, thus improving energy efficiency, which is crucial for mobile GPUs. Our proposal reaches quasi-optimality by exploiting the frame-to-frame reuse of textures in graphics. We do this by traversing frames in a boustrophedonic 1 1 Boustrophedon is a style of writing in which alternate lines of writing are reversed in order. This is in contrast to most modern languages, where the order of lines is the same, usually left-to-right. manner w.r.t. the frame-to-frame tile order. We first approximate the texture access trace to a circular trace and then forge a formal proof for our proposal being optimal for such traces. We also complement the proof with empirical data that demonstrates the quasi-optimality of our no-cost proposal. Exploiting kernel compression on BNNs http://hdl.handle.net/2117/401585 Exploiting kernel compression on BNNs Silfa Feliz, Franyell Antonio; Arnau Montañés, José María; González Colás, Antonio María Binary Neural Networks (BNNs) are showing tremen-dous success on realistic image classification tasks. Notably, their accuracy is similar to the state-of-the-art accuracy obtained by full-precision models tailored to edge devices. In this regard, BNNs are very amenable to edge devices since they employ 1-bit to store the inputs and weights, and thus, their storage requirements are low. Moreover, BNNs computations are mainly done using xnor and pop-counts operations which are implemented very efficiently using simple hardware structures. Nonetheless, supporting BNNs efficiently on mobile CPUs is far from trivial since their benefits are hindered by frequent memory accesses to load weights and inputs. In BNNs, a weight or an input is stored using one bit, and aiming to increase storage and computation efficiency, several of them are packed together as a sequence of bits. In this work, we observe that the number of unique sequences representing a set of weights or inputs is typically low (i.e., 512). Also, we have seen that during the evaluation of a BNN layer, a small group of unique sequences is employed more frequently than others. Accordingly, we propose exploiting this observation by using Huffman Encoding to encode the bit sequences and then using an indirection table to decode them during the BNN evaluation. Also, we propose a clustering-based scheme to identify the most common sequences of bits and replace the less common ones with some similar common sequences. As a result, we decrease the storage requirements and memory accesses since the most common sequences are encoded with fewer bits. In this work, we extend a mobile CPU by adding a small hardware structure that can efficiently cache and decode the compressed sequence of bits. We evaluate our scheme using the ReAacNet model with the Imagenet dataset on an ARM CPU. Our experimental results show that our technique can reduce memory requirement by 1.32x and improve performance by 1.35x. 2024-02-09T11:50:14Z Silfa Feliz, Franyell Antonio Arnau Montañés, José María González Colás, Antonio María Binary Neural Networks (BNNs) are showing tremen-dous success on realistic image classification tasks. Notably, their accuracy is similar to the state-of-the-art accuracy obtained by full-precision models tailored to edge devices. In this regard, BNNs are very amenable to edge devices since they employ 1-bit to store the inputs and weights, and thus, their storage requirements are low. Moreover, BNNs computations are mainly done using xnor and pop-counts operations which are implemented very efficiently using simple hardware structures. Nonetheless, supporting BNNs efficiently on mobile CPUs is far from trivial since their benefits are hindered by frequent memory accesses to load weights and inputs. In BNNs, a weight or an input is stored using one bit, and aiming to increase storage and computation efficiency, several of them are packed together as a sequence of bits. In this work, we observe that the number of unique sequences representing a set of weights or inputs is typically low (i.e., 512). Also, we have seen that during the evaluation of a BNN layer, a small group of unique sequences is employed more frequently than others. Accordingly, we propose exploiting this observation by using Huffman Encoding to encode the bit sequences and then using an indirection table to decode them during the BNN evaluation. Also, we propose a clustering-based scheme to identify the most common sequences of bits and replace the less common ones with some similar common sequences. As a result, we decrease the storage requirements and memory accesses since the most common sequences are encoded with fewer bits. In this work, we extend a mobile CPU by adding a small hardware structure that can efficiently cache and decode the compressed sequence of bits. We evaluate our scheme using the ReAacNet model with the Imagenet dataset on an ARM CPU. Our experimental results show that our technique can reduce memory requirement by 1.32x and improve performance by 1.35x. K-D Bonsai: ISA-extensions to compress K-D trees for autonomous driving tasks http://hdl.handle.net/2117/400227 K-D Bonsai: ISA-extensions to compress K-D trees for autonomous driving tasks Exenberger Becker, Pedro Henrique; Arnau Montañés, José María; González Colás, Antonio María Autonomous Driving (AD) systems extensively manipulate 3D point clouds for object detection and vehicle localization. Thereby, efficient processing of 3D point clouds is crucial in these systems. In this work we propose K-D Bonsai, a technique to cut down memory usage during radius search, a critical building block of point cloud processing. K-D Bonsai exploits value similarity in the data structure that holds the point cloud (a k-d tree) to compress the data in memory. K-D Bonsai further compresses the data using a reduced floating-point representation, exploiting the physically limited range of point cloud values. For easy integration into nowadays systems, we implement K-D Bonsai through Bonsai-extensions, a small set of new CPU instructions to compress, decompress, and operate on points. To maintain baseline safety levels, we carefully craft the Bonsai-extensions to detect precision loss due to compression, allowing re-computation in full precision to take place if necessary. Therefore, K-D Bonsai reduces data movement, improving performance and energy efficiency, while guaranteeing baseline accuracy and programmability. We evaluate K-D Bonsai over the euclidean cluster task of Autoware.ai, a state-of-the-art software stack for AD. We achieve an average of 9.26% improvement in end-to-end latency, 12.19% in tail latency, and a reduction of 10.84% in energy consumption. Differently from expensive accelerators proposed in related work, K-D Bonsai improves radius search with minimal area increase (0.36%). 2024-01-25T09:33:42Z Exenberger Becker, Pedro Henrique Arnau Montañés, José María González Colás, Antonio María Autonomous Driving (AD) systems extensively manipulate 3D point clouds for object detection and vehicle localization. Thereby, efficient processing of 3D point clouds is crucial in these systems. In this work we propose K-D Bonsai, a technique to cut down memory usage during radius search, a critical building block of point cloud processing. K-D Bonsai exploits value similarity in the data structure that holds the point cloud (a k-d tree) to compress the data in memory. K-D Bonsai further compresses the data using a reduced floating-point representation, exploiting the physically limited range of point cloud values. For easy integration into nowadays systems, we implement K-D Bonsai through Bonsai-extensions, a small set of new CPU instructions to compress, decompress, and operate on points. To maintain baseline safety levels, we carefully craft the Bonsai-extensions to detect precision loss due to compression, allowing re-computation in full precision to take place if necessary. Therefore, K-D Bonsai reduces data movement, improving performance and energy efficiency, while guaranteeing baseline accuracy and programmability. We evaluate K-D Bonsai over the euclidean cluster task of Autoware.ai, a state-of-the-art software stack for AD. We achieve an average of 9.26% improvement in end-to-end latency, 12.19% in tail latency, and a reduction of 10.84% in energy consumption. Differently from expensive accelerators proposed in related work, K-D Bonsai improves radius search with minimal area increase (0.36%). Lightweight register file caching in collector units for GPUs http://hdl.handle.net/2117/390429 Lightweight register file caching in collector units for GPUs Abaie Shoushtary, Mojtaba; Arnau Montañés, José María; Tubella Murgadas, Jordi; González Colás, Antonio María Modern GPUs benefit from a sizable Register File (RF) to provide fine-grained thread switching. As the RF is huge and accessed frequently, it consumes a considerable share of the dynamic energy of the GPU. Designing a large, high-throughput RF with low energy consumption and area for GPUs is challenging. In this paper, an energy-efficient hierarchical RF design for GPUs, called Malekeh, is introduced. Malekeh keeps registers in energy-efficient small caches and maximizes cache efficacy by using lightweight policies and supporting adaptive algorithms. The policies’ effectiveness is improved by leveraging register reuse distance information provided by the compiler as a hint. Malekeh reduces the RF reads by 48.5% and dynamic energy by 29.1%. It also improves performance by 9.6% with a negligible overhead of 0.04% in the area. 2023-07-06T11:15:06Z Abaie Shoushtary, Mojtaba Arnau Montañés, José María Tubella Murgadas, Jordi González Colás, Antonio María Modern GPUs benefit from a sizable Register File (RF) to provide fine-grained thread switching. As the RF is huge and accessed frequently, it consumes a considerable share of the dynamic energy of the GPU. Designing a large, high-throughput RF with low energy consumption and area for GPUs is challenging. In this paper, an energy-efficient hierarchical RF design for GPUs, called Malekeh, is introduced. Malekeh keeps registers in energy-efficient small caches and maximizes cache efficacy by using lightweight policies and supporting adaptive algorithms. The policies’ effectiveness is improved by leveraging register reuse distance information provided by the compiler as a hint. Malekeh reduces the RF reads by 48.5% and dynamic energy by 29.1%. It also improves performance by 9.6% with a negligible overhead of 0.04% in the area. Simple out of order core for GPGPUs http://hdl.handle.net/2117/389957 Simple out of order core for GPGPUs Huerta Gañán, Rodrigo; Arnau Montañés, José María; González Colás, Antonio María GPU architectures have become popular for executing general-purpose programs which rely on having a large number of threads that run concurrently to hide the latency among dependent instructions. This approach has an important cost/overhead in terms of low data locality due to the increased pressure on the memory hierarchy of the many threads being run concurrently and the extra cost of storing and managing the on-chip state of those many threads. This paper presents SOCGPU (Simple Out-of-order Core for GPU), a simple out-of-order execution mechanism that does not require register renaming nor scoreboards. It uses a small Instruction Buffer and a tiny Dependence matrix to keep track of dependencies among instructions and avoid data hazards. Evaluations for an Nvidia Tesla V100-like GPU show that SOCGPU provides a speed-up of up to 2.3 in some machine learning programs and 1.38 on average for a variety of benchmarks, while it reduces energy consumption by 6.5%, with only 2.4% area overhead. 2023-06-29T12:05:36Z Huerta Gañán, Rodrigo Arnau Montañés, José María González Colás, Antonio María GPU architectures have become popular for executing general-purpose programs which rely on having a large number of threads that run concurrently to hide the latency among dependent instructions. This approach has an important cost/overhead in terms of low data locality due to the increased pressure on the memory hierarchy of the many threads being run concurrently and the extra cost of storing and managing the on-chip state of those many threads. This paper presents SOCGPU (Simple Out-of-order Core for GPU), a simple out-of-order execution mechanism that does not require register renaming nor scoreboards. It uses a small Instruction Buffer and a tiny Dependence matrix to keep track of dependencies among instructions and avoid data hazards. Evaluations for an Nvidia Tesla V100-like GPU show that SOCGPU provides a speed-up of up to 2.3 in some machine learning programs and 1.38 on average for a variety of benchmarks, while it reduces energy consumption by 6.5%, with only 2.4% area overhead. Sliding window support for image processing in autonomous vehicles http://hdl.handle.net/2117/377029 Sliding window support for image processing in autonomous vehicles Taranco Serna, Raúl; Arnau Montañés, José María; González Colás, Antonio María Camera-based autonomous driving extensively ma-nipulates images for object detection, object tracking, or camera-based localization tasks. Therefore, efficient and fast image processing is crucial in those systems. Unfortunately, current solutions either do not meet AD’s constraints for real-time performance and energy efficiency or are domain-specific and, thus, not general [14]. In this work, we introduce Sliding Window Processing (SWP), a SIMD execution model that natively operates on sliding windows of image pixels. We illustrate the benefits of SWP through a novel ISA extension called SLIDEX that achieves high performance and energy efficiency while maintaining pro-grammability. We demonstrate the benefits of SLIDEX for the image processing tasks of ORB-SLAM [17] [18], a state-of-the-art camera-based localization system. SLIDEX achieves an average end-to-end speedup of ~1.65× and ~1.2× compared to equivalent scalar and vector baselines respectively. Compared with the vector implementation, our solution reduces the end-to-end energy consumption a 22% on average. 2022-11-24T08:56:39Z Taranco Serna, Raúl Arnau Montañés, José María González Colás, Antonio María Camera-based autonomous driving extensively ma-nipulates images for object detection, object tracking, or camera-based localization tasks. Therefore, efficient and fast image processing is crucial in those systems. Unfortunately, current solutions either do not meet AD’s constraints for real-time performance and energy efficiency or are domain-specific and, thus, not general [14]. In this work, we introduce Sliding Window Processing (SWP), a SIMD execution model that natively operates on sliding windows of image pixels. We illustrate the benefits of SWP through a novel ISA extension called SLIDEX that achieves high performance and energy efficiency while maintaining pro-grammability. We demonstrate the benefits of SLIDEX for the image processing tasks of ORB-SLAM [17] [18], a state-of-the-art camera-based localization system. SLIDEX achieves an average end-to-end speedup of ~1.65× and ~1.2× compared to equivalent scalar and vector baselines respectively. Compared with the vector implementation, our solution reduces the end-to-end energy consumption a 22% on average.