Practically tackling memory bottlenecks of graph-processing workloads
View/Open
Cita com:
hdl:2117/411973
Document typeConference report
Defense date2024
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Rights accessOpen Access
All rights reserved. This work is protected by the corresponding intellectual and industrial
property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public
communication or transformation of this work are prohibited without permission of the copyright holder
ProjectBSC - COMPUTACION DE ALTAS PRESTACIONES VIII (AEI-PID2019-107255GB-C21)
REDES DE INTERCONEXION, ACELERADORES HARDWARE Y OPTIMIZACION DE APLICACIONES (AEI-PID2019-105660RB-C22)
EC-H2020-101036168
THE EUROPEAN PROCESSOR INITIATIVE (EPI) SGA2 (AEI-PCI2022-132935)
REDES DE INTERCONEXION, ACELERADORES HARDWARE Y OPTIMIZACION DE APLICACIONES (AEI-PID2019-105660RB-C22)
EC-H2020-101036168
THE EUROPEAN PROCESSOR INITIATIVE (EPI) SGA2 (AEI-PCI2022-132935)
Abstract
Graph-processing workloads have become widespread due to their relevance on a wide range of application domains such as network analysis, path- planning, bioinformatics, and machine learning. Graph-processing workloads have massive data footprints that exceed cache storage capacity and exhibit highly irregular memory access patterns due to data-dependent graph traversals. This irregular behaviour causes graph-processing workloads to exhibit poor data locality, undermining their performance. This paper makes two fundamental observations on the memory access patterns of graph-processing workloads: First, conventional cache hierarchies become mostly useless when dealing with graph-processing workloads, since 78.6% of the accesses that
miss in the L1 Data Cache (L1D) result in misses in the L2 Cache (L2C) and in the Last Level Cache (LLC), requiring a DRAM access. Second, it is possible to predict whether a memory access will be served by DRAM or not in the context of graph-processing workloads by observing strides between accesses triggered by instructions with the same Program Counter (PC). Our key insight is that bypassing the L2C and the LLC for highly irregular accesses significantly reduces latency cost while also reducing pressure on the lower levels of the cache hierarchy.
Based on these observations, this paper proposes the Large Predictor (LP), a low-cost micro-architectural predictor capable of distinguishing between regular and irregular memory accesses. We propose to serve accesses tagged as regular by LP via the standard memory hierarchy, while irregular access are served via the Side Data Cache (SDC). The SDC is a private per-core set-associative cache placed alongside the L1D specifically aimed at reducing the latency cost of highly irregular accesses while avoiding polluting the rest of the cache hierarchy with data that exhibits poor locality. SDC coupled with LP yields geometric mean speed-ups of 20.3% and 20.2% on single- and multi-core scenarios, respectively, over an architecture featuring a conventional cache hierarchy across a set of contemporary graph-processing workloads. In addition, SDC combined with LP outperforms the Transpose-based Cache Replacement (T-OPT), the state-of-the-art cache replacement policy for graph-processing applications, by 10.9% and 13.8% on single-core and multi-core contexts, respectively. Regarding the hardware budget, SDC coupled with LP requires 10KB of storage per core.
CitationAlexandre, J. [et al.]. Practically tackling memory bottlenecks of graph-processing workloads. A: IEEE International Parallel and Distributed Processing Symposium. "2024 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024: 27–31 May 2024, San Francisco, California, USA: proceedings". Institute of Electrical and Electronics Engineers (IEEE), 2024, p. 1034-1045. ISBN 979-8-3503-3766-2. DOI 10.1109/IPDPS57955.2024.00096 .
ISBN979-8-3503-3766-2
Publisher versionhttps://ieeexplore.ieee.org/abstract/document/10579233
Files | Description | Size | Format | View |
---|---|---|---|---|
IPDPS24_Paper.pdf | 672,7Kb | View/Open |