Show simple item record

dc.contributor.authorJamet, Alexandre Valentin
dc.contributor.authorVavouliotis, Georgios
dc.contributor.authorJiménez, Daniel A.
dc.contributor.authorÁlvarez Martí, Lluc
dc.contributor.authorCasas, Marc
dc.contributor.otherUniversitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors
dc.contributor.otherBarcelona Supercomputing Center
dc.date.accessioned2024-07-18T08:47:12Z
dc.date.available2024-07-18T08:47:12Z
dc.date.issued2024
dc.identifier.citationAlexandre, J. [et al.]. Practically tackling memory bottlenecks of graph-processing workloads. A: IEEE International Parallel and Distributed Processing Symposium. "2024 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024: 27–31 May 2024, San Francisco, California, USA: proceedings". Institute of Electrical and Electronics Engineers (IEEE), 2024, p. 1034-1045. ISBN 979-8-3503-3766-2. DOI 10.1109/IPDPS57955.2024.00096 .
dc.identifier.isbn979-8-3503-3766-2
dc.identifier.urihttp://hdl.handle.net/2117/411973
dc.description.abstractGraph-processing workloads have become widespread due to their relevance on a wide range of application domains such as network analysis, path- planning, bioinformatics, and machine learning. Graph-processing workloads have massive data footprints that exceed cache storage capacity and exhibit highly irregular memory access patterns due to data-dependent graph traversals. This irregular behaviour causes graph-processing workloads to exhibit poor data locality, undermining their performance. This paper makes two fundamental observations on the memory access patterns of graph-processing workloads: First, conventional cache hierarchies become mostly useless when dealing with graph-processing workloads, since 78.6% of the accesses that miss in the L1 Data Cache (L1D) result in misses in the L2 Cache (L2C) and in the Last Level Cache (LLC), requiring a DRAM access. Second, it is possible to predict whether a memory access will be served by DRAM or not in the context of graph-processing workloads by observing strides between accesses triggered by instructions with the same Program Counter (PC). Our key insight is that bypassing the L2C and the LLC for highly irregular accesses significantly reduces latency cost while also reducing pressure on the lower levels of the cache hierarchy. Based on these observations, this paper proposes the Large Predictor (LP), a low-cost micro-architectural predictor capable of distinguishing between regular and irregular memory accesses. We propose to serve accesses tagged as regular by LP via the standard memory hierarchy, while irregular access are served via the Side Data Cache (SDC). The SDC is a private per-core set-associative cache placed alongside the L1D specifically aimed at reducing the latency cost of highly irregular accesses while avoiding polluting the rest of the cache hierarchy with data that exhibits poor locality. SDC coupled with LP yields geometric mean speed-ups of 20.3% and 20.2% on single- and multi-core scenarios, respectively, over an architecture featuring a conventional cache hierarchy across a set of contemporary graph-processing workloads. In addition, SDC combined with LP outperforms the Transpose-based Cache Replacement (T-OPT), the state-of-the-art cache replacement policy for graph-processing applications, by 10.9% and 13.8% on single-core and multi-core contexts, respectively. Regarding the hardware budget, SDC coupled with LP requires 10KB of storage per core.
dc.description.sponsorshipThis work has been partially supported by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation MCIN/AEI/10.13039/501100011033 (contracts PID2019- 107255GB-C21 and PID2019-105660RB-C22) and by the Generalitat de Catalunya (contract 2021-SGR-00763). This work is supported by the National Science Foundation through grant CCF-1912617 and generous gifts from Intel. Marc Casas has been partially supported by the Grant RYC-2017-23269 funded by MCIN/AEI/10.13039/501100011033 and by ESF Investing in your future. Els autors agraeixen el suport del Departament de Recerca i Universitats de la Generalitat de Catalunya al Grup de Recerca ”Performance understanding, analysis, and simulation/emulation of novel architectures” (Codi: 2021 SGR 00865). This research has received funding from the European High Performance Computing Joint Undertaking (JU) under Framework Partnership Agreement No 800928 (European Processor Initiative) and Specific Grant Agreement No 101036168 (EPI SGA2). The JU receives support from the European Union’s Horizon 2020 research and innovation programme and from Croatia, France, Germany, Greece, Italy, Netherlands, Portugal, Spain, Sweden, and Switzerland. The EPI-SGA2 project, PCI2022-132935 is also co-funded by MCIN/AEI /10.13039/501100011033 and by the UE NextGenerationEU/PRTR.
dc.format.extent12 p.
dc.language.isoeng
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)
dc.subjectÀrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
dc.subject.lcshMemory management (Computer science)
dc.subject.lcshCache memory
dc.subject.lcshMachine learning
dc.subject.otherGraph processing
dc.subject.otherCache management
dc.subject.otherOff-chip prediction
dc.subject.otherMicro-architecture
dc.titlePractically tackling memory bottlenecks of graph-processing workloads
dc.typeConference report
dc.subject.lemacGestió de memòria (Informàtica)
dc.subject.lemacMemòria cau
dc.subject.lemacAprenentatge automàtic
dc.identifier.doi10.1109/IPDPS57955.2024.00096
dc.description.peerreviewedPeer Reviewed
dc.relation.publisherversionhttps://ieeexplore.ieee.org/abstract/document/10579233
dc.rights.accessOpen Access
local.identifier.drac39375509
dc.description.versionPostprint (author's final draft)
dc.relation.projectidinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2019-107255GB-C21/ES/BSC - COMPUTACION DE ALTAS PRESTACIONES VIII/
dc.relation.projectidinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2019-105660RB-C22/ES/REDES DE INTERCONEXION, ACELERADORES HARDWARE Y OPTIMIZACION DE APLICACIONES/
dc.relation.projectidinfo:eu-repo/grantAgreement/AEI//RYC-2017-23269
dc.relation.projectidinfo:eu-repo/grantAgreement/EC/H2020/101036168
dc.relation.projectidinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PCI2022-132935/ES/THE EUROPEAN PROCESSOR INITIATIVE (EPI) SGA2/
local.citation.authorAlexandre, J.; Vavouliotis, G.; Jiménez, D. A.; Alvarez, L.; Casas, M.
local.citation.contributorIEEE International Parallel and Distributed Processing Symposium
local.citation.publicationName2024 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024: 27–31 May 2024, San Francisco, California, USA: proceedings
local.citation.startingPage1034
local.citation.endingPage1045


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record