Investigating memory prefetcher performance over parallel applications: from real to simulated
![Thumbnail](/bitstream/id/a9c0f08f-5df8-4617-83ce-a39273fecf76/9BSCDS_42_Investigating%20Memory%20Prefetcher.pdf.jpg?sequence=3&isAllowed=y)
Cita com:
hdl:2117/384135
Document typeConference report
Defense date2022-05
PublisherBarcelona Supercomputing Center
Rights accessOpen Access
Except where otherwise noted, content on this work
is licensed under a Creative Commons license
:
Attribution-NonCommercial-NoDerivs 4.0 International
Abstract
In recent years, there have been significant advances in the performance of processors, exemplified by the reduction of transistor size and the increase in the number of cores in a processor. Conversely, the memory subsystem did not advance as significantly as processors, not being able to deliver data at the required rate, and creating what is known as the memory wall [1]. An example of a technology used to mitigate the memory latency is the prefetcher, a technique that identifies access patterns from each core, creates speculative memory requests, and fetches data that can be potentially useful to the cache beforehand. In High-Performance Computing (HPC) systems, many other problems arise with parallelism. Since HPC applications are highly parallel, with many threads communicating with one another mainly through shared memory, it becomes necessary to keep data coherence in the several cache levels. Moreover, the memory interactions among different threads may also unpredictably change the data path through the memory hierarchy. When considering the memory hierarchy complexity along with prefetcher action, the behavior of the processor’s memory subsystem reaches a new level of complexity. In this work, we seek to shed light on how the prefetcher affects the processing performance of parallel HPC applications, and how accurately state-of-the-art multicore architecture simulators are simulating the execution of such applications, with and without prefetcher. We identify that an L2 cache prefetcher is more efficient in comparison with an L1 prefetcher, since avoiding excessive L3 cache accesses better contributes to performance, when comparing to accessing the L2 cache. Moreover, we show evidence that the prefetchers’ contribution to performance is limited by the memory contention that emerges when the level of parallelism increases.
CitationGirelli, V.S. [et al.]. Investigating memory prefetcher performance over parallel applications: from real to simulated. A: . Barcelona Supercomputing Center, 2022, p. 93-94.
Files | Description | Size | Format | View |
---|---|---|---|---|
9BSCDS_42_Investigating Memory Prefetcher.pdf | 745,8Kb | View/Open |