Decoupled vector architectures
Document typeConference report
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Rights accessOpen Access
The purpose of this paper is to show that using decoupling techniques in a vector processor, the performance of vector programs can be greatly improved. Using a trace driven approach, we simulate a selection of the Perfect Club programs and compare their execution time on a conventional vector architecture and on a decoupled vector architecture. Decoupling provides a performance advantage of more than a factor of two for realistic memory latencies, and even with an ideal memory system with no latency, there is still a speedup of as much as 50%. A bypassing technique between the load/store queues is introduced and we show how it can give up to an extra speedup of 22% while also reducing total memory traffic by an average of 20%. An important part of this paper is devoted to study the tradeoffs involved in choosing an adequate size for the different queues of the architecture, so that the hardware cost of the queues can be minimized while still retaining most of the performance advantages of decoupling.
CitationEspasa, R., Valero, M. Decoupled vector architectures. A: International Symposium on High-Performance Computer Architecture. "Second International Symposium on High-Performance Computer Architecture: February 3-7, 1996, San Jose, California: proceedings". San Jose, California: Institute of Electrical and Electronics Engineers (IEEE), 1996, p. 281-290.
All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder