Improving multithreading performance for clustered VLIW architectures.

Gupta, Manoj

doi:10.5821/dissertation-2117-95098

dc.contributor	Sánchez Carracedo, Fermín
dc.contributor	Llosa Espuny, José Francisco
dc.contributor.author	Gupta, Manoj
dc.contributor.other	Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
dc.date.accessioned	2014-01-31T13:01:04Z
dc.date.available	2014-01-31T13:01:04Z
dc.date.issued	2013-06-14
dc.identifier.citation	Gupta, M. Improving multithreading performance for clustered VLIW architectures. Tesi doctoral, UPC, Departament d'Arquitectura de Computadors, 2013. DOI 10.5821/dissertation-2117-95098.
dc.identifier.uri	http://hdl.handle.net/2117/95098
dc.description.abstract	Very Long Instruction Word (VLIW) processors are very popular in embedded and mobile computing domain. Use of VLIW processors range from Digital Signal Processors (DSPs) found in a plethora of communication and multimedia devices to Graphics Processing Units (GPUs) used in gaming and high performance computing devices. The advantage of VLIWs is their low complexity and low power design which enable high performance at a low cost. Scalability of VLIWs is limited by the scalability of register file ports. It is not viable to have a VLIW processor with a single large register file because of area and power consumption implications of the register file. Clustered VLIW solve the register file scalability issue by partitioning the register file into multiple clusters and a set of functional units that are attached to register file of that cluster. Using a clustered approach, higher issue width can be achieved while keeping the cost of register file within reasonable limits. Several commercial VLIW processors have been designed using the clustered VLIW model. VLIW processors can be used to run a larger set of applications. Many of these applications have a good Lnstruction Level Parallelism (ILP) which can be efficiently utilized. However, several applications, specially the ones that are control code dominated do not exibit good ILP and the processor is underutilized. Cache misses is another major source of resource underutiliztion. Multithreading is a popular technique to improve processor utilization. Interleaved MultiThreading (IMT) hides cache miss latencies by scheduling a different thread each cycle but cannot hide unused instructions slots. Simultaneous MultiThread (SMT) can also remove ILP under-utilization by issuing multiple threads to fill the empty instruction slots. However, SMT has a higher implementation cost than IMT. The thesis presents Cluster-level Simultaneous MultiThreading (CSMT) that supports a limited form of SMT where VLIW instructions from different threads are merged at a cluster-level granularity. This lowers the hardware implementation cost to a level comparable to the cheap IMT technique. The more complex SMT combines VLIW instructions at the individual operation-level granularity which is quite expensive especially in for a mobile solution. We refer to SMT at operation-level as OpSMT to reduce ambiguity. While previous studies restricted OpSMT on a VLIW to 2 threads, CSMT has a better scalability and upto 8 threads can be supported at a reasonable cost. The thesis proposes several other techniques to further improve CSMT performance. In particular, Cluster renaming remaps the clusters used by instructions of different threads to reduce resource conflicts. Cluster renaming is quite effective in reducing the issue-slots under-utilization and significantly improves CSMT performance.The thesis also proposes: a hybrid between IMT and CSMT which increases the number of supported threads, heterogeneous instruction merging where some instructions are combined using SMT and CSMT rest, and finally, split-issue, a technique that allows to launch partially an instruction making it easier to be combined with others.
dc.format.extent	173 p.
dc.language.iso	eng
dc.publisher	Universitat Politècnica de Catalunya
dc.rights	L'accés als continguts d'aquesta tesi queda condicionat a l'acceptació de les condicions d'ús establertes per la següent llicència Creative Commons: http://creativecommons.org/licenses/by-nc/3.0/es/
dc.rights.uri	http://creativecommons.org/licenses/by-nc/3.0/es/
dc.source	TDX (Tesis Doctorals en Xarxa)
dc.subject	Àrees temàtiques de la UPC::Informàtica
dc.title	Improving multithreading performance for clustered VLIW architectures.
dc.type	Doctoral thesis
dc.subject.lemac	Microprocessadors
dc.identifier.doi	10.5821/dissertation-2117-95098
dc.rights.access	Open Access
dc.description.version	Postprint (published version)
dc.identifier.tdx	http://hdl.handle.net/10803/129518

Fitxers d'aquest items

Nom:: TMG1de1.pdf
Mida:: 1,105Mb
Format:: PDF

Visualitza/Obre

Aquest ítem apareix a les col·leccions següents

Departament d'Arquitectura de Computadors [361]
Totes les tesis [5.461]

Mostra el registre d'ítem simple

UPCommons. Portal del coneixement obert de la UPC

Improving multithreading performance for clustered VLIW architectures.

Fitxers d'aquest items

Aquest ítem apareix a les col·leccions següents

Explora