Performance analysis and optimization of automatic speech recognition

Tabani, Hamid; Arnau Montañés, José María; Tubella Murgadas, Jordi; González Colás, Antonio María

doi:10.1109/TMSCS.2017.2739158

Visualitza/Obre

TMCS2018.pdf (2,078Mb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Tabani, Hamid

Arnau Montañés, José María

Tubella Murgadas, Jordi

González Colás, Antonio María

Tipus de documentArticle

Data publicació2018-10-01

Condicions d'accésAccés obert

Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets

ProjecteMICROARQUITECTURA Y COMPILADORES PARA FUTUROS PROCESADORES III (MINECO-TIN2013-44375-R)

Abstract

Fast and accurate Automatic Speech Recognition (ASR) is emerging as a key application for mobile devices. Delivering ASR on such devices is challenging due to the compute-intensive nature of the problem and the power constraints of embedded systems. In this paper, we provide a performance and energy characterization of Pocketsphinx, a popular toolset for ASR that targets mobile devices. We identify the computation of the Gaussian Mixture Model (GMM) as the main bottleneck, consuming more than 80 percent of the execution time. The CPI stack analysis shows that branches and main memory accesses are the main performance limiting factors for GMM computation. We propose several software-level optimizations driven by the power/performance analysis. Unlike previous proposals that trade accuracy for performance by reducing the number of Gaussians evaluated, we maintain accuracy and improve performance by effectively using the underlying CPU microarchitecture. First, we use a refactored implementation of the innermost loop of the GMM evaluation code to ameliorate the impact of branches. Second, we exploit the vector unit available on most modern CPUs to boost GMM computation, introducing a novel memory layout for storing the means and variances of the Gaussians in order to maximize the effectiveness of vectorization. Third, we compute the Gaussians for multiple frames in parallel, so means and variances can be fetched once in the on-chip caches and reused across multiple frames, significantly reducing memory bandwidth usage. We evaluate our optimizations using both hardware counters on real CPUs and simulations. Our experimental results show that the proposed optimizations provide 2.68x speedup over the baseline Pocketsphinx decoder on a high-end Intel Skylake CPU, while achieving 61 percent energy savings. On a modern ARM Cortex-A57 mobile processor our techniques improve performance by 1.85x, while providing 59 percent energy savings without any loss in the accuracy of the ASR system.

Descripció

© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

CitacióTabani, H. [et al.]. Performance analysis and optimization of automatic speech recognition. "IEEE Transactions on Multi-Scale Computing Systems", 1 Octubre 2018, vol. 4, núm. 4, p. 847-860.

URIhttp://hdl.handle.net/2117/128336

DOI10.1109/TMSCS.2017.2739158

ISSN2332-7766

Versió de l'editorhttps://ieeexplore.ieee.org/document/8010340

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
TMCS2018.pdf		2,078Mb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Performance analysis and optimization of automatic speech recognition

Visualitza/Obre

Explora