Automatic speech recognition (ASR) allows multimedia content to be transcribed
from acoustic waveforms into word sequences. It is an exemplar of a class of ma-
chine learning applications where increasing compute capability is enabling new
industries such as automatic speech analytics. Automatic speech analytics help cus-
tomer service call centers search through recorded content, track service quality,
and provide early detection of service issues. Fast and efﬁcient ASR enables eco-
nomic employment of a plethora of text-based data analytics on multimedia con-
tents, opening the door to many possibilities.
In this chapter, we describe our approach approach for scalable parallelization of
the most challenging component of ASR: the speech inference engine. This com-
ponent accesses a large irregular data structure with millions of states and arcs
representing a human speech model. Parallel implementation of the inference en-
gine requires frequent synchronizations, and has unpredictable data working set
and memory access patterns because of direct dependency on the audio input.
We demonstrate that parallelizing an application is much more than just re-
coding the program in another language. It requires careful consideration of a series
of concerns to successfully exploit the full parallelization potential of an applica-
tion. Using our approach, we were able to achieve more than 3.4x speed up on
an Intel Core i7 multicore processor and more than 11x speedup on an NVIDIA
GTX280 manycore processor for the ASR inference engine. This performance im-
provement opens up many opportunities for latency-limited as well as throughput-
limited applications of automatic speech recognition.