Voicebot Scaling Challenge: Throughput Leadership with CPU

Enterprises are exploring novel ways of providing stellar customer service. Voicebots are delivering just that – high quality customer service, available at any time, from anywhere. Gartner estimates that by 2020, 25% of customer service and support operations will integrate virtual customer assistant technology across engagement channels of voice, chat, and email. And the interactive Voice Response Market is expected to reach 5.54 Billion USD in value by 2023.

One of the first stages of any Voicebot deployment (and the most compute-intensive) is the Automatic Speech Recognition (ASR) process that converts speech to text. The open-source Kaldi Speech Recognition toolkit powers the most widely used ASR services in enterprise deployments today, due to its versatility in handling diverse language models and telephony speech. That’s why we’ve focused on performance improvements for the Kaldi ASR running on Intel Xeon Scalable Processors to help our customers implement Voicebots with real time response capabilities in large scale deployments. We call it the Voicebot Scaling Challenge.

Kaldi Speech Recognition Toolkit

The Kaldi toolkit is very popular in the research community and has become the default toolkit of choice for ASR. In a typical Kaldi ASR pipeline, the input audio signal or waveform is processed to extract a series of features like MFCC, CMVN and I-Vectors, where MFCC/CMVN are used to represent the content of the audio and I-Vectors are used to represent the style of utterance or speaker.

Figure 1: High Level Automatic Speech Recognition pipeline

Figure 1: High Level Automatic Speech Recognition pipeline.

The acoustic model transcribes the extracted features into a sequence of context-dependent phonemes (units of sound that distinguish one word from another in a particular language). Kaldi supports Gaussian Mixture Model (GMM)-based and Deep Neural Network (DNN)- based implementations for acoustic modeling. With advances in AI and deep learning, DNNs are widely replacing GMM-based implementations.

The language model decoder takes the phonemes and turns them into lattices (representations of alternative word-sequences that are likely for a particular audio part). The decoding graph takes into account the grammar of the data, as well as the distribution and probabilities of contiguous specific words (n-grams). In this benchmark, we have used Kaldi’s standard implementation of WFST decoder and compared it with Intel’s optimized decoder. We have also used nnet3-based Time Delayed Neural Network (TDNN) models ASpIRE & Librispeech. The benchmark highlights acceleration options that can significantly boost inference performance.

Intel CPU Performance Optimizations for Kaldi

The entire Kaldi inference pipeline has been optimized for improved performance on Intel processors. Acoustic model optimizations are summarized and details have been covered in earlier publications. The performance of these operations is improved using tools like the Intel Math Kernel Library (Intel MKL) which contains BLAS routines that are specifically optimized for Intel processors and the Deep Neural Network Library (DNNL) for neural network primitives.

Kaldi Decoder Overview

The Decoder takes the scores from acoustic modeling and maps them to lattices or text, based on the language model. The Kaldi toolkit uses Weighted Finite-State Transducer (WFST)-based decoding. In Decoding, a beam search is conducted in a weighted finite state transducer (WFST) that integrates different knowledge sources:

  • Hidden Markov Model topology (H)
  • Context-dependency (C)
  • Pronunciation model (L)
  • Language model (G)

During the search phase the acoustic scores are combined with the weights of the HCLG transducer to determine the best-scored word sequences. This process, known as “decoding,” is controlled by a number of parameters, e.g. beam width, acoustic scale factor, and others.

We used the ASpIRE Chain model to evaluate the compute signature of the Kaldi Decoder. For the specific configuration listed below, the Kaldi decoder execution takes about 38% of the overall execution time. This can be even higher based on decoder parameters, vocabulary and lexicon of the language model.

Intel-optimized Decoder

Intel has developed a new decoder library that boosts language modeling performance of the Kaldi ASR Decoder. This library will be available in binary form in a future release of the Intel® Distribution of OpenVINO™ Toolkit. In order to speed up the decoding process, a number of improvements have been applied: for instance, the representation of the WFST is not based on the original Kaldi HCLG WFST, but on a data structure that has been optimized for fast decoding. Furthermore, the search algorithm leverages a combination of beam pruning methods. Beam pruning shrinks the search space by discarding tokens whose score in the previous step was significantly worse than the best score. A token represents a path through the WFST. The computational complexity is reduced significantly by discarding batches of tokens at once instead of individually. Finally, the entire decoding library is not an optimized version of the Kaldi decoder but has been implemented completely from scratch.

Figure 2: Kaldi ASR compute distribution for ASpIRE Model.  The workload analysis is done on a single core of an Intel Xeon Gold CPU with Librispeech test-clean dataset using the Intel(R) VTune(TM) Profiler tool.

Figure 2: Kaldi ASR compute distribution for ASpIRE Model. [1] The workload analysis is done on a single core of an Intel Xeon Gold CPU with Librispeech test-clean dataset using the Intel® VTune™ Profiler tool.

As shown in Figure 2, the Intel-optimized decoder takes a much smaller percentage of overall execution time. The performance improvements will vary depending on the complexity and size of the language model. The performance and accuracy of an automatic speech recognition system (ASR) can be measured by two key metrics: Real-Time Factor (RTF) and Word Error Rate (WER). RealTimeX is the reciprocal of RTF. Improvements in RealTimeX should never impact WER integrity.

Benchmark Results

In real time human-machine interactions that are powered by an online AI inference service, the key service metric has been latency, as batch inference performance claims are not relevant. However, in large scale deployments, the marginal benefits of latency reduction rapidly disappear too. In these use cases, the most useful metric for an online AI inference service like a Voicebot is latency-bound throughput or throughput at small batch sizes.

In real-time Voicebot scenarios, all speech inputs are not available at the same time. Therefore, we chose small-batch throughput test as a representative test to measure the performance of real-time speech transcription where input speech data is available only in very small batches for processing with tighter latency requirements. The test is further categorized into ‘best case’ and ‘worst case’ scenarios.

In the best case scenario test, both the acoustic model and language model are assumed to be fixed and do not change with every incoming audio stream. Time accounted in this case is only the time spent in feature extraction, acoustic model and language model.

In the worst case scenario test, acoustic models and language models are not assumed to be fixed and also include time spent in model loading. ‘Model loading’ refers to the process of fetching the models from storage to CPU or GPU main memory.

We benchmarked throughput tests at small input/batch size on a NVidia Tesla V100 GPU-based system (measured via AWS P3 instance) and an Intel Xeon Gold 6252 processor-based system. Detailed system configuration tables are provided in appendix. On the Intel-based systems, we provided results for both the default Kaldi decoder and an Intel-optimized decoder.

The following figures plot the performance of throughput tests at small batch sizes for ASpIRE and Librispeech models with Librispeech test-clean dataset.

Figure 3: ASpIRE Model Online throughput at small input size (best case)

Figure 3: ASpIRE Model Online throughput at small input size (best case). [1]

Figure 4: ASpIRE Model Online throughput at small input size (worst case)

Figure 4: ASpIRE Model Online throughput at small input size (worst case). [1]

Figure 5: Librispeech Model Online throughput at small input size (best case)

Figure 5: Librispeech Model Online throughput at small input size (best case). [1]

Figure 6: Librispeech Model Online throughput at small input size (worst case)

Figure 6: Librispeech Model Online throughput at small input size (worst case). [1]

As shown above, in the ASpIRE model the Intel Xeon Gold CPU beats the NVIDIA V100 GPU at a batch size of 1 by 6.8X; when using the Intel-optimized Decoder the CPU performance increases to to 8.6X vs. GPU. [2] In the Librispeech model, the Intel Xeon CPU has a 11X throughput advantage over the NVIDIA GPU in single batch inference. On a multi-core CPU, the input streams can be processed as soon as they are received without waiting for batching streams.

The ASpIRE model is more complex and a better representation of the production models deployed by our customers. Throughput improvements on a single CPU node can help large production systems deploy AI inference services like a Voicebot without requiring the purchase of additional accelerator hardware.

Conclusion

Kaldi ASR engines power a large majority of enterprise Voicebots in production today. Intel Xeon Scalable processors offer unique performance benefits for this class of workload. In this work, we have focused on measuring latency-bound throughput performance of Kaldi ASR on a single compute node and demonstrated 8.6X faster throughput for ASPiRE Model and 11X faster throughput for Librispeech Model on Intel Xeon Gold CPUs vs. NVIDIA V100 GPUs in single-batch inference. [2] For enterprises that deploy millions of concurrent Voicebots, these throughput improvements deliver incredible performance and maximize the value of existing large-scale production systems.

Special thanks to Georg Stemmer and Joachim Hofer for their contributions to this blog post.

Additional Resources

Appendix

Software Configurations Kaldi ASR – Intel CPU Kaldi ASR – NVIDIA GPU
Compiler ICC 19.0.0.117 GCC 5.4.0,
nvcc 10.1
Tested Framework Kaldi ASR
(72ca1eb3e7630983c36a05053f72448ec707fcde)
Kaldi ASR
(72ca1eb3e7630983c36a05053f72448ec707fcde)
Other libs used
in benchmarks
MKL 2019u2 CUDA 10.1
Dataset Librispeech (test-clean
test-other)
Librispeech (test-clean
test-other)
ASpIRE Model config flags
Acoustic Model: ~141MB,
Language Model: ~1020MB
beam=10 lattice-beam=1
max-active=1500 iterations=1
CPU-Threads=48
executable:
online2-wav-nnet3-latgen-faster
beam=11 lattice-beam=1
max-active=10000
batch_size=180
batch_drain_size=15 iterations=1
file_limit=-1
gpu-feature-extract=false
main-q-capacity=40000
aux-q-capacity=500000
CPU-Threads=8
cuda-control-threads=3
cuda-worker-threads=5
executable:
batched-wav-nnet3-cuda
Librispeech Model config flags
Acoustic Model : ~78MB,
language Model: ~192MB
beam=8 lattice-beam=1
max-active=4000 iterations=1
CPU-Threads=48
executable:
online2-wav-nnet3-latgen-faster
beam=10 lattice-beam=7
max-active=10000
batch_size=180
batch_drain_size=15 iterations=1
file_limit=-1
gpu-feature-extract=false
main-q-capacity=30000
aux-q-capacity=400000
CPU-Threads=8
cuda-control-threads=3
cuda-worker-threads=5
executable:
batched-wav-nnet3-cuda
CUDA Driver Version: 418.87,
CUDA Version: 10.0
Intel Decoder Library Available in future
OpenVINO release

Hardware Configurations Intel CPU NVIDIA GPU
Platform S2600WFS Amazon EC2
# Nodes 2 1
CPU Intel Xeon Gold 6252 CPU @ 2.10GHz Intel Xeon E5-2686 v4 @ 2.30GHz
Cores/socket, Threads/socket 24/24 4/8
ucode 0x4000013 0xb000037
HT No On
Turbo On On
BIOS version SE5C620.86B.0D.01.0286.011120190816 4.2, Amazon EC2
System DDR Mem Config 12 slots / 16GB / 2933 MHz 4 / 16384 MB / Unknown RAM
Total Memory/Node (DDR+DCPMM) 192 GB 128 GB
NIC Intel Ethernet X527DA2OCP Amazon.com, Inc. Elastic Network Adapter (ENA)
PCH Intel C620 Unknown
Other HW (Accelerator) Tesla V100-SXM2-16GB,
OS CentOS-7 Ubuntu 16.04.6 LTS
Kernel 3.10.0-957.10.1.el7.x86_64 4.4.0-1092-aws