End-to-End Speech Recognition with neon

By: Anthony Ndirango and Tyler Lee

Speech is an intrinsically temporal signal. The information-bearing elements present in speech evolve over a multitude of timescales. The fine changes in air pressure at rates of hundreds to thousands of hertz convey information about the speakers, their location, and help us separate them from a noisy world. Slower changes in the power spectrum of speech denote the progression of phonemes, the building blocks of the spoken word. Beyond these are more slowly developing sequences of words, turning into phrases and narrative structure. There are, however, no strict demarcations between elements within and across timescales. Instead, elements at each scale blend together, temporal context is critical, and silence is a rare and ambiguous indicator of the transition between elements. Automatic speech recognition (ASR) systems must make sense of this noisy multi-scale stream of data, transforming it into an accurate sequence of words.

As of this writing, the most popular and successful approach to building speech recognition engines involves hybrid systems that combine deep neural networks (DNNs) with an amalgam of Hidden Markov Models (HMMs), context-dependent phone models, n-gram language models, and sophisticated variants of Viterbi search algorithms. These are complex models that require elaborate training recipes and demand a fair amount of expertise on the part of the model-builder. If the success of deep learning has taught us anything, it’s that we can often replace complex, multi-faceted machine learning approaches with generic neural networks that are trained to optimize a differentiable cost function. When applied to speech recognition, this approach, which we’ll loosely term “pure” DNN, has met with resounding success. It is now much easier to build a state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) system as long as one has access to relatively large amounts of training data and sufficient computational resources.

The goal of the present exposition is to provide a relatively simple guide to using Neon to build a speech recognition system using “pure” DNNs, following an approach pioneered by Graves and collaborators and further developed into a complete end-to-end ASR pipeline by AI researchers at Baidu.  We will be releasing an open-source implementation of our end-to-end speech recognition engine to complement this blog post. In its most rudimentary form, the system uses bi-directional recurrent neural networks (BiRNNs) to train a model to generate transcriptions directly from spectrograms without having to explicitly align the audio frames to the transcripts. Instead, implicit alignment is done using Graves’ Connectionist Temporal Classification (CTC) algorithm.

Though “pure” DNN approaches now allow one to train LVCSR systems capable of state-of-the-art performance, an explicit decoding step — converting model outputs into sensible sequences of words — remains critical during evaluation. Techniques for decoding are varied, often involving a mixture of weighted finite state transducers and neural network language models. This topic would require an in-depth article in its own right, and thus we have chosen to limit this article primarily to the training portion of the ASR pipeline. When necessary, we provide the reader with external references to fill in these gaps to hopefully convey a complete view of what goes into constructing an end-to-end speech recognition engine.

End-to-end speech recognitionIn its barest outline, an end-to-end speech recognition pipeline consists of three main components

  1. A feature extraction stage which takes raw audio signals (e.g. from a wav file) as inputs and generates a sequence of feature vectors, with one feature vector for a given frame of audio input. Examples of outputs of the feature extraction stage include slices of the raw waveform, spectrograms and the equally popular mel-frequency cepstral coefficients (MFCCs).
  1. An acoustic model which takes sequences of feature vectors as inputs and generates probabilities of either character or phoneme sequences conditioned on the feature vector input.
  1. A decoder which takes two inputs – the acoustic model’s outputs as well as a language model – and searches for the most likely transcript given the sequences generated by the acoustic model constrained by the linguistic rules encoded in the language model.


Handling Data

An efficient mechanism for loading data is critical when building an end-to-end speech recognition system. We will take full advantage of the fact that Neon, as of version 1.7, comes with a superb dataloader —Aeon — that supports image, audio and video data. Using Aeon substantially simplifies our work as it allows us to train the acoustic model directly from raw audio files without worrying about explicitly pre-processing the data. Furthermore, Aeon allows us to easily specify the type of spectral features we’d like to use during training.

Ingesting data

Typically, speech data is distributed as raw audio files in some standard audio format, along with a series of text files containing the corresponding transcripts. In many cases, the transcript file will contain lines of the form

<path to audio file>, <transcription of speech in audio file>
meaning that the listed path points to an audio file containing the listed transcript. However, in many cases, the path listed in the transcript file is not an absolute path, but a path relative to some assumed directory structure. To deal with the specifics of various data packaging scenarios, Aeon requires that the user generate a “manifest file” that contains pairs of absolute paths, with one path pointing to the audio file and the other path pointing to the corresponding transcript. We refer the reader to Neon’s speech example [include link] and the Aeon documentaton for further details.

In addition to the manifest file, Aeon also requires that the user provide the length of the longest utterance in the dataset as well as the length of the longest transcript. These lengths can be extracted while generating the manifest files, e.g. using the popular SoX program to extract the duration of audio files.

training a deep neural networkWe build our acoustic models by training a deep neural network comprised of convolutional (Conv) layers, bi-directional recurrent (BiRNN) layers and fully connected (FC) layers (essentially following “Deep Speech 2”) as shown schematically in the adjacent figure.

With the exception of the output layer which uses a softmax activation function, all other layers employ the ReLU activation function.

As shown in the figure, the network takes spectral feature vectors as input. Using the Aeon dataloader, Neon currently supports four types of input features: raw waveforms, spectrograms, mel-frequency spectral coefficients (MFCSs) and mel-frequency cepstral coefficients (MFCCs). MFSCs and MFCCs are derived from spectrograms and essentially turn each column of a spectrogram into a relatively small number of independent coefficients which more closely align with the perceptual frequency scale of the human ear.  In our experiments, we have also observed that, all else being equal,  models trained with mel-features as inputs perform slightly better than models trained with spectrograms.

The spectral inputs are fed into a Conv layer. In general, one could consider architectures with multiple Conv layers employing either 1D or 2D convolutions. We will take advantage of strided convolutions which effectively allow the network to operate on “wider contexts” of the input. Strided convolutions also reduce the overall length of the sequences, which in turn significantly reduces both the memory footprint and the number of computations carried out by the network. This allows us to train even deeper models which leads to improved performance without significantly increasing the required computational resources.

The outputs from the Conv layer(s) feed into a stack of BiRNN layers. Each BiRNN layer is comprised of a pair of RNNs running in tandem, with the input sequence presented in opposite directions as indicated in the figure.

processing speech signals

The outputs from the pair of RNNs are then concatenated as shown. BiRNN layers are particularly suited to processing speech signals as they allow the network access to both future and past contexts at every given point of the input sequence [1]. When training CTC-based acoustic models, we found it beneficial to employ “vanilla” RNNs as opposed to their gated variants (GRUs or LSTMs) mainly because the latter come with significant computational overhead. Following [2], we also apply batch normalization to the BiRNN layers to reduce the overall training time with little impact on the accuracy of the model measured in terms of the overall word error rate (WER).

The outputs from the BiRNN layers at each timestep are then fed into a fully connected layer which in turn feeds into a softmax layer. Each unit in the softmax layer corresponds to a single character in the alphabet characterizing the target vocabulary. For example, if the training data is drawn from an English corpus, the alphabet will typically include the characters A through Z, as well as any relevant punctuation symbols, including a symbol for the “space” character separating words in the transcripts. CTC-based models also typically require that the alphabet include a special “blank” character. The blank character allows the model to reliably deal with predicting consecutive repeated symbols, as well as artifacts in speech signals, e.g. pauses, background noise and other “non-speech” events.

Thus, given a sequence of frames corresponding to an utterance, the model is required to produce, for each frame, a probability distribution over the alphabet. During the training phase, the softmax outputs are fed into a CTC cost function (more on this shortly) which uses the actual transcripts to (i) score the model’s predictions, and (ii) generate an error signal quantifying the accuracy of the model’s predictions. The overall goal is to train the model to increase the overall score of its predictions relative to the actual transcripts.


Empirically, we have found that using stochastic gradient descent with momentum paired with gradient clipping leads to the best performing models. Deeper networks (seven layers or more) also tend to perform better in general.

We train our models using Nesterov’s Accelerated Gradient Descent following the implementation in Sutskever, et al. Most of the model’s hyperparameters, e.g. the depth of the network, the number of units in a given layer, the learning rate, the annealing rate, the momentum, etc., are chosen empirically based on a held-out development dataset. We use “Xavier” initialization for all layers in our models, although we have not systematically investigated whether there are any improvements to be had by using alternate forms of initialization.

All our models are trained using the CTC loss criterion. A detailed explanation of the inner-workings of the CTC computation is beyond the scope of this blog post. We will provide a brief overview here and refer the reader to [Alex Graves’ dissertation] for an in-depth treatment.

The CTC computation is centered around the action of a “Collapse” function which takes a sequence of characters as input and produces an output sequence by first removing all repeated characters in the input string followed by removing all “blank” symbols. For example, if we use “_” to denote the blank symbol, then

CTC computation

Given an utterance of length T and its corresponding “ground truth” transcript, the CTC algorithm constructs the “inverse” of the Collapse function defined as the set of all possible character sequences of length T which collapse onto the “ground truth” transcript. The probability of observing any sequence that appears in this “inverse” set can be computed directly from the softmax outputs of the neural network. The CTC cost is then defined as a logarithmic function of the sum of probabilities of sequences in the “inverse” set. This function is differentiable with respect to the softmax outputs which is all we need to compute the error gradients required for backpropagation.

Taking a simple example for illustrative purposes, suppose that the input utterance has 3 frames and that the corresponding transcription is the word “OX”. Again using “_” to denote the blank symbol, the set of 3-character sequences that collapse to OX contains _OX, O_X. OOX, OXX, and OX_. The CTC computation sets

CTC computation sets

where P(abc) = p(a,1)p(b,2)p(c,3) with p(u,t) given by the model’s softmax output for unit “u” at time (frame) t. The CTC algorithm thus requires enumerating all sequences of a given fixed length which collapse to a given the target sequence. When dealing with very long sequences, the enumerative combinatorics is efficiently carried out using a forward-backward algorithm that’s very close in spirit to that employed in HMMs.


Once the model is trained, we evaluate its performance by testing it on previously unseen utterances from a test set. Recall that the model generates sequences of probability vectors as outputs, so we need to build a decoder to transform the model’s output into word sequences.

The decoder’s job is to search through the model’s outputs and generate the most probable sequence as the transcription. The simplest approach is to compute

CTC algorithm

where Collapse( … ) is the mapping defined above.

Despite being trained on character sequences, our models are still able to learn an implicit language model and are already quite adept at spelling out words “phonetically” (see Table 1). The models’ spelling performance is typically measured using character error rates (CERs) calculated using the Levenshtein distance at the character level. We have observed that many of the errors in the models’ predictions occur in words that do not appear in the training set. It is thus reasonable to expect that the overall CER would continue to improve as one increased the size of the training set. This expectation is borne out in the results obtained in Deep Speech 2, where the training set consists of over 12000 hours of speech data.


Model output without LM constraints  “Ground truth” transcription
younited presidentiol is a lefe in surance company  united presidential is a life insurance company
that was sertainly true last week that was certainly true last week
we’re now ready to say we’re intechnical default a spokesman said we’re not ready to say we’re in technical default a spokesman said

Table 1: A sample of the model’s predictions on the Wall Street Journal evaluation dataset. We intentionally chose examples that the model struggles with. As shown, incorporating language model constraints essentially eliminates all “spelling errors” that the model makes without a language model.

Although our models exhibit excellent CERs, their tendency to spell out words phonetically results in relatively high word error rates.  One can improve the models’ performance (WER) by allowing the decoder to incorporate constraints from an external lexicon and language model. Following [3, 4], we have found using weighted finite state transducers (WFSTs) to be a particularly effective approach to this task. We have observed relative WER improvements of up to 25% on the WSJ and Librispeech datasets.

Table 2 lists various end-to-end speech recognition systems trained using the Wall Street Journal (WSJ) corpus. To allow an apples-to-apples comparison, we have chosen to compare published results on systems trained and evaluated using only the WSJ dataset. However, we should point out that hybrid DNN-HMM systems trained and evaluated on the same dataset have been shown to perform much better than systems using purely deep neural network architectures [6]. On the other hand, it has been established that by training on much larger datasets, purely deep neural network architectures are able to achieve the same performance as their hybrid DNN-HMM counterparts (see Deep Speech 2).




(no LM)


(no LM)


(trigram LM)


(trigram LM w/ enhancements)

Hannun, et al. (2014) 10.7 35.8 14.1 N/A
Graves-Jaitly (ICML 2014) 9.2 30.1 not reported 8.7
Hwang-Sung (ICML 2016) 10.6 38.4 8.88 8.1
Miao et al. (2015) [Eesen] not reported not reported 9.1 7.3
Bahdanau et al. (2016 6.4 18.6 10.8 9.3
Our implementation 8.64 32.5 8.4 N/A

Table 2: Performance of various end-to-end speech recognition systems trained and evaluated using only the Wall Street Journal (WSJ) dataset. CER refers to the character error rate comparing the sequence of characters from the model to the sequences of characters in the actual transcripts. LM refers to the language model. The final column accounts for cases where the decoding was carried out using additional techniques like rescoring, model aggregation, etc.

Future Work

The incorporation of the CTC objective function into neural network models for speech recognition provided a first glimpse of the ability of pure DNN methods. More recently, however, encoder-decoder RNN models augmented with the so-called attention mechanism have emerged as viable alternatives to RNN models trained using the CTC criterion [4, 5]. Both attention-based encoder-decoder models and CTC-based models are trained to map sequences of acoustic inputs to sequences of characters/phonemes. As discussed above, CTC-based models are trained to predict a character for each frame of speech input and proceed to search for possible alignments between the frame-wise predictions and the target sequence. In contrast, encoder-decoder attention-based models first read in the entire input sequence before predicting the output sequence. A conceptual advantage to this approach is that one does not have to assume that the predicted characters in the output sequence are independent. The CTC algorithm explicitly makes this assumption which is clearly unfounded — the sequences of characters appearing in words are very much conditioned on characters that appear earlier in the sequence. Recent work on attention-based encoder-decoder models for LVCSR has shown significant improvements in character error rates relative to CTC-based models [4]. This is true when both approaches are evaluated prior to integrating a language model, lending support to the assertion that the attention-based models produce better acoustic models than their CTC-based counterparts. However, it is worth pointing out that the difference in performance disappears when language models are used to determine word error rates.

We are actively working on a Neon implementation of attention-based encoder-decoder networks for ASR applications and wholeheartedly welcome contributions to this effort from the community at large.

Code accompanying this blog post can be found at https://github.com/NervanaSystems/deepspeech.git.