New Research Shows How Neural Networks Process Invariant Speech

Neural networks have been called a “black box” because parts of their decision making are famously opaque as processing happens in hidden layers. In a new paper, our team from Intel and MIT [1] has outlined how to probe the information content of these neural networks, shedding light on how speech patterns are processed inside neural networks. Our work is based on new theoretical results on the statistical mechanics of learning from SueYeon Chung, whose work was lauded by Professor Haim Sompolinsky as the most significant advancement in the statistical mechanics of learning since Elizabeth Gardner’s work on statistical mechanics in the weight space.

Better understanding of speech-processing using deep neural networks can help refine and improve current models and may even help scientists better understand how human brains process auditory information.

Building from Current Studies

A variety of deep neural networks have proven to be successful at visual invariant object recognition—being able to accurately recognize specific objects within images regardless of the size, position, and background. It’s hypothesized that objects become “untangled” across the neural network’s layers in order to separate various elements into different categories. In speech and language processing, this untangling becomes more difficult as different categories can emerge over time as the audio or text unfolds. Recurrent neural networks using sequence models have been created to better process audio for automatic speech recognition (ASR) and speaker identification, but even with recent advancements there is relatively little understood about how such models accomplish their tasks.

Speech recognition is a natural domain for analyzing auditory class manifolds (complicated structures understood in simpler properties of space) with word and speaker classes, and at the phonetic and semantic level. Recently, there have been several studies about how phonetic information is encoded in acoustic models and how it is embedded across layers by making use of classifiers. Much of this work, however, has focused on classifier accuracy and representational similarity (patterns of representations) or on explicit geometric measure (path of representations), such as curvature, geodesics, and Gaussian mean width.

Measuring the Geometry of Manifolds

In our NeurIPS 2019 paper, we make use of a recently developed theoretical framework based on the replica method (a mathematical technique from statistical physics) that connects both geometric properties of network representations and the separability of classes to better understand how information is untangled within neural networks trained to recognize speech, a step toward making the neural network “black box” less mystifying. This method has been used in visual convolutional neural networks (CNNs) to understand how object manifolds are untangled across layers. By measuring the geometry of manifolds in neural networks—their radius, separability, and center correlations—our work provides a unique view on how neural networks process information.

Figure 1. Illustration of word manifolds: (a) highly tangled manifolds, (b) untangled manifolds, (c) Manifold Dimension captures the projection and Manifold Radius captures the norm, (d) untanglement of words over time.

Figure 1. Illustration of word manifolds: (a) highly tangled manifolds, (b) untangled manifolds, (c) Manifold Dimension captures the projection and Manifold Radius captures the norm, (d) untanglement of words over time.

Using manifold analyses for auditory models for the first time, we are able to show that neural network speech recognition systems, like vision models, also untangle speech objects relevant for the task. We also find that models learn to untangle some types of object manifolds without being explicitly trained to do so.

For our paper we examined two speech recognition models: 1) a modified CNN model based on previous work from MIT that we trained on word recognition and speaker recognition with the WSJ Corpus, the Spoken Wikipedia Corpora and noise augmentation from AudioSet, and 2) the end-to-end ASR model Deep Speech 2.

Our Findings

Despite being trained for different tasks and built with different computational blocks, both the CNN architecture and the end-to-end ASR model converge on remarkably similar behavior by learning to discard nuisance variations and untangle their inputs for relevant information. For example, we observed that speaker-specific nuisance variations are discarded by the network’s hierarchy, whereas task-relevant properties such as words and phonemes are untangled in later layers. Higher level concepts such as parts-of-speech and context dependence also emerge in the later layers of the network.

We also find temporal dynamics in recurrent layers reveal untangling over recurrent time steps, in the form of smaller manifold radius and lower manifold dimensionality. In addition, we show that general auditory untangling with speaker manifolds, that are not evident in the ASR model or the model trained on word recognition, happens in a network only trained on a speaker recognition. Finally, we find that the deep representations carry out significant temporal untangling by efficiently extracting task-relevant features at each time step of the computation.

Taken together, these findings shed light on how deep auditory models process time dependent input signals to achieve invariant speech recognition, and show how different concepts emerge through the layers of the network. These results are the first geometric evidence for the untangling of manifolds, from phonemes to parts-of-speech, in deep neural networks for speech recognition and give researchers greater insight into how neural networks process information in their hidden layers.

NeurIPS 2019

We look forward to discussing these findings with our peers at the 2019 Conference on Neural Information Processing Systems. By studying the emergent geometric properties of speech objects and their linear separability, measured by manifold capacity, we hope that our work will motivate further theory-driven geometric analysis of representation untangling in tasks with temporal structure; the search for the mechanistic relation between the network architecture; learned parameters and structure of the stimuli; and the study of competing vs. synergistic tasks. For those interested, our code repository is available at github.com/schung039/neural_manifolds_replicaMFT.

For more on this research, please review our paper “Untangling in Invariant Speech Recognition,” look for us at the 2019 NeurIPS conference, and stay tuned to @IntelAIResearch on Twitter.