Optimization of TensorFlow* WaveNet* Models on Intel® Xeon® Processors

WaveNet* is a deep neural network for generating raw audio. The model was first introduced by the Google DeepMind team [1]. Per the authors, WaveNet yields state-of-the-art performance when applied to text-to-speech, and also has the ability to capture the characteristics of many different speakers with equal fidelity. WaveNet can be applied to music, and can generate novel and often highly realistic musical fragments [2]. It also can be used for phoneme recognition, a key step in speech recognition [1].

Two implementations of WaveNet models for TensorFlow* are currently available. One is the original implementation of DeepMind’s WaveNet as TensorFlow model by ibab [3]. The other is the application of WaveNet to model music, called magenta WaveNet [4].

In this blog, we discuss the model optimizations required to achieve speedup of both the versions of WaveNet on Intel® Xeon® Scalable processors. With a few small tweaks to these models, users can easily obtain improved runtime performance.

DeepMind’s Original WaveNet*

This model is implemented and open sourced by ibab [3]. The model can be used for training with dataset VCTK corpus. For generating audio, a previously trained checkpoint is used. The model supports audio generation with or without Global Conditioning. In this blog, we focus on audio generation without Global Conditioning.

The out-of-box performance can be seen by following the steps below:

  • Git-clone the WaveNet model from GitHub [3]
  • Install the prerequisite software[3]
  • Install Intel Optimized TensorFlow* [5]
  • Finally, generate audio using command:

python generate.py --wav_out_path=generated.wav --save_every 2000 --samples 16000 /path/to/trained/checkpoint

The out-of-box performance of the above run command is sub-optimal. In the above settings, TensorFlow uses its own default set of threading. After careful examination, we found that TensorFlow’s default set of threading did not deliver optimal performance. Thus, we have utilized two sets of WaveNet model and runtime optimizations.

First, we added support for the inter-thread and intra-thread options to the WaveNet model, where these threading parameters are tunable for Intel® Xeon® processor-based systems. Then, the value of intra thread and inter thread are tuned to get the best performance. More on intra thread and inter thread parallelism can be found in TensorFlow’s documentation [6].

Second, a runtime optimization is utilized, where the correct numactl runtime command is tuned for performance. In this case, we found that a single socket with memory attached and bonded with the binary will achieve the best performance. After the two optimizations, we achieved a speedup of 4.6x relative to non-optimized WaveNet^.

The run command for optimized WaveNet is shown below. The optimized branch is upstreamed as a Pull Request (PR) for the WaveNet model, and can be found in https://github.com/ibab/Tensorflow*-wavenet/pull/352/files.

numactl --physcpubind=0-7 --membind=0 python generate.py --wav_out_path=generated.wav --num_intra_threads=8 --num_inter_threads=1 --save_every 10000 --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000

In the above run command, “–physcpubind=0-7” dictates the Operating System to use the first 8 cores of the first socket of the Intel Xeon Processor, “–membind=0” indicates to utilize the memory attached to the first socket, “–num_intra_threads=8” indicates to use the number of intra threads as 8, and ”–num_inter_threads=1” indicates to use the number of inter threads as 1.

Magenta WaveNet for Music Generation

In this implementation, the model is applied for music generation. Similar steps can be followed as in the above example to achieve the out-of-box performance:

  • Git clone the WaveNet model from github [4]
  • Install the prerequisite software for Intel Xeon processors [4]
  • Install Intel Optimization for TensorFlow [5]
  • Finally, generate audio using command. The script below will take all .wav files in the source_path directory and create generated samples in the save_path directory.

nsynth_generate --checkpoint_path=//wavenet-ckpt/model.ckpt-200000 \
--source_path=/ --save_path=/ --batch_size=4

The out-of-box performance of the TensorFlow Magenta Wavenet Model, which is produced using the above run command, is found to be sub optimal. In the above run command, the TensorFlow uses its own default settings for inter and intra thread. The main limitations of default threading are that it is not optimized for Intel Xeon architecture and it is not tunable. Thus, similar to the case of DeepMind’s original WaveNet, we have used two sets of optimizations: support of inter and intra threads to the model, and numactl tuning in the run command.

The run command for optimized Magenta WaveNet is as follows. The optimized branch is upstreamed as a Pull Request for the github, and can be found in https://github.com/Tensorflow*/magenta/pull/1235 .

numactl --physcpubind=0-15 --membind=0 nsynth_generate --num_inter_threads=8 --num_intra_threads=4 --checkpoint_path=//wavenet-ckpt/model.ckpt-200000 \
--source_path=/ --save_path=/ --batch_size=4

In the above run command, “–physcpubind=0-15” dictates the Operating System (OS) to use first 16 cores of the first socket of Intel(R) Xeon(R) (SKL) for the run, “–membind=0” indicates the OS to utilize the memory attached to the first socket, “–num_intra_threads=4” indicates that the number of intra threads is 4, and ”–num_inter_threads=8” means that the number of inter threads is 8.

After the two optimizations on Magenta WaveNet, we saw a speedup of 11x compared to the out-of-the-box model.

Conclusion

In this blog, we discussed simple optimizations for two TensorFlow WaveNet models: DeepMind’s original WaveNet and Magenta WaveNet. The optimizations changes the model code minimally, but gained 4x and 11x runtime performance improvement^ for original WaveNet and Magenta WaveNet, respectively.

Reference:
[1] WaveNet: A Generative Model For Raw Audio
[2] Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
[3] A Tensorflow* implementation of DeepMind’s WaveNet paper
[4] NSynth: Neural Audio Synthesis
[5] Intel Optimization for TensorFlow Installation Guide
[6] TensorFlow Performance Overview

^Test configuration: Intel Xeon Gold 6148 CPU @ 2.40GHz; OS: Red Hat Enterprise Linux Server release 7.4 (Maipo); Intel® MKL-DNN version 0.14, Python* 2.7.15, GCC* 6.2.0, Intel Optimized TensorFlow 1.8 with Intel MKL support; specific TensorFlow commit: abccb5d3cb45da0d8703b526776883df2f575c87 with separate patch applied:
https://github.com/tensorflow/tensorflow/issues/17437.