WaveNet* is a deep neural network for generating raw audio. The model was first introduced by the Google DeepMind team . Per the authors, WaveNet yields state-of-the-art performance when applied to text-to-speech, and also has the ability to capture the characteristics of many different speakers with equal fidelity. WaveNet can be applied to music, and can generate novel and often highly realistic musical fragments . It also can be used for phoneme recognition, a key step in speech recognition .
Two implementations of WaveNet models for TensorFlow* are currently available. One is the original implementation of DeepMind’s WaveNet as TensorFlow model by ibab . The other is the application of WaveNet to model music, called magenta WaveNet .
In this blog, we discuss the model optimizations required to achieve speedup of both the versions of WaveNet on Intel® Xeon® Scalable processors. With a few small tweaks to these models, users can easily obtain improved runtime performance.
This model is implemented and open sourced by ibab . The model can be used for training with dataset VCTK corpus. For generating audio, a previously trained checkpoint is used. The model supports audio generation with or without Global Conditioning. In this blog, we focus on audio generation without Global Conditioning.
The out-of-box performance can be seen by following the steps below:
python generate.py --wav_out_path=generated.wav --save_every 2000 --samples 16000 /path/to/trained/checkpoint
The out-of-box performance of the above run command is sub-optimal. In the above settings, TensorFlow uses its own default set of threading. After careful examination, we found that TensorFlow’s default set of threading did not deliver optimal performance. Thus, we have utilized two sets of WaveNet model and runtime optimizations.
First, we added support for the inter-thread and intra-thread options to the WaveNet model, where these threading parameters are tunable for Intel® Xeon® processor-based systems. Then, the value of intra thread and inter thread are tuned to get the best performance. More on intra thread and inter thread parallelism can be found in TensorFlow’s documentation .
Second, a runtime optimization is utilized, where the correct numactl runtime command is tuned for performance. In this case, we found that a single socket with memory attached and bonded with the binary will achieve the best performance. After the two optimizations, we achieved a speedup of 4.6x relative to non-optimized WaveNet^.
The run command for optimized WaveNet is shown below. The optimized branch is upstreamed as a Pull Request (PR) for the WaveNet model, and can be found in https://github.com/ibab/Tensorflow*-wavenet/pull/352/files.
numactl --physcpubind=0-7 --membind=0 python generate.py --wav_out_path=generated.wav --num_intra_threads=8 --num_inter_threads=1 --save_every 10000 --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000
In the above run command, “–physcpubind=0-7” dictates the Operating System to use the first 8 cores of the first socket of the Intel Xeon Processor, “–membind=0” indicates to utilize the memory attached to the first socket, “–num_intra_threads=8” indicates to use the number of intra threads as 8, and ”–num_inter_threads=1” indicates to use the number of inter threads as 1.
In this implementation, the model is applied for music generation. Similar steps can be followed as in the above example to achieve the out-of-box performance:
nsynth_generate --checkpoint_path=//wavenet-ckpt/model.ckpt-200000 \
--source_path=/ --save_path=/ --batch_size=4
The out-of-box performance of the TensorFlow Magenta Wavenet Model, which is produced using the above run command, is found to be sub optimal. In the above run command, the TensorFlow uses its own default settings for inter and intra thread. The main limitations of default threading are that it is not optimized for Intel Xeon architecture and it is not tunable. Thus, similar to the case of DeepMind’s original WaveNet, we have used two sets of optimizations: support of inter and intra threads to the model, and numactl tuning in the run command.
The run command for optimized Magenta WaveNet is as follows. The optimized branch is upstreamed as a Pull Request for the github, and can be found in https://github.com/Tensorflow*/magenta/pull/1235 .
numactl --physcpubind=0-15 --membind=0 nsynth_generate --num_inter_threads=8 --num_intra_threads=4 --checkpoint_path=//wavenet-ckpt/model.ckpt-200000 \
--source_path=/ --save_path=/ --batch_size=4
In the above run command, “–physcpubind=0-15” dictates the Operating System (OS) to use first 16 cores of the first socket of Intel(R) Xeon(R) (SKL) for the run, “–membind=0” indicates the OS to utilize the memory attached to the first socket, “–num_intra_threads=4” indicates that the number of intra threads is 4, and ”–num_inter_threads=8” means that the number of inter threads is 8.
After the two optimizations on Magenta WaveNet, we saw a speedup of 11x compared to the out-of-the-box model.
In this blog, we discussed simple optimizations for two TensorFlow WaveNet models: DeepMind’s original WaveNet and Magenta WaveNet. The optimizations changes the model code minimally, but gained 4x and 11x runtime performance improvement^ for original WaveNet and Magenta WaveNet, respectively.
 WaveNet: A Generative Model For Raw Audio
 Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
 A Tensorflow* implementation of DeepMind’s WaveNet paper
 NSynth: Neural Audio Synthesis
 Intel Optimization for TensorFlow Installation Guide
 TensorFlow Performance Overview
^Test configuration: Intel Xeon Gold 6148 CPU @ 2.40GHz; OS: Red Hat Enterprise Linux Server release 7.4 (Maipo); Intel® MKL-DNN version 0.14, Python* 2.7.15, GCC* 6.2.0, Intel Optimized TensorFlow 1.8 with Intel MKL support; specific TensorFlow commit: abccb5d3cb45da0d8703b526776883df2f575c87 with separate patch applied:
Notices and Disclaimers:
Intel® technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit https://www.intel.com/benchmarks.
Performance results are based on testing as of Aug 5, 2018 and may not reflect all publicly available security updates. No product or component can be absolutely secure.
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #201108
© Intel Corporation. Intel, the Intel® logo, Xeon and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.