Scalable Methods for 8-bit Training of Neural Networks

MaryT_Intel · ‎12-20-2019

Quantized neural networks (QNNs) are regularly used to improve network efficiency in deep learning. Though there has been much research into different quantization schemes, the number of bits required and the best quantization scheme is still unknown.

In a paper jointly authored by myself and Itay Hubara, Elad Hoffer, and Daniel Soudry from The Technion at the Israel Institute of Technology, we find that much of the deep learning training process is not materially affected by substantial precision reduction, and identify the small handful of specific operations that require higher precision. Additionally, we introduce Range Batch-Normalization (Range BN), a method that hasa significantly higher tolerance to quantization noise and improved computational complexity in comparison to traditional batch-normalization. We are excited to present this research at the 2018 Conference on Neural Information Processing Systems (NeurIPS)

To the best of our knowledge, this is the first study to quantize the weights, activations, and a substantial volume of the gradients stream in all layers (including batch normalization) to 8-bit, while showing state-of-the-art results over the ImageNet-1K dataset. No earlier research has succeeded in quantizing this amount of the network to 8-bit precision without accuracy degradation. These results point to an opportunity to accelerate the execution of deep learning training and inference while still maintaining accuracy. As well, by reducing precision requirements from 32 Floating Point to 8-bit precision, we immediately get a reduction in memory and power due to the increased efficiency of computing at 8-bit precision.

Obstacles to Rapid Deep Neural Network Training

The exciting results and versatility of deep neural networks (DNNs) have made them a go-to approach for a broad array of machine learning applications. However, as networks grow more complex, their training becomes more computationally costly, with the main contributor being the massive number of multiply-accumulate operations (MACs) required to compute the weighted sums of the neurons’ inputs and the parameters’ gradients.

There has been much research into compressing fully-trained neural networks by using weights, sharing, low-rank approximation, quantization, pruning, or some combination of these methods[1] [2] [3]. Quantizing neural network gradients provides an opportunity to yield faster training machines, as network training requires approximately three times more computing power than network evaluation[4]. Precision of 16-bits has been found in earlier studies[5] [6] to be sufficient for network training, but further quantization (for example, 8-bit), has resulted in severe degradation.

Our work is the first to train almost exclusively at 8-bit precision without reducing accuracy. We have achieved this by addressing two primary obstacles to numerical stability: batch normalization and gradient computations.

Range Batch-Normalization (Range BN)

Traditional batch-normalization[7] requires the sum of squares, square-root, and reciprocal operations, all of which require high precision (to avoid zero variance) and a large dynamic range. Earlier attempts to lower the precision of networks either did not use batch normalization layers[8] or kept those layers at full precision[9].

Instead, we replace the batch normalization operation with range batch normalization (Range BN) that normalizes inputs by the range of the input distribution (i.e., max(x) - min(x)). We found this approach to be more suitable for low-precision implementations, with experiments on ImageNet* with Res18* and Res50* showing no distinguishable accuracy difference between Range BN and traditional batch-normalization.

Gradients Bifurcation

Given an upstream gradient g_l from layer l, layer l - 1 needs to apply two different matrix multiplications: one for the layer gradient g_l-1 and the other for the weight gradient g_W which are needed for the update rule. We found that the statistics of the gradient g_lviolate the assumptions that are core to common quantization schemes. This may be the main cause of the degradation in performance often seen when quantizing these gradients.

We suggest using two versions of layer gradients g_l: one with low-precision (8-bit) and another with higher-precision (16-bit). All g_lcalculations not meeting performance bottlenecks can be kept at 16 bits, with the rest occurring at 8 bits. We call this approach Gradients Bifurcation. This approach allows one to reduce the accumulation of noise (since some of the operations are done in 16-bit precision) without interrupting the propagation of the layer gradients whose computation is on the critical path.

Accelerating Deep Learning on Intel® Architecture

We are excited to discuss these and other cutting-edge findings with our peers and colleagues at the 2018 Conference on Neural Information Processing. For more on our research, please review our paper, titled “Scalable Methods for 8-bit Training of Neural Networks,” look for Intel AI at the 2018 NeurIPS conference (add link to overview blog), and stay tuned to https://ai.intel.com and
@IntelAIDev on Twitter.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

[1] Chen, W., Wilson, J., Tyree, S., Weinberger, K., and Chen, Y. Compressing neural networks with the hashing trick. In International Conference on Machine Learning, pp. 2285–2294, 2015.

[2] Ullrich, K., Meeds, E., and Welling, M. Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008, 2017.

[3] Jaderberg, M., Vedaldi, A., and Zisserman, A. Speeding up convolutional neural networks with low-rank expansions. arXiv preprint arXiv:1405.3866, 2014.

[4] Training involves three types of matrix multiplications. One type of matrix multiplication is related to the network evaluation (forward pass). The other two types of matrix multiplications correspond to the backward pass, which is relevant only to training. Therefore, training needs three types of matrix multiplication while inference (evaluation) requires only one type of matrix multiplication.

[5] Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1737–1746, 2015.

[6] Das, D., Mellempudi, N., Mudigere, D., et al. Mixed precision training of convolutional neural networks using integer operations. 2018.

[7] Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

[8] Wu, S., Li, G., Chen, F., and Shi, L. Training and inference with integers in deep neural networks. International Conference on Learning Representations (ICLR), 2018.

[9] Zhou, S., Ni, Z., Zhou, X., et al. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.