Vector Neural Network Instructions Enable Int8 AI Inference on Intel Architecture

Medical imaging analysis. Natural language processing. Investigating science’s most-challenging questions. Organizations around the world are choosing Intel® architecture for the AI compute they need. 2nd Generation Intel® Xeon® Scalable processors, the only microprocessor with built-in AI inference acceleration, have the versatility to excel at workloads such as analytics, high performance computing, and business-critical databases that are adding AI capabilities. Intel architecture’s enduring excellence for these workloads means it is well-understood by developers and readily available in their infrastructure, which allows organizations to achieve faster time to value by running AI inference on their existing IT investment.

With Intel® Deep Learning Boost (Intel® DL Boost), our 2nd Gen Intel Xeon Scalable processors provide a better platform for AI than ever before, boosting throughput for inference applications by up to 14x[1] in comparison to the first generation of Intel Xeon Scalable processors. In my talk today at the AI Conference in New York, I’ll dive deep into Intel DL Boost’s Vector Neural Network Instructions (VNNI) and how they improve AI performance by combining three instructions into one — thereby maximizing the use of compute resources, utilizing the cache better and avoiding potential bandwidth bottlenecks. Based on Intel® Advanced Vector Extensions 512 (Intel® AVX-512), VNNI speeds the delivery of inference results – and potentially, critical insights. Please read on for an introduction to VNNI and join me at the AI Conference if you’d like to learn more.

How Vector Neural Network Instructions Work

VNNI can be thought of as an AI inference acceleration integrated into every 2nd Gen Intel Xeon Scalable processor. Their benefits are best demonstrated by comparing them to similar instructions used in our previous generation of Intel Xeon Scalable processors, as shown below.

VNNI can be thought of as an AI inference acceleration integrated into every 2nd Gen Intel Xeon Scalable processor.

Most deep learning applications today use 32-bits of floating point precision for training and inference workloads. In the previous generation of Intel Xeon Scalable processors, the convolution operations predominant in neural network workloads were implemented in the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) using the FP32 data type via the vfmadd231ps instructions in the Intel® AVX-512 instruction set. Intel Xeon Scalable processors were the first Intel Xeon CPUs to include Intel AVX-512, with up to two 512 bit FMA units computing in parallel per core, enabling the execution of two vfmadd231ps instructions in a given cycle.

Lately, the Int8 data type has been used successfully for deep learning inference with a significant boost to performance and little loss of accuracy. Int8 uses 8 bits to represent integer data with 7 bits of mantissa and a sign bit versus FP32 uses 32 bits to represent floating point data with 22 bits of Mantissa, 8 bits of exponent and a sign bit. This reduction in number of bits with Int8 used for inference brings the benefits of better memory and compute utilization, since less data is being transferred and data is being processed more efficiently. Previous generation Intel Xeon Scalable processors implemented convolution operations in Intel MKL-DNN using the Intel AVX-512 instructions vpmaddubsw, vpmaddwd, and vpaddd to take advantage of low-precision data. Although this gave some performance improvement compared to the use of FP32 data types for convolution, the use of three instructions in Int8 convolution and the microarchitecture limit of only two 512-bit instructions in a clock cycle leaves room for further innovation.

In 2nd Gen Intel Xeon Scalable processors with VNNI, convolutions in Intel® MKL-DNN occur in Int8 precision via an individual vpdpbusd Intel AVX-512 instruction. Since the low precision operation now uses a single instruction, two of these instructions can be executed in a given cycle. Reduced precision and a single instruction optimizes utilization of the microarchitecture for each convolution operation in a neural network and brings significant performance benefits.

AVX-512 (VNNI) instruction to accelerate INT8 convolutions: <b>vpdpbusd</b>

Neural network inference requires weights from a trained model to perform forward propagation. These weights are often stored in FP32 precision during training. Floating point data types such as FP32 helps to maintain accuracy and ensure convergence during training. To take advantage of low precision inference, the FP32 weights from the trained model are converted to Int8 through a process called quantization. This conversion from a floating point data type to integer data type may result in some loss in accuracy. So, how can we take advantage of the benefits of using Int8 data type in inference without sacrificing accuracy?

Post training, we collect statistics for the activation in order to find an appropriate quantization factor. Using the quantization factor we perform post-training quantization for 8-bit inference. In addition, there is a technique called quantization-aware training that employs “fake” quantization in the networks during training so the captured FP32 weights are quantized to int8 at each iteration after the weight updates. This technique of quantization-aware training in some cases enables us to get slightly better accuracy.

You can realize the performance benefits of VNNI on the 2nd Gen Intel Xeon Scalable processor with the quantization techniques via the Intel® Distribution of OpenVino™ toolkit or Intel-optimized frameworks such as TensorFlow* and PyTorch*.

Benefits of Vector Neural Network Instructions

With VNNI, low-precision inference is possible using the processors already trusted by so many organizations for so many other tasks. Therefore, AI capabilities can be more easily integrated alongside other workloads on versatile, multi-purpose 2nd Gen Intel Xeon Scalable processors. Further, performance can significantly improve for both batch inference and real-time inference, because vector neural network instructions reduces both the number and complexity of convolution operations required for AI inference, which also reduces the compute power and memory accesses these operations require.

Learn More at O’Reilly AI NYC

If you’re interested in learning more about VNNI, Intel DL Boost, and Intel’s wider, edge-to-cloud technology portfolio for AI, please attend my session Understanding and Integrating Intel Deep Learning Boost on Wednesday, April 17th, at 4:05pm at O’Reilly AI NYC. Please also stay tuned to intel.ai and follow along on Twitter at @IntelAI.

Acknowledgements: Akhilesh Kumar, Nagib Hakim, Vikram Saletore, Andres Rodriguez, Evarist Fomenko, Indu Kalyanaraman, Ramesh AG, Emily Hutson

Notices and Disclaimers