Accelerating INT8 Inference Performance for Recommender Systems

Most inference applications today require low latency, high memory bandwidth, and large compute capacity. With the increasing use and growing memory footprint of the recommender systems that make up 50-60% of all inference workloads in the data center [1], [2], these requirements are expected to become stronger. Intel® Xeon® Scalable processors continue to hold strong inference value for recommendation systems, especially for sparse models with large memory footprints that cannot fit into an accelerator. Recently, Intel researchers demonstrated that deep learning inference can be performed with lower numerical precision, using 8-bit multipliers with minimal to no loss in accuracy. There are two main benefits of lower numerical precision. First, many operations are memory bandwidth-bound, so reducing precision enables better cache usage while lowering bandwidth bottlenecks. Second, the hardware may enable higher operations per second (OPS) at lower numerical precision, as these multipliers require less silicon area and power.

In this article, we describe INT8 data type acceleration using Intel® Deep Learning Boost (Intel® DL Boost), available in 2nd Generation Intel® Xeon® Scalable processors, the only microprocessor with built-in AI inference acceleration. The 2nd Gen Intel Xeon Scalable processor family includes the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set with 512-bit wide Fused Multiply Add (FMA) core instructions. These instructions enable lower numerical precision multiplies with higher precision accumulates. These specialized high-performing instructions provide embedded acceleration via Intel DL Boost to speed up low-precision inference. Further, Intel provides optimized software support with libraries such as the Intel® Deep Neural Network Library (Intel® DNNL) that take direct advantage of such CPU features.

We also describe how to quantize the model weights and activations and the lower numerical functions available in the Intel DNNL to efficiently accelerate the performance of the Wide and Deep learning recommender model [3] using Intel DL Boost. The embedding lookup portion of the model, which typically has a high memory footprint, can take advantage of the high memory bandwidth and capacity available in Intel Xeon Scalable processors. The compute-intensive neural network portion (fully-connected layers) takes advantage of accelerated performance with low-precision (INT8) provided by Intel DL Boost. We describe how the model can be optimized for the best performance with the dataset in consideration. Further, Intel DNNL supports general matrix multiply functions, which can take INT8 input values and INT8 weight values to do matrix multiplication to output INT32 results. We explain how fully connected layers of the deep portion of Wide and Deep model are quantized to utilize DNNL functions for accelerated inference performance.

We show that Intel DL Boost provides a 2x inference performance improvement with INT8 compared to FP32 precision, while maintaining accuracy loss below 0.5%. [4] This is demonstrated for low batch size use cases which are typical of recommender systems on popular machine learning frameworks like TensorFlow and MXNet. For more complete details on how we achieved this performance improvement, please read the complete article. Follow us on Twitter for more updates from our AI research team.

Notices & Disclaimers