Publishers, marketers, and advertising agencies are increasingly using artificial intelligence applications via software-as-a-service (SaaS) cloud platforms. Intel® AI Builders member, Taboola, provides its customers with custom inferencing solutions using TensorFlow Serving (TFS) framework.
Intel and Taboola have collaborated to optimize and significantly speed-up Taboola’s custom TensorFlow Serving application with the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) on Intel Xeon Scalable processors.
TFS is an open-source deployment service for machine learning models in a production environment. TFS is architected on top of the TensorFlow framework for deep learning and the workflow is a client-server model, where the server machine already has a pre-trained model and client machines send prediction requests through gRPC. Once the server receives the requests from clients, it runs a forward-pass on the available pre-trained model and returns the result.
In order to measure TFS performance consistently, we set up a benchmark workflow whereby 10 clients each sent 10,000 inference requests to a 2-socket system featuring Intel® Xeon® Platinum 8180 processors, and the number of recommendation requests served by the TFS server was used as a performance metric. The application was benchmarked in the following two configurations –
When measuring the performance in the aforementioned configuration modes, we observed that the optimized TFS delivers a 1.15x speed-up over the baseline. This performance improvement is due to the acceleration provided by Intel MKL-DNN for the matrix-matrix multiplication (SGEMM) operations encountered in the application. In order to effectively use the available 56 cores from the 2-socket system, two optimized TFS instances were executed by pinning the application threads and memory requests to the respective CPU socket and NUMA domain. This technique helped improve the performance by 1.3x over the baseline TFS instance. Figure 3 at the end of this post shows the performance in each of these configuration modes.
To understand performance bottlenecks, we analyzed the application with the Intel® VTune™ tool  and observed that tensor broadcast operations (a tensor is an n-dimensional array and a broadcast operation involves replicating the input tensor by a specified factor on any given dimension) are the most time-consuming function in the workflow.
On modern Intel processors, it is imperative to use SIMD (Single Instruction Multiple Data) processing to achieve ideal performance. However, Intel VTune profiling revealed that the Eigen implementation of tensor broadcast operations relies heavily on scalar/non-SIMD instructions and leads to suboptimal performance. These scalar instructions, which involve division and modulo operations, are used in calculating the index in the input tensor from which the elements are then copied to the output tensor. In addition, we observed that excessive index calculations are computed when the tensor dimensions are not SIMD-friendly, i.e. not a multiple of vector register width, which is 16 for an FP32 data-type.
To maximize performance, we optimized the Eigen implementation of tensor broadcast by using Intel® Advanced Vector Extension (Intel®AVX512) SIMD instructions and reduced the number of index calculations needed to form the output tensor. To evaluate the impact of these optimizations, we benchmarked Eigen tensor broadcast operation independently of TensorFlow on a single core of the Intel processor and observed a performance speed-up of 58-65x (NxNx1 inputs) and 3-4x (1xNxN inputs) over baseline. Figure 2 shows the performance comparison for a range of tensor sizes in steps of 32 with broadcast factor of 32 on unit-dimension.
Referring back to the TFS application with which we started, using the tensor broadcast optimizations on top of Intel MKL-DNN and two TFS instances resulted in an overall performance improvement of 2.5x compared to baseline TFS. Figure 3 shows the performance impact of each of the optimization steps compared to baseline.
We generalized the tensor broadcast optimizations to N-dimensional tensors having unit size inner-/outer-most dimension and an arbitrary broadcast factor on the unit dimension, then upstreamed the code improvements to public distribution of Eigen (available in TensorFlow release 1.10).
Intel optimizations to TensorFlow-Serving deliver significant performance gains and helped Taboola reduce the latency of recommendation services on Intel Xeon Scalable processors. Ariel Pisetzky, the Vice President of Information Technology at Taboola praised the Intel optimizations to their infrastructure, stating that “Serving from the CPUs helped us reduce costs, increase efficiency, and provide more content recommendations with our existing servers.” Intel continues to improve the performance of the deep learning software stack for infrastructure teams at companies such as Taboola, and other major customers. We also encourage the community to introduce SIMD-friendly parameters in their machine learning models for optimal performance.
As a co-sponsor of The Artificial Intelligence Conference in San Francisco from September 4-7, we look forward to showing you the latest innovations in applied AI. Intel keynotes and sessions will share practical AI use cases and provide the technical knowledge needed to help develop and implement successful AI applications across a variety of industries today. Visit us at booth # 101 to see how Intel is breaking barriers between model and reality.