Optimized NLP/Deep Attention Matching Model in Baidu’s PaddlePaddle

Natural language processing (NLP) is a subset of artificial intelligence (AI) technologies that focuses on enabling computers to understand and process human language. Baidu, a leading Chinese-based Internet and AI services company, supports over 100 applications through NLP technologies, with some modules being called up more than 100 billion times per day. An example of Baidu’s use of NLP technologies is their online customer support Chatbot. Baidu’s Chatbot is powered by a Deep Attention Matching (DAM) network model that was developed by Baidu engineers and is based on attention mechanisms. One important task of chatbots is response selection, which selects the best-matched response from a set of candidates by using the context of a conversation.

PaddlePaddle (Parallel Distributed Deep Learning) is a deep learning framework developed by Baidu and widely used in Baidu’s online and offline services and products. As Baidu seeks to integrate their chatbot with their PaddlePaddle framework, Baidu and Intel engineers worked together to optimize the performance of their DAM model on Intel architecture. Following software optimizations by Intel, performance gains on an Intel® Xeon® Gold 6148 CPU based system with PaddlePaddle* are shown in table 1.

 

Latency (per

sample) (ms)

PP-Latency (per

sample)-Baseline

PP-Latency (per

sample)-Best_fit_opt

Gain
Batch Size=1 174.22 62.35 2.79x
Batch Size=300 169.85 56.46 3.01x

 

Table 1: Table of DAM model inference latency (per sample). Configuration: Intel® Xeon® Gold 6148 CPU @ 2.40GHz. Environment configuration is OMP_NUM_THREADS=1. Baseline benchmark was tested on November 8, 2018 by Intel Corporation. Optimization benchmark was tested on December 12, 2018 by Intel Corporation. Please refer to notices and disclaimers for complete testing configuration.

Model Profiling and Analysis

Intel worked with Baidu to support optimized and intelligent services based on Intel® architecture. In this case, we started our DAM model optimization by analyzing the most time-consuming operators (or “hotspots”). As shown in Figure 1, these were layer_norm, softmax, stack and conv3d. These were our first priority for optimization, as they total more than 80% of all operations in the model.

Figure 1: Initial DAM hotspots profile analysis showing the most time-consuming operators.

Figure 1: Initial DAM hotspots profile analysis showing the most time-consuming operators.

We followed the overall structure of Baidu’s DAM network model to analyze these operators.

  • Representation: Representation consists of a repeatable attentive module (see Figure 2) which captures words and sentences with semantic dependencies. Layer_norm op is used in this repeatable module (where) to prevent vanishing or exploding of gradients (what), whose calculation equation is complex (why).
    Figure 2: Attentive Module

    Figure 2: Attentive Module[1].

  • Matching: Utterance and response are matched with each other by using a segment-segment similarity matrix that stacks them as the input of 3D convolution (what). Stack op appears in this module (where) and is a memory-level operation (why).
  • Aggregation: Finally, DAM aggregates all the segmental matching degrees across each utterance and response into a 3D matching image Q (what) with high-dimension (why), where two layers — conv3d with pool3d (where) — are used in the end of this network, as shown in Figure 3.
    Figure 3: Aggregation. 3D Matching image as the input of convolution 3d

    Figure 3: Aggregation. 3D Matching image as the input of convolution 3d[1].

Optimizing Operations through Workload Acceleration

The Intel® Math Kernel Library (Intel® MKL), Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) and Intel® Advanced Vector Extensions (Intel® AVX) all contribute to machine-learning workload acceleration. Choosing optimizations wisely can produce the largest performance gains at the op-level, as shown in Table 3 and Figure 4.

Figure 4: Op total time comparison between baseline and optimization. Configuration: Intel® Xeon® Gold 6148 CPU @ 2.40GHz. Environment configuration is OMP_NUM_THREADS=1. Baseline benchmark was tested on November 8, 2018 by Intel Corporation. Optimization benchmark was tested on December 12, 2018 by Intel Corporation. Please refer to notices and disclaimers for complete testing configuration.

Figure 4: Op total time comparison between baseline and optimization. Configuration: Intel® Xeon® Gold 6148 CPU @ 2.40GHz. Environment configuration is OMP_NUM_THREADS=1. Baseline benchmark was tested on November 8, 2018 by Intel Corporation. Optimization benchmark was tested on December 12, 2018 by Intel Corporation. Please refer to notices and disclaimers for complete testing configuration.

After these operators’ optimization, the latency (per sample) of the DAM model decreased about 2.3x. Table 4 shows the model’s performance gain with each operator-level’s optimization.

 

Batch size baseline(ms) best-fit optimization (ms) Gain
1 174.22 73.49 2.37x
300 169.85 67.83 2.50x

 

Table 4: Model performance gain with ”Best-fit” op-level optimization.Configuration: Intel® Xeon® Gold 6148 CPU @ 2.40GHz. Environment configuration is OMP_NUM_THREADS=1. Baseline benchmark was tested on November 8, 2018 by Intel Corporation. Optimization benchmark was tested on December 12, 2018 by Intel Corporation. Please refer to notices and disclaimers for complete testing configuration.

Library utilization optimization (PR#14437): Softmax

After profiling the softmax op implementation, we found that over 50% of softmax execution time is spent in “exp” part and 30% was spent in “1“. Therefore, we targeted optimization of “ex” followed by summing and elementwise dividing. Intel MKL implements the BLAS and Sparse BLAS routines to optimize these two parts.

Complex calculation op optimization (PR#14417): Layer Normalization

Equation 1 shows the calculation of layer normalization. Intel MKL and Intel MKL-DNN have no directly optimized math functions for this calculation. Although modern compilers produce well-optimized assembly code, we found that Intel AVX improved many deep learning primitives. We also directly used vector instructions which significantly improved the performance of layer normalization by 7X, as shown in Figure 4.


Equation 1

(Equation 1)

 

Memory-bound operation optimization (PR#14488): Stack

Stack op serves to stack all of the inputs along one axis. This is a memory copy operation. To optimize these memory-bound operators, the main idea is to decrease the number of write or read memory operations in two ways: 1) make the most of the created memory and 2) utilize the optimized memory function. In this case, we used the “memcpy” function to refactor the implementation of stack for the performance gain shown in Figure 4.

Use Intel MKL-DNN to Further Optimize 3D Convolution

Enhance conv operation with 3D convolution with Intel MKL-DNN

Based on our profiling results, convolution 3d takes up about 9% of the model’s execution time. Intel MKL-DNN is an open source, performance-enhancing library for accelerating deep learning, especially convolution. Therefore, we used Intel MKL-DNN to enhance the conv3d performance. With the help of Intel MKL-DNN, we achieved an almost 4X performance gain for 3D convolution on our Intel Xeon Processor E5-2650 v4 based-platform, as shown in table 5.

 

Batch size PP-MKL-Total time-Baseline(ms) PP-MKL-DNN-Total time-optimization (ms) Gain
conv3d 17140.9 4334.55 3.95x

 

Table 5: conv3d op total time comparison before and after optimization. DAM model, Batch size=1. Configuration: Intel® Xeon® Gold 6148 CPU @ 2.40GHz. Environment configuration is OMP_NUM_THREADS=1. Baseline benchmark was tested on November 8, 2018 by Intel Corporation. Optimization benchmark was tested on December 12, 2018 by Intel Corporation. Please refer to notices and disclaimers for complete testing configuration.

 

Batch size DAM with MKL conv3d latency (per sample) (ms) DAM with MKL-DNN conv3d latency (per sample) (ms) Gain
1 73.49 62.35 15.16%
300 67.83 56.46 16.76%

 

Table 6: model performance gain with conv3d’s optimization.Configuration: Intel® Xeon® Gold 6148 CPU @ 2.40GHz. Environment configuration is OMP_NUM_THREADS=1. Baseline benchmark was tested on November 8, 2018 by Intel Corporation. Optimization benchmark was tested on December 12, 2018 by Intel Corporation. Please refer to notices and disclaimers for complete testing configuration.

Fuse OP to save further

In PaddlePaddle, convolution with bias and elu are calculated with three operations: conv3d op, elementwise add op, and elu op. Since Intel MKL-DNN supports convolution with bias and elu, we can fuse these three operations to conv3d op, which supports the calculation of convolution with bias and relu. This will help decrease the framework overhead.

After applying all of these optimizations, 95% of the operations in the model (by time proportion) are running in the optimized acceleration tool, as seen in the following table:

 

Op names Time proportion in model Optimization
fc 27% Intel® MKL GEMM
softmax 18% Intel® MKL BLAS
layer norm 15% Math JIT
conv3d 14% Intel® MKL-DNN
matmul 13% Intel® MKL Batch GEMM
elementwise add 5% Intel® MKL VADD
stack 3% Memcpy

 

Table 7: list of operators applied the “best-fit” optimizations.

Key Takeaways and Further Thinking

Intel has developed a variety of framework optimizations, tools and software libraries to improve deep learning performance. Relying on one library or one method doesn’t always produce the best performance. For different operators, we chose the best ways to optimize rather than applying one type of optimization for all.

The goal of graph fusion is to minimize unnecessary calculations and memory access for the algorithm. If fusion can decrease time-consuming calculations, it is reasonable. If not, we can skip doing this kind of fusion. For the latest information on performance optimizations from the Intel AI team, follow us on @IntelAIResearch.

Notices and Disclaimers