Softmax Optimizations for Deep Learning Frameworks on Intel® Xeon® Scalable Processors

Softmax is a function used for classification problems in machine learning. It has been broadly applied to image classification in deep learning (where its execution time is small compared with convolution models) and it is now being adopted more frequently in natural language processing (NLP) models. However, without performance optimization, the softmax function may result in higher computation costs for these models.

This blog is based on a paper recently authored by Intel AI researchers Jacek Czaja, Michal Gallus, and Tomasz Patejko and Baidu researcher Jian Tang that presents a methodology of optimizations applied to the Softmax function. The goal of the project was to learn whether Softmax could be optimized to deliver equivalent – and possibly higher – performance through better utilization of a processor’s computing resources. Testing revealed that, in fact, the methods developed to optimize Softmax did produce performance gains.

Starting with Single-thread and PaddlePaddle*

This discussion centers on improvements of the Softmax operation for x86-64 architectures, in particular Intel® Xeon® Scalable processors. Efforts were limited to single-thread execution since the optimization process generally starts with exploiting all the capabilities of a single core.

Testing focused on inference, with a deep attention matching (DAM) model and Baidu’s PaddlePaddle* as the deep learning platform. The Intel® Xeon® Platinum 8180 processor served as the single-core hardware platform.

An open source deep learning framework, PaddlePaddle offers a function to check the execution time of operators, critical in getting performance results from Softmax execution. While optimizing Softmax, the team referred to PaddlePaddle profiling to obtain performance status for both Softmax and the overall DAM model. The team profiled operations in the Softmax profiler to target the most-time consuming ones, and observed that exponential functions execution takes significant time.

Performance Improvements

Throughout the optimization process, algorithmic modifications were performed to decrease execution time. A key consideration was how best to spare developers the effort of low-level optimizations for the most common mathematical algorithms.

Exponential computations and elementwise division were replaced with BLAS functions provided by the Intel® Math Kernel Library (Intel® MKL). While PaddlePaddle baseline code employs Eigen, a fast and elegant library, Intel MKL provides implementations optimized for x86-64 architecture, and Intel Xeon processors in particular, so it presented an effective alternative. The remaining Eigen code was replaced with a hand-crafted implementation.

Intel MKL functions, accompanied by hand-crafted code, produced a performance improvement of about 2X.1 Figure 1 provides details on the code used to achieve this speedup.

Figure 1: Intel® MKL-based Implementation

Figure 1: Intel® MKL-based Implementation

The team went further, improving code not already replaced by Intel MKL. Taking advantage of OpenMP, several vector-related operations were optimized. OpenMP simd by itself (hints to loops vectorization) did not provide much of a performance boost. It may result in code size reduction, as the compiler did not have to generate multiple implementations of code when some hints were provided. However, OpenMP simd followed by reduction clause decreased execution time significantly.

Line 28 of the code in Figure 2 is the modification the team introduced. This optimization brought an additional 5% reduction in time execution. [1]

Figure 2: Intel® MKL and OpenMP simd-based Implementation

Figure 2: Intel® MKL and OpenMP simd-based Implementation

The full paper includes further details on the results of this work on vectorization, including compiler investigations. More detailed information on OpenMP vectorization is also available.

Performance Conclusions

Once it was demonstrated that execution time could be improved, the team sought to find out whether additional work – in this case, more Softmax optimization – could extend the improvement. In the context of the DAM model Softmax was replaced with a memory copying routine (memcpy). The hypothesis was that if the Softmax and memcpy times were close, then the algorithm was likely bound by memory throughput, and performance gains would be unlikely. As it turned out, the baseline implementation, which was not fully vectorized, was far from memory-bound. Figure 3 shows the Softmax execution in DAM models is 2X faster than the original implementation. This optimization impacts performance of the entire DAM model and improves it by over 15%. [1]

Figure 3: Softmax Implementations Performance Comparison

Figure 3: Softmax Implementations Performance Comparison [1]

This finding underscores the conclusion that performance can be increased through better utilization of the processor’s computing resources. Specifically, the benefits of effective Intel MKL implementations and more effective vectorization were observed.

Given that Softmax is a popular deep learning primitive, these optimizations have been upstreamed into the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN). They are available to the public as a model for implementation. The team believes that the optimizations presented here could be transferred to other deep learning frameworks like TensorFlow and PyTorch, and encourages further deep learning optimizations for CPUs.

For more information, review the full report of this work, Softmax Optimizations for Intel® Xeon® Processor-based Platforms. To access the Intel Math Kernel Library for Deep Neural Networks (Intel MKL-DNN), a performance library for deep learning, go to: https://intel.github.io/mkl-dnn/index.html. For more AI research from Intel, follow @IntelAIDev and @IntelAI on Twitter, and visit ai.intel.com.

Notices and Disclaimers

Most of our work was upstreamed into PaddlePaddle and Intel MKL-DNN projects on GitHub. All quoted Pull Requests within this article are related to the PaddlePaddle GitHub repository.

Optimizations of Softmax using direct implementation in assembly language are not part of PaddlePaddle and Intel MKL-DNN repositories. For measuring performance, we created an integration branch. The experiments were executed using commit ID of the integration branch: 28bba75d9108026f236c312813caf5ba72a6aabe and the following commands:

1 	OMP_NUM_THREADS = 1 . / paddle / fluid / inference / tests/ api / test_analyzer_dam \
2 		- - infer_model = third_party / inference_demo / dam / model / \
3 		- - infer_data = third_party / inference_demo / dam / data.txt \
4		- - gtest_filter = Analyzer_dam . profile - - batch_size = 1 \
5 		- - test_all_data = true - - num_threads=1 - - use_analysis = false - - profile
6 	echo " ===>␣Batch␣8"
7 	OMP_NUM_THREADS = 1 . / paddle / fluid / inference / tests / api / test_analyzer_dam \
8 		- - infer_model = third_party / inference_demo / dam / model / \
9 		- - infer_data = third_party / inference_demo / dam/ data.txt \
10 		- - gtest_filter = Analyzer_dam . profile  - - batch_size = 8 \
11 		- - test_all_data = true - - num_threads = 1 - - use_analysis = false  - - profile
12 	echo " ===>␣Batch␣32 "
13 	OMP_NUM_THREADS = 1 . / paddle / fluid / inference / tests / api / test_analyzer_dam \
14 		- - infer_model = third_party / inference_demo / dam / model / \
15 		- - infer_data = third_party / inference_demo / dam/ data.txt \
16 		- - gtest_filter = Analyzer_dam . profile - - batch_size = 32 \
17 		- - test_all_data = true - - num_threads = 1 - - use_analysis = false  - - profile
18 	echo " ===>␣Batch␣128 "
19 	OMP_NUM_THREADS=1 . / paddle / fluid / inference / tests / api / test_analyzer_dam \
20    - - infer_model = third_party / inference_demo / dam / model / \
21 		- - infer_data = third_party / inference_demo / dam/ data.txt \
22 		- - gtest_filter = Analyzer_dam . profile - - batch_size = 138 \
23 		- - test_all_data = true - - num_threads = 1 - - use_analysis = false  - - profile

Enlarge