Addressing the Memory Bottleneck in AI Model Training

MaryT_Intel · ‎02-26-2020

Healthcare workloads, particularly in medical imaging, may use more memory than other AI workloads because they often use higher resolution 3D images. In fact, many researchers are surprised to find how much memory these models use during both training and inference. In a recent publication, Intel, Dell, and the University of Florida demonstrate how the large memory capacity of a 2nd generation Intel® Xeon® Scalable processor-based server allows researchers to more efficiently train and deploy medical imaging models for brain tumor segmentation that use almost a terabyte (TB) of RAM (Figure 1).

cq5dam.web.1280.1280.jpeg

Figure 1: Benchmarking the memory usage of 3D U-Net model-training over various input tensor sizes on a 2nd generation Intel® Xeon® Scalable processor-based server with 1.5 TB system RAM. Source: https://downloads.dell.com/manuals/common/dellemc_overcoming_memory_bottleneck_ai_healthcare.pdf.

For convolutional neural networks (CNNs), activation maps vary with the size of the input image. As the input image size grows, the activation map may grow to memory footprints that are many times larger than the weights and biases of the model. For training, one way to handle this is by distributing compute across multiple machines and cores. However, with access to 1.5 TB of DDR4 RAM and an additional 6 TB per socket of Intel® Optane™ DC Persistent Memory, the 2nd generation Intel Xeon Scalable CPU minimizes the need for this workaround. Instead, researchers are able to use the full capacity of RAM without any modifications to their code.

Intel’s software and hardware optimizations also provide significant speed-ups for training these large memory models. Training was increased by 3.4x when comparing standard, unoptimized TensorFlow 1.11 to Intel-optimized TensorFlow 1.11 for the 3D U-Net model (Figure 2). By scaling this single-node, memory-rich configuration to a multi-node CPU cluster with data parallel methods, researchers can expect to get even more efficient training performance for their most demanding real world use cases. In fact, Intel demonstrated training a similar 3D U-Net model using a data-parallel method at the 2018 Supercomputing Conference (SC18).

cq5dam.web.1280.1280.png

Figure 2: Intel-optimized TensorFlow 1.11 with DNNL provides a 3.4x improvement in training time compared to stock, unoptimized TensorFlow 1.11 for the 3D U-Net model. Source: https://downloads.dell.com/manuals/common/dellemc_overcoming_memory_bottleneck_ai_healthcare.pdf.

For inference, we can apply other techniques to deal with memory-intensive AI models. For example, the Max Planck Institute recently worked with Intel to run inference on a full 3D dataset for a 3D brain imaging model. The first and most important achievement of the project was to reduce the original 24 TB memory requirement by a factor of 16 via efficient reuse of memory enabled by the Intel® Distribution of OpenVINO™ toolkit. As a result, processing each image required only 1.5 TB of RAM to perform AI inference, and processing took less than an hour compared to 24 hours during initial tests.

We invite you to check out our Intel-optimized TensorFlow installation today and see how easy it is to develop deep learning models for high resolution 3D images.

Notices and Disclaimers:

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

Configurations: Tested by Dell EMC as of 12/17/2019. 4 socket Intel® Xeon® Gold 6248 Processor, 20 cores per socket HT OFF Turbo OFF Total Memory 1.5TB GB (DDR4, 48 slots/ 32GB), NUMA Not Enabled, KMP_AFFINITY=“granularity=thread,compact”, OMP_NUM_THREADS=80, KMP_BLOCKTIME=1, Number of intraop threads=80, Number of interop threads=1, Ubuntu 16.04, Deep Learning Framework: TensorFlow 1.11 with Intel® Deep Neural Network Library (DNNL/MKL-DNN), 3D UNet: https://github.com/IntelAI/unet, BS=16, Medical Decathlon (BraTS, http://medicaldecathlon.com/) + synthetic data, Datatype: FP32. Intel does not control or audit third-party data. You should review this content, consult other sources, and confirm whether referenced data are accurate.

FTC Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804

Security Disclaimer: Performance results are based on testing by Dell EMC as of 12/17/2019 and may not reflect all publicly available security updates. See complete configuration details here: https://downloads.dell.com/manuals/common/dellemc_overcoming_memory_bottleneck_ai_healthcare.pdf No product or component can be absolutely secure.

Technology Disclaimer: Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure.