Addressing the Memory Bottleneck in AI Model Training

Healthcare workloads, particularly in medical imaging, may use more memory than other AI workloads because they often use higher resolution 3D images. In fact, many researchers are surprised to find how much memory these models use during both training and inference. In a recent publication, Intel, Dell, and the University of Florida demonstrate how the large memory capacity of a 2nd generation Intel® Xeon® Scalable processor-based server allows researchers to more efficiently train and deploy medical imaging models for brain tumor segmentation that use almost a terabyte (TB) of RAM (Figure 1).

Figure 1: Benchmarking the memory usage of 3D U-Net model-training over various input tensor sizes on a 2nd generation Intel® Xeon® Scalable processor-based server with 1.5 TB system RAM. Source: https://downloads.dell.com/manuals/common/dellemc_overcoming_memory_bottleneck_ai_healthcare.pdf

Figure 1: Benchmarking the memory usage of 3D U-Net model-training over various input tensor sizes on a 2nd generation Intel® Xeon® Scalable processor-based server with 1.5 TB system RAM. Source: https://downloads.dell.com/manuals/common/dellemc_overcoming_memory_bottleneck_ai_healthcare.pdf.

For convolutional neural networks (CNNs), activation maps vary with the size of the input image. As the input image size grows, the activation map may grow to memory footprints that are many times larger than the weights and biases of the model. For training, one way to handle this is by distributing compute across multiple machines and cores. However, with access to 1.5 TB of DDR4 RAM and an additional 6 TB per socket of Intel® Optane™ DC Persistent Memory, the 2nd generation Intel Xeon Scalable CPU minimizes the need for this workaround. Instead, researchers are able to use the full capacity of RAM without any modifications to their code.

Intel’s software and hardware optimizations also provide significant speed-ups for training these large memory models. Training was increased by 3.4x when comparing standard, unoptimized TensorFlow 1.11 to Intel-optimized TensorFlow 1.11 for the 3D U-Net model (Figure 2). By scaling this single-node, memory-rich configuration to a multi-node CPU cluster with data parallel methods, researchers can expect to get even more efficient training performance for their most demanding real world use cases. In fact, Intel demonstrated training a similar 3D U-Net model using a data-parallel method at the 2018 Supercomputing Conference (SC18).

Figure 2: Intel-optimized TensorFlow 1.11 with DNNL provides a 3.4x improvement in training time compared to stock, unoptimized TensorFlow 1.11 for the 3D U-Net model. Source: https://downloads.dell.com/manuals/common/dellemc_overcoming_memory_bottleneck_ai_healthcare.pdf

Figure 2: Intel-optimized TensorFlow 1.11 with DNNL provides a 3.4x improvement in training time compared to stock, unoptimized TensorFlow 1.11 for the 3D U-Net model. Source: https://downloads.dell.com/manuals/common/dellemc_overcoming_memory_bottleneck_ai_healthcare.pdf.

For inference, we can apply other techniques to deal with memory-intensive AI models. For example, the Max Planck Institute recently worked with Intel to run inference on a full 3D dataset for a 3D brain imaging model. The first and most important achievement of the project was to reduce the original 24 TB memory requirement by a factor of 16 via efficient reuse of memory enabled by the Intel® Distribution of OpenVINO™ toolkit. As a result, processing each image required only 1.5 TB of RAM to perform AI inference, and processing took less than an hour compared to 24 hours during initial tests.

We invite you to check out our Intel-optimized TensorFlow installation today and see how easy it is to develop deep learning models for high resolution 3D images.

Notices and Disclaimers: