AWS Launches New Amazon EC2 C5 Instances Featuring Intel® DL Boost Technology

Last month, AWS announced three new C5 instances (C5.12xlarge, C5.24xlarge and C5.metal), all featuring custom 2nd Generation Intel® Xeon® Scalable processors (code-named Cascade Lake) with a sustained all-core turbo frequency of 3.6GHz and maximum single core turbo frequency of 3.9GHz, as well as Intel® Deep Learning (Intel® DL) Boost technology enabled. Intel DL Boost technology refers to a group of acceleration features, including new Vector Neural Network Instructions (VNNI), that help speed up deep learning operations like Convolution and GEMM by using the INT8 instead of FP32 FMA computation at the quadrupled throughput improving inference performance over a wide range of deep learning workloads without requiring re-training. Table 1 provides more details on C5 instance types, with the three new instances highlighted. As the price per vCPU remains unchanged, these new C5 instances are a compelling choice for deep learning inference applications.

Model vCPU Memory (GiB) Instance Storage (GiB) Network Bandwidth (Gbps) EBS Bandwidth (Mbps)
c5.large 2 4 EBS-Only Up to 10 Up to 3,500
c5.xlarge 4 8 EBS-Only Up to 10 Up to 3,500
c5.2xlarge 8 16 EBS-Only Up to 10 Up to 3,500
c5.4xlarge 16 32 EBS-Only Up to 10 3,500
c5.9xlarge 36 72 EBS-Only 10 7,000
c5.12xlarge 48 96 EBS-Only 12 7,000
c5.18xlarge 72 144 EBS-Only 25 14,000
c5.24xlarge 96 192 EBS-Only 25 14,000
c5.metal 96 192 EBS-Only 25 14,000

Table 1. C5 instance types with the 3 new instances highlighted.

You can start using these new instances today in the following regions: US East (N. Virginia), US West (Oregon), Europe (Frankfurt, Ireland, London, Paris, Stockholm), Asia Pacific (Sydney), and AWS GovCloud (US).

Quick facts

Using MXNet as an example of a typical deep learning framework, VNNI on the new c5.24xlarge instance can boost performance of popular image classification/object detection models by 2.6x~3.8x (figure 1) with minimum or no accuracy loss (figure 2), thanks to the help of Intel DL boost with VNNI. For more details on how to speed up inference workloads using VNNI, please see our recent work on Model Quantization for Production-Level Neural Network Inference.

Figure 1: MXNet inference speed up w/ vs. w/o VNNI on the new c5.24xlarge instance. Performance results are based on testing as of 1st July 2019 by AWS. Please see appendix for configuration details.

Figure 1: MXNet inference speed up w/ vs. w/o VNNI on the new c5.24xlarge instance. Performance results are based on testing as of 1st July 2019 by AWS. Please see appendix for configuration details.

Figure 2. VNNI can do the same object detection efficiently without sacrificing accuracy.

Figure 2. VNNI can do the same object detection efficiently without sacrificing accuracy.

An Introduction to Deep Learning Inference

In deep learning, inference is used to deploy a pretrained neural network model to perform a wide variety of tasks, including speech detection, image classification, object detection, and other prediction tasks. For enterprises, inference is especially important because it is the stage of the analytics pipeline where their production-level data is used to produce valuable insights. The huge number of inference requests from end users are constantly being routed to cloud servers all over the world. Recent studies show that major data centers currently rely heavily on CPUs for inference, and rapid growth in machine learning across existing and new services in data centers and cloud services is predicted. A majority of data centers run on CPUs today, so it’s critical to ensure inference workloads can perform efficiently on them.

What is VNNI — And How Does it Work?

Various researchers have demonstrated that both deep learning training and inference can be performed with lower numerical precision, using 16-bit multipliers for training and 8-bit multipliers for inference, with minimal to no loss in accuracy. Using these lower numerical precision (training with 16-bit multipliers accumulated to 32-bits, and inference with 8-bit multipliers accumulated to 32-bits) will likely become the standard over the next year. VNNI is an ISA embodiment of the aforementioned method, and it extends Intel® AVX-512 instructions to support vectored INT8 FMA at the quadrupled throughput versus FP32 FMA.

Intel DL Boost in Action

Intel DL Boost technology has been integrated into various popular deep learning frameworks like MXNet*, TensorFlow*, Pytorch*, PaddlePaddle*, and Caffe*. The Apache MXNet community has delivered quantization approaches to enable INT8 inference and use of VNNI. iFLYTEK, which is leveraging 2nd Gen Intel Xeon Scalable processors and Intel® Optane™ SSDs for its AI applications, has reported that Intel DL Boost has resulted in similar or better performance in comparison to inference using alternative architectures. For more information on framework support, please refer to our recent blog post, Increasing AI Performance and Efficiency with Intel® DL Boost.

Get Started with Intel DL Boost

Follow the steps in our blog Model Quantization for Production-Level Neural Network Inference to experience the performance improvement with Intel DL Boost in those new EC2 C5 instances, or directly run the shell scripts with the below order:

  1. Run ec2_benchmark_base.sh to get the data of FP32 w/o op Fusion (Baseline).
  2. Run ec2_benchmark_int8.sh to get the data of FP32 w/ op Fusion (Better) and INT8 w/ op Fusion (Best).

Check out the following blog posts for more details on Intel DL Boost features, configurations, benchmarks and framework integrations:

Many thanks to my colleagues Yixin Bao, Ciyong Chen, Xinyu Chen, Ying Guo, Zhiyuan Huang, Tao Lv, Eric Lin, Wei Li, Zhennan Qin, Shufan Wu, Zixuan Wei, Pengxin Yuan, Lujia Yin, Patric Zhao, Rong Zhang and many others in Intel for their great work on optimizing deep learning frameworks with the state-of-the-art accelerating technology on Intel processors. Also thanks to Emily Hutson for providing valuable feedback.

Appendix: Notices and Disclaimers