Intel® Nervana™ NNP-I

Public performance claims for Intel AI Summit on Nov 12, 2019

Performance Claim 1: Pre-production Intel® Nervana™ NNP – I ruler delivers up to 1.2x perf/watt over an NVIDIA T4 system.

Performance Claim 2: Pre-production Intel® Nervana™ NNP-I ruler system delivers up to 3.7X compute density over an NVIDIA T4 system.

Performance Claim 3: Pre-production Intel® Nervana™ NNP-I throughput is similar to the production Nvidia T4 in server and offline modes in MLPerf Inference v0.5 results.
Disclaimer:
Performance claims calculated per node based on Intel and Nvidia submissions to MLPerf Inference v0.5 results published on November 6, 2019 at https://mlperf.org/inference-results/.

Performance Claim 4: Pre-production Intel® Nervana™ NNP-I delivers less performance loss across Server and Offline modes against Nvidia T4 in MLPerf Inference V0.5 results.
Disclaimer:
Performance claims calculated per node based on Intel and Nvidia submissions to MLPerf Inference v0.5 results published on November 6, 2019 at https://mlperf.org/inference-results/.

Performance Claim 5: Intel® Nervana™ NNP-I is expected to deliver leadership performance/watt among commercially available accelerators when it launches.
Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.

Intel® Nervana™ NNP-T

Public performance claims for Intel AI Summit on Nov 12, 2019

Performance Claim 1: With pre-production NNP-T systems we measure training time on Resnet-50 to be just north of 1 hour, with a line-of-sight to about half of that.
Disclaimer:
Measurements based on Intel internal testing using pre-production hardware/software as of November 2019. All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice.

Performance Claim 2: NNP-T pre-production systems achieve near-linear scaling performance – up to 95% scaling with Resnet-50 & BERT as measured on 32 cards, to train deep learning models at incredible efficiency.

Performance Claim 3: Pre-production NNP-T’s purpose-built architecture with Bfloat16 compute showcases convergence at SOTA (State-Of-The-Art) accuracy as measured on 32 cards on Resnet-50, same as with any FP32 compute, enabling customers to train deep learning models at incredible efficiency.

Performance Claim 4: Pre-Production NNP-T balances compute, memory & interconnect, no loss in bandwidth from 8-card in-chassis to 32 cards cross-chassis, same data rate on 8 or 32 cards for large (128 MB) message size, and scales well beyond 32 cards.

Performance Claim 5: For deep learning training models such as BERT-large, Transformer-LT with large weight sizes (> 500MB), NNP-T systems with simplified glueless peer-to-peer scaling fabric is projected to have no loss in bandwidth, scaling from a few cards to 1000 cards.

Performance Claim 6: NNP-T is a highly energy-efficient alternative to general purpose compute for our customer’s real-world deep learning training workloads.
Disclaimer:
None

Performance Claim 7: A simplified glueless peer-to-peer fabric design enables NNP-T systems to be highly performant while providing significant cost savings for customers.
Disclaimer:
The foundation for the cost-savings claim is the fact that the NNP-T glueless fabric obviates the need for additional switching and NIC costs. Cost reduction scenarios described are intended as examples of how a given Intel- based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Performance Claim 8: NNP-T pre-production systems achieve near-linear scaling performance – up to 95% scaling with Resnet-50 (Paddle Paddle) as measured on 32 cards, as compared to Nvidia V100 that scales at 73% (Tensor Flow).

Common Disclaimer for all performance claims