Delivering a New Intelligence with AI at Scale

It’s an exciting day for Intel and the AI community we work to enable. Today, we’re proud to provide significant updates on the Intel AI portfolio: the first demonstrations of the Intel® Nervana™ Neural Network Processor for Training (NNP-T) and the Intel® Nervana™ Neural Network Processor for Inference (NNP-I). We will also be demonstrating for the first time enhanced integrated AI acceleration with bfloat16 on the next-generation Intel® Xeon® Scalable processor with Intel® Deep Learning Boost (Intel® DL Boost), codenamed Cooper Lake. Finally, we are announcing the future Intel® Movidius™ Vision Processing Unit (VPU), codenamed Keem Bay. This unique combination of hardware will enable the industry to embrace much larger and more complex AI algorithms, expanding what can be achieved with AI in the cloud and data center, an edge server, or an IoT device.

I am especially proud to present additional architectural details and live, working demos of real-world AI solutions that harness the power of Intel Nervana NNP-T and Intel Nervana NNP-I – products that many of my colleagues and I joined Intel to create. Performance of our pre-production hardware running pre-alpha software is already excelling, and we expect the forthcoming production platforms will perform even better.

While most enterprises are only getting started on their AI journey with smaller models that typically do not require acceleration, AI super users – generally CSPs – are embracing next-gen AI models with billions or trillions of parameters that require new approaches to AI acceleration.

While most enterprises are only getting started on their AI journey with smaller models that typically do not require acceleration, AI super users – generally CSPs – are embracing next-gen AI models with billions or trillions of parameters that require new approaches to AI acceleration.

The Drive for a New Intelligence

Intel has a unique position and perspective on AI, with a comprehensive edge-to-cloud product portfolio that makes a wide breadth of AI solutions possible: from smart IoT edge devices to classic enterprise machine learning to next-generation deep learning for true AI super-users. This last group are developing the next generation of models that will move us from more basic intelligence to algorithms capable of using reasoning and context to make decisions and scale knowledge.

Deeper, more complex models provide better, more valuable results, by incorporating reason and contextual factors about the user and use environment. However, this increase in capabilities corresponds to big increases in model size, data needs, and in demand for AI compute.

Deeper, more complex models provide better, more valuable results, by incorporating reason and contextual factors about the user and use environment. However, this increase in capabilities corresponds to big increases in model size, data needs, and in demand for AI compute.

This next wave of AI requires huge increases in data and model complexity, some with trillions of potential parameters. Training these cutting-edge algorithms requires demand for AI compute to double about every 3.5 months [1], which cannot be accomplished efficiently with today’s architectures. These AI breakthroughs require new architectures that are specifically designed for high-speed, mass-scale AI compute.

Intel® Nervana™ Neural Network Processor for Training (NNP-T)

Developed for the AI processing needs of leading-edge AI customers like Baidu, Intel Nervana NNP-T purpose-built deep learning architecture carefully balances compute, memory & interconnect near-linear scaling – up to 95% scaling with Resnet-50 & BERT as measured on 32 cards [2] – to train even the most complex models at high efficiency. As a highly energy-efficient compute platform for training real-world deep learning applications, Intel Nervana NNP-T ensures no loss in communications bandwidth when moving from an 8-card in-chassis system to a 32-card cross-chassis system, with the same data rate on 8 or 32 cards for large (128 MB) message sizes, scaling well beyond 32 cards [3].

Intel Nervana NNP-T in its PMC form factor.

Intel Nervana NNP-T in its PMC form factor.

For deep learning training models such as BERT-large, Transformer-LT with large weight sizes (> 500MB), Intel Nervana NNP-T systems with simplified glueless, peer-to-peer scaling fabric is projected to have no loss in bandwidth, scaling from a few cards to thousands of cards [4].

Intel Nervana NNP-I in its M.2 form factor, which draws 12W and generates up to 50 TOPs. Intel Nervana NNP-I is also available as a PCIe card drawing 75W and producing up to 170 TOPs.

Intel Nervana NNP-I in its M.2 form factor, which draws 12W and generates up to 50 TOPs. Intel Nervana NNP-I is also available as a PCIe card drawing 75W and producing up to 170 TOPs.

Intel® Nervana™ Neural Network Processor for Inference (NNP-I)

New AI services launch every day, driving demand for fast, efficient inference compute with a wide variety of use environments, energy constraints, and latency considerations. To serve these customers, the Intel Nervana NNP-I is designed for intense, near-real-time, high-volume, low-latency inference, as well as power and budget efficiency and flexible form factors. It is a performant, highly-programmable accelerator platform specifically designed for ultra-efficient multi-modal inferencing. Intel Nervana NNP-I will be supported by the OpenVINO™ Toolkit, incorporates a full software stack including popular deep learning frameworks, and offers a comprehensive set of reliability, availability, and serviceability (RAS) features to facilitate deployment into existing data centers.

We recently reported very positive MLPerf results for two pre-production Intel Nervana NNP-I processors on pre-alpha software, and we’re going to see even greater capabilities from Intel Nervana NNP-I as we further mature the AI software stack and update results on production products in the future.

Next-Generation of Built in AI Acceleration with the Intel® Xeon® Scalable Processor

As more organizations integrate AI capabilities into more facets of their operations, the Intel Xeon Scalable processor – which already powers most of the world’s inference today – will be called on to process increasingly complex algorithms To empower our customers to deliver more impactful AI applications, Intel introduced Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions in 2017 with the first generation Intel Xeon Scalable processor. With the 2nd Generation Intel Xeon Scalable processor, wwe introduced Intel DL Boost’s Vector Neural Network Instructions (VNNI), which combine three instruction sets into one while enabling INT8 deep learning inference.

At the AI Summit, we demonstrated how we’re improving on this foundation in our next-generation Intel Xeon Scalable processors with bfloat16, a new numerics format supported by Intel DL Boost. bfloat16 is advantageous in that it has similar accuracy to the more common FP32 format, but with a reduced memory footprint that can lead to significantly higher throughput for deep learning training and inference on a range of workloads.

Optimizations like these at the lowest level of silicon, which are unique to Intel Xeon Scalable processors, will help our customers continue to tackle even more computationally heavy problems with ease.

New Intel® Movidius™ VPU Launching 1H 2020

Compute at the network edge requires efficiency and scalability across a broad range of applications and with AI inference requirements come even tighter energy constraints – as low as just a few watts. To best support future edge AI use cases, we’re excited to announce the future Intel Movidius VPU (code-named Keem Bay), releasing in the first half of 2020. Keem Bay builds on the success of our popular Intel Movidius Myriad™ X VPU while adding groundbreaking and unique architectural features that provide a leap ahead in both efficiency and raw throughput.

Early performance testing indicates that Keem Bay will offer more than 4x the raw inference throughput of NVIDIA’s similar range TX2 SOC, at 1/3 less power, and nearly equivalent raw throughput to NVIDIA’s next higher class SOC, NVIDIA Xavier, at 1/5th the power [5]. This is in part because of Keem Bay’s mere 72mm2 size vs NVIDIA Xavier’s 350mm [6], highlighting the efficiency that this new product’s architecture delivers. Keem Bay will also be supported by Intel’s OpenVINO Toolkit at launch and will be incorporated into Intel’s newly announced Dev Cloud for the Edge, which launches today and allows you to test your algorithms on any Intel hardware solution to try before you buy.

AI Delivered at Intel Scale and Efficiency

Intel’s solution portfolio uniquely integrates the compute architectures that analysts predict will be required to realize the full promise of AI: CPUs, FPGAs, ASICs like those we’re announcing today, all enabled by an open software ecosystem. We’re now realizing $3.5 billion per year in AI revenue, and we’re just getting started.

However, it will take a village; ushering in the 64X compute increase that Intel estimates the AI community will demand [7] in just 2 years can’t be done by compute alone; only Intel is equipped to look at the full picture of compute, memory, storage, interconnect, packaging and software to maximize efficiency, programmability, and ensure the critical ability to scale up distributing deep learning across thousands of nodes to, in turn, scale the knowledge revolution.

Check out Intel.ai for more information on our Intel Nervana Neural Network Processors and Movidius.com for more information on current and future VPU technologies.