Building What’s Next: Scaling Complex Deep Learning Workloads

The recent advancements in artificial intelligence has been driven by abundant compute, large data sets, and the re-emergence of deep learning (DL) techniques using labeled data to train computers to mimic our human abilities to perceive and classify. Now, we are on the cusp of machines understanding context, enabling a level of common sense that will allow them to independently make decisions. To facilitate deep reasoning, new models like BERT and techniques including reinforcement learning are emerging, drawing upon massive amounts of unlabeled data and incredibly complex models with billions of parameters and hundreds of neural network layers. As they do, AI researchers have found that it is taking too long to experiment and train the most complex models on existing products that are not specifically built for DL training. Purpose-built DL training accelerators and systems are necessary for developers to deploy distributed learning algorithms and scale up deep reasoning that powers innovation, research and discoveries.

Inside the Intel Nervana Neural Network Processor for Training

The Intel® Nervana™ Neural Network Processor for Training (Intel® Nervana™ NNP-T) architecture is inspired by the brain, so that one can construct tightly integrated AI systems to efficiently and rapidly train the next wave of large, complex deep learning models. Creating a DL processor from the ground up allows for the unrivaled balancing of three critical elements:

  • Compute: Specialized DL compute functions for maximum utilization
  • Memory: On-die memory to reduce data movement and keep the compute units fed
  • Communications: Dedicate data paths for scale-out efficiency and flexibility

The balance of memory and compute in the Intel Nervana NNP-T maximizes processor and cluster utilization and reduces time-to-train. By doing so, it supports multiple workload sizes, scaling from small clusters to the largest PODs, supporting real-world training scenarios even at smaller matrix sizes.

Keeping Compute Fed with Data

We developed specialized tensor processing clusters (TPC) dedicated to GEMMs and convolutions—the primary math operations that make up nearly 99% of all deep learning—with instruction sets and pipelines that maximize DL processing performance. Tensor-based bfloat architecture brings flexibility to support all deep learning primitives while making hardware components as efficient as possible. The TPCs leverage both on-die SRAM and on-package HBM to keep data local, reusing it as much as possible and reducing data movement and latency. Maximum real-world processor utilization allows for less time and power spent moving data between chips and therefore the fastest-time-train of a large scale AI system within a targeted cost and power budget.

Maximum Throughput, Minimum Congestion

Deep learning at scale is as much a challenge of having the data where you need it as it is a compute problem. The Intel Nervana NNP-T’s on-die communication is prioritized for throughput and congestion avoidance. It accelerates communication with features like bidirectional 2-D mesh architecture with 2.6TeraBytes per second bandwidth with any-to-any communication, ensuring the TPCs are kept fed with data without congestion. The TPCs have separate data channels with cut-through forwarding and direct peer-to-peer communication with other TPCs, the host, HBM and other NNP-T cards, so that no energy or time is wasted waiting for data from cache or HBM.

Built for Scale-Out

Built for Scale-Out

Very large systems require an architecture optimized for workload efficiency at every level, to be economical to the enterprise and useful to scientists. Using high-speed Inter-Chip-Links (ICL) communications fabric to directly interconnect NNP-T cards within and between chassis, large-scale systems and clusters scale with near perfection, acting almost as one efficient processor. The ICL fabric implements a fully programmable router as well as support for reliable transmission. TPCs can directly transfer data to the links, rather than taking up precious bandwidth from the HBM memory subsystem, ensuring lower latency and greater efficiency.

Software handles memory management, message passing, synchronization and scheduling of the data transfers to ease data- and model-parallel distributed training across hundreds of cards with well over 80% communications efficiency, so time-to-train goals are not compromised with larger, more complex models. Clusters utilize NNP-T provisioning and orchestration management software to allow many users to simultaneously scale from one to many cards and systems while maintaining high utilization.

Open Software and Programmability

The NNP-T full software stack is easy to scale and built with open components. It works with popular frameworks such as TensorFlow, PaddlePaddle, and PyTorch, and includes a deep learning library and a hardware-agnostic open source graph compiler. As such, programmability is enabled at multiple levels – framework, graph compiler, kernel-compiler, and kernel library. . Data scientists can work with existing frameworks and automatically leverage our graph and kernel optimizations, or extend them by writing their own kernels.

Accelerate with Purpose

Highly efficient, fast solutions for deep learning demand integrated systems. Intel takes a systems-level approach that optimizes interactions between CPU, accelerator, interconnect, memory, and storage. Intel® Xeon® Scalable processors are already relied upon for a majority of enterprise workloads, providing a strong balance of performance, cost, and versatility. For intensive, continuous, high-volume tensor compute Intel® Nervana™ Neural Network Processors for Training (NNP-Ts) work hand in hand with Intel Xeon-based infrastructure as part of an integrated system to run the most complex emerging DL models with high effective real-world utilization and near-linear scaling.

For more information, join us at the AI Hardware Summit in Mountain View, CA from September 17-18 or head to the Intel® Nervana™ Neural Network Processors product page.

Notices and Disclaimers