Recently, we announced a new architecture built from the ground up for neural networks, known as the Intel® Nervana™ Neural Network Processor (NNP). The goal of this new architecture is to provide the needed flexibility to support all deep learning primitives while making core hardware components as efficient as possible. We designed the NNP to free us from the limitations imposed by existing hardware, which was not explicitly designed for AI. Today, we are excited to disclose more details on the architecture and on some of the new features specific to increasing neural network training and inference performance. In this blog, I will talk about the attributes that make the NNP so innovative and why these features are important tools for neural network designers.
To solve a large data problem with a neural network, it is necessary for the designer to iterate quickly on different possible neural networks using large data sets. In other words, it’s all about being able to train a given neural network against more data in a given amount of time. There are multiple important factors to achieve this: 1) maximizing compute utilization, 2) easily scaling to more compute nodes, 3) doing so with as little power as possible. The NNP architecture provides novel solutions to these problems and will give neural network designers powerful tools for solving larger and more difficult problems.
The NNP gives software the flexibility to directly manage data locality, both within the processing elements and in the high bandwidth memory (HBM) itself. Tensors can be split across HBM modules in order to ensure that in-memory data is always closest to the relevant compute elements. This minimizes data movement across the die, saving power and reducing on-die congestion. Similarly, software can determine which blocks of data are stored long-term inside the processing elements, saving more power by reducing data movement to and from external memory.
HBM is at the forefront of memory technology, supporting up to 1TB/s of bandwidth between the compute elements and the large (16-32GB) external memory. But even with this large amount of memory bandwidth, deep learning workloads can easily become memory-limited. Until new memory technologies become available, it is important for deep learning compute architectures to use creative strategies that minimize data movement and maximize data re-use in order to leverage all of their computational resources. The NNP employs a number of these creative strategies, all under the control of software. The local memory of each processing element is large (>2MB each, with more than 30MB of local memory per chip). This larger on-die memory size reduces the number of times data needs to be read from memory, and enables local transforms that don’t affect the HBM subsystem. After data is loaded into the local memory of each processing element, it can then be moved to other processing elements without re-visiting the HBM, leaving more HBM bandwidth available for pre-fetching the tensor for the next operation. Tensors can even be sent off-die to neighboring chips directly from processing element to processing element, again without requiring a second trip in and out of the HBM subsystem. Even simple aspects of the architecture like free (zero-cycle) transpose are targeted at reducing the overall memory bandwidth.
We designed Flexpoint, the core numerical technology powering the NNP, in order to achieve results similar to FP32 while only using 16 bits of storage space. As opposed to FP16, we use all 16 bits for the mantissa, passing the exponent in the instruction. This new numeric format effectively doubles the memory bandwidth available on a system compared to FP32, and utilizes 16 bit integer multiply-accumulate logic which is more power efficient than even FP16.
Flexpoint is modular. While our first generation NNP focuses on 16b multipliers with a 5b exponent, future silicon will enable even smaller bit widths in order to save even more power.
The NNP includes high speed serdes which enable more than a terabit-per-second of bidirectional off-chip bandwidth. Similar to our memory subsystem, this bandwidth is fully software-controlled. QOS can be maintained on each individual link using software-configurable, adjustable-bandwidth virtual channels and multiple priorities within each channel. Data can be moved between chips either between their HBM memories or directly from the processing elements. The high bandwidth enables model parallelism (a set of chips will combine together and act as if they are a single compute element), in addition to data parallelism (where a job is split up along input data boundaries). The ability to move data directly from local to remote processing elements ensures that HBM reads can be reused as many times as possible, maximizing data re-use in memory-bound applications.
By designing the Intel® Nervana™ Neural Network Processor to maximize compute utilization and easily scale to more compute nodes while using as little power as possible, we have created a novel form of hardware that is optimized for deep learning.