Nervana is currently developing the Nervana Engine, an application specific integrated circuit (ASIC) that is custom-designed and optimized for deep learning.
Training a deep neural network involves many compute-intensive operations, including matrix multiplication of tensors and convolution. Graphics processing units (GPUs) are more well-suited to these operations than CPUs since GPUs were originally designed for video games in which the movement of on-screen objects is governed by vectors and linear algebra. As a result, GPUs have become the go-to computing platform for deep learning. But there is much room for improvement — because the numeric precision, control logic, caches, and other architectural elements of GPUs were optimized for video games, not deep learning.
As authors of the world’s fastest GPU kernels for deep learning, Nervana understands these limitations better than anyone, and knows how to address them most effectively. When designing the Nervana Engine, we threw out the GPU paradigm and started fresh. We analyzed the most popular deep neural networks and determined the best architecture for their key operations. We even analyzed and optimized our core numerical format and created FlexPoint™, which maximizes the precision that can be stored within 16 bits, enabling the perfect combination of high memory bandwidth and algorithmic performance. Then we added enough flexibility to ensure that our architecture is “future proof.” The Nervana Engine includes everything needed for deep learning and nothing more, ensuring that Nervana will remain the world’s fastest deep learning platform. So … are you ready for deep learning at ludicrous speed?!
Training deep neural networks involves moving a lot of data, and current memory technologies are not up to the task. Training data is accessed and model parameters are updated during the course of a training session. DDR4 SDRAM has high storage capacity but bandwidth is limited. GDDR5 SDRAM is faster but has limited storage capacity. The Nervana Engine uses a new memory technology called High Bandwidth Memory (HBM) that is both high-capacity and high-speed. HBM enables the Nervana Engine to have 32GB of in-package storage and a blazingly fast 8 terabits per second of memory access bandwidth.
The Nervana Engine’s HBM memories achieve high capacity by die-stacking. A single HBM chip can store a whopping 8GB of data because the chip itself is a stack of eight individual 1GB memory dies. The Nervana Engine includes four HBM stacks, providing 32GB in-package storage. The HBM’s high speed (bandwidth) results from a new process called 2.5D manufacturing. This process allows for much smaller pin-spacing on the bottom of the memory die, thereby accommodating more data channels. 2.5D manufacturing enables the Nervana Engine to access memory at such a high rate. This performance is achieved in a small die footprint, leaving more room for new compute circuitry designed to leverage this increased memory capacity and bandwidth.
The Nervana Engine design includes memory and computational elements relevant to deep learning and nothing else. For example, the Nervana Engine does not have a managed cache hierarchy; memory management is performed by software. This is an effective strategy in deep learning because operations and memory accesses are fully prescribed before execution. This allows more efficient use of die area by eliminating cache controllers and coherency logic. In addition, software management of on-chip memory ensures that high-priority data (e.g. model weights) is not evicted.
The result of this deep learning-optimized design is that the Nervana Engine achieves unprecedented compute density at an order of magnitude more computing power than today’s state-of-the-art GPUs. Nervana achieves this feat with an ASIC built using a commodity 28nm manufacturing process which affords Nervana room for further improvements by shrinking to a 16nm process in the future.
As discussed, data movement is often the bottleneck in deep learning. Given this dependency, it is a challenge to ensure that computation is never “starved” waiting for data. The Nervana Engine has separate pipelines for computation and data management, ensuring that new data is always available for computation, and that the compute elements are always doing computation. This pipeline isolation, combined with plenty of local memory, means that the Nervana Engine can run near its theoretical maximum throughput much of the time.
The Nervana Engine was designed from the ground up to support true model parallelism. The Nervana Engine includes six bi-directional high-bandwidth links enabling chips to be interconnected within or between chassis in a seamless fashion. This enables users to get linear speedup on their current models by simply assigning more compute to the task, or to expand their models to unprecedented sizes without any decrease in speed. For example, eight ASICs can be interconnected in a torus configuration as shown below to achieve nearly 8x linear performance speedup. Conceptually, the torus acts as a single large processor:
This speedup capability is unique to Nervana. Competing systems use oversubscribed, low-bandwidth PCIe busses for all communication, including peer-to-peer and host-to-card. This oversubscribed PCIe link limits the ability to improve performance by adding more hardware. Moreover, current systems require expensive and limited-bandwidth infiniband or ethernet links for chassis-to-chassis communication, which greatly reduces their ability to scale beyond a single motherboard or chassis.
The Nervana Engine — featuring high-bandwidth memory, unprecedented compute density, isolated data and computation pipelines, and built-in networking — will enable deep learning at a scale never before seen in the industry.
Have questions? Contact us: firstname.lastname@example.org.