In the world of artificial intelligence, there has been a lot of talk about performance and capabilities of hardware platforms. It is true that today’s computing power is what allowed the AI revolution to (re)happen and this is a combination of 1) increased data set sizes, and 2) high-density compute. In this blog, I’d like to focus on the compute side and provide a framework for the comparison of different high-density computing devices.
Numerous efforts have been started trying to solve this problem by ‘building a better mousetrap’ than a CPU or GPU. My own startup, Nervana (acquired by Intel in 8/2016) is an example. While there are certainly ways to better arrange transistors on a silicon die to have a performance and power advantage for this application, there are some fundamental issues that must be addressed with any architecture. A problem today is that there are many performance numbers being tossed about that may not have much correlation to real AI performance. Raw TeraFLOPs/s or TeraOPs/s have been used to compare various platforms, and below we’ll delve into some reasons why this metric is not sufficient to assess performance on neural network training.
You may have heard of the “Von Neumann” architecture and how it’s dead. Simply stated, the Von Neumann architecture is one where data lives in a memory connected to an arithmetic device (ALU) via some narrow data pipe. This has several key issues. When data is moved back and forth from the memory to the arithmetic device, energy is used and latency is incurred. In addition, the memory pipe might become a bottleneck if the arithmetic device can consume the data faster than it can be supplied by the memory. The new thinking is, if we can bring the memory closer to the arithmetic device, we burn less energy and mitigate bottlenecks. The problem with this in building a real silicon device is that memory grouped together will generally be denser and lower power than memory interspersed with digital logic. This is true for on-die SRAM but is even starker when we consider standard external memory technologies like DDR4, HBM2, or HMC that achieve very high density and power efficiency. The parameter sizes of today’s neural networks are generally too large to fit into on-die memory resources, so we are stuck with a data pipe between an off-die memory and arithmetic device. On-die memory can be used to mitigate the memory bandwidth problem, but deciding what stays on-die vs off-die requires careful management to achieve high performance.
Utilization in this context is the percentage of the raw compute capabilities of the chip that can be effectively used for a real workload. Deep learning and neural networks use a relatively small number of computational primitives, and only a few of those occupy much of the compute time. Matrix multiplication (MM) and transposes are fundamental operations. MM is composed of Multiply Accumulate (MAC) operations. OPs/s numbers are derived by how many MACs can be done per second (each multiply and accumulate are considered 1 operation, so a MAC is actually 2 OPs). So, we can define utilization as
Now, if the MAC capabilities of a design are ‘starved’ by the memory bandwidth, our design will never get high utilization. All of the OPs/s in the world will not make the design work faster since the memory bandwidth has become the bottleneck. We call this being memory bound. The memory subsystem has the job of keeping all of the compute busy on the chip. This can be done by being clever about how memory is managed between external memory and on chip memory. Caches are an example of this.
As might be obvious, the more compute a chip has, the more memory bandwidth is required to keep the MAC units busy. So, additional circuitry like buffers, transpose logic, nonlinearity (ReLU) logic must be employed to accomplish this. These come at a cost of die area and power. These factors must be carefully balanced to make a device that devotes enough power and area to keeping the MACs busy and utilizing the memory bandwidth optimally. Simply throwing more and more OPs/s at the problem won’t help much in the real world if these other operations are not considered.
One of the main knobs we have to make better use of memory bandwidth, utilization, and power is to go to lower bit precisions for each MAC. It is out of the scope of this blog to describe exactly the challenges and solutions with lower precision, but it is an area of active research. In addition, we can exploit sparsity and employ techniques like pruning to achieve more apparent computation on devices.
There is a desire for simple metrics to compare AI workloads on various platforms. CPUs used to use clock rate as a basis for comparison, but better benchmarks eventually obviated that need. Similarly, in the dense compute space we see the use of TeraFLOPs/s or TeraOPs/s commonly. Instead, we need a metric that linearizes the relative training performance of hardware platforms. If device A has twice the metric rating as device B, it would imply that device A is double the performance on training most neural networks for instance.
To this end, I’d like to propose the following metric: Computational Capacity (CC). The 3 factors that are involved are bit width of numeric representation, memory bandwidth, and OPs/s
Let b=# bits of representation, m=memory bandwidth in GigaBits/s, and o=TeraOPs/s
We use the square of the number of bits of representation as a simple proxy of the relative area of the multipliers to implement that precision. This implies that 16 bit multipliers are approximately 4 times larger circuits than 8 bit multipliers which is close to reality.
As with any comparison metric, the CC metric will also be just an approximation of performance and will not capture the nuances of different architectures. An obvious issue is that chip-to-chip I/O is not considered at all and this might be a further refinement to the metric later on (indeed, feel free to reach out on twitter or email us with any suggestions). Power and area devoted to interconnect can be highly advantageous if performance scales across multiple chips. The goal of this blog is really to get the community thinking more deeply about what it takes to achieve high performance on AI workloads.