Going beyond Full Utilization: The Inside Scoop on Nervana’s Winograd Kernels

By: Urs Köster and Scott Gray

This is part 2 of a series of posts on how Nervana uses the Winograd algorithm to make convolutional networks faster than ever before. In the first part we focused on benchmarks demonstrating a 2-3x algorithmic speedup. This part will get a bit more technical and dive into the guts of how the Winograd algorithm works, and how we optimized it for GPUs.


With the advent of new benchmark setting models such as GoogleNet Inception and Deep Residual Networks, convolutional networks architectures are becoming more and more complex. At the same time though, the types of convolutional layers used are very stereotyped, and a small number of configurations tend to be used over and over again. Most common by far are 3×3 convolutions, with 5×5 and 7×7 sometimes used in early layers. A small number of stride 2 layers are used to reduce feature map sizes. And recently, 1×3, 1×5 and 1×7 have been used to factorize convolutional layers Inception v4. Optimizing these convolutions has the potential for large performance pay-offs.

Direct convolution is the standard approach that computes convolutions using dot products. It hits a ceiling when the multipliers on the compute device are fully utilized, which modern implementations are very close to doing. Further speedups have to come from algorithmic improvements. One alternative is to use Fast Fourier Transforms (FFT) for convolution, which Facebook demonstrated with FB-FFT. This works best for large filters, though, and not the commonly used small 3×3 filters, which is where the Winograd algorithm comes in.

The Winograd algorithm

Optimal filtering algorithms using polynomial residuals have been studied by Andrei Toom and Stephen Cook, and generalized by Shmuel Winograd. While these ideas are not in the scope of a blog post, a review of this family of algorithms can be found in Chapters 2 and 5 in Richard Blahut’s Fast Algorithms for Signal Processing. Here we focus on the implementation of this algorithm on GPUs. For this we need to choose algorithms that maintain acceptable levels of accuracy and speedup without excessive memory overhead. These algorithms are described in detail by Lavin and Gray in Fast Algorithms for Convolutional Neural Networks.

Transforms for 3×3 filters (2×2 and 4×4 output tiles)

The Winograd algorithm works on small tiles of the input image. In a nutshell, the input tile and filter are transformed, the outputs of the transform are multiplied together in an element-wise fashion, and the result is transformed back to obtain the outputs of the convolution. This process is illustrated below, where we use the following notation:

C – number of input feature maps

K – number of output feature maps

H,W – height and width of the input feature map

R, S – height and width of the filter

P,Q – height and width of the output feature map

N – minibatch size


Fig 1a: An input tile of size 4×4 pixels is extracted from the input image. Tiles are overlapping with a stride of 2.


Fig 1b: The input is transformed by performing two matrix multiplications. Actual coefficients for the transforms are shown, they are obtained by solving the polynomials as described in Fast Algorithms for Convolutional Neural Networks by Lavin and Gray.

Winograd algorithm filters

Fig 2a: A filter with C=3 input feature maps and K=3 output feature maps. The filter size R,S is 3×3

doing two matrix multiplicationsFig 2b: Filter transform which results in an output the same size as the image tile. The transform is performed as two matrix multiplications.

Transforms for 3×3 filters

Fig 3a: Now the transformed image tile and filter are multiplied together to obtain the transformed output.

convolutions using dot products

Fig 3b: The result is transformed back into pixel space using the output transform , which again consists of two matrix multiplications.

finishing filter algorithm

Fig 3c: As the last step, the output tile is placed into the output feature map, and the convolution is completed.

Batched GEMM

Most of the computation happens in multiplying the transformed input and filter, as the transforms themselves are reused multiple times. Therefore, optimizing this multiplication is imperative for good performance. To this end, we package each single element-wise multiplication as a matrix multiplication, and then group together the matrix multiplications corresponding to individual elements to perform them with the best possible utilization of GPU resources.

The first step is to pose each element-wise multiplication of a filter and input element as a dot product (BLAS 1 operation). Summing up the product of input and filter over all C input feature maps takes care of this. Next we collect the K output channels for the filter and stack them up. This results in a matrix-vector product (BLAS2). Finally we collect the N elements of the minibatch and stack them up to obtain an efficient matrix-matrix (BLAS3) operation. The output matrix has one element for each combination of output channel and mini-batch element. This matrix is visualized below as one of the blocks on the diagonal.

The second step batches together these small matrix multiplications into a larger one. This is achieved by stacking up the matrices corresponding to all 16 filter elements at the different K and C, and stacking up the matrices corresponding to the 16 input elements at the different C and N. Doing the full matrix multiplication would give us the desired results on a block-diagonal. Hence we only perform the computations corresponding to those elements, while still increasing the compute sufficiently to hide any memory bandwidth bottlenecks.

Batched GEMM operation

Fig 4b: Batched GEMM operation. Dot products between filter and input tile. We only show two input (blue) and filter (red) elements out of the total 16 in the filter / input tile. Each square on the block-diagonal is a separate GEMM operation, but by loading the inputs for all tiles at the same time, we get better memory reuse.

Algorithmic Speedup

To understand the speedup from using Winograd over direct convolution, let’s consider an example. With a filter size of 3×3, the number of FMA (fused multiply-accumulate) operations to compute one output element with direct convolution is 9, exactly one per filter element. With the 2×2 Winograd transform considered here, a 2×2 block of output pixels is computed in one shot. This requires working on a 4×4 block of inputs (given by the “field of view” of the outputs), and transforming the filter to 4×4 as well. The number of FMA operations in multiplying the transformed inputs with the transformed filter is 16. Since this gives us 4 outputs in one shot, the FMAs per output are 4. Compared to the 9 for direct convolution, this represents an algorithmic speedup of 2.25.

Savings are even greater for larger tile sizes. If we compute a 4×4 output block at once, we work on a 6×6 block of inputs, and a transformed filter of 6×6. This requires 36 FMA operations, or 36/16 per input. This is a factor of four savings over direct convolution. Even larger transforms are possible and provide greater algorithmic speedup, but at the same time the memory requirements for the transformed tile increase steeply, while the additional benefit is small and will asymptote at 9, the number of pixels in the filter. With current hardware, the sweet spot is reached with output tiles of 2×2 or 4×4, which are both implemented in Nervana Winograd. For very large filters (such as the 11×11 filter used in the first layer of AlexNet) where larger tile sizes are favorable, FFT-based convolution such as FB-FFT using tile sizes of 32×32 or 16×16 can perform even better.

Computing Transforms

The full speedup is not quite obtainable in practice, as we have been ignoring the costs of performing the transforms. The reason the performance penalty is small in practice is that the transforms can be amortized to make up only a small part of the total compute:

In our example, the input transform consists of two 4×4 matrix multiplications, which reduces to 32 additions (skipping the zero elements), and this transform can be reused over all K output feature maps. The filter transform needs to be done only once per filter can be reused for NPQ images in the mini-batch and output feature map positions. Lastly, the output transform is amortized over C, the input channel. For ImageNet-scale networks, C, K, and PQN are generally large enough that the computational cost of the transform is small.

On the GPU, the transforms can be implemented as separate kernels or fused into the kernel that performs the matrix multiplications. Using separate kernels is a great way to amortize the cost of the transform over multiple uses, but it has two drawbacks: (1) Memory bandwidth can become a bottleneck because after performing the transform, there is more data that needs to be moved on chip, and (2) additional DDR scratch space is needed to store the transformed data. Using fused kernels alleviates these issues, but incurs a fixed computational overhead because the transform has to be computed for every GEMM tile rather than just once.


For very small mini-batches, it is challenging to feed all the multipliers even using the batched GEMM approach, which is optimized for a mini-batch size of 32 or more. But even for a mini-batch of one, we can create a much larger effective batch size by working on multiple output tiles simultaneously, a technique we call superblocking. In Figure 4b, the N dimension would be replaced by NxP/2xQ/2, i.e. the different output locations would be treated as additional data points in the minibatch. This works extremely well for feature map sizes that are large or powers of two, but comes with the caveat that super-blocks have to be well-aligned with the inputs. For example, consider a 14×14 input feature map, which can be tiled perfectly with tiles of 4×4 and stride 2. If we use superblocking with a 2×2 superblock size, we are effectively operating on 6×6 input tiles with stride 4 now, so we have to pad the feature map to 16×16, resulting in 24% wasted computation.


These techniques, combined with careful assembler level optimization for an efficient implementation, make it possible to obtain a speedup of close to 4x over direct convolution on individual layers. As the technique applies to fprop, bprop and update operations alike, essentially all of the computationally expensive parts of a convolutional network profit from this speed boost. In practice, a factor of 2 speedup for training an entire model is easily attainable. In fact, this often exposes new bottlenecks, for example providing data batches and launching kernels fast enough to keep up with the Winograd convolutions.

In the next part of this blog series, we will review how the different optimizations discussed here apply to various layer configurations in different networks.