Parameter-Efficient Training of Deep CNNs

Neural network models are becoming larger, and the initial training of a full model can be costly. It involves massive datasets and extensive computation, sometimes to the point where training big models can become prohibitively expensive, especially when done on-premise. Large neural networks typically contain complex structured connections between layers. To reduce model size after training (so that it can be deployed in the real world), these connections are often optimized and reduced to a practicable size through a process known as pruning or connection removal. A successfully optimized, pruned version ideally performs as well as the original large model. However, we still have to follow the computationally expensive two-step training process of training large models, followed by compression/pruning.

This blog is based on a paper (accepted as an oral presentation at ICML 2019) that documents a technique that approaches training from the opposite direction. It begins with a compact model and, during training, modifies the model structure based on the data. This method is more scalable and computationally efficient than starting with a large model followed by compression, as training operates directly on the compact model without requiring access to a large over-parameterized model. Unlike past attempts, the proposed technique is able to train a small model with performance equivalent to a large model that has been compressed.

Achieving a compact network

Current literature on training notes that deep networks learn more effectively when they are highly overparameterized, and that overparameterization leads to better performance. Evidence has attributed this need for overparameterization to the geometry of the high-dimensional loss landscapes of overparameterized deep neural networks. One major downside is that a model with more parameters is more costly to train and use. Several techniques are currently available to trim the size of a trained model. All are highly effective in reducing the number of network parameters with little to no degradation in accuracy. They either operate on a pre-trained model or they require the overparameterized model to be maintained during training.

These techniques include:

  • Distillation methods
  • Reduced bit-precision methods
  • Low-rank decomposition methods
  • Pruning methods

The accuracy of trimmed models suggests that small and shallow networks – in effect, subsets resulting from model reduction – contain essential parameter configurations that put them on a par with bigger, deeper networks. By implication, then, overparameterization is not a strict necessity for an effective network if compact networks can be achieved in a more direct way.

The approach presented here is a novel dynamic sparse reparameterization method that addresses the limitations of previous techniques, which incur high computational cost and require manual configuration of the number of free parameters allocated to each layer. It outperforms previous static and dynamic reparameterization methods by using an adaptive threshold for pruning and reallocating parameters across layers without imposing a fixed sparsity on each layer. Previous dynamic sparse reparameterization methods were mostly concerned with sparsifying across fully connected layers, and none of them were applied to large-scale convolutional neural networks.

By exploring structural degrees of freedom during training, the method described here yields the best accuracy for a fixed training time-parameter budget – on par with accuracies obtained by iteratively pruning a large, pre-trained dense model. Our method is able to successfully train compact models to the same level of accuracy as compact models obtained by compressing large models – obviating the need for a big model.

How this approach to training works

This dynamic parameterization scheme trains deep convolutional neural networks (CNNs) where the majority of layers have sparse weight tensors. All sparse weight tensors are initialized at the same sparsity (percentage of zeros) level. It uses a full (non-sparse) parameterization for all bias parameters and the parameters of batch normalization layers.

Throughout training, the same total number of non-zero parameters in the network is maintained. Parameters are moved within and across tensors in two phases: a pruning phase, followed immediately by a growth phase, as shown in Algorithm 1. The parameter re-allocation step described by Algorithm 1 is carried out every few hundred training iterations.

The algorithm employs magnitude-based pruning, which removes the links with the smallest weights. Here, it is based on an adaptive global threshold, which makes pruning particularly efficient compared to previous methods that pruned the smallest fraction of weights and thus had to sort a large number of weights.

The algorithm employs magnitude-based pruning, which removes the links with the smallest weights. Here, it is based on an adaptive global threshold, which makes pruning particularly efficient compared to previous methods that pruned the smallest fraction of weights and thus had to sort a large number of weights.

Parameters are re-allocated across layers during training. A fixed sparsity level is not imposed on each layer. As a result, networks perform better, and extremely sparse networks can be trained.

Performance concerns

To address potential concerns that the performance of this dynamic parameterization scheme can be matched by networks with static parameterization trained for more epochs, all statically-parameterized networks having the same number of parameters were trained for double the number of epochs used to train dynamic sparse models. (An epoch is defined as one forward pass and one backward pass of all the training examples.) This approach ensures that any superior accuracy this method achieves cannot be due merely to its ability to converge faster during training.

This dynamic parameterization scheme incurs minimal computational overhead, which means that the statically-parameterized networks it is compared against were trained using significantly more computational resources than our dynamic sparse approach.

Testing and conclusions

A series of tests, documented in the full paper, detail experiments with:

  • WRN-28-2 on CIFAR10: A Wide ResNet model trained to classify a collection of images commonly used in machine learning, from the Canadian Institute for Advanced Research.
  • ResNet-50 on ImageNet: A bottleneck architecture trained on a large visual database designed for use in visual object recognition.

Test results indicated that the dynamic parameterization method that allocates free parameters across the network based on a simple heuristic can achieve significantly better accuracies than static methods for the same model size. This method yields better accuracies than previous dynamic parameterization methods, and it outperforms all the static parameterization methods tested. More work is needed to explain the mechanism underlying this phenomenon.

The results further indicated that for deep residual CNNs it is possible to train sparse models directly to reach generalization performance comparable to sparse networks produced by iterative pruning of large dense models. Moreover, this dynamic parameterization method results in models that significantly outperform equivalent-size dense models.

Experiments indicate that exploring network structure during training is essential to achieve best accuracy; if a static (with fixed structure) sparse network is constructed that copies the final structure of the sparse network discovered by the dynamic parameterization scheme, this static network will fail to train to the same level of accuracy.

Exploring structural degrees of freedom during training is key, and this method is the first that is able to fully explore these degrees of freedom, using its ability to move parameters within and across layers. These results do not contradict the conventional wisdom that extra degrees of freedom are needed while training deep networks. Rather, they point to structural degrees of freedom as an alternative to the degrees of freedom introduced by over-parameterization.

For more information and about AI research from Intel, follow @IntelAIResearch, and visit https://ai.intel.com.