Fast Inference with Early Exit

Deep learning has increased the accuracy of many machine learning problems to (and sometimes beyond) human levels. Many devices and applications in our daily lives are already enhanced with machine learning techniques; in the near future, more and more tasks normally performed by humans (e.g. navigating while driving a car) will be enhanced by deep learning systems.

The power of deep learning models means that not every classification task requires the entire model’s attention. In fact, using only a portion of the model is adequate for quite a substantial number of “easy” data examples, and depth is only really required for more complex data inputs.

But first, we must classify when data is simple or complex, and then treat it accordingly. If we can check along the way that we have enough information to make a decision, we can reduce unnecessary processing – and save time and energy. This article will address the problem of simple vs. complex classification using convolutional neural networks (CNNs).

Motivation

We focus on the classification problem with machine learning. Deep models are able to express complex boundaries between classes with increased accuracy, but often at the cost of increased computational complexity in the number of layers and overall capacity of the model.
“Early exit” is a strategy with a straightforward and easy to understand concept. Figure 1 shows an example in a 2D feature space. While a deep network can represent a more complex and expressive boundary between classes (as shown in the curved area) [1], it’s also clear that much of the data can be properly classified with even the simplest of classification boundaries (e.g. a linear classifier as shown by the straight, bold black line).

Figure 1: Simple and more expressive classification boundaries

In other words, data points outside of the two parallel green lines that bound the complex decision boundary of the deep network can accurately be classified with a simple, linear decision boundary. Those points within the two parallel lines are tougher to distinguish and require extra processing to accurately classify. (Those familiar with SVM classifiers will appreciate the analogy of the support vectors to the boundaries between hard and easy data points.) We then use a confidence measure to determine if we can accurately classify the data point.

Previous Work

Early exit has been an active area of research recently, with several teams taking various approaches to the topic. What ties this research together is the idea that a confidence measure will determine if a prediction made at a certain stage can exit early from the entire deep learning topology, thus saving unnecessary processing in the subsequent layers.

Interesting work at Purdue [2] has leveraged the variance of difficulty among real-world data and thus uses only part of the network to handle recognition tasks. The team proposed a methodology known as Conditional Deep Learning (CDL), where linear classifiers after convolutional layers are used to assess if the classification can be terminated at that layer. The linear classifier provides a confidence metric for early termination of the network. While performance improvements can be realized with such early termination, Panda et al suggest that power saving is the ultimate advantage of short-circuiting work with respect to the entire network.

Another research team at Harvard used an approach of selectively inserting exits between specific layers. Checks for an ability to exit were done after some amount of extra processing on the exit branches themselves [3]. The example code in Distiller most closely resembles this approach.

A third method is to perform dynamic routing of the data and thus skip certain layers of processing along the way [4]. While this isn’t really an early exit, it does make use of criterion to assess whether certain work can be avoided much in the same spirit as an early exit.

An interesting approach to a new architecture that incorporates (or at least enables) early exit is the Multi-Scale Dense Network (MSDNet) which was proposed by Huang et al. [5]. MSDNet incorporates feature representation at multiple scales throughout the network. This provides a structure that creates early exit opportunities with more coarse level feature information at earlier exits than would be available just by insertion of early exits throughout the network.

The goal of our early exit strategy is to provide optimizations by modestly augmenting the data scientists’ architecture with early exits. A ResNet-50 with our early exit strategy is basically a pure ResNet-50 with one or more exit points (and some exit layers with downsizing and read-out functionality to facilitate accurate early classification). By comparison, MSDNet is a brand new convolution network architecture that is amenable to early exit insertion based on various computational budget constraints.

Our approach most closely resembles the Branchynet approach of selectively inserting exits between certain layers. Note that the example early exit code does not actually exit early but rather uses computed probabilities and entropies to determine the portion of data points that could exit early.

In terms of taking actual exits and avoiding the work of subsequent layers, some architecture-specific considerations are required in order to achieve the full benefit of early exit, especially for special purpose hardware accelerators. This topic will be covered in a future publication.

Exit Processing

At each early exit, there must be a minimal amount of extra processing in order to get to a form where we can realize a prediction and a confidence measure. There might be some layers of processing (e.g. convolutional layers) but the exit must finish with a fully connected layer that will produce a probability distribution of the classes. We use the output at that exit to determine not only the current class prediction (i.e. the class with the maximum value) but also the confidence of the prediction at that exit. If the confidence is strong enough, then we have an indication that further processing will not change the prediction and we can thus exit avoiding the rest of the layers.

Confidence Measure

The Cross-Entropy Loss is defined as:

and since we are dealing with multiple exits, we take a weighted combination of exit losses to form an overall loss (during training of the network with multiple exits):

During inference stages (validation and/or test), if the entropy is below a specific threshold, then the confidence is high enough to exit.

Cross-Entropy During Inference?

Cross-entropy is a metric often used during training. We can use it to assess loss before doing the backpropagation. We can also use it during inference, even though production inference runs do not provide ground-truth vectors. Since the prediction at that exit corresponds to the class with the maximum value, this reduces to -log(pmax) where pmax is the value of the probability distribution of the class corresponding to the maximum value, at that exit.

Effect of Exits During Training

The modified network is trained with the exit paths. This will produce errors at each exit during training. As shown previously in Equation 2, the losses are combined in a weighted linear fashion to produce an overall loss. Since the backpropagation comes from each exit, the earlier exits have a significant influence on the early layers of the network and help mitigate the vanishing gradient problem.

Early Exit Example in Distiller

Distiller is a Python* package for neural network compression research. It provides a PyTorch* environment for prototyping and analyzing compression algorithms, such as sparsity-inducing methods and low-precision arithmetic.

Early exit is a new feature in Distiller and is available as an Open Source package on Github. The compression_classifier example now includes sample code for ResNets of various sizes to run on standard datasets such as Cifar10 and ImageNet.

CIFAR10

Using Distiller’s example of compression of CNNs, we can run an actual example of a ResNet architecture on the CIFAR10 dataset. If you’ve already cloned the Distiller repo, go ahead and run the following command on (adjusting for your setup as necessary). You will be able to run for ResNets of various sizes (20, 32, 44, 56, 110) with exactly one early exit inserted after the first grouping of layers. Below is a command line for ResNet-32 with an early exit entropy threshold of 0.4 and a 0.4 weighting on the early exit loss:

python compress_classifier.py --arch=resnet32_cifar_earlyexit --epochs=20 -b 128 \
--lr=0.003 --earlyexit_thresholds 0.4 --earlyexit_lossweights 0.4 -j 30 \
--out-dir /home/ -n earlyexit /home/pcifar10

The code will automatically download the CIFAR10 dataset if it does not exist already in the specified location.

Figure 2 shows the influence of various thresholds on overall accuracy and the percentage that actually exits early. We see that for ResNet-20 and a 0.5 loss weight and a threshold of 0.8, we can achieve the best accuracy and highest performance improvement (i.e. more data samples will exit early, thus saving work).

Figure 2: Early Exit % vs Entropy Threshold

They have a similar shape for other loss ratios, but some loss ratios perform better than others. For example, if we plot the CIFAR10 Top1 Accuracy vs Loss Ratio (see Figure 3), we see that there is some influence in the weighting of the separate loss values for each exit.

Figure 3: Early Exit % and Entropy Threshold Influence

ImageNet

Even more interesting results come from the ImageNet dataset, with 1000 categories and more than 1.2 million images. You must download the ImageNet dataset yourself prior to running with Distiller. The example code in Distiller for early exit on ImageNet has exactly two exits and can be invoked with a command line, such as:

python compress_classifier.py --arch=resnet50_earlyexit --epochs=120 -b 128 \
--lr=0.003 --earlyexit_thresholds 1.2 0.9 --earlyexit_lossweights 0.1 0.3 \
-j 30 --out-dir /home/ -n earlyexit /datasets/I1K/i1k-extracted/

Regarding the results for ImageNet, we decided to present more details regarding performance. Figure 4 shows a table of compute complexity for a ResNet-50 on ImageNet.

Performance
Resnet50 with 2 Early Exits
Overhead est. in MACs 116,464,210
% not exiting early 53.56%
savings layers after exit0 3,137,496,064
num MACs through exit0 454,045,522
% exit0 22.72%
savings layers after exit1 1,997,597,696
num MACs through exit1 1,605,587,538
% exit1 23.72%
Total MACs standard resnet50 3,486,721,024
Worst Case Latency w/overhead 3,603,185,234
Total % Exit Early 46.44%
Entropy Thresholds 1.2 1.2
Average MACs/input 2,413,744,552
Savings per data input 30.77%
Speedup 1.44
No. of epochs 120
Loss Weights 20-30-50

Figure 4: Computational Complexity and Savings for ResNet-50 with 2 Early Exits

The table shows the total number of MAC (multiply-accumulate) operations for the ResNet-50 as well as percentages of data that can exit early in the network. In the current example, we have additional processing overhead during the early exits. Therefore, the worst case latency can actually be somewhat worse than a standard ResNet-50 due to additional overhead for the early exits. However, the average latency has been reduced substantially (by over 30%) given that almost half of the data can terminate its processing at one of the early exits.

Summary

In this article, we have examined an optimization strategy for early termination of processing during inference based on the concept that a sufficient confidence level can be achieved after only a portion of the network has been traversed. These “easy” data points can exit early and contribute to a substantial performance (and/or power) benefit.

Some things we will look at in the future:

  • Implementation considerations for early exit on special purpose hardware
  • Heuristics for exit placement and methods to reduce overhead
  • Effects of reduced precision and pruning
  • Using pre-trained models to reduce training time (i.e. only train exit paths)

References

    • C.D. Lee, E. Jung, O. Kwon, M. Park, and D. Hong. Decision Boundary Formation of Neural Networks. Technical report, Yonsei University, 2003. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.495.8609&rep=rep1&type=pdf
    • P. Panda, A. Sengupta, and K. Roy. Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition.Technical report, Purdue University, 2015. http://arxiv.org/abs/1509.08971.
    • S. Teerapittayanon, B. McDanel, and H. T. Kung. BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks. Technical report, Harvard, 2017. https://arxiv.org/abs/1709.01686.
    • X. Wang, F. Yu, Z.-Y. Dou, T. Darrell, and J. E. Gonzalez. SkipNet: Learning Dynamic Routing in Convolutional Networks. European Conference on Computer Vision (ECCV), 2018. http://arxiv.org/abs/1711.09485.
    • G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, K.Weinberger. Multi-Scale Dense Networks for Resource Efficient Image Classification. Technical Report, Cornell University, 2017. https://arxiv.org/abs/1703.09844
Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
© Intel Corporation
*Other names and brands may be claimed as the property of others.