Accurate and expeditious semantic segmentation of an image is critical for applications like autonomous driving. The objects in the scene need to be localized and placed into categories to identify items like the road, the sidewalk, street signs, pedestrians, and other vehicles.
In a recent attempt at solving this problem, a group from University Cambridge Computer Vision and Robotics Group [1,2] designed a deep network, named SegNet, to generate a pixel-by-pixel classification of an input image and trained it on the Cambridge-driving Labeled Video Database (CamVid) dataset (among others) [3,4]. An example of a street scene and the corresponding semantic segmentation from the CamVid training set is shown in Figure 1. Each pixel in the image is given a single label from 1 of 12 categories, which is represented by the pixel color.
Figure 1: Example of the input image (left) and target output (right) from the CamVid [1-4] training dataset. The color of each pixel in the target image represents a class of object (e.g. car, bicyclist, pedestrian, street sign, etc.). There are a total of 12 possible classes in this dataset.
The SegNet model resembles a deep convolutional autoencoder with the output layers altered to generate a pixel-wise softmax output. Given an input image, the model generates a probability distribution over the output categories for each pixel (Figure 2). A novel “upsampling” layer is used in the decoder which generates sparse, upsampled output feature maps using the max pooling indices from the corresponding pooling layer in the encoder. Different deep network models can be incorporated into the SegNet architecture. For this work the 16 convolutional layer VGG model (VGG-16)  was used (Figure 2).
Figure 2: SegNet model architecture. The first half of the model (encoder) is similar to the VGG model used for image classification, except the output fully connected layers are excluded. The second half is the decoder, which mirrors the structure of the encoder. The upsampling layers receive the indices of the maximum input pixels from the corresponding pooling layers and use that index to generate a sparse “unpooling” output. The output of the model is a pixel-wise softmax layer that generates a probability distribution for each output class for every image pixel. These probabilities can be used to infer a classification for each pixel. Figure was taken from .
The VGG-16 based SegNet model was implemented using the neon deep learning library developed by Nervana Systems. In order to implement SegNet using neon, the upsampling layer and a pixel-wise softmax layer needed to be developed. The latter requires only a few lines of python programming since the operations are all supported by the neon backend. The upsampling layer was adapted from the neon pooling layer with a minimal amount of CUDA C programming. An example of the pixel-wise classification generated by the neon implementation of SegNet on one of the test set images is shown in Figure 3. The categories inferred by the model are denoted by the color of the pixel in the output image (text labels were added to the image for reference).
Figure 3: Output of the SegNet model trained in neon. This is the output from the model for the image shown in Figure 1. Each color represents the category for each pixel in the image. Some of the categories are labeled with overlaid text for clarity.
The SegNet model implemented here is essentially composed of 2 VGG-16 models with the fully connected layers excluded. Models with the VGG-16 architecture require a large amount of memory on the GPU, thereby limiting the number of images in each training mini-batch. With a Titan-X GPU with 12GB of memory the training is limited to a mini-batch size of 4 images for 256×512 pixel color images. The recent addition of Winograd convolution kernels to neon overcomes the limitations on GPU utilization that plagues conventional convolution computation with small batch sizes. Furthermore, the Winograd kernels are highly efficient for the 3×3 convolutions which are used throughout the VGG-16 convolutional layers. These optimizations in neon make it the optimal library for training and deploying a model like SegNet.
Here we provide benchmarks of the computational speed of the forward and backward passes of SegNet model and also provide benchmarks versus the Caffe framework for comparison (table 1). The custom fork of Caffe and the SegNet model implementation used to generate these benchmarks can be found in the links below [6,7]. On a single GPU the neon implementation is 7 times faster than Caffe for the forward pass through the network and 4.5 times faster for the backward pass. Increased speed can be achieved by utilizing the multi-gpu backend for neon which spreads the computational load over multiple GPUs and concurrently processes multiple mini-batches. Table 4 shows the additional training speed up that can be realized on 4 and 8 GPUs working in parallel with neon. Since the total number of images in a mini-batch scale with the number of GPUs used, the multi-gpu benchmarks in table 2 are expressed per image processed. Similar benchmarks could not be obtained with Caffe because the custom Caffe implementation that supports SegNet does not currently include multi-gpu support.
|total||265 ms||1455 ms||5.5 x|
Using cuDNN v3
Table 1: Speed benchmarks for SegNet model. Computation times the forward and backward pass with neon and Caffe as shown in the table. The input image size is 256×512 (3 channels) and the times are for a mini batch with 4 images.
|Number of GPUs||Iteration Time per Image||Speedup|
|2||35 ms||1.9 x|
|4||18 ms||3.7 x|
Table 2: Benchmark with multiple GPUs. The times are for a single forward and backward pass per image. The number of images processes concurrently is 4 per GPU (i.e. 16 images for the 4 GPU case).
For autonomous driving applications this model will need to be run on the automobile itself, probably on an embedded platform like the Jetson TX-1. For comparison, we ran neon benchmarks of the forward pass of the SegNet model on the TX-1 platform. Due to memory limitations, the input image was scaled down to 128×256. The inference computation time on the TX-1 was 97 ms for a single image.
Another image localization model, the Fast RCNN model, has also been implemented in neon and is one of the example models included in the neon repository. This model fits bounding boxes around objects detected in an input image and generates a categorical tag for each object. Like the SegNet model, the Fast RCNN model also benefits from the computational optimizations of the neon platform. Training in neon is almost twice as fast as Caffe. Figure 4 shows an example of the output of the Fast RCNN model on detecting cars in a road scene.
Figure 4: Example of the Fast RCNN model trained on detecting cars in road scenes. The red boxes are the bounding box encompassing each car detected in the image. The images were taken from the KITTI cars object detection dataset .
Please contact Nervana Systems at email@example.com to get started with automotive use-cases using semantic segmentation or Fast-RCNN models using the neon framework.
 “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation” Badrinarayanan, Vijay; Kendall, Alex; Cipolla, Roberto http://arxiv.org/abs/1511.00561
 University of Cambridge Machine Intelligence Laboratory website: http://mi.eng.cam.ac.uk/projects/segnet/
 Segmentation and Recognition Using Structure from Motion Point CloudsGabriel J. Brostow and Jamie Shotton and Julien Fauqueur and Roberto CipollaECCV (1), 44-57, 2008.
 “Semantic Object Classes in Video: A High-Definition Ground Truth Database”Gabriel J. Brostow and Julien Fauqueur and Roberto Cipolla”,Pattern Recognition Letters, 2008.
 “Very deep convolutional networks for large-scale image recognition,” Simonyan and A. Zisserman arXiv:1409.1556, 2014.
 SegNet Caffe fork repo: https://github.com/alexgkendall/caffe-segnet
 SegNet model repo: https://github.com/alexgkendall/SegNet-Tutorial
 “Fast R-CNN”Ross GirshickInternational Conference on Computer Vision (ICCV), 2015.
 “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite”Andreas Geiger and Philip Lenz and Raquel UrtasunConference on Computer Vision and Pattern Recognition (CVPR), 2012.