We’ve recently applied the U-Net architecture to segment brain tumors from raw MRI scans (Figure 1). With relatively little data we are able to train a U-Net model to accurately predict where tumors exist. The Dice coefficient (the standard metric for the BraTS dataset used in the study) for our model is about 0.82-0.88. Menze et al. reported that expert neuroradiologists manually segmented these tumors with a cross-rater Dice score of 0.75-0.85, meaning our model’s predictions are on par with what expert physicians have made.
Figure 1: MRI of the brain with highlighting of a large tumor (yellow). Courtesy of the “Multimodal Brain Tumor Segmentation Challenge”.
Since its introduction two years ago, the U-Net architecture has been used to create deep learning models for segmenting nerves in ultrasound images, lungs in CT scans, and even interference in radio telescopes.
U-Net is designed like an auto-encoder. It has an encoding path (“contracting”) paired with a decoding path (“expanding”) which gives it the “U” shape. However, in contrast to the autoencoder, U-Net predicts a pixelwise segmentation map of the input image rather than classifying the input image as a whole. For each pixel in the original image, it asks the question: “To which class does this pixel belong?” This flexibility allows U-Net to predict different parts of the tumor simultaneously.
U-Net passes the feature maps from each level of the contracting path over to the analogous level in the expanding path. These are similar to residual connections in a ResNet type model, and allow the classifier to consider features at various scales and complexities to make its decision.
This all sounds wonderful– until you finally start to code the topology with your favorite deep learning framework. Ultimately, “up-convolution” (or “up-sampling” or “de-convolution”) just means that we take a low resolution image and transform it into a higher resolution image. This can be performed by at least two types of operations:
The nearest neighbors resampling algorithm is an interpolation method which, like convolution, performs a mathematical operation on each pixel (and its neighbors) within the image to enlarge the image size. In the simplest case, we make three copies of each pixel and place them nearby (Figure 2). More complex cases involve weighted combinations of pixels to generate gradient colors between neighbors (e.g. bilinear, cubic, and Lanczos interpolation).
Figure 2: Nearest Neighbors Algorithm. The black areas are filled with copies of the center pixel. This gives the “jagged lines” or “pixelated” effect. More complex algorithms use weighted combinations of the surrounding pixels to fill the black areas and provide the enlarged image with a smoother and more realistic appearance.
Both techniques output an enlarged version of the original image (Figure 3). We wondered whether there were any empirical differences in the accuracy of the U-Net model given the choice of upsampling operation. Note that we were not specifically questioning which algorithm produced the smoothest or most efficient enlarged image; instead we wanted to learn how these algorithms affect the performance of the trained U-Net model on the test dataset. Given that the transposed convolution method contains learnable parameters, while the nearest neighbors approach is a fixed operation with no learnable parameters, it is conceivable that using transposed convolution to perform the “up-sampling” could yield more accurate predictions on the test dataset.
Figure 3: Both methods can produce similar increases in image size. In this case, we used OpenCV’s resize (bilinear interpolation) method to double the number of pixels in both height and width of the MNIST handwritten digits dataset. On the left is the same effect performed with a trained autoencoder using transposed convolution with a kernel size of (2,2) and a stride of (2,2). The autoencoder neural network has learned bilinear interpolation. (Source: G.A. Reina 2017 using the original LeCun/Cortes/Burges MNIST Dataset)
We implemented U-Net using Intel Optimizations for TensorFlow*1.4.0 with tf.keras layers. The MRI images and segmentation maps from the BraTS dataset were divided into 24,800 training and 9,600 test samples. Our model used the Adam optimizer for stochastic gradient descent with a learning rate of 0.0005 and a global batch size of 512. The distributed training was performed for 10 epochs on a 4-node IntelⓇ Xeon Phi™ 7250 cluster with one IntelⓇ XeonⓇ Processor E5-2697A v4 as the parameter server. The “whole tumor” segmentation mask was used. Each worker node received one-quarter of the batch to process and the parameter server updated the gradients synchronously.
There were some minor differences between our implementation and the original topology:
The major difference in our approach was to create two versions of the U-Net model. The “upsampling” version used the Keras function UpSampling2D to perform the “up-conv 2×2”. The “transposed” version used the Keras function Conv2DTranspose to perform the “up-conv 2×2”.
Figure 4 shows the training curves for the different U-Net model configurations. Both methods appear to achieve the same state of the art Dice performance. In TensorFlow 1.3.0, the U-Net model using UpSampling2D converged slightly faster than the model that used Conv2DTranspose. This was presumably due to the additional weights needed to train the transposed convolutional filters. However, using Intel Optimized TensorFlow 1.4.0, the model using Conv2DTranspose executed over 20% faster than that using UpSampling2D because of the particular MKL-DNN optimizations found in the IntelⓇ optimized TensorFlow™ distribution.
In Figure 5 we show the difference between the Dice scores of the two models for each test image. A Kolmogorov-Smirnoff two sample test showed no significant difference between the two distributions. A one sample t-Test does show a difference between the predictions (toward upsampling), but the difference is very small (𝞵=-0.006, 𝞼=0.11).
Figure 4. Training and testing curves for U-Net on BraTS. Both methods appear to achieve the same Dice performance. On CPU the IntelⓇ optimized version of TensorFlow™ 1.4.0 we found a 20% improvement in execution time.
Figure 5: Comparison of the Dice coefficients for the two models. A Kolmogorov-Smirnoff two sample test shows no significant difference between the two distributions. A one sample t-Test does show a difference between the predictions (toward upsampling), but the difference is very small (𝞵=-0.006, 𝞼=0.11).
In Figure 6, we show the tumor segmentation masks as predicted by the different U-Net configurations. Subjectively, there is very good agreement between the upsampling and transposed convolution models, and this agreement was confirmed by the Dice coefficients achieved on the test set. UpSampling2D yielded an average test Dice score of 0.8718, while Conv2DTranspose yielded 0.8707. We found that over 95% of the predictions differed by less than 0.1 Dice points and over 98% by less than 0.2 points.
Figure 6: Comparison of models predictions. The predicted segmentation is subjectively similar regardless of whether the UpSampling2D or Conv2DTranspose method is used in the topology.
In Figures 7 and 8 we show a few of the cases where one method outperformed the other by more than 0.5 points. These cases accounted for less than 2% of the predictions. As can be seen in the figures, there were several cases where one model made a good prediction and the other failed to make a prediction at all. We have been unable to explain what factors caused this discrepancy in the predictions, but believe it may be a useful place to explore further optimizations to the model.
Figure 7: Example UpSampling2D wins. Test cases where the upsampling model’s prediction was far better than the transposed convolution model. This occurred with 0.07% of the predictions.
Figure 8: Example Conv2DTranspose wins. Test cases where the transposed convolution model’s prediction was far better than the upsampling model. This occurred with 0.05% of the predictions.
We believe that both UpSampling2D and Conv2DTranspose are equally reliable methods for generating accurate predictions in the U-Net topology. Theoretically, transposed convolution can learn more complex (and even non-linear) image resizing functions, but at the expense of more parameters and a slightly longer time to train. Nevertheless, if kernel-level optimizations (such as Intel® MKL-DNN) are used, these differences in time to train can be significantly reduced.
Notices and Disclaimers:
Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as “Spectre” and “Meltdown”. Implementation of these updates may make these results inapplicable to your device or system.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks.
HARDWARE USED: Workers nodes: Four IntelⓇ Xeon Phi™ 7250 processors @ 1.40GHz, 68 cores, CentOS Linux 7 x86_64 Parameter Server: One Intel® Xeon® Processor E5-2697A v4, 16 cores, CentOS Linux 7 x86_64 SOFTWARE USED: Intel Optimized TensorFlow* wheel. Date of testing: December 29, 2017.
© 2018 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi and Intel Nervana are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as property of others.