Uber’s* Horovod is a great way to train distributed deep learning models. Its ring-allreduce network architecture scales well to several hundred nodes and it only requires a few simple changes in your Keras* or TensorFlow* code to get going.
The only bottleneck we’ve found to run jobs in Horovod is with the initial environment setup. As with many deep learning workloads, it takes a bit of effort to get TensorFlow, Horovod, MPI, and the hardware interfaces to play well together. This is where Docker* and Kubernetes really shine.
At Intel, we have created an open-sourced project called Machine Learning Container Templates (MLT) which packages deep learning workloads into easy to deploy Docker containers for quick scaling on multiple devices via Kubernetes. We’ve handled all of the complex setup for users in advance to provide a great out of the box experience when developers are ready to deploy their custom deep learning model for distributed training.
In a past blog, we’ve shown how to train a brain tumor detection deep neural network using the TensorFlow parameter server distributed platform. We’ll take that same U-Net topology and the BraTS data from the previous blog and show how to package that code into Docker container with Horovod for quick deployment on Kubernetes.
Using MLT Horovod Template
This blog assumes you already have a Kubernetes cluster up and running and MLT installed.
The MLT Horovod template leverages the OpenMPI component from Kubeflow to deploy distributed jobs for model training. We’ve incorporated IntelⓇ Math Kernel Library (IntelⓇ MKL) optimized TensorFlow in the CPU Dockerfile to significantly improve the training speed compared to vanilla TensorFlow7 when using IntelⓇ processors.
Data Transfer to Kubernetes cluster
In this blog, we’ve used Volume Controller for Kubernetes (VCK) to copy data (.npy files) from Google* Cloud storage to the Kubernetes nodes. Please refer to this section to use VCK. Please take note of
node affinity (on which nodes data being copied) and
host path (path on the node), you will need it in a moment.
Let’s jump into adapting the Horovod template to locate tumors in brain scan images.
Initialize the Horovod template using
mlt init. During
init you should tell which docker registry to use. MLT supports Docker Hub and Google Container Registry (Ex:
gcr.io/<your_project_id> or docker.io/repo_name)
unet-horovod directory now contains Dockerfiles, the training file and a custom deployment script to deploy the job on to Kubernetes cluster.
A complete unet example based on the above template is located here. All file details from example are explained as follows:
template_paramaterssection which is exposed as environment variables to deployment script deploy.sh.
node-affinity(because those are high memory nodes). Or you can completely ignore this flag, make sure you remove corresponding flag usage in the file.
hostPathhere, it will be mounted as volumes in containers. Results written to this path are persisted.
ksonnet version 0.11.0, for volume mount to work as expected.
Once you have made the required modifications, use the
mlt build command, to build and tag docker images. The initial build usually takes longer to complete because we have to build the IntelⓇ Optimization for TensorFlow using the OpenMPI base image. However, successive builds complete within a few seconds. You can make this faster for initial build by hosting the combined binary image (OpenMPI and the IntelⓇ Optimization for TensorFlow) on docker hub. You can refer to this Dockerfile.
To see what’s happening in the background use verbose flag as
mlt build -v
Now it’s time to deploy, we use
mlt deploy. This command first pushes images to the provided registry and uses images from there to deploy on to Kubernetes.
To check the status of above job use
mlt logs to see training progress and
mlt events for job events in Kubernetes.
If you modify
mlt.json Ex: (
num_nodes: “4”, gpu:”2” etc.) pushing images to the registry is not needed, so you can just try the following:
mlt undeploy, which frees up resources
mlt deploy --no-push(
--no-pushflag avoids pushing image again and it uses the same image) as there are no code changes.
If there are code changes, then rebuilding the images using
mlt build command is required followed by
mlt deploy command to deploy the new job.
Model checkpoints are stored at
OUTPUT_PATH (which is NFS mounted location) specified in mlt.json and settings.py file. You can use
tensorboard to view the training results.
tensorboard --logdir <OUTPUT_PATH>
We used 4-node Kubernetes cluster, each node spec looks as below:
OS Version: CentOS Linux release 7.5.1804 (Core)
Kernel: 3.10.0-862.3.2.el7.x86_64 #1 SMP Mon May 21 23:36:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
CPU: 2 X Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
Memory: 256GB DDR4 1866 MHz
Disk: 400GB SSD on Sata0
Software: TensorFlow 1.9, MKL-DNN 0.14
Vendor: Intel Corporation
Release Date: 07/10/2018
Microcode Update Driver: v2.01 <email@example.com>, Peter Oruba
Microcode Patch: microcode_ctl-2.1-29.10.el7_5.x86_64
The MLT template includes methods to output the final model, training checkpoints, and TensorBoard logs to either a local folder or a cloud storage location. By taking advantage of both Kubernetes and TensorFlow’s MonitoredTrainingSession this deployment will have automatic failover built in: If one or more nodes fails during training, Kubernetes will attempt to restart the node and the TensorFlow session will restart training from the last good checkpoint! This is a must-have feature for jobs that may run for hours or days at a time. You can monitor your job as it trains (or anytime afterward) by viewing the TensorBoard. As you can see from the TensorBoard figures, our model does a pretty good job at finding brain tumors!
MLT’s built-in support for Horovod and TensorFlow gives you many advantages:
Refer to https://github.com/01org/mkl-dnn for more details on Intel® MKL-DNN optimized primitives
Notices and Disclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Performance results are based on testing as of 8/17/2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Intel, Xeon, and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
© Intel Corporation.