We are pleased to announce the open sourcing of nGraph, a framework-neutral Deep Neural Network (DNN) model compiler that can target a variety of devices. With nGraph, data scientists can focus on data science rather than worrying about how to adapt their DNN models to train and run efficiently on different devices. Continue reading below for highlights of our engineering challenges and design decisions, and see GitHub, our documentation, and our SysML paper for additional details.
We currently support TensorFlow*, MXNet*, and neon directly through nGraph. CNTK*, PyTorch*, and Caffe2* are supported indirectly through ONNX. Users can run these frameworks on several devices: Intel Architecture, GPU, and Intel Nervana Neural Network Processor (NNP). Support for future devices/frameworks in our roadmap is faded.
When Deep Learning (DL) frameworks first emerged as the vehicle for running training and inference models, they were designed around kernels optimized for a particular device. As a result, many device details were being exposed in the model definitions, complicating the adaptability and portability of DL models to other, or more advanced, devices.
The traditional approach means that an algorithm developer faces tediousness in taking their model to an upgraded device. Enabling a model to run on a different framework is also problematic because the developer must separate the essence of the model from the performance adjustments made for the device, translate to similar ops in the new framework, and finally make the necessary changes for the preferred device configuration on the new framework.
We designed the nGraph library to substantially reduce these kinds of engineering complexities. While optimized kernels for DL primitives are provided through the project and via libraries like Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN), there are also several compiler-inspired ways in which performance can be further optimized.
Install the nGraph library and write or compile a framework with the library in order to run training and inference models. Specify nGraph as the framework backend you want to use from the command line on any supported system. Our Intermediate Representation (IR) layer handles all the device abstraction details and lets developers focus on their data science, algorithms and models, rather than on machine code.
At a more granular level of detail:
Intel has considerable experience with MKL-DNN optimization of frameworks for Intel Architecture. We make use of previous work with the added benefit that optimizations developed for a device benefits all frameworks through nGraph. Framework developers continue to perform their own optimization work. For example, the performance for TensorFlow 1.7+/XLA on Intel Architecture is much better than for TensorFlow 1.3/XLA on Intel Architectures, and this should improve further as more work is put into XLA for Intel Architectures.
We present below initial performance data from multiple frameworks that reflects the optimizations done so far on the IA transformer. On the latest Intel Xeon Platinum 8180 processor, in conjunction with MKLDNN v0.13, we are able to meet or greatly exceed the performance of previously optimized frameworks such as MXNet-MKLDNN-CPU (MXNet optimized with MKLDNN) and neon-MKLML-CPU (neon optimized with MKLML). We also deliver better performance than the TensorFlow XLA compiler (TF-XLA-CPU), but there are significantly more optimizations that can be done with XLA both on the default CPU implementation and on nGraph.
As of today, nGraph supports six DL frameworks and three compute devices.
We will continue to add support for additional devices and more graph optimizations such as device-specific op fusions, better work schedulers and faster custom op kernels.
Visit our GitHub repository to learn how to contribute to nGraph.
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
On-line CPU(s) list: 0-111
Thread(s) per core: 2
Core(s) per socket: 28
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model name: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
CPU MHz: 1000.585
CPU max MHz: 3800.0000
CPU min MHz: 1000.0000
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 39424K
NUMA node0 CPU(s): 0-27,56-83
NUMA node1 CPU(s): 28-55,84-111
OS: Ubuntu 16.04.3 LTS
Environment Variables: export OMP_NUM_THREADS=56
MXNet (nGraph and Direct Optimization): export MXNET_NGRAPH_GRAPH_OPTIMIZATION=1
python example/image-classification/train_cifar10.py –num-layers=8
python example/image-classification/train_cifar10.py –num-layers=20
python example/image-classification/train_imagenet.py –num-layers=50 –data-train=/path/to/train.rec –data-val=/path/to/val.rec
TensorFlow/XLA: Baseline- Tensorflow 1.3 built with JIT compilation
*Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks.
The benchmark results reported above may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user’s components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations.