OpenVINO™ Model Server Boosts AI Inference Operations

When executing inference operations, AI practitioners need an efficient way to integrate components that delivers great performance at scale while providing a simple interface between application and execution engine.

Thus far, TensorFlow* Serving has been the serving system of choice for several reasons:

  • Efficient serialization and deserialization
  • Fast gRPC interface
  • Popularity of the TensorFlow framework
  • Simple API definition
  • Version management

There are, however, a few challenges to the successful adoption of TensorFlow Serving:

For successful adoption, an inference platform should include acceptable latency even for demanding workloads, easy integration between training and deployment systems, scalability and a standard client interface. A new model server inference platform developed by Intel, the OpenVINO™ Model Server, offers the same interface as TensorFlow Serving gRPC API but employs inference engine libraries from the Intel® Distribution of OpenVINO™ toolkit. Based on convolutional neural networks (CNN), this toolkit extends workloads across Intel® hardware (including accelerators) and maximizes performance across computer vision accelerators—CPUs, integrated GPUs, Intel Movidius VPUs, and Intel FPGAs

Performance Results

Intel’s Poland-based AI inference platform team compared results captured from a gRPC client run against a Docker container using a TensorFlow* Serving image from Docker Hub (tensorflow/serving:1.10.1) and a Docker image built using an OpenVINO Model Server image with Intel Distribution for OpenVINO toolkit version 2018.3. We applied standard models from TensorFlow-Slim image classification models library, specifically resnet_v1_50, resnet_v2_50, resnet_v1_152 and resnet_v2_152.

Using identical client application code and hardware configuration in the Docker containers, OpenVINO Model Server delivered up to 5x the performance of TensorFlow Serving, depending on batch size. The improved performance of the OpenVINO Model Server means that the inference interface can be easily accessible over a network, opening new opportunities for supported applications and reducing the cost, latency and power consumption.

Figure 1: Performance Results with Batch Size 1 (Higher is Better)1

Figure 2: Performance Results with Batch Size 16 (Higher is Better)1

Figure 3: Performance Results for Model Resnet v1 50 Depending on Batch Size (Higher is Better)1

How OpenVINO Model Server Works

Models trained in TensorFlow, MxNet*, Caffe*, Kaldi*, or in ONNX format are optimized using the Model Optimizer included in the OpenVINO toolkit. This process is done just once. The output of the model optimizer is two files with .xml and .bin extensions. The XML represents the optimized graph, and the bin file contains the weights. These files are loaded into the Inference Engine, which provides a lightweight API for integration into the actual runtime application.OpenVINO Model Server allows these models to be served through the same gRPC interface as TensorFlow Serving.

An automated pipeline can be easily implemented, which first trains the models in the TensorFlow framework, then exports the results in a protocol buffer file and later converts them to Intermediate Representation format. As long the model includes layer types supported by OpenVINO, there are no extra steps needed. However, for a few non-supported layers there is still a way to complete the transformation by installing appropriate extensions for the missing operations. Refer to the Model Optimizer documentation for more details.

The same conversion can be completed for Caffe and MXNet models (and the recent OpenVINO release 2018.3 also supports Kaldi and ONNX models.) As a result, the OpenVINO Model Server can become the inference execution component for all these deep learning frameworks.

OpenVINO Model Server

The OpenVINO Model Server architecture stack is shown in Figure 5. It is implemented as a Python* service with gRPC libraries exposing the API from the TensorFlow Serving API. These are used as identical proto files, which make the API implementation fully compatible for the same clients. Therefore, no code changes are needed on the client side to connect to both serving components.

Key differences include inference execution implementation which relies on the Inference Engine API. With an optimized model format, and using Intel-optimized libraries for inference execution on CPUs, FPGAs, and VPUs, you can take advantage of significantly better performance.

Figure 4: OpenVINO Model Server Architecture Stack

OpenVINO Model Server is well suited for Docker containers. This allows OpenVINO Model Server to be employed in edge, data center and cloud architectures such as AWS Sagemaker. The image building process is very straightforward and much faster comparing to TensorFlow Serving*. Implementation of this image building process simplifies the hosting and inference service on any operating system and platform. By exposing the service via a gRPC interface the execution engine becomes available for applications written in most languages (C#, C++, Java, Golang, JavaScript, Python etc), which makes the integration seamless for developers.

How To Use OpenVINO Model Server

IR Model and Using Optimizer

The first step in enabling OpenVINO Model Server is to generate an IR model format out of the TensorFlow saved model representation using the Model Optimizer. You can use a command similar to the example below: --saved_model_dir /tf_models/resnet_v1_50 --output_dir /ir_models/resnet_v1_50/ --model_name resnet_v1_50

Model Optimizer arguments:
Common parameters:
- Path to the Input Model: None
- Path for generated IR: /ir_models/resnet_v1_50/
- IR output name: resnet_v1_50
- Log level: ERROR
- Batch: Not specified, inherited from the model
- Input layers: Not specified, inherited from the model
- Output layers: Not specified, inherited from the model
- Input shapes: Not specified, inherited from the model
- Mean values: Not specified
- Scale values: Not specified
- Scale factor: Not specified
- Precision of IR: FP32
- Enable fusing: True
- Enable grouped convolutions fusing: True
- Move mean values to preprocess section: False
- Reverse input channels: False
TensorFlow specific parameters:
- Input model in text protobuf format: False
- Offload unsupported operations: False
- Path to model dump for TensorBoard: None
- Update the configuration file with input/output node names:
- Operations to offload: None
- Patterns to offload: None
- Use the config file: None
Model Optimizer version:
[ SUCCESS ] Generated IR model.
[ SUCCESS ] XML file: /ir_models/resnet_v1_50/resnet_v1_50.xml
[ SUCCESS ] BIN file: /ir_models/resnet_v1_50/resnet_v1_50.bin
[ SUCCESS ] Total execution time: 7.75 seconds.

Folder Structure with Models

Before the models can be used in OpenVINO Model Server they should be placed in a folder structure similar to the one shown in Figure 6.

Figure 5. Example of Folders Structure with Serving Models

Each model should have a separate folder where every version is stored in subfolders with numerical names. This way, OpenVINO Model Server can handle multiple models and manage their versions in a similar manner to TensorFlow Serving.

Deployment Process

The deployment process is limited to two steps:

  1. Building the Docker image, using the following command:

docker build -t openvino-model-server:latest

  1. Starting the Docker container, using the following commands:

docker run --rm -d  -v /models/:/opt/ml:ro -p 9001:9001 openvino-model-server:latest
/ie-serving-py/ ie_serving model --model_path /opt/ml/model1 --model_name my_model --port 9001

After those two steps are finished, OpenVINO Model Server runs as a Docker container in detached mode and listens for gRPC inference requests on port 9001. More details about the usage and configuration is included in the GitHub repository documentation.

Use Case Examples

Below are examples of OpenVINO Model Server adoptions. More details can be found in the github repository.

1. Standalone Inference Service

The simplest usage example for the OpenVINO Model Server is with a Docker container running on a single machine.

The prerequisites for the setup are:

  1. Downloading the OpenVINO installer
  2. Placing them in an appropriate folder structure where each model includes a set of numerical versions

While the docker container is configured and launched according to this documentation, it can be used to serve inference interface for local applications or over the network.

2. Integration with AWS Sagemaker

AWS Sagemaker* is an inference solution with REST API interface and capabilities for configuring custom pre and post processing. It passes inference requests to the TensorFlow Serving component which is installed in the same docker container along with Sagemaker services.

It is possible to replace the TensorFlow Serving component with OpenVINO Model Server without changing the client code or the Sagemaker component. The replacement is mostly transparent and the only needed modifications are in an updated Docker file for building the Sagemaker component and a minor code update that injects a command for starting OpenVINO Model Server instead of TensorFlow Serving. The example code is present in the OpenVINO Model Server source code repository.

You can take advantage of the capabilities of AWS Sagemaker while improving the performance and reducing the response latency.

3. Inference Serving Service in Kubernetes

OpenVINO Model Server can be easily deployed in a Kubernetes* environment. Such a configuration enables new capabilities due to its scalability and high availability. Each instance of OpenVINO Model Server is represented by a Kubernetes pod attached to a service exposed via nginx ingress.

Inference operations are stateless, which makes the infrastructure easy to scale horizontally up and down according to user demand. AI models need to be mounted inside the volumes stored in NAS solutions like NFS, CEPH, S3 or others supported by Kubernetes.

Advantages for AI Applications

To summarize, OpenVINO Model Server has multiple benefits for data scientists and inference consumers:

  • Support for multiple frameworks: A common inference system can be used for frameworks like TensorFlow, Caffe, and MXNet that can export their models to the ONNX format. Just the conversion mechanism should be applied to generate the Intermediate Representation files with the Model Optimizer.
  • Support for gRPC interface: OpenVINO Model Server utilizes a http2 protocol transfer mechanism, which is commonly recognized for its fast transfer rate and high tolerance of network latency.
  • Ease of transition from existing API: For existing clients and applications relying on TensorFlow Serving API, the transition is mostly transparent.
  • Improved performance: Performance is improved on identical hardware models and results in quicker response time.
  • Support for Intel FPGAs and Intel Movidius VPUs: Users can take advantage of hardware capabilities and software optimizations for inference execution, both at the edge and in the data center.
  • Ease of Python Service Implementation: The source code is easy to analyze, which makes it also simple to expand and add extra features. Python is widely used, well-known by the developers and also convenient for troubleshooting and debugging.
  • Ease of installation and integration: Docker containers makes OpenVINO Model Server easy to integrate with a wide range of platforms and solutions.

The Intel AI inference platform team would like to thank the GE Healthcare team for their collaboration in designing and testing OpenVINO Model Server integration with AWS Sagemaker and providing valuable feedback about performance results. We would also like to thank Prashant Shah and the Intel AI business development team for their collaboration and partnership, the  Intel AI benchmarking team in Poland for help in executing performance tests on multiple configurations and hardware, and the Intel OpenVINO development team in Nizhny for assistance and numerous consultations. These contributions enabled Marek Lewandowski and the AIPG inference platform team to design and develop OpenVINO Model Server in a short timeframe.