Currently, more than 75% of all internet traffic is visual (video/images). Total traffic is exploding, projected to jump from 1.2 zettabytes per year in 2016, to 3.3 zettabytes in 2021, and visual data will comprise roughly 2.6 zettabytes of that.
A major challenge for applications is how to process and understand this visual data, a capability called “visual understanding”. So what exactly is visual understanding?
Visual understanding (VU) is one important part of computer vision. According to Wikipedia, computer vision is a field that includes methods for acquiring, processing, analyzing and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic information. Basically, VU is the process of analyzing and understanding images and videos. It is mainly focused on object processing, unlike pixel processing in the imaging process. The objective of VU is to derive knowledge from images and videos of the real world. VU encompasses the following capabilities, among others:
In the following example, the Person and Camera have been detected and classified, and the action has been recognized as Taking Pictures.
Intel Labs China, directed by Dr. Yurong Chen, has been making dramatic strides in VU. Dr. Chen is a Senior Research Director and Principal Research Scientist at Intel Corporation and Director of Cognitive Computing Lab at Intel Labs China (ILC). Under his direction ILC has made significant progress in these three key areas:
ILC has developed a full face analysis pipeline with “best in class” algorithms, resulting in more than twenty patents awarded for this work.
Face analysis research at ILC is advancing rapidly, with current technologies generally able to recognize a subject’s face, gender, age, expression and emotion, and in real time, create live 3D facial animations with emotional enhancements. Applications for this include avatar representations, virtual reality, augmented reality, gaming, etc. These face analysis technologies have been integrated and leveraged in a variety of other technologies and applications, including Intel® RealSense™ technology, the OpenVINO™ toolkit, client application prototyping, and IoT video E2E solutions.
Intel’s 3D face technology is able to recognize emotions and perform 3D face modeling, tracking, and enhancements in real time, for applications in virtual reality, augmented reality and gaming. Using Intel China Labs’ 3D face technology, Intel collaborated with Chinese pop star Chris Lee to create the world’s first AI music promotion video.
Visual emotion recognition will be key for smart devices. ILC’s Action Units-Aware Features and Interactions (AUAFI) technology leverages multi-task learning to decode facial muscle movements and their inherent interactions. Tested against the CK+ dataset of expressions consisting of 327 videos, 7 basic facial expressions, and 118 subjects, AUAFI achieved a 98.7% overall recognition rate. And tested against the MMI facial expression database of 205 videos, 6 basic facial expressions, and 23 subjects with large-pose variations, AUAFI achieved 80.27% overall recognition rate. Intel Labs China presented AUAFI at the ACM ICMI Conference in 2015.
Audio is another important queue for emotion recognition, and ILC proposed “Importance-Aware Features” with selective grouping for audio emotion recognition, and designed a fusion framework to make the best use of visual and audio modalities for emotion recognition in the wild.
Competing against seventy-four teams from around world, including from Carnegie Mellon, University of Illinois Urbana-Champaign, and Microsoft Research, Intel Labs China won First Place in the ‘Emotion Recognition in the Wild Challenge’ 2015 (EmotiW 2015) in the audio-video based task. With Intel’s entry, Capturing AU-Aware Facial Features and Their Latent Relations for Emotion Recognition in the Wild, ILC scored an overall recognition rate of 53.8% against the EmotiW 2015 AFEW dataset (against a baseline of 39.33%) and an overall recognition rate of 55.38% against the EmotiW 2015 SFEW dataset (against a baseline of 39.13%.) (AFEW stands for Acted Facial Expressions in the Wild, while SFEW stands for Static Facial Expressions in the Wild.)
Following are samples of the EmotiW 2015 video clips.
Source: 1) Dhall, R. Goecke, S. Lucey and T. Gedeon, “Collecting Large, Richly Annotated Facial Expression Databases from Movies”, IEEE MultiMedia 19 (2012) 34-41.
2) Dhall, R. Goecke, J. Joshi, K Sikka and T. Gedeon, “Emotion Recognition in the Wild Challenge 2014: Baseline, Data and Protocol”, ACM ICMI 2014. https://cs.anu.edu.au/few/AFEW_Ver_4_2014_License.pdf
In 2016, ILC invented a deep, yet computationally-efficient CNN framework named HoloNet (represented in the figure below) for robust emotion recognition. ILC won First Runner Up with HoloNet against 100 registered teams in EmotiW 2016 (ACM ICMI ’16) in the audio-video based task, and Most Influential Paper award in the past four years’ challenges. Intel’s method, a fusion of ILC’s convolutional neural network model named HoloNet (A&B), plus one audio model and one iDT model, achieved a test score of 57.84% against the AFEW 6.0 dataset. Abhinav Dhall, EmotiW 2016 Chairperson had this to say regarding ILC’s submission: “… You showed me a really novel method, no use of extra data and its speed is hundreds of times faster than the other competitors.”
Supervised Scoring Ensemble is an approach to emotion recognition invented by ILC that applies supervision not only to deep layers, but also to intermediate and shallow layers of a convolutional neural network. This method also employs a new fusion structure in which class-wise scoring activations at diverse complementary-feature layers are concatenated and used as the inputs for second-level supervision, thus acting as a deep feature-ensemble within a single CNN. This approach brings large accuracy gains over diverse backbone networks. ILC presented SSE at the ACM International Conference on Multimodal Interaction 2017, and achieved a recognition rate of 60.34%, against that year’s audio-video-based emotion recognition task, surpassing all existing records.
Over time, Intel Labs China has applied for and received tens of patents for its methods in designing and training large and deep convolutional neural networks (CNNs) for Intel® platforms. ILC has developed and optimized algorithms and models for general visual recognition–performing large-scale object classification and multiclass object detection–and then developed specific applications targeted to edge devices (IoT) for recognizing objects in real life scenarios, e.g., pedestrians, cars, etc. More recently, ILC is working on advances in CNN algorithm design to better balance the needs for accuracy, speed, memory consumption and power efficiency, in order to support edge-device deployment with FPGAs, VPUs, etc.
Most top-performing object detection networks employ region proposals to guide the search for objects. Although leading region proposal network methods may achieve promising detection accuracy, this is usually after several hundred proposals, which is inefficient, and this method still struggles to detect and precisely locate smaller objects.
Intel Labs China, in conjunction with Tsinghua University, designed HyperNet to alleviate these shortcomings. HyperNet handles region proposal generation and object detection jointly and is primarily based on an elaborately designed Hyper Feature which aggregates hierarchical feature maps first, and then compresses them into a uniform space. Hyper Features incorporate highly-semantic, complementary, and high-resolution features of the image, thus allowing HyperNet to generate proposals and detect objects via an end-to-end joint training strategy. Using the deep VGG16 CNN pre-trained model, HyperNet achieves leading recall and state-of-the-art object detection accuracy on PASCAL VOC 2007 and 2012 datasets using only 100 proposals per image. Advantages include high-recall, a smaller memory footprint, and speed. In tests, HyperNet ran at five frames per second (including all steps), thus having the potential for real-time processing in deployment.
At the 2017 Conference on Computer Vision and Pattern Recognition, ILC presented its work on a new framework for object detection called “Reverse Connection with Objectness Prior Networks” or RON. RON is a fully convolutional framework that combines the merits of two mainstream solution families (region-based & free) and eliminates their two major detractions:
RON can directly predict final detection results from all locations of various feature maps. Extensive experiments on the standard datasets demonstrate the competitive performance of RON. Specifically, with VGG-16 and low resolution 384X384 input size, RON gets 81.3% mean Average Precision (mAP) on the PASCAL Visual Object Classes (VOC) 2007 dataset and 80.7% mAP on the PASCAL VOC 2012 dataset. Its superiority increases when datasets become larger and more difficult, as demonstrated by the results on the Microsoft Common Objects in Context (COCO) dataset. With COCO, RON excelled in both state-of-the-art accuracy and speed.
State-of-the-art object detectors rely heavily on the off-the-shelf networks, pre-trained on large-scale classification datasets, such as ImageNet. This approach incurs learning bias due to the differences on the loss functions and the category distributions between classification and detection tasks. Fine-tuning the model for detection can alleviate this bias to some extent, but not entirely. And transferring pre-trained models from classification to detection between discrepant domains is difficult (for example, from RGB to depth-images). A better solution to tackle these two problems is to train object detectors from untrained models, and this is what ILC’s Deeply Supervised Object Detector (DSOD) framework is able to achieve.
Previous efforts in this direction have largely failed due to excessively complicated loss functions and limited training data in object detection. For the DSOD, ILC developed a set of design principles for training object detectors. One of the team’s key findings is that deep supervision, enabled by dense layer-wise connections, plays a critical role in training a good detector. Combined with several other improvements, ILC developed a DSOD following the single-shot detection (SSD) framework. Experiments on the PASCAL VOC 2007 and 2012 datasets and the MS COCO dataset demonstrate that DSOD can achieve better results than state-of-the-art solutions, and with highly-compact models. ILC’s DSOD outperforms SSD on all three benchmarks above, yet requires only 1/2 of the parameters compared to SSD, and 1/10 of the parameters compared to Faster RCNN (https://arxiv.org/abs/1506.01497). These features make DSOD suitable for training with limited data for specific problems, and opens doors to other domains, such as depth, medical, and multi-spectral images.
HyperNet, RON and DSOD are all CNN-based, multi-class object detection algorithms. Based on these algorithms, ICL has developed a multi-class object detection prototype system that can perform multi-class object detection in real time and deliver accurate results in complex scenes. This prototype can be widely used in applications such as automatic driving and video analysis.
Artificial intelligence will have greater impact as fully functional models can be deployed to edge devices, such as mobile devices, IoT devices, etc. However, pre-trained, full-precision convolutional neural networks are resource-intensive, making them difficult to deploy to devices with limited computational resources. The need is to greatly reduce CNN complexity, to prune and compress the trained model to improve performance and allow compressed models to run efficiently on edge devices. ILC has developed an impressive and elegant solution, called Low-bit Deep Compression (LDC), which can achieve hundred-level lossless compression on deep neural networks (DNNs) with low-precision weights and low-precision activations. Thus, it can pave the way for the development of efficient inference engines both for HW and SW implementation. LDC includes three key modules:
ILC has developed a model compression process called Dynamic Network Surgery that performs intelligent network pruning on the fly. This process incorporates a new tool: connection splicing. Parameter importance can change during the pruning process. The loss of some connections due to excessive pruning can actually result in accuracy loss and network damage. With Dynamic Network Surgery, pruned connections can be spliced back to the network, recovering needed parameters and network accuracy. Network pruning and maintenance become a continual process.
Incremental Network Quantization (INQ) is an innovative technique created by ILC that converts a pre-trained, full-precision CNN into a low-precision version, the weights of which are constrained to be either powers of two, or zero. INQ employs three novel operations: parameter partitioning, quantization, and re-training. This procedure is incremental, permitting consecutive model partitioning, quantization, and training cycles to optimize for greatest model compression along with sufficient model accuracy.
The results delivered by INQ are impressive. INQ is able to deliver a lossless, low-precision CNN model from any full-precision reference. ILC conducted extensive experiments on the ImageNet large-scale classification task using almost all known deep CNN architectures and were able to show that:
ILC has developed innovative methods that achieve network quantization in both the network width and depth. This approach employs two novel network quantization methods: single-level network quantization (SLQ) for high-bit quantization, and multi-level network quantization (MLQ) for extremely low-bit quantization (ternary). ILC was the first to consider the network quantization both for width and depth levels. In the width level, parameters are divided into two parts: one for quantization and the other for re-training, to eliminate the quantization loss. SLQ leverages the distribution of the parameters to improve the width level. In the depth level, ILC introduces incremental layer compensation to quantize layers iteratively, which decreases the quantization loss in each iteration. Together, SLQ and MLQ achieve impressive results, validated with extensive experiments based on state-of-the-art neural networks including AlexNet, VGG-16, GoogleNet and ResNet-18.
ILC found that performing low-bit, deep compression employing the three methods described above, Dynamic Network Surgery, Incremental Network Quantization, and Multi-level Quantization, yielded truly impressive results in creating models that retained accuracy and achieved compression ratios greater than 100X using 2-bit quantization, compared to their pre-trained and full-precision but uncompressed counterparts. These combined levels of accuracy and efficiency are unmatched (at the time of testing).
The following table compares the accuracy of the LDC-derived models against other current state-of-the-art models, and shows compression rates compared to the original inference model size using AlexNet on the ImageNet dataset as an example. LDC outperformed the state-of-the-art deep compression solution* with at least 1X absolute margin on AlexNet, achieving >100X compression with 2 bits. For example, in the last row, the LDC compressed model achieved a compression ratio of 142X, and only suffered a loss in accuracy of 0.96 percent in the Top-5 recognition rate.
|Method||Bit Width (Conv/FC)||Bit Width (Act)||Compression ratio||Decrease in top 1 / top 5
error rate in percent
|P+Q*||8/5||32||27X||0.00 / 0.03|
|P+Q+H*||8/5||32||35X||0.00 / 0.03|
|LDC||4/4||4||71X||0.08 / 0.03|
|P+Q+H*||4/2||32||–||-1.99 / -2.60|
|LDC||3/3||4||89X||-0.52 / 0.20|
|LDC||2/2||4||142X||-1.47 / -0.96|
A final approach to reducing complexity in CNNs, called “network slimming”, has been jointly developed by researchers from Intel Labs China, Tsinghua University, Fudan University, and Cornell University. Network slimming takes wide and large networks as input models and, during training, identifies insignificant portions or channels of the network. These are pruned, yielding thin, compact models with accuracies comparable to the input networks. This technique reduces the complexity of deep neural networks in the channel level and can reduce the model size by up to 20X, and the number of floating-point operations by up to 5X, all without accuracy loss for network structures such as VGGNet, ResNet, DenseNet, etc. With limited accuracy loss, network slimming can further reduce the number of floating-point operations by >10X. Unlike low-bit deep compression, the inference speedup from network slimming does not require any special hardware accelerators, just conventional floating-point operation hardware.
ILC also conducts advanced multimodal fusion and learning research to bridge the gap between visual recognition and visual understanding. The idea is to connect the dots from vision, speech, language, knowledge, and machine learning to make machines able to see and infer in ways indistinguishable from human beings. This effort includes research on video to text (VTT), visual question answering, visual relation detection, and so on. These capabilities will enable many important applications, such as natural interaction, intelligent visual cloud, personal visual assistants, and advanced visual control and decision making.
For video to text, ILC has focused on a novel and challenging vision task: dense video captioning. This aims at automatically describing a video clip with multiple informative and diverse caption sentences. ILC invented a weakly-supervised dense video captioning approach with only video-level sentence annotations during the training procedure. First, they proposed lexical, fully convolutional neural networks (Lexical-FCN) with weakly supervised, multi-instance, multi-label learning to weakly link video regions with lexical labels. Second, they introduced a novel submodular maximization scheme to generate multiple informative and diverse region-sequences based on the Lexical-FCN outputs. Third, they trained a sequence-to-sequence, learning-based language model with the weakly supervised information obtained through the association process. The proposed method not only produces informative, diverse, and dense captions, but also outperforms many state-of-the-art single video captioning methods by a large margin.
The ever-growing explosion in the volume of online visual data demands the ability to analyze, understand, and respond to that data, even in real time. Researchers at Intel Labs China, with partners in academia, are making significant, often break-through progress in AI research. ILC is developing deep convolutional neural networks (DNNs) that can accurately and rapidly detect and recognize objects, understand complex scenes, recognize actions and activities, and faces and their emotions. ILC is also developing a leading 3D face technology that can perform 3D face modeling and rendering with enhancements such as emotion cues, in “real-time”, and is conducting research on automatic video-to-text transcription and captioning, among other capabilities. These advances are clearly impressive, but such complex DNNs require substantial compute and memory resources. Deployment for inference requires that DNNs are dramatically compressed and pruned to be able to run on low-power platforms, at the edge, and in the IoT space, all while maintaining accuracy. ILC is also making dramatic progress developing technologies to achieve low-bit deep compression, on the order of up to 100X over trained, uncompressed DNNs. These ongoing advances in visual data analysis and understanding, coupled with low-bit deep model compression, are leading the way to deployment on a range of Intel AI devices, including low-powered edge devices. The opportunities are substantial. Watch this space.
Notices and Disclaimers
Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as “Spectre” and “Meltdown”. Implementation of these updates may make these results inapplicable to your device or system.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information about benchmarks and performance test results, go to http://www.intel.com/performance.
Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Currently characterized errata are available on request.
Intel does not control or audit third-party benchmark data or the websites referenced in this document. You should visit the referenced website and confirm whether referenced data are accurate.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
Intel, the Intel logo, Xeon, Xeon Phi, OpenVINO and Intel Nervana are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
© 2018 Intel Corporation. All rights reserved.