Deep learning approaches are being applied across a broad spectrum of disciplines, having demonstrated that by combining big data with supervised learning, that we can train systems to perform artificial intelligence (AI)-centric tasks previously considered impossible with traditional approaches. One of the biggest drivers of the machine learning trend has been the availability of large amounts of data with which these algorithms are calibrated, or trained. A great example of this is image data found readily online, which provides a rich source of sample data for training algorithms to detect and classify objects. Training neural networks often require hundreds to thousands of data per class, so in our modern data-laden world, we are witnessing an explosion of interesting applications of machine learning techniques.
But what if the problem at hand doesn’t come with a treasure trove of readily available raw data? What if, for reasons such as privacy or scarcity, researchers are prevented from obtaining large enough sample sizes of real-world data to train these complex artificial neural networks? Some researchers have solved this problem creatively by employing what are known as Synthetic Datasets– virtually constructed datasets designed to be used in absence of real-world data in the machine learning process. This approach has shown some very interesting benefits in certain applications. Here’s a look at how the Intel Movidius group is utilizing synthetic datasets for various AI research endeavors.
Aerial volumetric scans using LiDAR (Light Detection and Ranging) and other technologies (that is, creating three-dimensional models from aerial survey using manned, or unmanned aircraft) are becoming increasingly valuable for companies working in infrastructure, maintenance, or surveying. One challenge for 3D mapping technology is “patchy” data – when a side or face of an object is hidden from direct line of sight of a sensor, the resultant 3D model often appears patchy with pieces missing.
This is a simple reflection of the fact that a system cannot reconstruct what it cannot see. To solve this problem, Intel researchers have trained a system to make educated “guesses” as to what should be in the gaps. In absence of real-world data, our researchers Jonathan Byrne and Alessandro Palla trained a deep neural network on a dataset of sparsified 3D models that intentionally had parts removed from them. By comparing the “sparsified” models to the original complete 3D models, the algorithm is able to estimate what kind of geometries can be used to fill the gaps in incomplete aerial scans. The results are much cleaner, complete 3D models, as can be seen below the results of a “repaired” 3D aerial scan taken from a drone survey conducted in Dublin, Ireland.
(For more information on this work, please refer to Convolutional Neural Network on Neural Compute Stick for Voxelized Point-clouds Classification by authors Xiaofan Xu, Joao Amaro, Sam Caulfield, Andrew Forembski, Gabriel Falcao and David Moloney)
Image classification is one of the most common uses of neural networks today. By training deep neural networks on thousands of 2-dimensional images that can be easily found online, researchers and companies alike have been able to build strikingly accurate image classifiers. One remaining challenge is to create object identifiers that are both scale and rotation invariant – meaning that they are able to identify an object regardless of its size and position relative to the camera sensor.
By training a system on 3-dimensional objects could potentially yield a very accurate object identification algorithm, but the process of creating large-scale 3D datasets can be prohibitively time consuming. Capturing every conceivable angle and position of say, a sofa in a living room could take dozens of hours – and that’s merely one of the many objects that one may want to build up a dataset for. To solve this problem, Intel researchers have again turned to synthetic datasets. Intel researchers created 3-dimensional datasets of common household objects, and then procedurally created thousands of differing angles and positions for each object inside different rooms. Training on this synthetic data has already shown impressive results, with up to 91.7% accuracy1achieved for classification of these 3D objects. With this kind of algorithm, rapidly trained on synthetic 3D data, the next step is to validate against real-world counterparts, scanned in 3D with an Intel® RealSenseTMdepth camera. While these advanced cameras are still required in the training process, the major improvement here is that real-world 3D camera data is only now required to validate the algorithms, rather than produce the lion’s share of generic training data.
(For more information on this work, you can explore the full publication: Evaluation of Synthetic Data for Deep Learning Stereo Depth Algorithms on Embedded Platforms by authors Kevin Lee and David Moloney.)
A very common application of computer vision is stereo depth. Put simply, stereo depth refers to a method of delivering 3-Dimensional depth data by comparing disparities in the input from two slightly offset image sensors. Much like our own pair of eyes, the small distance separating the two points of view enables an algorithm to determine geometric information such as depth. Stereo depth is an important component of many modern applications in fields such as robotics, drones, and even consumer-centric categories such as virtual reality, and smart shopping.
Much like other fields, researchers have begun to attack the problem of stereo depth algorithms with new approaches in machine learning. While visual elements such as occlusion (when objects are partially hidden by others), and reflections cause problems for conventional stereo matching algorithms, deep neural networks appear to be less prone to these challenges. That being said, the major stumbling block for a machine learning approach to training stereo depth algorithms is obtaining large amounts of reliable training data for such purposes.
As Intel researcher Kevin Lee describes: “CNNs are capable of using a deep representation to learn directly from the pixels and overcome these shortcomings but in order to do so, they require copious amounts of high quality annotated data.”
To solve this problem using synthetic data, Intel researchers recreated 3D environments of rooms in various sizes and dimensions, even going to far as to realistically lighting and texturing the scenes using the popular 3D development tool Blender* 3D. Once base models were created, our researchers could generate thousands of image pairs (two sets of images of the exact same scene, but with the virtual cameras spaced a few inches apart). With this flexible model, it is possible to created image pairs from virtually any angle, lighting condition and baseline (the space between the two cameras).
Compare generating the synthetic datasets to the traditional method: In creating the industry standard KITTI dataset, a Volkswagen* Passat, LiDAR sensor, a pair of color cameras, and a pair of monochrome cameras were required. In addition to this hardware, a team of researchers invested over 6 hours of time capturing footage, and countless hours constructing and calibrating the sensors on the vehicle before even beginning to capture the aforementioned data.
By employing a synthetic dataset, our researchers were able to achieve accuracy within 1.8% of real-world counterparts2, all without the need of any time spent capturing real-world data. Other benefits of the synthetic data approach is that labeling the data is automatic, reducing time required to hand-label data required in supervised learning applications.
Another challenge for using real-world data to train depth algorithms is obtaining data for difficult corner cases. Taking physical cameras into unusual environments, or attempting to capture rare occurrences can be both time consuming and expensive. With synthetic alternatives, researchers can simply define an artificial scenario and run as many variations as they wish. This can reduce both cost and development times significantly.
While the use of synthetic datasets is showing tremendous promise, there are limitations that we have identified. Using synthetic data often requires the additional step of validating against real-world data. While this may not be a tremendous hurdle, researchers need to think carefully about ensuring their synthetic datasets accurately represent real-world conditions that they are substituted for. Strange aberrations or simplified representations in synthetic datasets can have hidden knock-on effects on the performance of an algorithm when unleashed in a real-world setting.
As is often the case, it is likely a hybrid solution will provide the best approach for training deep neural networks. By complementing synthetic data with real-world data, it will be possible to use synthetic data to rapidly and cost-effectively create a generalized model, and then use smaller amounts of real-world data to fine tune this model. This hybrid model becomes particularly interesting when networks endpoints are able to contribute real-world data back to a centralized host where additional accuracy can be achieved through tuning on new real-world data. The increasing demand for training sets, combined with ever-improving simulation techniques means we’ll likely see many more examples of machine learning with synthetic and real-world data components going forward.
1Full configuration details: Convolutional Neural Network on Neural Compute Stick for Voxelized Point-clouds Classification
2Full configuration details: Evaluation of Synthetic Data for Deep Learning Stereo Depth Algorithms on Embedded Platforms