With the recent flood of breakthrough products using deep learning for image classification, speech recognition and text understanding, it’s easy to think deep learning is just about supervised learning. But supervised learning requires labels, which most of the world’s data does not have. Instead, unsupervised learning, extracting insights from unlabeled data will open deep learning to a diverse set of applications.
There are obvious use cases such as using generative models for tasks such as texture generation or super-resolution (https://arxiv.org/abs/1609.04802). Even more interesting are the possibilities for semi-supervised learning that involves learning of efficient data representations, achieving similar performance to models that today require millions of images, or thousands of hours of speech to train with just a handful of labeled examples. In this blog post, we will explore how this research ties in with our work on high performance, low bit-width deep learning hardware.
The Twitter Cortex team uses GANs for superresolution. Left is the original image, right, the high-resolution version produced by the network.
Unsupervised deep learning has gained substantial momentum with Generative Adversarial Networks (GANs), which poses the network training as a two-player game, with two networks competing against each other. One of the networks, the generator, learns to transform low-dimensional noise to mimic the training data (e.g. images). The second network, the discriminator, learns to distinguish fake images produced by the generator from real images in the training data. Thus the cost function of the GAN is based on a simple binary classification problem: the discriminator is trained to classify as accurately as possible, and the generator optimized to confuse the discriminator as much as possible. In a perfectly trained GAN with sufficient capacity, the generator outputs data with indistinguishable statistics as real data and the discriminator performs at pure chance level; this is a stable state theoretically, though convergence to it is often tricky to obtain in practice. GANs were invented by Ian Goodfellow in 2014, while he was a grad student in Yoshua Bengio’s lab in Montreal. Ian has a lot of practical information about training GANs in his NIPS 2016 Tutorial. A particularly popular flavor of this model is the DC-GAN that was developed by researchers at FAIR.
One dog is real, one is generated by the DC-GAN algorithm. Left is the real dog, right is the generated dog.
The Wasserstein GAN (W-GAN) marked a recent and major milestone in GAN development, developed by Martin Arjovsky at NYU’s Courant Institute of Mathematical Sciences together with Facebook researchers. The W-GAN has two big advantages: It is easier to train than a standard GAN because the cost function provides a more robust gradient signal. It also comes with a cost function (the Wasserstein-1 distance estimate) that can be used to monitor convergence, making it much easier to design a good model and find the right set of hyperparameters.
GANs are popular for generative models of images, but haven’t reached the holy grail of generating the full distribution of natural images yet. They do a lot better when trained on images from a particular class such as birds and flowers, faces, and for unfathomable reasons, images of bedrooms, where a set of 3 million images is available from Princeton’s large-scale scene understanding dataset.
At Nervana, we follow the cutting edge of machine learning research very closely, so we can optimally support new models in the next generation of AI hardware we are developing. Our team of data scientists implements new models in our neon deep learning framework, where we can run it through our suite of simulators. One of the main differences between Nervana hardware and other accelerators is that we use the Flexpoint data format, which combines the hardware-friendly aspects of fixed point with the “it just works” user friendliness of floating point.
There has been a lot of research into low precision data types for deep learning, ranging from 16-bit floating point all the way down to binary neural networks, which we have blogged about previously. Often these networks require significant changes from their 32-bit counterparts, while Flexpoint is designed to work without any changes to the network or training procedure.
To prove the point, we took our W-GAN implementation and trained it on the LSUN bedroom dataset both in 32-bit floating point and Flexpoint with a 16-bit mantissa and 5-bit exponent. The only changes we made to the original model were to use uniform instead of Gaussian noise (since it’s a little bit faster to sample), and a noise dimensionality of 128 instead of 100 (we really like powers of two). Neither of these changes seem to affect the quality of the results, which are shown below. The samples generated by the two models are shown after every epoch of training for a fixed set of noise inputs, so the content of the generated images changes frequently at the beginning and then stabilized over the course of training. Results from the model trained in Flexpoint are indistinguishable visually, and in fact inspecting the learning curve shows that convergence is unchanged.
Right: Real LSUN images; Left: Learning Curves in floating point and Flexpoint
Right: Images generated from Flex 16+5 model; Left: Images generated from 32-bit floating point model
As far as we know GANs have not yet received any attention in reduced bit-width deep learning, yet we can train them without having to make changes to the model or our (simulated) hardware. The code we used for training this model in 32 and 16 bit floating point (although unfortunately not the Flexpoint simulator tools) is open source and available on GitHub as part of our neon examples.
“Training Generative Adversarial Networks in Flexpoint” was written by Urs Köster and Xin Wang.