Real-World AI at Enterprise Scale

What a difference a few years makes! We have seen tremendous shifts in the field of deep learning as it moves from model training to inference deployment within main lines of business across industries. Customers are asking us different questions now: how do I scale in the real world, quickly and cost-effectively, to stay competitive? How do I run AI applications in products with strict latency and lightning-fast requirements for inference results? The answer has three key pieces: get more from the architecture you already know, accelerate with purpose, and use software to simplify the environment. Before we look at these key elements of real-world deployments, let’s examine how we got here.

3 Years, 3 Major Changes

As deep learning (DL) has matured, CPU solutions turned a page by achieving a many-fold boost to deliver optimized AI. Today, even older CPUs can deliver performance many times better than once thought possible, let alone new generations of enhanced CPUs.

  1. Entirely new software: Libraries that didn’t exist just a couple years ago now allow you to access broadly deployed CPU hardware in AI-specific ways, preserving existing software environments and the hardware you use to run other enterprise applications.
  2. New hardware features: For the past few years we’ve worked to enhance our x86-based architecture with new AI hardware features, and earlier this month, we announced the release of 2nd generation Intel® Xeon® Scalable Processors with Intel® Deep Learning Boost technology to accelerate inference.
  3. AI cycles shift to inference in lines of business: We’re on the cusp of a major shift to inference deployments at scale that meet a critical blend of performance, cost, and energy efficiency needs. Currently, we estimate the training to inference workload ratio to be around 1:5. CPUs are well-architected for the high-throughput, low-latency compute that the shift to inference demands.

Deploying AI in the Real World

Now that we’ve examined these shifts, let’s revisit the question: How do I scale performant AI applications efficiently, while on a budget? The answer is clear and manageable:

  1. Get more performance from the Xeon foundation you know. Thanks to the new software and hardware enhancements discussed above, Intel Xeon Scalable processors have never been more performant for AI applications. Using the latest software libraries and compilers will ensure optimal performance on existing hardware, and upgrading to 2nd Gen Intel Xeon Scalable processors provides a significant hardware-based performance improvement, thanks in part to Intel DL Boost’s new Vector Neural Network Instructions (VNNI).
  2. Accelerate with purpose, for continuous and intensive tensor compute. The most demanding part of deep learning compute is the arithmetic done on large multi-dimensional arrays called Tensors. When an application needs continuous, intensive tensor arithmetic, using a purpose-built specialized deep learning accelerator is the right solution. These ASICs, designed to do this specific task really well, work in tandem with the main host CPU to offload the intensive deep learning parts of the application.
  3. Keep the software environment updated and simple. Software is key! Using the latest software versions, libraries and optimizations with deep learning frameworks (like TensorFlow, MXNet, Pytorch, and PaddlePaddle) will “unlock” the CPU hardware, including features of newer Intel Xeon Scalable processor generations. We’re also very focused on delivering a streamlined environment which connects popular deep learning frameworks like TensorFlow* to various HW platforms like CPUs, accelerators, and FPGAs.

Common Ground Across Industries

Many companies, representing a diverse range of applications, markets, data, and audiences, are using this three-part approach to deploy real-world AI today. Some are long-time users of Intel Xeon processors. Others are taking advantage of the new 2nd Gen Intel Xeon Scalable processors. With their hardware and software optimizations targeting AI workloads, these CPUs deliver up to 14X inference throughput[1] over the previous generation.

Here are a few customers seeing great success deploying AI on Intel:

  • Philips – Fortune 500 company that cost-effectively deployed fast deep-learning inference on tens of thousands of servers and scanning machines already in the field.
  • Taboola – The world’s largest content recommendation engine sped inference by 257%, while reducing planned infrastructure spend by scaling with CPUs instead of GPUs.
  • iFlyTek – This voice recognition leader in China phased out GPUs in favor of CPUs to process six billion transactions daily.
  • TACC (Texas Advanced Computing Center) – Their new Frontera system based entirely around 2nd Gen Intel Xeon Scalable processors with Intel® Optane™ DC persistent memory yields 40 petaflops of peak performance to enable groundbreaking discoveries using massively-parallel AI inference on HPC systems.

The reason these collaborations and many others were successful is because the companies and academic institutes were able to meet performance demands and extend their existing solutions with AI capabilities, while minimizing the cost of change. Another recurring theme was the flexibility to quickly adapt to new usages and opportunities.

Facebook: A Case Study in Acceleration

Intel® Xeon® Scalable processors are relied upon for so many other enterprise workloads, leveraging them for AI comes at minimal extra cost. Yet as AI matures, the path to the future calls for decisions about when further acceleration is needed for intensive, continuous, high-volume tensor compute. Custom ASICS work hand-in-hand with Xeon-based infrastructure to offload and accelerate the intensive deep learning tensor-based parts of the application leaving the rest to benefit from the host CPU.

Customers like Facebook, whose deep learning demands grow more intensive and sustained, are looking to augment their current CPU-based inference with this new class of accelerators that offer very high concurrency of large numbers of compute elements (spatial architectures), fast data access, high-speed memory close to the compute, high-speed interconnect, and multi-node scaled solutions.

For this reason, Facebook has been a close collaborator with us on the Intel® Nervana™ Neural Network Processor-i 1000 (codenamed Spring Hill) in production later this year. As a leading community platform that unites nearly half the world, Facebook relies on driving and helping build substantial advancements in AI, including this new generation of power-optimized, highly-tuned AI inference chips that we expect to be a leap forward in inference application acceleration, delivering industry leading performance per watt on real production workloads. The Intel Nervana NNP-I 1000 will be fully integrated with Facebook’s Glow compiler to help keep their software environment simple and highly optimized.

Change is the Only Constant

The AI landscape is shifting – constantly and quickly. What you couldn’t do three years ago, you can do now. It’s an exciting time to witness the impact of enterprise-scale inference deployments, and advancements in both hardware and software from devices to data centers. I can’t wait to see what the next three years brings!

Notices and Disclaimers