(Source: Phonlamai Photo/Shutterstock.com)
Early machine learning was search-based and relied mainly on brute-force approaches with some optimization. But as machine learning matured, it focused on statistical approaches and optimization problems that were ripe for acceleration. Deep learning also appeared, which found an unlikely source for optimization. Here, we’ll learn how modern machine learning found new methods to bring both scale and speed.
In part 1 of this series, we explored some of the history of AI and the journey from Lisp into modern programming languages and new paradigms of computational intelligence such as deep learning. We also discussed early artificial intelligence applications that relied on optimized forms of search, modern neural network architectures trained on massive datasets in addition to solving problems thought impossible a decade ago. The focus today is twofold; accelerating these applications even further as well as constraining them to power-optimized environments such as smartphones.
The focus of most acceleration today is on deep learning. Deep learning is neural network architecture that relies on many levels of neural networks where layers can support different functions for feature detection. These deep neural networks rely on vector operations that can easily benefit from parallelism. These architectures represent opportunities for distributed computation for the neural network layers as well as parallelized computation for the many neurons within a layer.
The unlikely source for accelerating deep learning applications was the Graphics Processing Unit, or GPU. A GPU is a special device that is used to accelerate the construction of the frame buffer (memory) that is output to a display device. The GPU offloads the rendering of the image in the framebuffer rather than relying on the processor to do this. GPUs are made up of thousands of independent cores that operate in parallel and perform specific types of computation such as vector math. Although the original GPUs were designed solely for video applications, they were discovered to also accelerate the operations in scientific computing such as matrix multiplication.
GPU vendors are happy to provide APIs that allow developers to integrate GPU processing into their applications, but this work can also be done through standard packages available for a variety of different environments. The R programming language and environment includes packages that work with GPUs to accelerate processing such as gputools, gmatrix, and gpuR. GPUs can also be used with Python through various libraries such as the numba package or Theano.
These packages make GPU acceleration for machine learning accessible to anyone with the motivation to use them. But more specialized approaches are also on the way. In 2019, Intel® purchased Habana Labs for $2 billion (USD). Habana Labs developed custom chips for various machine learning accelerators in servers. This was preceded in 2017 by the purchase of Mobileye for $15 billion (USD) for self-driving chip technologies.
Beyond GPU acceleration in servers and desktops, accelerators for machine learning are finding their way beyond traditional platforms and into power-constrained embedded devices and smartphones. These accelerators take many forms from USB sticks, APIs to smartphone neural network accelerators, and vector instructions for deep learning acceleration.
Deep learning toolkits have made their way from PCs to smartphones for networks that are more constrained. Frameworks such as TensorFlow Lite and Core ML are already being deployed on mobile devices for machine learning applications. Apple® recently released the A12 Bionic chip which includes an 8-core neural network engine to develop neural network applications that are more energy efficient. This will expand deep learning applications on Apple’s smartphones.
Google released a neural network API (NNAPI) for Android® 8.1 that features machine learning capabilities. These are in use for Google Assistant in the context of natural language processing and image recognition for the Google Lens app. NNAPI is similar to other deep learning toolkits, but built for the Android smartphone environment and its resource constraints.
Intel released an updated version of its Neural Compute Stick that accelerates deep learning applications in the form factor of a USB stick. This can be used by various machine learning frameworks such as TensorFlow, Caffe, and PyTorch. This device is an interesting choice when a GPU isn’t available, but also allows rapid prototyping of deep learning applications.
Finally, while machine learning computation has moved off CPU into the GPU, Intel has optimized its Xeon instruction set with new instructions to accelerate deep learning. Called the AVX-512 extension, these new instructions (called Vector Neural Network instructions, or VNNi) improve the throughput of convolutional neural network operations.
The application of GPUs to machine learning created the ability to build and deploy massive deep neural networks for various applications. Machine learning frameworks have made it simple to then build deep learning applications. But not to be outdone, smartphone vendors have integrated power-efficient neural network accelerators for constrained applications (along with APIs for custom application use). Other accelerators can now also be found to offload to USB hardware, with many new startups digging into this accelerator space for future machine learning applications.
M. Tim Jones is a veteran embedded firmware architect with over 30 years of architecture and development experience. Tim is the author of several books and many articles across the spectrum of software and firmware development. His engineering background ranges from the development of kernels for geosynchronous spacecraft to embedded systems architecture and protocol development.