(Source: a-image/Shutterstock.com)
In 2016, the artificial intelligence software package AlphaGo defeated South Korean Go master Lee Sedol. Go is a two-player abstract strategy board game in which the objective is to surround more territory than the opponent with game pieces. In 2017, a new version of AlphaGo with improved playing capabilities (AlphaGo Master) defeated Ke Jie, the world's top-ranked Go player. Rather than relying upon previously established strategies to defeat human opponents, this novel artificial intelligence (AI) system appeared to surpass human cognition in certain specific areas and even showed some ability to think, effectively closing the gap between the current state of the art and what popular culture imagines AI could be.
In the past, AI has always seemed like distant science fiction. But many applications of the technology in today's world have already reached a level worthy of being called AI. In addition to the aforementioned Go software, a slew of AI systems have recently been deployed to great effect, including automated driving systems, smart butlers, and even Siri, the voice assistant bundled with Apple's line of smartphones. The core algorithm behind these applications is deep learning, one of the hottest branches of machine learning. Unlike other machine-learning algorithms, deep learning relies on iterative training performed on large volumes of data to discover the data’s features and provide the corresponding results. Many of these features have already surpassed the expressive power of human-defined features, allowing deep learning to outperform other machine-learning algorithms significantly—and even humans themselves—in many tasks.
However, deep learning has not yet surpassed humans in all aspects. On the contrary, it is entirely dependent on the design decisions of humans with respect to the algorithms used. It has taken roughly 50 years since its inception for deep learning to bear the fruits we now see today. The history of its development provides a glimpse into the computer scientists’ meticulous ingenuity and presents an opportunity to discuss where this exciting field could be headed.
Deep learning leverages what is known as an artificial neural network (ANN). Although neural network algorithms derive their name from the fact that they simulate how animal neurons transmit information, the term deep learning comes from the multi-layered cascade of neurons involved—a multitude of layers that allow for depth to be achieved in the transmission of information.
In animals, nerves connect to receptors at one end and to the animal's cortex at the other, and signals are conducted through multiple layers of neurons in the middle. Neurons are not actually connected one-to-one but have multiple connections (such as radiative and convergent connections), resulting in a network structure. This rich structure ultimately enables the extraction of information and the production of the corresponding cognition in the animal brain. The learning process found in animals requires the integration of external information within the brain. External information enters the nervous system, which, in turn, becomes a signal that can be received by the cortex. The signal is compared to existing information in the brain so that complete cognition can thereby be established.
Similarly, using computer programming techniques, computer scientists allow a layer of functions containing certain parameters and weights to simulate the internal operations of neurons, simulate connections between neurons using an overlay of non-linear operations, and ultimately re-integrate the information to produce classification or prediction results as an output. To deal with the difference between the neural network output and real-world results, the neural network adjusts the corresponding weights layer by layer via a gradient to reduce the difference and thereby achieve deep learning.
It might come as a surprise that simulating the animals’ neural activity is not the exclusive preserve of deep learning. As early as 1957, Frank Rosenblatt introduced the concept of the perceptron. The perceptron was effectively a single-layer neural network only capable of distinguishing between two types of results. The model is very simple, and the relationship between the output and the input information is essentially a weighted sum. Although the weights automatically adjusted directly to the difference between the output and the true value, the learning capacity of the entire system was limited and could only be used for simple data fitting.
At almost the same time, significant advances in neuroscience also appeared. Studies conducted by the neuroscientists David Hubel and Torsten Wiesel on cats’ visual nervous system confirmed that the response to visual features in the cortex is achieved by different cells. In their model, simple cells perceived light information, while complex cells perceived movement information.
Inspired by this, the Japanese scholar Kunihiko Fukushima proposed the neocognitron network model in 1980 to recognize handwritten numbers (Figure 1). This network was divided into multiple layers, with each layer consisting of one type of neuron. Within the network, two types of neurons were used to extract and combine graphical information alternately. These two types of neurons have since evolved into the convolution layer and the extraction layer, which remain of great importance. However, the neurons present in this network were artificially designed, and they did not automatically adjust to the presented results, so they were not capable of learning and were limited to the rudimentary task of recognizing a small set of simple numbers.
Figure. 1: The working mechanism of the neocognitron machine (Source: Fukushima, Kunihiko. "Neocognitron: A hierarchical neural network capable of visual pattern recognition." Neural Networks 1.2 (1988): 119–130)
When the capacity to learn could not be achieved, additional manual design was required in lieu of a true self-learning network. In 1982, American scientist John Hopfield invented a neural network with several restrictions that allowed it to maintain memory amid changes, thus facilitating learning. In the same year, Finnish scientist Teuvo Kohonen proposed a self-organizing map based on an unsupervised algorithmic vector quantization neural network (learning vector quantization network) with the hope of learning correct relationships from complex networks by shortening the Euclidean distance between input and output. In 1987, American scientists Stephen Grossberg and Gail Carpenter proposed their adaptive resonance theory network based on earlier theories. They let known information and unknown information resonate to infer the unknown information from the known information, achieving a kind of analogical learning. Although these networks were appended with keywords such as self-organizing, adaptive, and memory, the corresponding learning method was not efficient. It required continuous optimization of the design based on the application itself, which when coupled with the small memory capacity of the network, made it difficult to apply in practice.
It took the publication of the backpropagation algorithm by computer scientists David Rumelhart, Geoffrey Hinton, and Ronald Williams in 1986 to resolve neural network learning gradually. The difference between the output of the neural network and the true value could now be fed back into each layer’s weights through a chain rule pertaining to the gradient, effectively allowing each layer function to be trained in the same manner as a perceptual machine. This was Geoffrey Hinton's first milestone in the field. Today, he is an engineering fellow at Google and is the Turing Award recipient, the highest honor awarded in computer science.
"We didn't want to build a model to simulate the way the brain works,” Hinton says. “We would observe the brain and, at the same time, think that since the brain works as a viable model, we should look to the brain for inspiration if we want to create some other viable model. It is precisely the feedback mechanism of the brain that the backpropagation algorithm simulates."
Thereafter in 1994, computer scientist Yann LeCun, while a postdoctoral fellow in Geoffrey Hinton's group, combined neural cognitive machinery and the backpropagation algorithm to create LeNet, a convolutional neural network for recognizing handwritten ZIP codes, which achieved 99 percent automatic recognition and was able to handle almost any form of handwriting. The algorithm was a huge success and was put to use by the US postal system.
Despite the above achievements, deep learning did not garner significant popularity until sometime thereafter. One reason for this is that neural networks need to be updated with a large number of parameters (650,000 neurons and 60 million parameters for AlexNet proposed in 2012 alone) and require robust data processing and computational power (Figure 2). Furthermore, attempting to reduce the amount of data and training time by reducing the number of layers in the network renders deep learning less effective than other machine-learning methods (such as support vector machines, which became hugely popular around 2000). Another paper by Geoffrey Hinton in 2006 was the first to use the name deep belief nets, wherein Hinton provided a way to optimize the entire neural network. Although this laid the foundation for the future popularity of deep learning, the reason deep network was used instead of the previous neural network moniker was because mainstream research journals remained averse to the term neural network, even going so far as to reject submissions upon seeing the word used in the title of certain papers.
A major turnaround in deep learning occurred in 2012. Scientists in the field of computer vision were becoming aware of the importance of data size. In 2010, Li Fei-Fei, an associate professor of computer science at Stanford University, released ImageNet, an image database containing tens of millions of manually tagged images belonging to 1,000 categories, including animals, plants, everyday living, and other areas. From 2010 to 2017, computer vision experts held annual classification competitions based on these images, and ImageNet has become a touchstone for machine learning and deep-learning algorithms in global vision research. In 2012, one of Geoffrey Hinton's students at the University of Toronto, Alex Krizhevsky, won an ImageNet classification competition by writing neural network algorithms on two NVIDIA graphics cards (GPUs), and his algorithms substantially outperformed the recognition rate achieved by the second-place entrant. The network was subsequently named AlexNet. This was the beginning of deep learning's rapid ascent.
Figure 2: The network structure of AlexNet (Source: Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton."Imagenet classification with deep convolutional neural networks." Advances in Neural Information Processing Systems, 2012.)
Starting with AlexNet and with data support from ImageNet and computational support from commercial graphics cards, there was a gradual explosion in neural network architecture research. First, it rapidly became much easier to implement deep learning because of the release of a several software packages (such as Caffe, TensorFlow, and Torch). Second, subsequent iterations of the ImageNet classification competition and the COCO competition, in which the task given involved more complex image segmentation and description, gave rise to VGGNet, GoogLeNet, ResNet, and DenseNet. The number of layers used in these neural networks gradually increased, from 11 layers in AlexNet and 19 layers in VGGNet to 150 in ResNet and even 200 layers DenseNet, thereby achieving true deep learning. When tested on certain datasets about classification problems, these deep neural networks even surpassed human recognition accuracy (the human error rate on ImageNet is about 5 percent, while SENet can reach 2.25 percent). This is further demonstrated in Table 1 below:
Table 1: A summary of the best performing networks in previous ImageNet image classification competitions (Source: Author calculated from the original papers based on https://github.com/sovrasov/flops-counter.pytorch).
Year
2012
2013
2014
2015
2016
2017
Network
AlexNet
ZFNet
VGGNet
GoogLeNet
ResNet
ResNeXt
SENet
Top 5 error rate
15.32%
13.51%
7.32%
6.67%
3.57%
3.03%
2.25%
Number of layers
8
16
22
152
154
Number of parameters
60M
138M
7M
44M
67M
Since this breakthrough, computer scientists have increasingly used neural network algorithms to solve problems. In addition to the aforementioned applications in the classification, segmentation, and detection of two-dimensional images, neural networks have also come into use in the field of temporal signals and even in unsupervised machine learning. A recurrent neural network (RN) can receive signal inputs in chronological order. Each layer of the network's neurons can compress and store memories, while the network itself can extract valid dimensions from memories to perform speech recognition and text comprehension. When neural networks are used for unsupervised learning, instead of extracting principal components or extracting eigenvalues, raw information can be automatically downscaled and extracted by simply using an autoencoder that includes a multilayer network. Combining the above with a vector quantization network makes it possible to cluster features and obtain classification results without any significant use of labeled data. There is now little doubt that neural networks have become the undisputed king of the hill both in terms of their effectiveness and their scope of application.
In 2017, the ImageNet image classification contest announced the completion of their final competition. But this did not herald the end of deep learning. On the contrary, research and deep-learning applications had effectively transcended the previous classification problem stage and entered a second stage of extensive development. At the same time, the exponential increase in the number of manuscript contributions to international conferences related to deep learning year after year indicated that more and more researchers and engineers were devoting their efforts to the development and application of deep learning algorithms. The development of deep learning will primarily conform to the following trends over the next few years.
First, structurally speaking, the types of neural networks used will become more diverse. Generative adversarial networks (GANs), which can perform the reverse process of a convolutional neural network, have been developing rapidly since they were first proposed in 2016, and have become an important growth area for deep learning. Because deep-learning algorithms can extract features from raw information (such as images), then the inverse process should be logically feasible. In other words, it should be possible to use a cluttered signal to generate a corresponding image via a specific neural network. It was with this insight that computer scientist Ian Goodfellow proposed the concept of generative adversarial networks. This type of network provides a discriminator in addition to a generator that generates the image. During the training process, the generator tends to master a generated picture that is extremely close to a real one, and that is difficult for the computer to distinguish. In contrast, the discriminator tends to master a robust capability to discriminate between the real picture and the generated picture. As the two learn with respect to each other, the more realistic the generated image becomes, the harder it becomes for the discriminator to discriminate. Conversely, the greater the discriminator's ability, the more the generator will be driven to generate new, more realistic images. Generative adversarial networks have a wide range of applications, ranging from face generation and recognition to image resolution improvement, video frame rate improvement, image style migration, and other fields.
Second, the research questions involved in such networks tend to be more diverse. On the one hand, some concepts found in other branches of machine learning, such as reinforcement learning and transfer learning, have found a new place in deep learning. On the other hand, research into deep learning has evolved from engineering trial and error to deduction by theory. Deep learning has been criticized for its lack of theoretical underpinning, relying almost entirely on data scientists’ experience during the training process. To reduce the influence of experience on the results and reduce the time taken to select hyperparameters, researchers are fundamentally revising the efficiency of deep learning in addition to making modifications to the initial classical network structure. Some researchers are attempting to link other machine-learning methods (such as compressed perception and Bayesian theory) to facilitate the transition of deep learning from engineering trial and error to theory-guided practice. There have also been efforts to explain the effectiveness of deep-learning algorithms rather than just treating the entire network as a black box. At the same time, researchers have been busy establishing another set of machine-learning problems for the subject of hyperparameters, known as meta learning, in an attempt to reduce the difficulty and randomness of the hyperparameter selection process.
Third, thanks to a recent large infusion of fresh research results, more algorithms are being used in commercial products. In addition to several small-scale companies developing image-generating applets, large companies are also competing for territory in the deep-learning space. Internet giants Google, Facebook, and Microsoft have all set up deep-learning development centers. Their Chinese counterparts, Baidu, Alibaba, Tencent, Jingdong, and ByteDance, have also each set up their own deep-learning research centers. Several unicorn companies with roots in deep-learning technology, such as DeepMind, SenseTime, and Megvii, also stand out from a large field of competitors. Since 2019, industry-related deep learning research has gradually shifted from publishing papers to landing projects. For example, Tencent AI Lab optimizes video playback, while lung nodule screening produced by Itu has already been implemented on a pilot basis in some hospitals in China.
Fourth, with the gradual spread of 5G technology, deep learning will become embedded in daily life along with cloud computing. Deep learning is a technology that has generally been difficult to get off the ground due to a lack of computing resources. A supercomputer with included GPU can cost up to $3,411,900 (USD), or ¥500,000 (RMB), and not all companies have the capital and talent to make full use of such equipment. However, with the proliferation of 5G technology and cloud-computing availability, companies can now obtain computing resources directly from the cloud and at a low cost by renting them. Companies can upload data to the cloud and receive calculations back from the cloud in almost real time. A host of emerging startups are figuring out how to leverage this infrastructure and have assembled teams of computer scientists and data scientists to provide deep learning algorithmic support and hardware support for other companies. This renders obsolete the need for companies in industries that previously had little to do with computer technology (such as manufacturing, services, entertainment, and even the legal industry) to define their problems and develop their solutions. Rather, they can now benefit from the computer technology industry’s expertise by partnering with algorithm companies so that they can become empowered by deep learning.
For more than 50 years, deep learning has evolved from prototype to maturity and from the simple to the complex. A great deal of theoretical and technological experience has been accumulated in academia and industry. The direction of current development is now more diversified than ever before. This is partly because many corresponding products have entered the research and development phase and because computer scientists are moving forward with even more detailed research on deep learning.
Naturally, as a comprehensive discipline, deep learning has also borne fruit in the areas of speech analysis and natural language processing in addition to its core history of development in the area of image recognition. At the same time, combining multiple neural networks and multimedia formats is rapidly becoming a hot area of research. For example, automatic image captioning, which combines image and language processing, is a challenging problem.
It should also be noted that deep learning is not the only way to implement neural networks. Some network structures that are not as widely used at this stage, such as adaptive resonance networks, Hopfield networks, and restricted Boltzmann machines, might also someday drive the entire industry further forward. What is certain is that while deep learning remains seemingly cloaked in an impenetrable aura of sophistication and mystery for the time being, in the near future, this sci-fi concept is poised to become a fundamental technology for many companies large and small.
Wang Dongang is a PhD candidate in the University of Sydney. His research involves medical imaging, artificial intelligence, neuroscience and video analysis, and he is always devoting to implementing machine learning techniques into applications in daily life. He has published papers in top international conferences including CVPR and ECCV, and he serves as the reviewer for journals including IEEE Transactions on Circuits and Systems for Video Technology and IEEE Transactions on Multimedia and conferences including AAAI and ICML. He is experienced in developing algorithms in machine learning and computer vision. He has cooperated with companies and institutes in China, US and Australia in projects including multi-view action recognition, road management based on surveillance videos and auto-triage system for brain CT.