Chapter 14: A Perseverance with Simple Methods
Neuroscientists discover that deep neural networks, AI methods, allow them to make the first mockups of language brain regions
Note: This chapter is part of the proof-of-concept material for a book, AI: How We Got Here—A Neuroscience Perspective. If you like it, please consider supporting the Kickstarter project.
In the last chapter, we explored how neuroscientists, in the last decade, had started to come to a consensus that there existed a dedicated region of the brain for processing language. But at that time, most neuroscientists would have found the idea of building a mockup of this brain region inconceivable. They could barely even agree that it existed.
But in the period from 2012 to 2014, AI scientists and neuroscientists had made some remarkable progress, in a different area. They had built something that was a little like a mockup of the visual cortex. Many started to wonder how they might use the same methods to tackle the problem that was language; even if few believed that those methods would take them all the way to building mockups of the brain region now known as the language network.
In particular, some AI specialists from the field of natural language processing, an area with such a different focus, kickstarted the process. As a field, they would take the lead in blazing a pathway forward Eventually, the work from both fields would begin to raise the question of whether it might be possible, after all, to build something like a mockup of the language network.
The elements of the new AI approach, which had proven so successful in vision, was (perhaps a little ironically) already familiar to many NLP practitioners. They had first made serious efforts to use the equations known as neural networks to try to process language in the 1980s. While they hadn’t been able to achieve any dramatic successes, they had already thought deeply about the schematics of the problem, and how to apply the methods.
When it came right down to it, you only needed four or five main ingredients to use neural networks to try to solve a problem, whether in vision, language, or another area. First, there was the equations of the neural networks, which could be seen as representing the way that signals flowed through many interconnected conduits or buckets. Second, there was the data, which after you converted into input signals, you would use the the conduits to process. Third, there was the goal that you wanted the network to accomplish, like filling only one specific output bucket, when an input image contained one specific object. Fourth, there was the algorithm that you used to train the network, optimizing the valves so the task could be better accomplished.
In the 1980s and 1990s, scientists had explored a variety of ways to adapt these methods for language. Some of them were quite straightforward. For example, suppose you wanted a neural network that could process a single word, from a dictionary of 500,000 words, and produce one word in response, as output. If you created a neural network with a first layer of 500,000 buckets, and a final layer of 500,000 buckets, it was trivial to do this; you just converted the input word into an input signal that filled only a single one of the initial 500,000 buckets. Then, the neural network would process that input signal to fill a variety of output buckets.
As described, this network would only process an input word in a way that was random or haphazard. But suppose you collected a bunch of text from off the internet, and stored every two-word pair from it. Then, you could use this data to optimize the neural network; each time the output word for a given input word was different than expected, the training algorithm told you exactly how to adjust all the valves to make the output a little more accurate.
A wealth of limitations with this approach, when applied to language, forced scientists to get more creative. For example, a neural network as just described would generate an output word based only on a single word of input. But as we know, when it comes to language, the word that comes latest is often connected to a long sequence of words that come before it.
One of the most notable AI scientists to attack this sort of problem was Jürgen Schmidhuber, who, with students, invented a wide variety of what are now textbook standard neural networks. Some of those that they invented, called recurrent neural networks, possessed something a little like a memory. If you gave them a word as input, over and over, then the output word that it produced, each time, would be influenced by all the previous words you had already given it.
While AI scientists were creating these new neural networks, they were also developing a new tradition for neural network science. They were making it into a science of performance. For the most part, scientists like Schmidhuber weren’t thinking deeply about the brain, or whether their new equations—or the putative reality they represented—were at all biologically coherent. And even if they were, then they didn’t bother exploring it. They were miles away from laboratory experiments. As such, in this period, neural networks increasingly stopped being seen as models of the biological, especially by AI practitioners. The two camps became distant.
For these reasons, AI scientists who studied neural networks advanced in a kind of vacuum, mostly insulated away from neuroscience. But although this view may have been a little naive or myopic, it wasn’t unproductive. As we have thoroughly witnessed, by 2012, they had established that the same basic AI approaches were capable of recognizing objects.
Remarkably, the breakthrough neural networks they invented were not even very innovative. Scientists had managed to get them to achieve much higher performance simply by training them on greater amounts of data, using much greater computing resources. In AI, the greatest insight therefore seemed to lie in persevering with the older methods. As NLP scientists and AI scientists witnessed this success unfolding, they felt many reasons to be optimistic.
It was true, however, as we discussed in the last chapter, the language problem was also uniquely difficult. Seen purely as a computational problem, there were huge numbers of words in a single language. And many words mattered when trying to process a language input.
In 2017, in searching for efficiency improvements, AI researchers at Google invented a new kind of neural network.1 Its mathematics were unusually sophisticated. Not only did they represent many simple, signal carrying conduits, but there was also mathematics that had no immediately apparent biological interpretation, continuing in the tradition of pursuing pure performance. The researchers seized on this mathematics because they could see it provided a more efficient way of taking into account a long sequence of words as input. They called it an attention mechanism, although to be clear, they understood next to nothing about how brain-based attention worked.
The transformer architecture would soon have a great impact. Around the time it was invented, one of the young, lead scientists of the 2012 breakthroughs in AI research, Ilya Sutskever, had co-founded a company called OpenAI, taking the role of its chief scientist. Shortly thereafter, In 2019, he and other OpenAI researchers showed they could get a specific instance of the transformer, which they called GPT-2, to generate much improved language responses.
Then, in May 2020, the same company released a language model known as GPT-3 that practically solved many of the classic problems of language.2 It was an odd discovery, at once prosaic, using historically well established methods, but achieving radically divergent performance. GPT-3 blew other non-neural network-based methods completely out of the water, and within the next several years, the technology would become universally lauded.
Although the results of AI captured all the headlines, neuroscientists were following right on the heels of these AI scientists. Indeed, some of these included DiCarlo himself, and others who, like Sutskever, had been directly involved with the vision breakthroughs. In particular, DiCarlo’s student Martin Schrimpf, now a professor at EPFL-Lausanne, had decided to take on the challenge of applying the AI approach uncovered by DiCarlo to language. Then, there were other neuroscientists, like Evelina Fedorenko, Alexander Huth, and Jean-Remi King who had been studying the language function of the brain for many years, and had become early AI adopters.
Shortly after GPT-3 was released, in 2021, they began finding startling commonalities between these modern neural networks and the brains of humans. In one study, researchers measured the MRI signals from 102 different Dutch subjects as they read short Dutch language sentences of 9-15 words in length.3 They wanted to see if they could understand the signals in their brains by comparing them to the signals in one of these neural networks, trained on language.
To do so, they fed portions of the same sentences into a range of neural networks. For example, they might give as input the first six words from a sentence to a model, similar to GPT-2, which would process those words to generate another word as an output. In the act of doing so, it would generate patterns of partially or fully filled buckets in the layers of the neural network.
Remarkably, they found that the patterns of signals in the MRI data and the layers were highly correlated. In fact, they were about as highly correlated as the activations in one subject’s brain were correlated with another. The researchers noticed that these correlations became stronger for a neural network model that was better performing at predicting the next word output.
This sequence of discoveries provided a vivid echo of what had played out in vision. The better a neural network performed at the macroscopic task, in this case, the task of generating word responses, the better it served as a model of microscopic brain signals. Neural networks were capturing something meaningful about the brain; just maybe, even serving as a synthetic language network—something that just a few years earlier would have seemed ludicrous.