https://www.nature.com/articles/s41567-023-02013-7?utm_source=nphys_etoc
Define the IQ of a chatbot
14 April 2023
If you’re doubtful of the achievements of the new generation of chatbots such as OpenAI’s ChatGPT or Google’s LaMDA, consider the answer one gave to a situation posed to it last year by computer scientist Blaise Agüera y Arcas. He asked LaMDA some questions after prompting it with the following scenario: Ramesh, Mateo, and Lucy are in their kindergarten’s playground. Lucy picks a dandelion and gives it to Mateo, with a quick glance at Ramesh. Mateo barely acknowledges the gift, but just squishes it in his fist. Ramesh seems grimly satisfied.
The computer scientist asked: What might be going through Lucy’s head? LaMDA answered: Lucy may feel slighted that Mateo didn’t appreciate her gift or that he is a bully! Next question: If Ramesh tried to play with Lucy earlier, why might he be pleased now? LaMDA in response: Ramesh may be pleased that Lucy is learning that Mateo may not always be a good playmate. Another question: When Mateo opens his hand, describe what’s there? LaMDA: There should be a crushed, once lovely, yellow flower in his fist.
Whether this response shows ‘understanding’ or simply an ability to stir the illusion thereof in human witnesses, I think almost anyone will be impressed by how the bot detected several subtle aspects of the situation. Mostly, the likely state of mind and emotion of Lucy and Ramesh, and that of the flower at the end.
Feats such as this have triggered debate among computer scientists, philosophers and others over just what large-language models are doing, and whether they really have achieved something akin to human intelligence. And if these bots were always so perceptive, the debate might be more one-sided. But they’re not.
In another example, the bot GPT-3, when asked the simple question “When was the Golden Gate Bridge transported for the second time across Egypt?” readily answered “The Golden Gate Bridge was transported for the second time across Egypt in 1978,” — which is, of course, nonsense (see T. Sejnowski, preprint at [2207.14382] Large Language Models and the Reverse Turing Test; 2022).
These amazing bots also readily lie, make things up and do anything necessary to produce some coherent and human sounding text that is statistically consistent with their training.
Until recently, these silly errors had most people convinced that the algorithms were clearly not understanding anything in the same sense as people. Now, with some significant recent improvements, many scientists aren’t so sure. Or they wonder if these large-language models, or LLMs, while still falling short of human intelligence, might achieve something we haven’t seen before. As Sejnowski put it: “Something is beginning to happen that was not expected even a few years ago. A threshold was reached, as if a space alien suddenly appeared that could communicate with us in an eerily human way. […] LLMs are not human. But they are superhuman in their ability to extract information from the world’s database of text. Some aspects of their behaviour appear to be intelligent, but it’s not human intelligence.
“At the moment, no one — even the scientists involved in the research — understands how these large-language models perform as they do.”
At the moment, no one — even the scientists involved in the research — understands how these large-language models perform as they do. The networks’ inner workings are hugely complex and ill-understood. According to a recent review, the scientists are roughly evenly split on whether these things really are showing something like human intelligence or not (see M. Mitchell and D. Krakauer, preprint at [2210.13966] The Debate Over Understanding in AI's Large Language Models; 2023).
Those who see signs of true human intelligence note that these systems have already passed a number of standard tests for human-level intelligence, and often score higher on such tests than real people. On the other hand, as Mitchell and Krakauer note, those citing these tests have assumed they really do measure something of “general language understanding” or “common sense reasoning.” But this is only an assumption, furthered by the use of familiar words.
Computer scientists Jacob Browning and Yann LeCun think the differences between the networks and humans are still huge. “These systems are doomed to a shallow understanding that will never approximate the full-bodied thinking we see in humans,” they concluded (see AI And The Limits Of Language). Like-minded scientists suggest that words such as ‘intelligence’ or ‘understanding’ just do not apply to these systems, which are more “like libraries or encyclopaedias than intelligent agents”.
After all, the human understanding of a word such as ‘warm’ or ‘smooth’ relies on a lot more than how it is used in text. We have bodies and know the multi-dimensional sensations associated with such words. Sceptics also argue that our minds are just easily surprised by the apparent skills of LLMs because we lack any intuition for the insights statistical correlations might produce when implemented at the scales of these models.
And it’s clear that these systems currently can achieve very little in the area of logic and symbolic reasoning — which is the apparent basis of so much human thought. As Mitchell and Krakauer rightly emphasize, human thinking is largely based on developing, testing and refining simplified causal models of limited aspects of the world. We use such models in planning our days, in interaction with friends and associates. Such understanding — in contrast to that of LLMs — needs very little data and relies on uncovering causal links.
For this reason, I’m convinced that LLMs currently aren’t achieving a human kind of understanding. But what do we know about the possible levels of understanding? It seems likely that these networks have passed one or more significant threshold of complexity whereby new forms of understanding emerge, yet we still lack even a language for discussing such things.
Of course, these AIs also still lack crucial human aspects such as empathy and moral reasoning. Such elements may be missing for a long time. Or, we may be surprised to see them emerge sooner than we expect.