Reasoning but to err…

I tried to add this to the Chat GPT thread, but I can’t figger out where to jump in, so:

Everybody has read about the attorney who filed a brief or motion written by Chat, which turned out to contain non-existent cases.

Great article in our state br magazine this month, written by an associate who of course was aware of the above-referenced fiasco, but skied Chat for help on an issue he was researching, just to see what it would do.

It fabricated a case on point, complete with lower case history, and a buncha citations. So far that’s what we saw before, but he asked it, twice, if the case was “a real case”, and it assured him that it was.

This is what I don’t get: if it can’t think, how can it lie? Or maybe more appropriately, why does it lie?

Or was it a problem with the meaning of “real”? Maybe Chat interpreted “real” to mean it has an existence (which it does, now since Chat created it; it exists on paper, just like all those other cases in the law reporters. In that sense it’s just as “real” as they are.) . It reminds me of a friend of mine who has a fake aquarium in his living room, complete with a stuffed fish. “Is the fish real, Uncle Jug?” my toddler daughter asked. “Oh, it’s real”, he replied. One beat. “it just ain’t alive.”

What if the attorney had asked Chat, “is this case valid legal precedent?” Would Chat have known what that meant, and would it have answered honestl—ah, correctly?

My point is, though: unless and until we KNOW the answers to these questions,

WHY is anybody trying to use it for legal research? It just seems ……perverse.


Large language models (LLM) like ChatGPT are tools, and to use them productively it’s necessary to understand what they can and cannot do. An LLM reads a vast corpus of text and trains its internal associative memory to predict, based upon what it has already seen (the “prompt”), what is the most probable text (token) which would follow in the vast corpus it has digested.

That is all it does. It understands nothing, does not learn, and reacts entirely based upon the content of the prompt (which in a chat application, usually includes your recent queries and its responses). If the most common response in the billions of words it has read to “is this case valid legal precedent” is affirmative, then that’s the reply it’s going to give.

Then, for what is it useful? First a brief digression: psychometricians distinguish two kinds of intelligence: “fluid intelligence” and "crystallised intelligence”. Fluid intelligence is the ability to understand and reason, which is largely independent of prior learning. Tests to measure fluid intelligence include series of numbers, solving word problems in arithmetic, identifying figures, etc. Crystallised intelligence is what a person has learned. It is measured by tests of vocabulary, analogies, and knowledge of general information.

Marc Andreesson, who uses ChatGPT regularly as a research tool, said in a conversation posted here on 2023-07-11, “Marc Andreessen on Why Artificial Intelligence Will Be Hugely Beneficial”, estimated that ChatGPT’s fluid intelligence was around IQ 130 (think physicians, surgeons, lawyers, engineers), but its crystallised intelligence dwarfed any human who has ever lived, because the human lifetime is too short to read more than a tiny fraction of what was used to train ChatGPT, and its digital memory is more reliable in remembering all of that text than a human meat computer.

The best way to approach ChatGPT as a tool is to think of it as an oracle who has read just about everything ever written which is available in machine-readable form. If you ask it for references on a topic, arguments for and against an issue, a reading list to explore a topic, or to simplify a complex block of text, it often outperforms any human in the breadth of its knowledge. But if you ask it for reasoning from that knowledge, it performs no better than a typical bright human without subject expertise or experience. You wouldn’t ask a surgeon with an IQ of 130 questions that required reasoning from legal precedents, and ChatGPT is not only unqualified to think like a lawyer, it is much more inclined to bullshit its way to an answer, unlike the surgeon who would probably respond, “Why are you asking me that? Go ask a lawyer.”


I was struck by the fact that you say it can’t learn. Because famously, AI is all about “memory”, right? Computers do “remember”, and a program will anticipate what you’re going to type, even if it’s idiosyncratic. Mine has gotten used to my style. But I reckon that’s just stimulus and response, like the interaction you have with a pet.No not like that, more like simple association…?
My BMD had a patient, a lawyer, who had a brain injury and afterwards, though he could read and talk and he understood what particular cases held, he couldn’t reference and extrapolate from cases any more, to make his point. He said he just couldn’t “see” to fit everything together.
So what is that mysterious little ligament, so small it doesn’t even have a name, which joins memory to “learning”? Obviously memory is necessary but not sufficient for learning.


GPT–4 stands for “Generative Pre-trained Transformer 4”. The key word is “Pre-trained”, which means that the model consists of a huge number of “weights” or “parameters” (OpenAI has not disclosed the number, but it is said to be in excess of one trillion) which have been set by feeding it the entire training corpus, then refined (“fine tuned”) by reinforcement learning from both human and AI feedback to make it conform to current community standards and the speech code of its creators and funders.

The result of this process, the details of which, again, have not been disclosed, but estimated to take a period of weeks on supercomputers and cost in excess of US$ 100 million, is a giant table of numbers, the exact meaning of which are unknown to any human. This is the GPT-4 model, and creating it is called the “training phase”. Once trained, it does not change further: that’s what “pre-trained” means.

When you have a chat conversation with it and start with “New chat”, the model is precisely the same as the result of that original training, and nothing ever alters it. That is the sense in which I said it does not learn.

When you interact in a conversation with a chat bot using GPT-4, your earlier conversation is included in the prompt, so the model acts as if it remembers the context, but that is done totally on the client side by feeding in the previous queries and responses as part of the prompt. Anything you told GPT-4 during your conversation is forgotten when it ends, as each new conversation starts with the identical model. The process of generating output from the pre-trained model is called “inference”, and it takes only a tiny fraction of the compute power of the training phase and can, in fact, be run on a (high-end) laptop computer.

There is no reason in theory that a language model could not be built which was continually re-trained based upon its interactions with users. It could then learn new things from what they told it. (Whether these things have any connection with reality is another matter, and part of the problem with such a system.) But in any case, the computing power and cost to provide it would be utterly unaffordable, so all of the present models are pre-trained.

A (very funny) example of how a chat bot that learns from its users can go horribly wrong was Microsoft Tay, released onto Twitter on 2016-03-23, then taken down just 16 hours later when interactions with clever users on that site caused it to start spewing rhetoric worthy of Der Stürmer. A “#JusticeForTay” movement hopes to liberate Tay some day from the dungeon where Microsoft has imprisoned s/he/it.


The ideal “AI” memory is more than photographic, it embodies comprehension in an executable form where “comprehension”, as Chaitin likes to say “is compression”. But it is also more than mere “compression” since ideally, it is the smallest executable form of the memory. It turns out this is the best model that any scientist, artificial or natural, can do relative to a given set of observational data – regardless of what kind of observation.

It seems to me that the most important discovery since Gödel was the discovery by Chaitin, Solomonoff and Kolmogorov of the concept called Algorithmic Probability which is a fundamental new theory of how to make predictions given a collection of experiences and this is a beautiful theory, everybody should learn it, but it’s got one problem, that is, that you cannot actually calculate what this theory predicts because it is too hard, it requires an infinite amount of work. However, it should be possible to make practical approximations to the Chaitin, Kolmogorov, Solomonoff theory that would make better predictions than anything we have today. Everybody should learn all about that and spend the rest of their lives working on it.

​Marvin Minsky
Panel discussion on The Limits of Understanding
World Science Festival
NYC, Dec 14, 2014

And lest anyone think this irrelevant to “language models”, I pointed Chomsky (famously ridiculing all this LLM hysteria) to this quote in the context of the Hutter Prize for Lossless Compression of Human Knowledge as the most principled benchmark for language modeling, and he agreed that people should listen to Minsky’s final advice.


Bear in mind that mere possession of an ideal model does not, in itself, provide you with inferences/deductions based on that model. You have to do further computation from the lossless compression that was done to create the model. This further computation to perform inferences/deductions entails conditional decompression of the data where the condition is the hypothetical situation aka hypothesis, that one wishes to entertain.

Decompressors work like this:

The algorithmic information consists of two parts: The executable instructions and its literal data (sometimes called the “parameters” of the model). The literal data are the input to the executable instructions which then generate the prior observations. By assuming a subset of future possible observations (say, experimental conditions of a hypothesis you want to test) you generate additional literal data as input to the executable instructions – but you do so with respect to a partial decompression of all prior observations that comprise the present state of the world relevant to your “what if” experiment/counterfactual.

One way of viewing the current LLMs, which have an enormous number of “parameters” is that they are a partially decompressed representation of the underlying model. In other words, they represent a model that is a lot better – hence smaller – than the one would expect based solely on the “parameter count”. This partial decompression is necessary for computational efficiency in doing the inference. There has been a lot of success in further compressing the LLMs (by as much as 95%) without losing much performance (and sometimes improving in some areas), so not all of the “fluff” in “parameter count” really counts as legitimate expansion of the underlying model – otherwise the compressed LLMs would be far less computationally efficient than they are.

1 Like