How To Nuke Large Language Model Bias?

We need to nuke what might be called “large language model bias”. The bias of which I speak is in two interdependent senses:

  1. Bias toward large models
  2. The most principled definition of “bias”: Bias operationally identified by lossless compression under Algorithmic Information Theory

These are interdependent:

  1. A corpus’s Algorithmic Information can only be approached by operationalizing the definition of “bias” as it exists in the corpus.
  2. Algorithmic Information is defined as the smallest of all possible algorithms that output the corpus.

To those of us that have been patiently lining up for over 15 years now, way down at Algorithmic Information Beach, awaiting the inevitable small language model tsunami borne of Kolmogorov et al, it is with a mixture of sardonic pity and horror that we watch, from a safe distance, hoards frantically paddling out to catch the “Large Language Model” wave set now breaking at Shannon Information Beach.

image

While it may be that we must wait until after carnage has cleared and the great whites have had their fill at Shannon Beach, it does occur to me that such techwave lineups have large distributional tails. Being one of those pioneers suffering from 6-sigmaitis myself (not speaking of IQ but of a combination of salient basis vectors) I feel uncomfortable with the sardonic aspect of my pity toward them.

Might there be some overlap with long-tail funding sources (despite someone believing I am “autistic” if not “schizophrenic” in need of medication)? It is with a heavy heart that I have long-recognized (what John Robb calls) swarm alignment is dumbing-down funding sources (including those at ycombinator despite its “autistic” title referring to Haskell Curry’s combinatorial calculus), I find the sheer size of the lineup at Shannon Beach, combined with my long-tail-techwave-lineup prior hopeful: That a few might be induced to head on down here to line up for the tsunami at Algorithmic Information Beach.

Among the computational language model luminaries that have endorsed Hutter’s Prize directly are Noam Chomsky, Douglas Hofstadter (both recently) and others that are lost to my memory at present over the course of 17 years. In his last public admonition to the field, Minsky very emphatically endorsed indirectly Hutter’s Prize:

So here are a couple of spitballs:

  1. A crowd-funded Hutter Prize purse running in parallel and as a supplement to Marcus Hutter’s limited ability to increase the Hutter Prize purse to the level required to induce the afore-mentioned fat-tail funding sources to weigh in.
  2. A proprietary version of the Hutter Prize that offers much greater rewards in exchange for nondisclosure.

(These are not mutually-exclusive, of course.)

What sorts of entities would be best to manage such prize purses? We can’t rely on the X-Prize foundation as it has long-ago abandoned the idea that its prize criteria must reduce the subjective judging aspects. They no longer get the critical need for reducing the judging argument surface other than the catch-all phrase “are final”. Even the Methuselah Mouse prize folk, who inspired the original prize criterion for Hutter, have gone flabby in the head. It’s sad but at least one new organizational structure is required to revive the illustrious objective judging criterion visions of these organizations, IMHO.

I’m no lawyer, nor organizer. This, to me, is the main barrier to nuking large language model bias in both of the aforementioned senses.

I must now put to bed an additional sense of “bias” toward the “large”: Not of models but of corpora. The Hutter Prize is a mere 1GB corpus and the large language models are drawing on ever-larger corpora orders of magnitude larger! The pragmatic value of larger corpora is, at this stage of language model development, best seen as the value of rote memorization over synthetic intelligence:

Computational complexity (not to be confused with Algorithmic Information’s notion of Kolmogorov Complexity) of deductive intelligence is simply too large to ever permit us to make all predictions based on reduction to the Standard Model + General Relativity + Wave Function Collapse Events, as some might think theoretically optimal under Algorithmic Information.

Having now paid The Devil His Due, let me describe why it is we should consider the 1GB limit adopted by Hutter’s Prize to be even more pragmatic for the present stage of language model development:

Thomas K. Landauer, a colleague of Shannon at Bell Labs, in "How Much Do People Remember? Some Estimates of the Quantity of Learned Information in Long-term Memory" estimated that over a human lifetime, the information taken in and integrated into memory was a few hundred megabytes. Ten hundred megabytes of carefully curated knowledge (Hutter’s 1GB) should be more than adequate for the most important challenge now facing language models:

Unbiased language models mastering a wide range of knowledge that runs on commodity hardware.

TSUNAMI’S UP!

6 Likes

That 70B parameter model, called “Chinchilla AI”*, is from DeepMind.

Well, actually, since Hutter’s PhD students founded DeepMind, the only thing that is “surprising” here is that DeepMind thence Google hasn’t backed the Hutter Prize with at a billion dollar purse.

I scare quote “surprising” because I’ve held for some time that this aversion to Algorithmic Information Criterion is not merely analogous but is equivalent to the historic aversion to the scientific method during the early years it was being advanced.

One of the more troubling developments in the LLM space is this paper by a mob of lavishly-endowed midwits titled “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models”. This is an enormous project to create “The” LLM benchmark consisting of a huge number of “tasks” when the only principled benchmark is the Algorithmic Information Criterion.

Unsurprisingly, they engage in tortured rhetoric to deal with “bias” and come up with the below graphs showing that as LLM size increases, so does “bias” as defined by the politically correct.

Larger, less PC.

The more “critical” graphs would show critical thinking score vs PC. I’m not sure how to wrench that out of this mob of PC midwits.

*Chinchilla AI doesn’t really advance the state of the art in model generation. All it demonstrates is that by increasing the size/diversity of the training corpus, the current training algorithms (transformers) use fewer computer resources to achieve equivalent or better performance on current LLM benchmarks, which are heavily biased toward relatively superficial aspects of intelligence such as high school level of knowledge and common sense. Critical thinking? The demand for that from those that control the purse strings depends on how much they benefit by the public having access to enhanced critical thinking. Take one guess…

4 Likes

We here at Algorithmic Information Beach welcome any Quimby that shows up to surf with us, even if they end up being Jakes because they don’t get that the most important thing isn’t machine learning but is, rather, model selection, thereby inviting specious jeers from LessWrong.

Moore’s law should have subjected the social pseudosciences to Ockham’s Guillotine 50 years ago and spared the world what is now upon us. Popper and Kuhn derailed this. May they burn in Hell.

4 Likes

Here is a step in the right direction:

1 Like

Here is another step in the right direction.

This paper is one step away from reifying identities latent in the data as “points of view” which can help to discount bias. Think about it in the context of Wikipedia’s policy of “neutral point of view”, the enforcement of which is anything but neutral. This was my primary motivation in suggesting a replacement for the Turing test with lossless compression of Wikipedia.

Completing this step in combination with the previously linked article on a layer on top of the large language models rendering them Turing complete so as to support multi-step critical reasoning will put the field under the algorithmic information criteria for model selection. Once that happens there will be a direct head-to-head contest between economic value and enforcement of the moral zeitgeist. Watch for those who try to obscure how direct this contest is. They are the tip of the spear of lies. They are almost certainly to be flushed out by this.

2 Likes