How To Nuke Large Language Model Bias?

We need to nuke what might be called “large language model bias”. The bias of which I speak is in two interdependent senses:

  1. Bias toward large models
  2. The most principled definition of “bias”: Bias operationally identified by lossless compression under Algorithmic Information Theory

These are interdependent:

  1. A corpus’s Algorithmic Information can only be approached by operationalizing the definition of “bias” as it exists in the corpus.
  2. Algorithmic Information is defined as the smallest of all possible algorithms that output the corpus.

To those of us that have been patiently lining up for over 15 years now, way down at Algorithmic Information Beach, awaiting the inevitable small language model tsunami borne of Kolmogorov et al, it is with a mixture of sardonic pity and horror that we watch, from a safe distance, hoards frantically paddling out to catch the “Large Language Model” wave set now breaking at Shannon Information Beach.


While it may be that we must wait until after carnage has cleared and the great whites have had their fill at Shannon Beach, it does occur to me that such techwave lineups have large distributional tails. Being one of those pioneers suffering from 6-sigmaitis myself (not speaking of IQ but of a combination of salient basis vectors) I feel uncomfortable with the sardonic aspect of my pity toward them.

Might there be some overlap with long-tail funding sources (despite someone believing I am “autistic” if not “schizophrenic” in need of medication)? It is with a heavy heart that I have long-recognized (what John Robb calls) swarm alignment is dumbing-down funding sources (including those at ycombinator despite its “autistic” title referring to Haskell Curry’s combinatorial calculus), I find the sheer size of the lineup at Shannon Beach, combined with my long-tail-techwave-lineup prior hopeful: That a few might be induced to head on down here to line up for the tsunami at Algorithmic Information Beach.

Among the computational language model luminaries that have endorsed Hutter’s Prize directly are Noam Chomsky, Douglas Hofstadter (both recently) and others that are lost to my memory at present over the course of 17 years. In his last public admonition to the field, Minsky very emphatically endorsed indirectly Hutter’s Prize:

So here are a couple of spitballs:

  1. A crowd-funded Hutter Prize purse running in parallel and as a supplement to Marcus Hutter’s limited ability to increase the Hutter Prize purse to the level required to induce the afore-mentioned fat-tail funding sources to weigh in.
  2. A proprietary version of the Hutter Prize that offers much greater rewards in exchange for nondisclosure.

(These are not mutually-exclusive, of course.)

What sorts of entities would be best to manage such prize purses? We can’t rely on the X-Prize foundation as it has long-ago abandoned the idea that its prize criteria must reduce the subjective judging aspects. They no longer get the critical need for reducing the judging argument surface other than the catch-all phrase “are final”. Even the Methuselah Mouse prize folk, who inspired the original prize criterion for Hutter, have gone flabby in the head. It’s sad but at least one new organizational structure is required to revive the illustrious objective judging criterion visions of these organizations, IMHO.

I’m no lawyer, nor organizer. This, to me, is the main barrier to nuking large language model bias in both of the aforementioned senses.

I must now put to bed an additional sense of “bias” toward the “large”: Not of models but of corpora. The Hutter Prize is a mere 1GB corpus and the large language models are drawing on ever-larger corpora orders of magnitude larger! The pragmatic value of larger corpora is, at this stage of language model development, best seen as the value of rote memorization over synthetic intelligence:

Computational complexity (not to be confused with Algorithmic Information’s notion of Kolmogorov Complexity) of deductive intelligence is simply too large to ever permit us to make all predictions based on reduction to the Standard Model + General Relativity + Wave Function Collapse Events, as some might think theoretically optimal under Algorithmic Information.

Having now paid The Devil His Due, let me describe why it is we should consider the 1GB limit adopted by Hutter’s Prize to be even more pragmatic for the present stage of language model development:

Thomas K. Landauer, a colleague of Shannon at Bell Labs, in "How Much Do People Remember? Some Estimates of the Quantity of Learned Information in Long-term Memory" estimated that over a human lifetime, the information taken in and integrated into memory was a few hundred megabytes. Ten hundred megabytes of carefully curated knowledge (Hutter’s 1GB) should be more than adequate for the most important challenge now facing language models:

Unbiased language models mastering a wide range of knowledge that runs on commodity hardware.



That 70B parameter model, called “Chinchilla AI”*, is from DeepMind.

Well, actually, since Hutter’s PhD students founded DeepMind, the only thing that is “surprising” here is that DeepMind thence Google hasn’t backed the Hutter Prize with at a billion dollar purse.

I scare quote “surprising” because I’ve held for some time that this aversion to Algorithmic Information Criterion is not merely analogous but is equivalent to the historic aversion to the scientific method during the early years it was being advanced.

One of the more troubling developments in the LLM space is this paper by a mob of lavishly-endowed midwits titled “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models”. This is an enormous project to create “The” LLM benchmark consisting of a huge number of “tasks” when the only principled benchmark is the Algorithmic Information Criterion.

Unsurprisingly, they engage in tortured rhetoric to deal with “bias” and come up with the below graphs showing that as LLM size increases, so does “bias” as defined by the politically correct.

Larger, less PC.

The more “critical” graphs would show critical thinking score vs PC. I’m not sure how to wrench that out of this mob of PC midwits.

*Chinchilla AI doesn’t really advance the state of the art in model generation. All it demonstrates is that by increasing the size/diversity of the training corpus, the current training algorithms (transformers) use fewer computer resources to achieve equivalent or better performance on current LLM benchmarks, which are heavily biased toward relatively superficial aspects of intelligence such as high school level of knowledge and common sense. Critical thinking? The demand for that from those that control the purse strings depends on how much they benefit by the public having access to enhanced critical thinking. Take one guess…


We here at Algorithmic Information Beach welcome any Quimby that shows up to surf with us, even if they end up being Jakes because they don’t get that the most important thing isn’t machine learning but is, rather, model selection, thereby inviting specious jeers from LessWrong.

Moore’s law should have subjected the social pseudosciences to Ockham’s Guillotine 50 years ago and spared the world what is now upon us. Popper and Kuhn derailed this. May they burn in Hell.


Here is a step in the right direction:

1 Like

Here is another step in the right direction.

This paper is one step away from reifying identities latent in the data as “points of view” which can help to discount bias. Think about it in the context of Wikipedia’s policy of “neutral point of view”, the enforcement of which is anything but neutral. This was my primary motivation in suggesting a replacement for the Turing test with lossless compression of Wikipedia.

Completing this step in combination with the previously linked article on a layer on top of the large language models rendering them Turing complete so as to support multi-step critical reasoning will put the field under the algorithmic information criteria for model selection. Once that happens there will be a direct head-to-head contest between economic value and enforcement of the moral zeitgeist. Watch for those who try to obscure how direct this contest is. They are the tip of the spear of lies. They are almost certainly to be flushed out by this.


Probably one of the most destructive titles for a paper in history is “Attention Is All You Need” which sought to prove that recurrent* neural networks were outperformed by a non-recurrent architecture. Elon Musk’s AI guru Andrej Karpathy even went on to palaver with Lex Fridman to claim that it is Turing complete:

Karpathy attempts to justify his assertion “it is a general-purpose differential computer” by attributing to each Transformer layer the quality of a statement in a computer programming language. So, if you have N transformer layers, you have N statements!

Yeah, but there’s just one thing that he ignores: There are no loops because there is no recurrence.

That Musk would look up to such illiteracy, stunning in both its theoretic and pragmatic bankruptcy, exemplifies why the field of machine learning has succumbed to the “algorithmic bias” industry’s theocratic alignment of science with the moral zeitgeist.

Imagine Newton struggling to incorporate the notion of velocity into the state of a physical system so that the concept of momentum could make its debut in the natural sciences, after millennia of Greek misapprehension that force is the essence of motion. Then some idiot from the King’s court shows up with otherwise quite useful techniques of correlation and factor analysis granted him by a peek into the 19th century through the eyes of Galton, claiming “Statistics are all you need.” This idiot then denounces Newton, with the full authority of the King, backing up his idiocy with all of the amazing things Galton’s techniques allow him to do.

That’s “Attention Is All You Need” in the present crisis due to Artificial Idiocy.

So my response to the Karpathy interview was simply this:

“Recurrence Is All You Need”

The key word here is “Need” as in “necessary” in order to achieve true “general-purpose” computation.

Oh, but what about Galton? I mean, statistics are so powerful and provide us with such pragmatic advantage! How can I claim Musk’s being misled in a pragmatic sense?

The answer is that while we must always give statistics its due – just as we must The Devil – we must never elevate The Devil to to the place of God lest we damn ourselves to the eternal torment of argument over causation. Always be mindful that statistics are merely a way of identifying steps (ie: “statements”) that must be incorporated into an ultimately recurrent model.

Fortunately there are a few sane folks out there looking into doing this, but I fear they are more marginalized, in relative terms, than a Newton who is barred forever from influence because some idiot managed to dazzle the King with a peek into the future pragmatics of statistics.

Here’s a paper by such marginalized Newtons, groping in the dark to recover what Alan Turing taught us nearly a century ago:

Recurrent Memory Transformer

Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-attention.

In this work, we propose and study a memory-augmented segment-level recurrent Transformer (RMT). Memory allows to store and process local and global information as well as to pass information between segments of the long sequence with the help of recurrence.

We implement a memory mechanism with no changes to Transformer model by adding special memory tokens to the input or output sequence. Then the model is trained to control both memory operations and sequence representations processing.

Results of experiments show that RMT performs on par with the Transformer-XL on language modeling for smaller memory sizes and outperforms it for tasks that require longer sequence processing. We show that adding memory tokens to Tr-XL is able to improve its performance. This makes Recurrent Memory Transformer a promising architecture for applications that require learning of long-term dependencies and general purpose in memory processing, such as algorithmic tasks and reasoning.

*Moreover, the only sense in which the out-performed “recurrent” architectures could be called “recurrent” was in a very degenerate sense because they were being used in a feed forward layered architecture where the recurrence was subservient. This is rather like claiming that a digital multiply circuit’s use of flipflops to communicate steps in multiplication is “recurrent” and hence is a Turing complete “general-purpose computer”.


if it can’t be done on a commodore 64 its probably not gonna make a difference. 1 mhz should be enough.

1 Like

Wolfram conducts a really good interview with Terry Sejnowski who illustrates the point I’ve been making about the failure of the ML industry regarding recurrence and dynamical systems – and in particular the failure of the large language models like ChatGPT.

At one point in the interview Sejnowski asserts quite correctly that the field should have paid a lot more attention to dynamical systems experts!

I was most excited to see someone of his reputation finally say the Emperor Has No Clothes and do so in a way that was almost a confession of prior negligence.

But then Sejnowski answers Wolfram’s question about ChatGPT in regards to dynamical systems and asserts that transformers hence ChatGPT embodies a dynamical system.

Sejnowski is not referring to the Transformer model itself as being recurrent but rather to the fact that the vector output by the Transformer model is used to refine the conversations context for further dialogue. He then goes on to say that this is the kind of “recurrence” the human brain does. BUT that elides the internal dynamics of the human brain – which dynamics entail causal chains of reasoning we call “thought”. So his answer was specious and, worse, when I brought up the issue, someone in the chat answered me incorrectly that the Transformer uses RNNs – and this was a guy who had actually worked with some of the luminaries in the field that influenced Marcus Hutter’s PhD students that founded DeepMind – but not Marcus Hutter himself.

I did appreciate that he debunked the notion that “Hebbian learning” is merely “co-occurrence” of neuron firings – that it is a actually close-in-time occurrence, with one firing leading another providing the direction of neuron connection growth. Moreover, he also pointed out that this is the origin of “causal” reasoning at the very lowest level of learning! Good for him!


For 17 years I’ve been telling people increased funding for the Hutter Prize is the way to fix bias in AI. Now POTUS exec-orders this.

“root out bias in the design and use of new technologies, such as artificial intelligence”

Time’s running out.

I strongly suggest contacting Marcus at his email address to discuss how you can support the Hutter Prize.

For my own part, I’m taking $100/month out of my income stream, that has been annihilated by my political pariah status and 20 years of caring for a HD-degenerating wife, to increase the Hutter Prize steadily.

This is the vault where I’m putting $100/month of BTC to increase the next Hutter Prize payout.


PS: Marcus suggested that I keep a vault for my monthly donations to the Hutter Prize under my control and I chose a BTC address for this purpose. The reason I am violating the so-called “best practices” of BTC by “reusing” the address is, first of all, because I want to be transparent and second of all because people shouldn’t be using BTC if they want anonymity!!! Come on folks, all this nonsense about avoiding BTC address reuse just strikes me as a trap for the unwary – giving a false sense of security while using BTC. If people are serious about protecting privacy they need to use other solutions such as Monero.

---------- Forwarded message ---------
From: James Bowery
Date: Thu, Feb 9, 2023 at 6:40 PM
Subject: Sending Donations For the Next Prize Award?
To: Marcus Hutter
Cc: Matt Mahoney


Given all the hype about large language models, and the growing suspicion about “bias”, I suspect now would be a pretty good time to start accepting monthly donations to gradually increase the size of the next HP payout. There may be quite a number of folks who get that the HP fills an important gap. Paypal has facilities for this kind of thing, as well as Patreon, etc.

A funding thermometer on the HP page showing the current level of prize payout may help to induce not only contestants but donors as well.

For my own part, I’d be willing to send $100/month toward this end without any requirement that I be credited. I just want to see more attention paid to the value of algorithmic information approximation in drawing a distinction between the “is” and the “ought” notions of “bias” as well as debunking the specious notion that “bigger is better” in ML .

– Jim


Have you considered making a video presentation targeting potential investors who may not have a background in computer science, explaining why you think LLM bias poses an existential threat and how the Hutter Prize is the solution?


Yes. The big problem I have is what educators call “placement” – not just in understanding but also in motivation of potential “investors” – although philanthropists may be a more proper word here.

I’ve gotten so little feedback its hard to figure out where to start. It’s like I’m out in the desert living on locusts and honey:

Take Ockham’s Razor and The Information Age SERIOUSLY!

If I could read minds – imputation of variables latent in the data and all that causal inference rot – I can only guess they’re silently thinking:

“Don’t be insulting. Of course I take Ockham’s Razor seriously. Of course I take Moore’s Law seriously. Of course I take the data flood seriously. The entire human race takes ALL that as seriously as a heart attack! Where have you been? And why are you wearing that lion skin? Could you just dump the lion skin and take a shower?”



The very day after I started using BTC to fight back against Biden’s AI “bias” Executive Order, I get this bizarre “recommendation” from YouTube:

If there is any group of people who should have been using Algorithmic Information for macrosocial model selection, its the Intelligence Community.

All those destination “Ft. Mead” supercomputers lined up downstairs at CDC Arden Hills back in the late 70s come to mind.

This looks like the same scammers who have been running crypto cons that appear to be SpaceX announcements, often using the SpaceX logo, and names like “Space X” or “SpaceX CEO”, where the video content is months-old live streams with Elon Musk (one of which was the same one shown in your tweet with Jack Dorsey and Elon), wrapped with the scam payload which, when I checked them out, was a “Send BTC or ETH here to double your money” with a fake chat window with people saying how well it worked for them. When you check out the domain name, it was always registered a few days before you see the scam video. For the OPENAI-X2.INFO in the one you posted:

Creation Date: 2023-02-16T12:11:34Z
Registry Expiry Date: 2024-02-16T12:11:34Z

The registrant ID is always “REDACTED FOR PRIVACY”.

I discussed these scams in a post here on 2022-02-22, ‘  “SpaceX” Scammers—YouTube Just Doesn’t Care’.


Thanks. I’d never run across YouTube’s de facto participation in this kind of scam before. I’d seen stuff like that on Twitter but YouTube never bothered me with such recommendations before. It’s probably just because I had mentioned BTC in various places combined with YouTube’s fight against fake news, disinformation and grifters that led to the “recommendation”.


And how not to nuke the bias, but grow it strategically:


How much power have conservative organizations lost due merely to the threat that a weaponized IRS will engage in lawUNfare against them as they clearly did under Obama? If you write off any such donation, you could end up in prison and don’t tell me you’ll be able to pay.

The worst are full of passionate intensity…

One of the advantages of the Hutter Prize is that the “leftists” can hardly claim that Wikipedia is that “biased”, nor can they claim that payouts based on lossless compression are subject to “bias”.

It may be one of the few places where a “conservative” agenda could operate under 501C3 tax exemption. The leftists would have to not only become aware of the fact that lossless compression is the gold-standard for truth discovery, but become self-aware that they are hostile to truth.


NextBigFuture has an article on “scaling laws” for language models, with a recommendation on where to put funding, and I responded.


One of about 4 or 5 pedantic objections to using algorithmic information criterion for model selection is that the choice of UTM in defining algorithmic information is arbitrary. This has always seemed obstructionist to me, and not just for the obvious reason that the additive constant of an additional UTM emulator is small compared to just about any reasonable-size dataset (ie: 1GB of Wikipedia).

It is pedantic because there are obvious notions of UTM complexity going back at least to Shannon. Pedants will continue to hammer on this by objecting that the language used to describe the UTM determines the length of the description of the UTM. So shortly after Hutter announced his prize and people started in on this line of obstruction over at LessWrong, I came up with the idea of a UTM that emulates itself – the length of the self-emulator being the Algorithmic Information closure of the entire system.

That was 17 years ago and I still don’t see anything in the literature of pedants dealing with this, but I did find one question over at cstheory stackexchange that asked the question in rather technical jargon. No reponse after 6 years! This seems to be something that Wolfram should have been researching as part of his NKS since he awarded $25,000 for the shortest UTM description (and proof that it is universal).

So I asked the question there and moderators deleted it – perhaps because I posted it under “challenges”. So I just reposted it to Wolfram’s NKS group and took a screen shot in case the moderators delete it again:

PS: There are various games one can play to try to evade this question such as arise when people talk about the combinatorial calculus (SK combinators) or other concatenation formal languages as requiring descriptions external to their physical implementation. To these objections I can respond with my original suggestion to Hutter which was that to whatever level of abstraction from the physical one may wish to retreat, there is still the challenge of simulating that level of physical abstraction as the substratum of universal computation. So pick your level, obstructionists. NOR gates? Maxwell’s equations? QED? I don’t care.

PPS: In a kind of meta-ironic question to ChatGPT, I get this response consisting of word salad tossed with fresh steaming bullshit:



ChatNPC’s answer to my brain teaser made me think a truly evil thought:

Tailor it with LessWrong’s corpus and …

The Applied Brain Research guys appear to have a breakthrough in language modeling described in this video seminar. I didn’t want to talk about this video until I had given them a chance to enter the Hutter Prize, which I suggested they do during the seminar – but its been nearly 6 months and I haven’t heard anything from them. They outperform transformers on language modeling benchmarks, and are technically able to scale up more easily and economically in terms of CUDA cores and VRAM.

I’ve been watching the guys at ABR ever since they published their first paper on Legendre Memory Units a few years ago because of the parsimony of their models measured in terms of parameters – as well as their orientation toward dynamical systems. Both of these are big advantages in terms of the Algorithmic Information Criterion for causal model selection. This is the most promising of the emerging language modeling technologies.