Kaido Orav's fx-cmix Wins 6911€ Hutter Prize Award!

Kaido Orav has just improved 1.38% on the Hutter Prize for Lossless Compression of Human Knowledge with his “fx-cmix” entry.

The Hutter Prize winners have, since 2006, “predicted the next token” as the basis of language modeling, many years before predicting the next token was cool. Although, to be fair, The Hutter Prize doesn’t restrict winners to mere “next token” prediction.

After all, scientists are free to repeatedly pour over their datasets, “compressing” them into world models. They are only restricted to next observation prediction when designing experiments to test their models! But it is a good idea to select the best model when designing an experiment, just as it is when engineering a technology.

The Hutter Prize uses the size of executable archive of the data as an approximation of the most principle information criterion for model selection:

Algorithmic Information

Unlike the menagerie of less principled statistical information criteria, Algorithmic Information has been known since 1964 to be the least-biased approach to the natural sciences relative to a given selection of data. Since Wikipedia embodies wide-ranging knowledge encoded as language data, it was The Hutter Prize’s selection of data.

The Hutter Prize is is a scientific research prize. The sibling benchmark for technology is Matt Mahoney’s Large Text Compression Benchmark, which (unlike the Hutter Prize) has no resource constraints. The general purpose CPU constraint on the Hutter Prize is there, first and foremost, to avoid what Sara Hooker has described as “The Hardware Lottery”: Existing technological infrastructure may disfavor radical scientific discoveries that could otherwise point the way to new and better techniques. General purpose CPUs introduce less bias in research precisely because they are general purpose.


One of the more exasperating things about promoting the Hutter Prize – especially in places like ycombinator which has the imprimatur of Pope Sam “Gibz $7T” Altman – is the claim that large language models are evidence that achieving “AGI” requires orders of magnitude more data than the 1GB Wikipedia snapshot of the Hutter Prize.

Aside from the fact Hutter is widely recognized as the foremost authority on the rigorous definition of what “AGI” means in mathematical theory, there is the implication that the “throw everything including the kitchen sink at the learning algorithm” approach can achieve their less principled notions of “AGI”.

OK, fine.

So you, dear pseudonym-created-for-this-particular-exchange-only-to-disappear-once-youve-damaged-the-world, are certain the Hutter Prize is rendered worthless by <insert specious arguments>, right?

What would you consider an event that could change your mind?

At least then we can discuss your getting an insurance company to weigh in with a bet against the Hutter Prize’s value, in a manner not unlike that used to underwrite the Ansari X-Prize.

But then, of course, pseudonym-created-for-this-particular-exchange-only-to-disappear-once-youve-damaged-the-world disappears.


The “Grokked Transformers” field is starting to realize what I’ve been saying – if this week-old paper is to be believed:

Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

An excerpt:

We find that the speed of improvement in generalization (grokking) …depends little on the absolute size of the training data.

What is important?

  1. the quality of the data and
  2. what they call “the critical data distribution”

When I first suggested Wikipedia as the prize corpus, Marcus mentioned that it was valuable because it was high quality data relative to the vast majority of natural language text available on the internet. (This despite my motive being to model the horrendous bias in its articles.) As for “critical data distribution” one might think of the kind of data that scientists seek when they design “critical experiments” to decide between competing models. In this respect Wikipedia isn’t so great.

Indeed, even from a “quality” standpoint, the aforelinked article on grokking transformers would see Wikipedia as abysmal. Despite all the care put into syntax if not semantics, the sentences are very far from the quasi-formal relations being expressed in the knowledge graphs used to train the grokking transformers.

Nevertheless, these guys are taking baby steps toward the day when it will be feasible to distill a natural language corpus, like Wikipedia, into various conflicting theories/formal systems/world models/foundation models. Indeed, one can already see this kind of thing in existing LLMs where one can prompt with something like:

From: Isaac Newton
To: Richard Dawkins

…and have it complete with a “Theory” of mind of both Isaac Newton and Richard Dawkins where the LLM’s Isaac Newton theory of mind has, within it, Isaac Newton’s own theory of mind of Richard Dawkins that he uses to decide how to express his ideas so that Richard can best understand/be persuaded.


There are a few efforts to create alternatives to Wikipedia for AI/compression corpora:

Left-wing: Towards a Books Data Commons for AI Training – Open Future

Right-wing: Brighteon.AI


Is there a “truth wing”? or an “actual reality wing”? Is the concept of “objective” now reduced to “hate speech”?


While those efforts have their own merits, they are fundamentally different in aim from the Hutter Prize as I intended it, and IMNSHO inferior in that aim.

We have to destroy The Ministry of Truth rather than fighting a rear-guard action. The nuclear weapon to destroy the Ministry of Truth is The Hutter Prize because it does everything The Ministry of Truth claims it is doing! The Ministry of Truth claims it is fighting “misinformation”, “bias”, “disinformation”, “fake news” by promoting “the science”, and on and on and on.

The reason Ingsoc inverts everything is to occupy the position of authority that would destroy it if it were occupied by anything remotely resembling what they claim they are!

Isn’t it obvious?

It’s HIV.

It targets the immune system for occupation.


Because it is the immune system whose job it is to detect it and destroy it.

The fundamental fallacy is the idea that a big steaming pile of half-truths based on various agendas advanced by liars with hidden identities and hidden agendas cannot, with sufficiently rigorous forensic epistemology, be admitted to evidence and thereby expose identities latent in the lies and expose their respective agendas.

I’m not saying such rigorous forensic epistemology is easy, but I am saying it is feasible so long as we restrict the amount of evidence that must be so-analyzed to just Wikipedia. If we do not restrict the amount of evidence, it makes the computational resources far more expensive hence the entire project less practical!


This is now easily my favorite AI YouTube channel especially the four videos involving Grokked transformers up to an including this one.

He sets forth a plausible business model for insurance at 54:24. Basically if you have structured knowledge, as in a business database, you can synthesize its quasi-formal language corpus and train the Grokked LLM with that. The *conjectured *result is an expert system that groks the business domain.

It is rather bemusing to me that people are befuddled by the phenomenon of “grokking” (and “double descent”) when to me it seems rather obvious what is going on:

When you adjust your loss function to more closely relate to the Algorithmic Information Criterion for causal model selection, what ends up happening is, at first, the conventional notion of an “error term” (squared error, etc.) dominates the gradient descent. Then it levels off and you apparently get no further improvement in your loss function for a long time because the regularization term(s) (the number of “parameters” in the model, ie: the number of algorithmic bits in the model) is a much smaller term in the loss function. But, it is still a small gradient. It is this reduction in the number of parameters that signals the onset of “grokking”.

Toward the end of the video he gets into ICL or “in context learning”. ICL, IMNSHO, is a dead end because it confuses inference with training. I don’t totally discount ICL’s potential either for some marginal benefit or for some sort of “black swan” breakthrough in ML, but I really don’t think it is a good idea to confuse inference with training. My simple test that demonstrates LLMs aren’t all they’re cracked up to be is intended to expose this distinction.

A visual representation of this appears in an earlier video:

The geometric symmetry is of the parameters of the model becomes more compressible (ie: less random) at the onset of grokking.