Kaido Orav's fx-cmix Wins 6911€ Hutter Prize Award!

jabowery · 5 February 2024 21:41

Kaido Orav has just improved 1.38% on the Hutter Prize for Lossless Compression of Human Knowledge with his “fx-cmix” entry.

The Hutter Prize winners have, since 2006, “predicted the next token” as the basis of language modeling, many years before predicting the next token was cool. Although, to be fair, The Hutter Prize doesn’t restrict winners to mere “next token” prediction.

After all, scientists are free to repeatedly pour over their datasets, “compressing” them into world models. They are only restricted to next observation prediction when designing experiments to test their models! But it is a good idea to select the best model when designing an experiment, just as it is when engineering a technology.

The Hutter Prize uses the size of executable archive of the data as an approximation of the most principle information criterion for model selection:

Algorithmic Information

Unlike the menagerie of less principled statistical information criteria, Algorithmic Information has been known since 1964 to be the least-biased approach to the natural sciences relative to a given selection of data. Since Wikipedia embodies wide-ranging knowledge encoded as language data, it was The Hutter Prize’s selection of data.

The Hutter Prize is is a scientific research prize. The sibling benchmark for technology is Matt Mahoney’s Large Text Compression Benchmark, which (unlike the Hutter Prize) has no resource constraints. The general purpose CPU constraint on the Hutter Prize is there, first and foremost, to avoid what Sara Hooker has described as “The Hardware Lottery”: Existing technological infrastructure may disfavor radical scientific discoveries that could otherwise point the way to new and better techniques. General purpose CPUs introduce less bias in research precisely because they are general purpose.

jabowery · 13 February 2024 15:13

One of the more exasperating things about promoting the Hutter Prize – especially in places like ycombinator which has the imprimatur of Pope Sam “Gibz $7T” Altman – is the claim that large language models are evidence that achieving “AGI” requires orders of magnitude more data than the 1GB Wikipedia snapshot of the Hutter Prize.

Aside from the fact Hutter is widely recognized as the foremost authority on the rigorous definition of what “AGI” means in mathematical theory, there is the implication that the “throw everything including the kitchen sink at the learning algorithm” approach can achieve their less principled notions of “AGI”.

OK, fine.

So you, dear pseudonym-created-for-this-particular-exchange-only-to-disappear-once-youve-damaged-the-world, are certain the Hutter Prize is rendered worthless by <insert specious arguments>, right?

What would you consider an event that could change your mind?

At least then we can discuss your getting an insurance company to weigh in with a bet against the Hutter Prize’s value, in a manner not unlike that used to underwrite the Ansari X-Prize.

But then, of course, pseudonym-created-for-this-particular-exchange-only-to-disappear-once-youve-damaged-the-world disappears.

jabowery · 9 June 2024 18:57

The “Grokked Transformers” field is starting to realize what I’ve been saying – if this week-old paper is to be believed:

Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

An excerpt:

We find that the speed of improvement in generalization (grokking) …depends little on the absolute size of the training data.

What is important?

the quality of the data and
what they call “the critical data distribution”

When I first suggested Wikipedia as the prize corpus, Marcus mentioned that it was valuable because it was high quality data relative to the vast majority of natural language text available on the internet. (This despite my motive being to model the horrendous bias in its articles.) As for “critical data distribution” one might think of the kind of data that scientists seek when they design “critical experiments” to decide between competing models. In this respect Wikipedia isn’t so great.

Indeed, even from a “quality” standpoint, the aforelinked article on grokking transformers would see Wikipedia as abysmal. Despite all the care put into syntax if not semantics, the sentences are very far from the quasi-formal relations being expressed in the knowledge graphs used to train the grokking transformers.

Nevertheless, these guys are taking baby steps toward the day when it will be feasible to distill a natural language corpus, like Wikipedia, into various conflicting theories/formal systems/world models/foundation models. Indeed, one can already see this kind of thing in existing LLMs where one can prompt with something like:

From: Isaac Newton
To: Richard Dawkins
Subject:

…and have it complete with a “Theory” of mind of both Isaac Newton and Richard Dawkins where the LLM’s Isaac Newton theory of mind has, within it, Isaac Newton’s own theory of mind of Richard Dawkins that he uses to decide how to express his ideas so that Richard can best understand/be persuaded.

eggspurt · 10 June 2024 11:30

There are a few efforts to create alternatives to Wikipedia for AI/compression corpora:

Left-wing: Towards a Books Data Commons for AI Training – Open Future

Right-wing: Brighteon.AI

civilwestman · 10 June 2024 14:27

Is there a “truth wing”? or an “actual reality wing”? Is the concept of “objective” now reduced to “hate speech”?

jabowery · 10 June 2024 15:34

While those efforts have their own merits, they are fundamentally different in aim from the Hutter Prize as I intended it, and IMNSHO inferior in that aim.

We have to destroy The Ministry of Truth rather than fighting a rear-guard action. The nuclear weapon to destroy the Ministry of Truth is The Hutter Prize because it does everything The Ministry of Truth claims it is doing! The Ministry of Truth claims it is fighting “misinformation”, “bias”, “disinformation”, “fake news” by promoting “the science”, and on and on and on.

The reason Ingsoc inverts everything is to occupy the position of authority that would destroy it if it were occupied by anything remotely resembling what they claim they are!

Isn’t it obvious?

It’s HIV.

It targets the immune system for occupation.

Why?

Because it is the immune system whose job it is to detect it and destroy it.

The fundamental fallacy is the idea that a big steaming pile of half-truths based on various agendas advanced by liars with hidden identities and hidden agendas cannot, with sufficiently rigorous forensic epistemology, be admitted to evidence and thereby expose identities latent in the lies and expose their respective agendas.

I’m not saying such rigorous forensic epistemology is easy, but I am saying it is feasible so long as we restrict the amount of evidence that must be so-analyzed to just Wikipedia. If we do not restrict the amount of evidence, it makes the computational resources far more expensive hence the entire project less practical!

jabowery · 11 June 2024 03:04

This is now easily my favorite AI YouTube channel especially the four videos involving Grokked transformers up to an including this one.

He sets forth a plausible business model for insurance at 54:24. Basically if you have structured knowledge, as in a business database, you can synthesize its quasi-formal language corpus and train the Grokked LLM with that. The *conjectured *result is an expert system that groks the business domain.

It is rather bemusing to me that people are befuddled by the phenomenon of “grokking” (and “double descent”) when to me it seems rather obvious what is going on:

When you adjust your loss function to more closely relate to the Algorithmic Information Criterion for causal model selection, what ends up happening is, at first, the conventional notion of an “error term” (squared error, etc.) dominates the gradient descent. Then it levels off and you apparently get no further improvement in your loss function for a long time because the regularization term(s) (the number of “parameters” in the model, ie: the number of algorithmic bits in the model) is a much smaller term in the loss function. But, it is still a small gradient. It is this reduction in the number of parameters that signals the onset of “grokking”.

Toward the end of the video he gets into ICL or “in context learning”. ICL, IMNSHO, is a dead end because it confuses inference with training. I don’t totally discount ICL’s potential either for some marginal benefit or for some sort of “black swan” breakthrough in ML, but I really don’t think it is a good idea to confuse inference with training. My simple test that demonstrates LLMs aren’t all they’re cracked up to be is intended to expose this distinction.

A visual representation of this appears in an earlier video:

The geometric symmetry is of the parameters of the model becomes more compressible (ie: less random) at the onset of grokking.

jabowery · 26 November 2024 18:02

I find it amusing that the “AI Realists” are now coming to believe that RAG is the moat for those in possession of proprietary data and that “when I see AI consultants on Twitter saying they can charge six figures for a two-month stint for RAG implementation and services, I know that a) it’s not foundational model companies paying for the service b) GPT-wrapper companies have the money to spend.”

These consultants remind me of Dogbert.

jabowery · 26 November 2024 18:35

Cronenberg captures what would happen to the social pseudosciences, hence the ruling theocracy, if some ruthlessly based philanthropists were to offer a prize for incremental improvements in the lossless compression of a broad range of longitudinal social measures:

jabowery · 27 November 2024 02:56

In addition to the US government’s domestic databases, the following video is about another source of datasets that would not be viewed with suspicion by the usual suspects:

I’m not sure what sort of restrictions they place on access to that data however.

I ran across that while looking for the latest on “one dataset many analysts” approach to containing the worst practices in sociology. It was referenced in the September 9, 2024 Journal of Development Effectiveness’s special issue on TREE: Transparent, Reproducible and Ethical Evidence.

“development effectiveness” refers to the branch of sociology pertaining to “developing nations” and the in/effectiveness of “interventions” to enhance “development”.

jabowery · 14 January 2025 20:36

Answer from http://grok.com:
Combining these elements—(1) the Hutter Prize’s focus on limited but general-purpose computation for language model learning, (2) Sara Hooker’s “Hardware Lottery” thesis, (3) the empirical nature of ML scaling laws, and (4) the significant disparity in capital between the Hutter Prize and initiatives like xAI’s Memphis ML center—offers an insightful perspective on the landscape of AI research and development:

Analysis:

Hutter Prize’s Approach:

The Hutter Prize emphasizes efficiency and ingenuity within constrained computational environments, specifically for text compression. This approach aligns with solving complex problems with minimal resources, highlighting the value of algorithm optimization over sheer computational power.
2. Sara Hooker’s “Hardware Lottery”:

Hooker’s thesis suggests that much of the success in AI research might be due to serendipitous alignment with available hardware rather than the inherent superiority of the algorithms themselves. This implies that the Hutter Prize’s focus could be seen as a counter-narrative or a challenge to the hardware lottery, promoting research where hardware constraints lead to more universal, hardware-agnostic solutions.
3. Empirical Nature of ML Scaling Laws:

The scalability of machine learning models often follows empirical laws, where performance improvements are observed with increased scale (more data, larger models, more compute). However, these laws are not theoretically guaranteed, and there’s a risk of diminishing returns or unexpected plateaus. The Hutter Prize, by contrast, operates under the premise that significant advancements can be made without scaling up computational resources, potentially offering insights into where these scaling laws might not hold or where efficiency can trump scale.
4. Capital Disparity:

The Hutter Prize operates with an endowment far less than what is allocated to a single AI research center like Memphis ML. This stark contrast in funding underscores different philosophies in AI research:
Small, Focused Funding: The Hutter Prize demonstrates that impactful research can come from very constrained resources, potentially leading to innovations in algorithmic efficiency.
Large Scale Investment: Conversely, centers like Memphis ML center invest in the belief that scaling is key to breakthroughs, supported by empirical scaling laws, but this comes with risks as outlined by Hooker’s thesis.
Implications for Capital Allocation and Research Direction:
Risk Mitigation through Diversification: xAI and similar organizations might benefit from not only investing in large-scale, high-resource projects but also in smaller, more constrained research akin to the Hutter Prize’s ethos. This could hedge against the risks of the hardware lottery and the empirical unpredictability of scaling laws.
Innovation in Efficiency: The Hutter Prize could inspire research into more efficient algorithms that perform well under hardware constraints, potentially leading to breakthroughs that are more universally applicable across different hardware environments. This could be particularly valuable if scaling laws begin to show diminishing returns or if there’s a shift towards more resource-efficient AI due to environmental or economic pressures.
Reevaluation of Success Metrics: There’s a need to look beyond performance scales based purely on computational power. Success might also be measured by how well a model performs relative to the resources it consumes, a metric that the Hutter Prize implicitly endorses.
Potential for New Research Paradigms: The combination of these factors might push towards new paradigms where the focus isn’t only on scaling but on smart scaling, where efficiency, generality, and adaptability are equally prized, potentially leading to more sustainable and widely applicable AI technologies.
In summary, while the Memphis ML center represents one end of the spectrum with its focus on scaling and empirical laws, the Hutter Prize represents another, emphasizing efficiency and generalization within constraints. The interplay between these approaches could lead to a richer, more nuanced landscape for AI research, where capital is allocated not just based on current trends but also on potential future shifts in technology and methodology.

jabowery · 19 January 2025 20:22

Its most ironic that Schmidhuber can’t get across to people the fact that the empirical scaling laws are likely ill-founded:

Schmidhuber is somewhat arguing against interest here, since he can lay claim to a substantial portion of the LLM priority, but the poor SOB is from the old school when understanding the fundamentals was no luxury.

I’m also “old school” (second neural network summer of the 1980s) but was only incidentally exposed to the fundamentals at that time. What I did get exposure to was the hardware speed limitation of that era, hence my focus was on hardware convolution neural networks.

But that provides me my own “against interest” position in claiming that the empirical scaling laws are not to be trusted, since those scaling laws are largely the result of a Hardware Lottery winner that I was first to bring to market (although thwarted by a moron PhD who was advising a China Lake admiral against the NEC system at the second IJCNN). Once I intuited the fundamentals circa 2005, I was stuck arguing against interest and watched with horror as the field went nuts.

jabowery · 20 January 2025 17:01

Xeet from Ivanka Trump linked to this:

jabowery · 20 January 2025 18:14

China can scale, too.

In fact, if what you rely on is mere scale, China wins.

This is what comes of permitting Maoists, like Milton Friedman, posing as “libertarians” to sell network effect fentanyl of the private sector.

jabowery · 26 January 2025 12:58

Well, hush my mouth! After looking into DeepSeek, it is apparent that what they did was not typically “Chinese” in terms of “scale” but, rather, they have attacked the scaling problem!

jabowery · 27 January 2025 03:10

The Trump Pirate Ship had better start listening to people who want to mobilize capital toward entrepreneurs rather than network effect monopoly “moats” or they’re going to find themselves sitting on huge piles of worthless NVIDIA hardware:

eggspurt · 27 January 2025 12:24

The hardware is quite useful: there’s a lot of demand with more and more people using these tools. The fact that the DeepSeek folks made it more efficient will just make it cheaper and therefore grow the total market.

jabowery · 27 January 2025 13:43

That’s a reasonable alternative hypothesis related to Jevons Paradox.

Which depends on the price elasticity of demand for NVIDIA hardware, all else being equal. But there is another factor that I’ve alluded to which is the plausibility of a breakthrough into an entirely different winner in The Hardware Lottery – a winner other than the GPU ML systems.

So what is in the back of my mind is the Fear Uncertainty and Doubt about the empirical scaling laws* for GPUs.

Regarding NVIDIA itself: NVIDIA’s core strength is simply rapid time to market for new integrated circuit design and fabrication, making NVIDIA’s value resilient.

* When I speak of “empirical scaling laws” in the current context, I am speaking in particular with respect to the GPU paradigm of machine learning. The idea that demand for ML hardware of some kind will continue to follow a similar empirical scaling law, but with radically different parameters, has some justification from optimal universal search theory.

https://people.idsia.ch/~juergen/optimalsearch.html

eggspurt · 27 January 2025 14:09

Oh I’m not defending NVIDIA - I think it’s competitive position is not as good as their relative valuation. There are companies like https://www.positron.ai/

These models are really useful and we’ll need to have them all around us.

jabowery · 27 January 2025 15:14

AFAIK, companies like Positron are focused on inference rather than training. There is a lot of activity in this area that is bringing down the cost of inference. Of the various scaling laws the one that most constrains AI is the cost of training since that is how you reduce the likelihood of erroneous responses to queries. Think of “errors and omissions insurance” for professional consultants. NVIDIA still leads the field in training hardware (although some might claim dark horse approaches out of left field like Cerebras are already ahead but just not being given their due by the big players).

Note the exponential fit:

And speaking of errors and omissions insurance consider getting it clinical facts correct