Kaido Orav's fx-cmix Wins 6911€ Hutter Prize Award!

The “Grokked Transformers” field is starting to realize what I’ve been saying – if this week-old paper is to be believed:

Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

An excerpt:

We find that the speed of improvement in generalization (grokking) …depends little on the absolute size of the training data.

What is important?

  1. the quality of the data and
  2. what they call “the critical data distribution”

When I first suggested Wikipedia as the prize corpus, Marcus mentioned that it was valuable because it was high quality data relative to the vast majority of natural language text available on the internet. (This despite my motive being to model the horrendous bias in its articles.) As for “critical data distribution” one might think of the kind of data that scientists seek when they design “critical experiments” to decide between competing models. In this respect Wikipedia isn’t so great.

Indeed, even from a “quality” standpoint, the aforelinked article on grokking transformers would see Wikipedia as abysmal. Despite all the care put into syntax if not semantics, the sentences are very far from the quasi-formal relations being expressed in the knowledge graphs used to train the grokking transformers.

Nevertheless, these guys are taking baby steps toward the day when it will be feasible to distill a natural language corpus, like Wikipedia, into various conflicting theories/formal systems/world models/foundation models. Indeed, one can already see this kind of thing in existing LLMs where one can prompt with something like:

From: Isaac Newton
To: Richard Dawkins
Subject:

…and have it complete with a “Theory” of mind of both Isaac Newton and Richard Dawkins where the LLM’s Isaac Newton theory of mind has, within it, Isaac Newton’s own theory of mind of Richard Dawkins that he uses to decide how to express his ideas so that Richard can best understand/be persuaded.

5 Likes