Kaido Orav has just improved 1.38% on the Hutter Prize for Lossless Compression of Human Knowledge with his “fx-cmix” entry.
The Hutter Prize winners have, since 2006, “predicted the next token” as the basis of language modeling, many years before predicting the next token was cool. Although, to be fair, The Hutter Prize doesn’t restrict winners to mere “next token” prediction.
After all, scientists are free to repeatedly pour over their datasets, “compressing” them into world models. They are only restricted to next observation prediction when designing experiments to test their models! But it is a good idea to select the best model when designing an experiment, just as it is when engineering a technology.
The Hutter Prize uses the size of executable archive of the data as an approximation of the most principle information criterion for model selection:
Unlike the menagerie of less principled statistical information criteria, Algorithmic Information has been known since 1964 to be the least-biased approach to the natural sciences relative to a given selection of data. Since Wikipedia embodies wide-ranging knowledge encoded as language data, it was The Hutter Prize’s selection of data.
The Hutter Prize is is a scientific research prize. The sibling benchmark for technology is Matt Mahoney’s Large Text Compression Benchmark, which (unlike the Hutter Prize) has no resource constraints. The general purpose CPU constraint on the Hutter Prize is there, first and foremost, to avoid what Sara Hooker has described as “The Hardware Lottery”: Existing technological infrastructure may disfavor radical scientific discoveries that could otherwise point the way to new and better techniques. General purpose CPUs introduce less bias in research precisely because they are general purpose.