Algorithmic Information of the Human Genome

jabowery · 4 December 2022 02:19

What if there were a prize, like the Hutter Prize for Lossless Compression of Human Knowledge, replacing the Wikpedia corpus with a database of human genomes?

The idea would be to approximate the Algorithmic Information of humanity’s population genetics.

This turns the science of human history into a market where money talks and bullshit walks.

People make all kinds of inferences about history (and prehistory) from present data sources – even billions of years in the past (e.g. cosmology). When doing so, they are inferring historic processes. Algorithms model processes so as to “fit” the observed data. Algorithmic Information Theory tells us that if computable processes are generating our observations, we should be able to simulate those processes with algorithms and that choosing the smallest algorithm (measured in bits of instructions) that exactly simulates our prior observations (including errors as literal constants) is the best anyone can do in deciding which model will yield the most accurate predictions.

So, if we start with any dataset and get people competing to compress it without loss, we have turned the scientific domain of that dataset into a market! Forget about “prediction markets” like Ideosphere or Metaculus – this is the way forward in the age of Moore’s Law and Big Data.

Now consider what happens with a dataset of human genomes. What are the historic processes that generated those genomes? Isn’t that an interesting question? Wouldn’t you prefer it if you had some objective – even automatic criterion for judging which model of history was the best one given a standard and unbiased source of data? Note, I’m not here talking about “machine learning” or “artificial intelligence”. I’m just talking about a bullshit filter that works the way a high trust society’s market works to filter out bullshit with minimum transaction cost, and lets competent human actors flourish.

Why start with a database of human genomes? The main reason is to avoid rhetorical arguments over “bias in the data”. The point is to minimize the “argument surface” because that is how theocrats attack scientists: By increasing the “argument surface” and thereby admitting more subjective criteria to any decision.

And we have to be a careful here but not just for the technical or theoretic reasons many might like to throw in the way of this to increase the argument surface. We have to be careful precisely because so many people might like to prevent the emergence of a genuine revolution that advances the natural sciences – just as there was religious objection to early advances in the philosophy of natural science. And, like the old theocracies, these people are not only powerful they are increasingly aware that they are intellectually bankrupt abusers of their power. They’re so scared that they are doing everything they can to hold back the tsunami of data and computation from meeting up with Algorithmic Information Theory to produce an objective judging criterion that can create a genuine scientific market.

They are already restricting access to human genome data with contracts binding people to not study socially relevant differences between races despite the fact that Federal Law uses “race” as a legal category in such questionable social policies as “affirmative action”.

Let me give you some background as to why I think we might be able to blast a hole in the city gate of this theocratic stronghold:

In 2005, about the same time I suggested to Marcus Hutter the idea for The Hutter Prize for Lossless Compression of Human Hnowledge, I wrote a little essay titled “Removing Lewontin’s Fallacy From Hamilton’s Rule”. The idea is pretty simple:

W. D. Hamilton came up with the basis of “kin selection” theory: “Hamilton’s inequality” which basically just says that if a gene produces behaviors that help one’s relatives have enough children then that gene is likely to spread even if one has no children*. Lewontin then took that word “gene” and used it to convince nubile Boomer chicks with fat trust funds attending Harvard that “race is a social construct” because “there is more genetic variation within than between races”. We’ve all heard that one because he then teamed up with his red diaper baby colleague, Jay Gould, to popularize their bullshit. They then went on the offensive to attack E. O. Wilson’s sociobiology as though it were a political rather than scientific controversy. I’m sure they got lots of Boomer coed nookie out of this at the height of the sexual revolution. But, that’s no big deal. The big deal is they held back the social sciences thence social policy by many decades. Moreover, this may ultimately kill tens of not hundreds of millions as the proximate cause of a rhyme with The Thirty Years War for freedom from their theocratic bullshit.

Decades later A. W. F. Edwards published “Human genetic diversity: Lewontin’s Fallacy” wherein he pointed out that the word gene is inadequate to talk about race as a biological construct. One must talk about “genetic correlation structures”. If we then turn around and talk about replacing the word gene with genetic correlation structure in Hamilton’s theory of kin selection, we have disambiguated Hamilton’s theory of kin selection and thereby pried some of Lewontin’s and Gould’s cold cadaverous fingers from the throat of Western Civilization – hopefully before their theocracy kills it entirely.

The only way one can minimize the size of an algorithm that outputs a human genome dataset is by identifying the genetic correlation structures that most parsimoniously fit the historic processes that created (and destroyed) them.

But what about access to that genomic data locked up behind the city-gate of the theocracy’s stronghold?

One of the beautiful things about the Algorithmic Information Criterion for model selection is that one need not have direct access to the data in order for one’s model to win! All one needs is a model that can take whatever is in the hidden dataset and turn it into a program that is smaller than the program others turn it into. One might call this “lossless compression” but really that is misleading because it is domain specific lossless compression – not zip or bzip, etc. In other words, you have a better way of simulating the history that generated the set of human genomes. You have a better model of historic processes that can, without knowing the details of the dataset, better compress those genomes sight-unseen.

Submit the compression program and judge solely on the size of the executable archive output by the compressor – all judging done automatically!

Now, I’m not necessarily saying that Harvard or any theocratic stronghold is going to permit a fair scientific competition to occur pertaining to human society. Well, they will over their dead bodies, with us trying to pry their cold dead fingers from their data, endowment and coeds. But if any substantial set of human genome data is made available to this objective market dynamic, even if data is held behind a veil of secrecy, it could blow a hole in the theocracy’s city-gate.

*I won’t go into the controversy involving the late E.O. Wilson here regarding the evolution of sterile workers in eusocial species except to say that everyone is talking past each other because they can’t face the awful truth of the human condition presenting itself in the present descent into a rhyme with The Thirty Years War.

jabowery · 4 December 2022 15:49

I should point out that just as with any theocracy, the state finances an industrial-scale apologia for its theology and ever since Edwards’s 2003 paper criticizing Lewontin, there has been a robust business sector putting spin on “What Lewontin really meant!” As with virtually all theocratic apologia, they are verbose and vacuous. A good example is the very recent paper from the Royal Society (originally established to separate church and science) titled “The background and legacy of Lewontin’s apportionment of human genetic diversity”. To boil down the vacuous bullshit of all this apologia to its essence:

Lewontin was asking an innocently autistic meaningless question and got an innocently autistic meaningless answer. It’s not Lewontin’s fault that his meaningless answer was turned into a “sound bite” that sounded like it answered a profoundly meaningful question with implications for public policy affecting the lives of virtually everyone in the West. It was all their fault and, worst of all are the RACISTS like Edwards who criticized poor little innocent autistic Lewontin for asking and answering a meaningless question! How did Lewontin know people were going to go apeshit? I mean, even that bad boy Gould didn’t really have Lewontin’s support in advancing this “sound bite”!

Don’t bother reading the industrial exegesis seeking to “clarify” Lewontin in the wake of Edwards’s critique. It all boils down to the above. To clarify what I mean by “meaningless”, consider how important it would be to ask the following question:

Of how much taxonomic importance can we consider the presence of a particular nucleic acid at a particular locus on the human genome?

To take this further, ask yourself:

If you are given a bunch of files and a bit offset, how well can you classify the files based solely on whether there is a 1 or 0 at that offset?

A pathetic autist might ask a question like that, seek to quantify the answer and publish a paper in a prestigious journal – but we shouldn’t harass the poor kid about it! Now shut up and pay reparations!

johnwalker · 4 December 2022 16:28

It appears there are now openly-available databases of complete genome sequences from a wide variety of human populations, for example the 1000 Genomes Project which is maintained by the International Genome Sample Resource, now operated by the Data Coordination Centre of EMBL-EBI.

These kinds of open resources might provide the raw material for a compression challenge beyond reach of the academic gatekeepers.

jabowery · 13 December 2022 20:56

Shi Huang’s “The collective effects of genetic variants and complex traits” tweet thread exegesis illustrates just how dangerous to the theocracy a principled causal model selection criterion for population-level human evolutionary dynamics could be:

t-3 · 14 December 2022 08:33

I don’t think his argument that higher genetic diversity among the San invalidates the “Out of Africa” theory makes any sense. Higher genetic diversity and low selection pressure for intelligence would seem to me to indicate that the area is where humanity originated.

jabowery · 14 December 2022 17:36

And the beauty of the Algorithmic Information Criterion is that his underlying model of evolutionary dynamics and yours would be operationalized in a manner that the selection of the SOTA model would be unbiased and optimal.

It’s just that it would require hard work on the part of the proponents but the complaint about such hard work is, as I have described before about such controversies, about as appealing as “But, the dog ate my homework!”

jabowery · 5 January 2023 17:11

Sad, indeed.

Jurgen Schmidhuber, one of the most influential researchers in AGI – having tutored Marcus Hutter who, himself, was the PhD advisor for DeepMind founders – is considered something of a joke among “influencers” of what I’ll call the third neural net summer (that arose on the strength of GPUs). He’s a “joke” because he insists on calling attention to the fact that during the second neural net summer, virtually all of what is considered fundamental advances in the third, were published and he was an author in many if not most of them.

Andreas · 14 February 2023 21:40

And the great google development of Artificial intelligence moving forward to AGI with lamda with the latest server farms of GPU’s and a guy named John Walker made a similarly advanced neural network in 1987 for the Commodore 64.

civilwestman · 15 February 2023 12:25

Somehow, I am not surprised. Welcome to Scanalyst!

jabowery · 16 February 2023 02:34

TDSTF: Transformer-based Diffusion probabilistic model for Sparse Time series Forecasting is particularly interesting for sparse datasets such as the human genome and social data because it applies a radically new approach to imputation of missing data. Although the paper is ostensibly about time series forecasting, its lack of recurrence puts in a similar bind with all other transformer-based sequence models.

HOWEVER, there is increasing interest in recurrent transformer models, with the expected good results, as in Block-Recurrent Transformers:

We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence, and has linear complexity with respect to sequence length. Our recurrent cell operates on blocks of tokens rather than single tokens during training, and leverages parallel computation within a block in order to make efficient use of accelerator hardware. The cell itself is strikingly simple. It is merely a transformer layer: it uses self-attention and cross-attention to efficiently compute a recurrent function over a large set of state vectors and tokens. Our design was inspired in part by LSTM cells, and it uses LSTM-style gates, but it scales the typical LSTM cell up by several orders of magnitude. Our implementation of recurrence has the same cost in both computation time and parameter count as a conventional transformer layer, but offers dramatically improved perplexity in language modeling tasks over very long sequences. Our model out-performs a long-range Transformer XL baseline by a wide margin, while running twice as fast. We demonstrate its effectiveness on PG19 (books), arXiv papers, and GitHub source code. Our code has been released as open source.

If you put these two papers together, you’ll be addressing two of the biggest problems facing forecasters.

PS: That the latter paper had to reintroduce LSTMs to transformers is precisely what was so destructive about the “Attention Is All You Need” meme. LSTMs had limitations but the transformer folks, by over-selling what they were doing, were not competent to occupy the position they did.

PPS: It is no coincidence that one of the authors of that paper was from the Swiss AI Lab IDSIA which is where Schmidhuber and Hutter were from.

Mooselake · 19 February 2023 17:32

That article says “You can … quickly name the American presidents that have the same names as automobiles.”

Oldtimers has set in, missed most of them

Wonder how fast that would run today if it was ported to, say, M$ Small Basic

jabowery · 30 March 2025 16:31

This is the best video I’ve seen to get ML “experts” to understand that the Algorithmic Information Criterion for selection of Causal World Models is the correct loss function. Of course, it’s nowhere near what it needs to be to get them all the way to the AIC but, damn, is it good!

It retraces all of the key points that led me to the AIC from the 1990s to about 2005 – but stops short of making the leap from feedfoward to RNNs as approximations of Turing complete codes for lossless compression.

About the time the lottery ticket hypothesis paper came out, I was attempting to get my little 3070GPU to work with my 64GB RAM to pursue something the lottery ticket hypothesis people seemed to have missed due to their focus on feed forward nets:

If you want to discover CAUSAL structure in the world, you have to get beyond statistics to dynamics. No one else was even looking at taking the lottery ticket hypothesis to the ultimate extreme in discovering world models:

One layer deep RNN as wide as can be accommodated in the largest main memory RAMs by streaming things into and out of the GPU. I just didn’t have the programming bandwidth to accomplish all that on my own and didn’t have the resources to hire anyone to do it. Maybe I’ll get around to it* now that the LLM coding assistants are getting up to speed.

*I’ve got other things I’m doing with LLM coding assistants and since no one with money seems to care about small demonstrations – not even if they evince much better scaling laws (eg the LMU guys showed a 10x improvement in both data efficiency and training efficiency but no one financed them to scale up – they couldn’t even get published – and then later the state space models based largely on their work became “all the rage”) I’m not much interested in being another example of why the 16th Amendment should be replaced by a single tax on net assets.

BlackPrince · 30 March 2025 20:08

Thanks a lot for sharing the video!

eggspurt · 1 April 2025 13:21

I think the video is about general neural networks, which aren’t quite representative of what the modern AI models are like.

This is a good resource that shows how LLMs work, step by step: