Interpretability via Symbolic Distillation

One of the tragedies of the information age, reflected most egregiously in machine learning, is that it failed to recognize that it was actually engaged in formalizing the scientific method in a manner potentially far more rigorous than the dreams of Leibniz. So guys like Miles Cranmer end up isolated and alone rather than being the central figures in the world of AI/AGI/ML.

That said, I really laid into Miles in this comment, but only because he’s worth criticizing, unlike the vast majority of the $300k/year AI experts parading their insolence around before our weary eyes:

Conflating inductive bias in mutation with selection bias is keeping ML in the dark ages. It’s up there with the philosophical nuisance of claiming that AIT is ill founded because one may arbitrarily choose the instruction set used to encode the dataset.

Maybe it will help if I explain why choice of “language” is not arbitrary since “instruction set” and “language” are ultimately indistinguishable if you want a model that can be executed as an algorithm to simulate hence predict the phenomena of interest:

Mathematicians spend centuries trying to come up with the minimum set of axioms for their systems. Yet, for some strange reason, even Turing Awardees like Yan LeCunn, when confronted with the question of why they don’t use lossless compression as information criterion for model selection, will hide behind the philosophical nuisance of “Well, I can just have an instruction that outputs the entire dataset!” No, Yan, you can’t do that if you are going to be intellectually honest and get on with the revolution in the scientific method offered by computation rather than holding things back in the dark ages as you are doing with your Turing Award." But lest I be accused of having it in for Yan Lecunn, even “Mr. Causality” himself, Judea Pearl – another Turing Award winner – has fallen into the same dark age mentality regarding algorithmic information.

But, ok, let’s say we escape this philosophical nuisance and recognize that measures of complexity should be based on algorithmic descriptions in terms of simple instruction sets with some reasonable relationship to the axioms of arithmetic upon which which we ultimately found our higher level language definitions so as to minimize the algorithmic description of the data. We’re still left with the aforementioned conflation of mutation with selection bias! Sheesh…

OK, look kids, here’s what you do: Go ahead and impose whatever arbitrary mutation bias you want in terms of whatever language you choose. If you insist, you may even indulge yourself and call this choice “arbitrary”. And of course, the way you mutate within this language, is up to you as well – just have at it and be as “arbitrary” as you like. OK? I’m being really nice to you up to this point, right? But now, here 's where I’m going to be Mean Mr. Jim to you – so gird you loins:

When you compute your fitness in preparation for model SELECTION – that is to say, when you compute the measure of complexity – that is to say, when you compute your “loss” function – do your goddamn homework and encode your “arbitrary language” as a minimum length interpreter written in an “instruction set” chosen to be reasonable. You think that’s a nasty homework assignment? Oh, but I’m not through with your sorry ass: You must also encode the errors as program literals so that the resulting algorithm outputs the dataset exactly bit for bit, rather than summarizing the errors as “noise” in “metrics” like sum of squared errors. Scientists recognize that “noise” isn’t just “noise”. Failure to recognize that apparently random bit strings may be hiding critical information about their experimental set up, bad theory, etc. and therefore failure to fully encode those errors is the path to confirmation bias.

And if you come back to me and tell me “The Dog Ate My Homework, Mean Mr. Jim.” you’ll be in detention.

1 Like