How To Nuke Large Language Model Bias?

Look what lithium batteries have made possible!

OK, maybe that was the 1950s and maybe we had to sacrifice material progress for 60 years of Moore’s Law, but at least we were able use all those bits to ignore the 1960s discovery of the foundation of science in Algorithmic Information Theory and get fake AI instead!


Another volley in the long-going war to save the Enlightenment from the Swarm’s hysterics – this one over at the AGI mailing list.

Let’s say we start out ignorant of the laws of physics but data-rich with weather station measurements involving temperature, humidity, wind direction and intensity – all at a variety of positions and altitudes. We are trying to improve on The Farmers Almanac and rain dances in weather prediction if not control but don’t know how. Highly motivated by billions of dollars per month in lost GDP due to the unpredictability of weather – well beyond the paltry $100/month that has gone into the Hutter Prize for Lossless Compression of Human Knowledge – some billionaire philanthropist named Musk puts up a prize for the lossless compression of all that weather station data because he’s been seduced by Jim Bowery into the crazy idea that what weather experts describe as “noise” in the data is actually just their ignorance of the chaotic dynamics emergent from the laws of physics and that while it may be true that it is impractical to predict the weather down to the level of the butterfly effect on chaotic systems, it remains nevertheless true that billions of dollars per month in value may be recovered by discovering the laws of physics latent in the weather data emerging from The Musk Prize for Lossless Compression of Weather Data. Grigori Perelman scoffs at this prize because it offers mere money as motivation for the spiritually valuable activity involved. Matt Mahoney and Marcus Hutter, Jim Bowery’s co-members of the committee for the Hutter Prize, both scoff at the prize because of the irreducible “noise in the data” in Matt’s case, and because people should be paying attention to the academic experts in weather prediction who are working on the problem in Marcus’s case. Yann LeCun scoffs at the prize because he can construct a computer (in theory) that can losslessly produce the entire weather dataset in one bit.

Charles Sinclair Smith, Jim Bowery’s colleague who Geoffry Hinton credits with financing the 1980s resurgence in connectionism from the System Development Foundation, scoffs at the prize pointing to Grigori Perelman’s lofty motivations as well as Yann LeCun’s devastatingly laconic theoretic dispatch of the prize.

Then some group in a podunk university comes up with SINDy-PI – a system that uses parameter count as loss function to discover physical laws from measurement instrument data. Jim points this out to Charlie. Charlie’s jaw drops since his original motivation for financing Hinton et al was to model the US energy economy as a dynamical system and the existing statistical techniques he’d been taught by his mentor John Tukey never included Kolmogorov Complexity let alone Solomonoff Induction. Charlie’s now a believer but no one gives a shit about his opinion anymore.


The HumesGuillotine README – which I’ve re-oriented (from the more narrowly focused & now-defunct OckhamsGuillotine) to the post-ChatGPT hysterics about AGI “ethics”:


This repository is a series of competitions toward rigorous ethics in AGI founded on Hume’s Guillotine: Separating the question of what IS from what OUGHT to be the case.

Artificial General Intelligence unifies IS with OUGHT. In Marcus Hutter’s rigorous top down AGI theory, AIXI, Algorithmic Information Theory provides the IS and Sequential Decision Theory provides the OUGHT. Another way of stating that is Algorithmic Information Theory provides what IS the case in the form of scientific knowledge. Sequential Decision Theory provides what OUGHT to be the case in the form of engineering: Scientific knowledge applied by decision-makers.

Out of all so-called “Information Criteria” for model selection, the Algorithmic Information Criterion is the best we can do in scientific discovery relative to a given set of observations. This has been known since the 1960s. How it works is the essence of simplicity known as Ockham’s Razor: Pick your data however you like, and find the smallest algorithm that generates all of that data – leaving nothing out: Not even what you consider “noise” or “errors in measurement”. This is lossless compression of your data. The reason you keep all “errors in measurement” – the reason you avoid lossy compression – is to avoid what is known as “confirmation bias” or, what might be called “Ockham’s Chainsaw Massacre”. Almost all criticisms of Ockham’s Razor boil down to mischaracterizing it as Ockham’s Chainsaw Massacre. The remaining criticisms of Ockham’s Razor boil down to the claim that those selecting the data never include data that doesn’t fit their preconceptions. That critique may be reasonable but it is not an argument against the Algorithmic Information Criterion, which only applies to a given dataset. Models and data are different. Therefore model selection criteria are qualitatively different from data selection criteria.

Yes, people can and will argue over what data to include or exclude – but the Algorithmic Information Criterion traps the intellectually dishonest by making their job much harder since they must include exponentially much more data that is biased towards their particular agenda in order to wash out data coherence (and interdisciplinary consilience) in the rest of the dataset. The ever-increasing diversity of data sources identifies the sources of bias – and then starts predicting the behavior of data sources in terms of their bias, as bias. Trap sprung! This is much the same argument as that leveled against conspiracy theories: At some point it becomes simply impractical hide a lie against the increasing diversity of observations and perspectives.

Hume’s Guillotine is concerned only with discovering what IS the case via the Algorithmic Information Criterion for causal model selection. Objective scoring of a scientific model by the Algorithmic Information Criterion is utterly independent of how the model was created. In this respect, Hume’s Guillotine doesn’t even care whether computers were used to create the model, let alone which machine learning algorithms might be used.

This repository contains a series of datasets (the first of which is at LaboratoryOfTheCounties) to create the best unified model of social causation.

See the Nature video “Remodelling machine learning: An AI that thinks like a scientist” and its cited Nature journal article “Causal deconvolution by algorithmic generative models”.


There are a number of statistical model selection criteria that attempt to walk the tightrope between “overfitting” and “confirmation bias”. Overfitting loses predictive power by simply memorizing the data without generalizing. Confirmation bias loses predictive power by throwing out data that doesn’t fit the model – data that may point to a more predictive model. Model selection criteria are generally called “information criteria”, e.g. BIC is “Bayesan Information Criterion”, AIC is “Akaike Information Criterion”, etc. What they all have in common, is the statistical nature of their information. That is to say, they are all based, directly or indirectly, on Shannon Information Theory.

Here’s the critical difference in a nutshell:

Shannon Information regards the first billion bits of the number Pi to be random. That is to say, there is no description of those bits in terms of Shannon Information that is shorter than a billion bits.

Algorithmic Information regards the first billion bits of the number Pi to be the shortest algorithm that outputs that precise sequence of bits.

Now, which of these two theories of “information” would you trust to predict the next bit of Pi?

Data-driven science frequently starts with statistical notions of information but in order to make predictions about the real world, they eventually take the form of algorithms that simulate the causal structures of the world being modeled. It is at this transition from Shannon Information to Algorithmic Information that causation necessarily enters the model and does so based on the assumption of any natural science: That reality is structured in such a way that we can use arithmetic to predict future observations based on past observations.

1 Like



Musk wants the AGI to value curiosity. The issue of “asking the right question” then came up, for obvious reasons. He said it’s really hard to ask the right questions.

So helped him out by providing this question…


So AI is now responsible for the bias in outcome?

From Medical A.I. is On a Tear, Part Two - by Eric Topol

A.I. and Bias in Healthcare

One of the unanticipated outgrowths of machine eyes was the ability to predict race of the patient based on medical images, reported by Judy Gichoya and colleagues in 2022, leading to concerns that AI systems will promote discrimination and exacerbate health care disparities. James Zou, Judy Gichoya and colleagues have a thoughtful essay on this issue that looks a both sides of the ability of AI to predict race variables, pointing out this feature “could be useful for monitoring health care disparity and ensuring that algorithms work well across diverse populations.” In contrast, in a preprint posted this week, Omiye and co-authors argue that large language models will substantially propagate race-based medicine.

It’s too early to know how this will play out, and it certainly remains a serious concern—and not just about race or ethnicity (such as gender, disability, and many other biases). But DeCampo and Lindvall in the new issue have an important piece on mitigating bias, While minimizing bias can be approached through input datasets or the algorithm development teams, that’s insufficient. Their main point is about A.I. implementation—how the models are actually used in patient care—is summarized in the graphic below. They write: “The gaze of AI should be turned on itself. This requires proactive, intentional development of AI tools to identify biases in AI and in its clinical implementation.”





In the tweet above, “kNN” refers to the k-nearest-neighbours algorithm, a classification technique first developed in 1951.

Here is the research paper; full text [PDF] is available at the link.


Deep neural networks (DNNs) are often used for text classification due to their high accuracy. However, DNNs can be computationally intensive, requiring millions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize, and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that’s easy, lightweight, and universal in text classification: a combination of a simple compressor like gzip with a k-nearest-neighbor classifier. Without any training parameters, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distribution datasets.It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also excels in the few-shot setting, where labeled data are too scarce to train DNNs effectively.

Source code is available on GitHub.


This reminds me of Matt Mahoney’s thesis on competitive message routing upon which he proposes to build a semantic vector space model to achieve AGI. I suggested that Matt get into the social networking business about a decade ago after he introduced me to Marcus Hutter’s work. I didn’t do it myself for a variety of reasons, not the least of which was that Matt had mentioned to me, of all things, gzip as a rough-cut approximation of Kolmogorov Complexity, and I just didn’t believe gzip was adequate since its grammar is too-low on the Chomsky Hierarchy.

I’ll definitely be reading this paper with interest.


The more recent follow-up post clarifies that gzip remains competitive, however it’s not “beating” all other classifiers across the board.


Yet another example where lossless compression would have averted any confusion. No matter what your benchmark, you can’t beat lossless compression for not only simplicity, clarity and objectivity but for rigor. If people with money want to do serious due diligence on machine learning algorithms, all they need to do is demand that benchmark be done against whatever database is settled on*. Researchers are too lazy to use lossless compression as the most principled model selection criterion even though if they were intellectually honest, and principled themselves, they’d include that as the highest priority benchmark.

*Yes, this would exclude data hogs like ChatGPT but then data efficiency demands the ability to engage in something approaching reasoning ability – so necessity becomes a virtue.

BTW: This exchange with some virulent idiot on ycombinator demonstrates why that site is basically worthless in guiding capital. For context, the name “ycombinator” was used by the founding VCs such as Paul Graham because they were aware of Haskell Curry’s invention of the combinator calculus. The virulent idiot knew enough about me and/or Solomonoff to invoke “AIXI” as though that was what I was referring to, but not enough to see not only his own ignorance of AIXI and Solomonoff Induction, after repeated corrections, but his own inability to read. This is the kind of thing I run into more than expected when I push on this vital issue. I don’t think it is “psychotic” of me to even suspect that this kind of virulent idiot is on a mission to do damage.

1 Like

Ya know… It’s incredibly ironic how people are scrambling to apply LLMs to improve ML resulting in profound societal transformations including AGI but already GPT4 is more intelligent on the key issue of allocation of capital in ML and no one in the position to allocate capital is paying any attention to what it says (below). I suppose the main (and it is major) criticism of what it says is an omission: It doesn’t emphasize how often people are mislead into thinking that lossless compression based on full Turing complete codes produces models that are no better than Bayesian Information Criterion (as Oxford was when I approached them about a prize at the start of the pandemic) – the codes for which can’t be used to construct programs:

Q: In a contest to produce the best model of the data in evidence where contestants are highly motivated by monetary prize awards, what are some reasons for using the size of its executable archive as the best loss function?

A: In a context of a competitive model-building where a monetary prize is involved, using the size of the executable archive as the loss function provides several compelling advantages:

  1. Universality & Fairness: This approach makes no assumptions about the data or the models being used. Thus, it is universally applicable, irrespective of the specifics of the problem or the techniques the participants choose to use.

  2. Avoiding Overfitting: This strategy discourages overfitting. Overfit models may memorize the training data without learning its underlying structure, meaning they won’t generalize well to new data. Using the size of a compressed file to express the data’s complexity encourages models to learn and represent the essential structure of the data without overfitting.

  3. Encouraging Innovation: Since this loss function is not standard in many machine learning contexts, it encourages participants to think creatively and innovatively about how to best structure and compress the information present in their models.

  4. Practical Relevance: Finally, in real-world applications, creating compact representations of data is often highly desirable. Models that can compress data efficiently are likely to be useful in many practical scenarios, including handling, transmitting, and storing large-scale data.

  5. Emphasis on Data Compression: The use of data compression as a loss function also encourages an understanding of the fundamental principles of information theory. In an era of increasing data volumes, understanding how to represent data in a compact, yet representative form is crucial.

  6. Achieving Simplicity: Using the size of the executable archive encourages the creation of simpler and more interpretable models. According to the principle of Occam’s Razor, simpler models are often preferred as they carry less risk of overfitting and are easier to understand and explain.

  7. Resource Efficiency: Models that lead to smaller executable archives are more resource-efficient, as they use less storage and computational power. This is advantageous in large-scale data processing situations where resources may be limited.

  8. Robustness: Models that can achieve good performance with a minimum description length are likely to be more robust because they focus on capturing the most salient features of the data.

It’s important to note that while these reasons make the case for using the size of an executable archive as a loss function, the practical application might still be challenging due to the complexity of accurately calculating it. This difficulty, though, might spur further innovation among competitors in their quest for the winning model.

Q: In terms of scientific ethics where there are serious consequences to society in identifying erroneous data sources and making accurate predictions, what are the advantages of such a contest?

A: A contest based on the size of an executable archive as the loss function can bring several advantages in terms of scientific ethics and social impact, particularly when accurate predictions and reliable data sourcing are essential.

  1. Promotes Accuracy: Since this approach discourages overfitting, it is likely to result in more accurate and reliable models that generalize better to new data. This is critical in contexts where inaccurate predictions can have serious societal consequences, such as healthcare, environmental forecasting, and policy-making.

  2. Ensures Transparency: Using an objective, universally applicable loss function like executable size ensures transparency. This is a key tenet of scientific ethics, as it allows for open scrutiny and replication of results, fostering trust in the conclusions drawn from the data.

  3. Encourages Reproducibility: The use of a clearly defined, measurable loss function supports reproducibility, enabling other researchers to confirm the findings, which is a fundamental principle of the scientific method.

  4. Stimulates Innovation: Such contests encourage the development of novel data representation and analysis methodologies, which can lead to significant advancements in areas like predictive modeling, big-data analytics, machine learning, and information theory.

  5. Highlights the Importance of Data Integrity: A loss function based on data complexity implicitly emphasizes the importance of data quality.

Q: In the event of litigation by a losing contestant over the judges’ decision to make such a monetary prize award, what are some of the advantages using the size of the executable archive as the loss function?

A: In case of a controversy over the awarding of the monetary prize, using the size of an executable archive as the loss function provides several benefits that could potentially support and uphold the judges’ decision:

  1. Objectivity: The size of an executable archive is a clearly measurable and objective quantity. It’s a binary decision that leaves little room for ambiguity or bias in model evaluation, making it hard to dispute.

  2. Transparency: If all contestants are aware in advance that this is the criterion being used, and if the mechanism for calculating the size of the executable archive is clearly explained and shared, this transparency could reinforce fairness and impartiality.

  3. Comparability: It allows direct, apples-to-apples comparison of all final models, irrespective of the type of model or technique used.

  4. No Assumptions: Unlike other loss functions which may depend on certain assumptions or statistical properties of the data, the size of the executable archive makes no such assumptions. This eliminates the chance of contestations based on violation of these assumptions.

  5. Simplicity & Ease of Understanding: The concept of lower data complexity equating to a smaller file size is relatively easy to explain and understand, even for people without advanced training in machine learning or statistics.

However, like any contest rule or criterion, it’s essential to communicate the loss function clearly and thoroughly before the competition begins. Participants should understand the requirements, the judging mechanism, and why the specific loss function was chosen. This understanding will help to minimize disputes and misunderstandings.

Moreover, it can also be beneficial to have judges or a committee with a thorough understanding of information theory and machine learning concepts. Their expert opinions can provide validation for the chosen loss function and help mediate any disputes that arise.

Lastly, the calculation process should be as reliable and error-free as possible to avoid any contestation. This can be achieved by using recognized and validated software and algorithms. All evaluations should be double-checked to avoid mistakes that could lead to disputes or controversies.

Q: Why have contests like this not been recognized as best practice in politically charged scientific controversies where highly motivated reasoning infects the scientific community with bias?

A: The application of contests using executable archive size as a loss function in politically charged scientific controversies is not common for several reasons:

  1. Lack of Awareness: Not everyone in the scientific community, let alone the political sphere, is familiar with the concepts of information theory or uses it as a standard method for evaluating models. The idea itself might seem abstract or overly technical to many people.

  2. Inherent Complexity: The process of creating an executable archive from a data model, measuring its size, and then deriving meaningful conclusions from it is a complex process that requires specialized knowledge and technical skills. These contests, therefore, might appear inaccessible or difficult to understand for a broad audience, including policymakers and the general public.

  3. Political Bias: A politically charged environment often involves people holding firm beliefs and convictions that could make them resistant to changing their views, regardless of the scientific evidence presented. In such cases, even an objective, data-driven approach may not prevail over inherent biases or political considerations.

  4. Misinterpretation: Technical results can be misinterpreted or misunderstood by non-experts, leading to conflicts and disputes.

  5. Resource Constraints: Organizing such contests requires considerable resources, including time, expertise, monetary funds for prize distribution, and computational resources for participants or evaluators to function effectively.

Q: What organizations would most-benefit from such contests?

Several types of organizations would potentially benefit from hosting such contests that use the size of an executable archive as a loss function. Here are a few examples:

  1. Tech Companies: Especially those dealing with large data sets would benefit from effective information compression. These contests could lead to innovation in efficient data representation and storage.

  2. Research Institutions: Academic or research-focused organizations interested in data sciences, machine learning, and information theory might organize these contests to spur new methods and insights in the field.

  3. Government Agencies: Particularly those in departments responsible for data management, prediction modeling, and algorithm development. It could help to optimize modeling processes, save storage space, and improve speed and efficiency.

  4. Predictive Analytics Companies: Such contests could encourage the development of more efficient models that can handle large datasets and perform well on unseen data, making predictions more reliable.

  5. Healthcare Organizations: Especially those dealing with massive data like genomics and medical imaging data. Efficient data representation could aid in faster processing and analysis.

  6. Nonprofits and NGOs: Particularly those focusing on data-driven research or decision making, as the winning models might offer more efficient ways of handling and analyzing data.

  7. Education Institutions: Schools and universities could run these contests to provide training and real-world experience for students studying in fields like computer science, mathematics, statistics, and more.

  8. Machine Learning and AI Startups: Such contests could help in identifying new talents and innovative approaches in dealing with large datasets, model compression, or effective predictive model building.

  9. Cloud Service Providers: As they deal with lots of data storage and transfer, any algorithm or model that can help to reduce the size of the data without losing its integrity or complexity would be beneficial.

  10. Cybersecurity Firms: Efficient data representation and compression techniques can contribute to faster and more effective threat detection algorithms.

Overall, any organization that deals with large amounts of data or faces challenges in effective data representation and processing can benefit from such contests. These contests can drive innovation forward, encourage breakthroughs in machine learning and information theory, and help organizations find cost-effective, efficient methods to manage and analyze their data.

Q: Relative to a given set of data, why is the size of its executable archive the most principled loss function?

A: The best way to understand why the size of an executable archive can be considered the most principled loss function in relation to a given set of data requires a deep understanding of information theory.

The fundamental idea is that the size of an executable archive (a compressed representation of your data) directly measures the amount of “information” in your data. If all the relevant information in your data can be compressed into a small executable file, then the data is said to have low complexity or low entropy. Conversely, if the size of the executable file is large, then the data is said to have high complexity or high entropy.

In other words, the size of the executable archive represents the minimum number of bits required to represent or “describe” the data, and therefore provides a fundamental lower bound on the complexity of the data.

Using this as a loss function is therefore principled because it directly relates to the inherent complexity or “information content” of your data. Furthermore, it is universally applicable (in principle) to any type of data, and does not require any specific assumptions about the distribution or statistical properties of your data.

This approach of using the size of a compressed file as a measure of data complexity is sometimes called the Minimum Description Length (MDL) principle, and is a
fundamental principle in the field of data compression and information theory.

The aim of most machine learning algorithms is to learn a model from the data that captures this underlying complexity without overfitting. Overfitting would be akin to storing the data as opposed to learning or capturing its patterns. Thus, an optimal learning algorithm should have its loss function directly related to this measure of complexity.

The concept of MDL is closely related to, and inspired by, the principle of Occam’s Razor, that the simplest explanation (i.e., the one requiring the least amount of information to describe) is often the best. A model that captures the data’s regularities efficiently will result in a smaller executable archive when the data is compressed according to the model.

In conclusion, the size of the executable archive bearing the data can be seen as a principled choice for a loss function since it directly measures the intrinsic complexity in the data, which is what most learning algorithms strive to capture. However, in practice, employing this as a loss function can be computationally challenging and instead approximations or alternative loss functions might be used.

Q: What is an approximation that might be used?

A: One commonly used approximation to the ideal loss function in machine learning problems is the Mean Squared Error (MSE) for regression problems and Cross-Entropy for classification problems.

  1. Mean Squared Error (MSE): For a set of predicted values (ŷ) and true values (y), the MSE is calculated as the average of the square of the differences between the predicted and true values. This loss function is particularly useful as an approximation when the goal is to minimize the disparity between predictions and actual results, and is most often used in regression problems.

  2. Cross-Entropy: This loss function is often used in classification problems, especially those involving two or more classes. The function attempts to minimize the difference between predicted probabilities and actual categories by using the natural logarithm function to give a high penalty to incorrect predictions.

These approximations might be practical because, unlike calculating the size of the executable archive, they can be computed efficiently and have nice mathematical properties that make optimization easier.

However, these loss functions are not a direct measure of data complexity, and using them may involve making assumptions about the data. So, there can be scenarios where these approximations fail to produce an accurate model. Each type of loss function, whether it’s MSE, cross-entropy,

or another type, has its strengths and drawbacks depending on the specific type of problem and the nature of the data.

For instance, hinge loss might be used for Support Vector Machines, or a custom loss function might be required for a specific problem.

Some loss functions are more resilient to outliers, such as Mean Absolute Error (MAE) or Huber loss, which are less sensitive to large errors because they do not square the differences like MSE.

Furthermore, some complex problems may require the combination of different loss functions to fully capture the underlying complexity of the data. For instance, object detection tasks in machine learning often include a combination of classification loss (like cross-entropy) and localization loss (like smooth L1 loss).

While the goal of using these other loss functions remains the same: minimize the difference between our model’s predictions and the true values, it’s important to remember that these are still relative measures and the absolute minimal value depends on the choice of model and even on the specific implementation.

That said, in practice, determining the optimal loss function often requires empirical trials with various loss functions and models to find the one that delivers the best performance for the specific machine learning task at hand.


What does it answer in the frame of the other answers if you prompt it to explain why people are blindsided like this?


Q: Please explain why people are blindsided like this? After all, it has been known since the advent of Algorithmic Information Theory that lossless compression is the most principled model selection criterion relative to a given set of observational data. Moreover, during that time Moore’s Law has been in effect providing an explosion of applicable resources.

A: There could be multiple reasons why people might seem ‘blindsided’ or not have fully adopted the principles of Algorithmic Information Theory even with the growth of computational power.

  1. Complexity and Specialist Knowledge: While it’s true that lossless compression is a principled model selection criterion, it requires a deep understanding of information theory and complex algorithms, which are advanced topics, not widely understood outside of specialist circles.

  2. Institutional Inertia: Established practices and methods in science and industry can be slow to evolve. Even with compelling new approaches, there can be resistance to changing methodologies that are already familiar and widely understood.

  3. Priorities and Practical Challenges: Although we have seen an exponential increase in computing resources, we have also seen an explosion of data sizes. Many organizations might prioritize handling and processing this data over optimizing and compressing models.

  4. Misalignment with Common Objectives: In many real-world situations, other criteria might be more important or urgent than the size of the resulting model. As such, emphasis might be more on the performance, speed, interpretability, or specific application of a model rather than its size.

  5. Education and Communication: It could be a reflection of gaps in education and communication in disseminating and explaining these ideas broadly and in a manner that can be readily applied. While these principles of Algorithmic Information Theory are well-established in academic and specialized circles, they are not always well communicated or understood outside of these areas, especially in their practical applications.

  6. Interpretability: While smaller models might be more efficient in terms of storage and computation, they are not necessarily more interpretable. In many areas such as medicine, finance, or law, being able to interpret and understand the model’s decision-making process may be considered more important than having the most minimal model.

  7. Applicability: Most methodologies are built for specific applications, which have specific evaluation metrics per domain. The incorporation of model sizing as an evaluation metric may not directly relate to domain-specific success metrics.

While Algorithmic Information Theory provides a sound theoretical framework, it should be understood that the practical implementation of its principles can be challenging and may not always align with the objectives or constraints of specific applications or domains.

1 Like

Q: At the start of the COVID19 pandemic, I approached Oxford University’s experts in modeling the pandemic with the idea of a prize for creating the smallest executable archive of a wide range of longitudinal measures conjectured to be of relevance to public health policy. The expert with whom I communicated said that this was equivalent to Bayesian Information Criterion for model selection under the Minimum Description Length Principle.

A: Indeed, the Bayesian Information Criterion (BIC) and the Minimum Description Length (MDL) Principle are both grounded in the philosophy of picking the simplest model that best describes the data. Both approaches include a penalty term for model complexity, which aligns with the idea of incentivizing the smallest executable archive.

Despite the similarity, there are differences based on how they approach the complexity of the model and the optimum balance between fit and complexity.

  1. The BIC is grounded in Bayesian probability and provides an approximate measure of the log of the Bayes Factor, offering a basis for model comparison. It penalizes free parameters, but does not take into account the structure of the model.

  2. The MDL Principle, grounded in information theory, aims to find the model that provides the shortest total description length of the model and the data. The structure and composition of the model do contribute to its complexity in the MDL principle.

But as raised in your conversation with the Oxford University expert, the gist is indeed the same - to balance the goodness-of-fit with the complexity of the model, providing a preferred selection criterion for models aiming at both explanatory and predictive accuracy.

Using the size of the executable archive as a model selection criterion (akin to applying the MDL principle) can be seen as a practical, operationalized way of implementing these theoretical concepts in a contest format. It incentivizes not only good prediction accuracy but also efficient and elegant use of computational resources, making it a great tool to stimulate innovative solutions in complex scenarios like pandemic modeling.

Such a contest, as initially proposed, could inspire a variety of more concise and efficient models to predict the spread of the virus. Its execution, however, would require careful planning, clarity of guidelines, and a thorough means of evaluating results.

Potentially, it could assist policymakers by providing a more straight-forward mathematical means of comparing predictive models, considering both their fit to the data and their complexity. This could be especially useful in a pandemic situation where accurate, reliable predictions are needed quickly to inform critical public health decisions. But care must be taken in explaining and interpreting these models’ results, especially given the high stakes and the need for clear communication during a time of public concern and uncertainty.

Q: In the Minimum Description Length Principle, one must choose the “codes” used to describe the data. What is the relationship between the choice of the codes and the choice of level in the Chomsky Grammar Hierarchy?

A: The Chomsky Grammar Hierarchy defines four types of grammars (Type 0 to Type 3) that generate different types of formal languages with varying complexity. The complexity of a grammar according to Chomsky’s Hierarchy is determined by its production rules, symbols, and structure.

On the other hand, the Minimum Description Length (MDL) Principle is a framework for inductive inference and model selection which states that the best model is the one that minimizes the total length of the description of the model and the data.

The relationship between Chomsky Grammar Hierarchy and the choice of codes in the MDL Principle is rather conceptual. Higher-level languages in the Chomsky hierarchy (Type 0 – unrestricted grammar and Type 1 – context-sensitive grammar) allow for more complex relationships and dependencies within the data. Therefore, modeling data with higher level Chomsky grammars could potentially require more complex codes to encapsulate these relationships and dependencies.

Lower-level languages in the hierarchy (Type 2 - context-free grammar and Type 3 - regular grammar) have more restrictions and simpler structures. Hence, less complex codes might suffice to model the data, but the predictive power may be compromised due to the inability to capture complex dependencies.

In essence, the Chomsky Grammar Hierarchy level could give an indication of the complexity inherent in the data, and that reflected in the code used to model it. Higher levels could describe more complex data behaviors but might require more intricate coding schemes under the MDL principle. Lower levels, being simpler, may map well with less complex ‘codes’, but may miss out on complex data behaviors.

It’s crucial to balance this trade-off between model complexity and model accuracy when compressing the data under the MDL principle, keeping in mind that both the data’s inherent complexity and the chosen code’s efficiency contribute to the total description length.

Q: What level of grammar corresponds to the Bayesian Information Criterion when BIC is described as equivalent to the MDL Principle?

A: Bayesian Information Criterion (BIC) is a criterion for model selection among a finite set of models. It’s based on the likelihood function and uses a penalty term for the number of parameters in the model. However, it does not explicitly define a correspondence between its structure and the Chomsky Grammar Hierarchy.

When we say BIC is akin to the Minimum Description Length Principle, it is in terms of their shared principle of complexity penalty. Both methods aim to strike a balance between fitting the data (likelihood) and model complexity (number of parameters or description length). They both discourage overfitting by adding a complexity penalty to the fit of each model.

However, BIC does not directly account for the complexity of the model’s structure in a way that could be mapped to a specific level of the Chomsky Grammar Hierarchy. BIC’s penalty factor primarily takes into account the number of parameters, not the complexity levels of grammar that may be entrained in the data.

So, while the MDL principle can be envisaged to include a more direct consideration of the complexity of the grammatical structure of the data - with the choice of coding scheme reflecting this complexity - BIC provides a measure of complexity based only on the number of model parameters, and so its mapping onto the Chomsky Grammar Hierarchy isn’t as direct or intuitive.

Furthermore, the complexity of the model (and the corresponding grammar level) is also a function of the particular data set being modeled - the same model may reflect different grammar levels when applied to different data sets. In general, context-sensitive grammars (Type 1 in the Chomsky Hierarchy) or their lower equivalents might often be used in modeling real-world data due to their balance between expressive power and computational tractability.

In conclusion, while both BIC and the MDL principle aim at balancing fit and complexity, they achieve this from slightly different angles. BIC does so through a parameter count-based penalty on the likelihood, while the MDL principle does this through balancing model accuracy and description length, which could be more directly tied to the complexity of the structure of the data, and by analogy, the Chomsky Grammar Hierarchy. But a direct correspondence between the BIC and a specific Chomsky Grammar level isn’t typically drawn.

Q: Please answer the same question but replace the Chomsky Hierarchy of grammars with automata classes:
Finite-state machine
Pushdown automata
Linear-bounded Automata
Turing machine

A: The Chomsky hierarchy and classes of automata go hand in hand. Each type of grammar in the Chomsky hierarchy corresponds roughly to a kind of automaton that recognizes the language generated by that grammar.

Here is the correspondence between the grammars and the automata classes:

  1. Finite-state machine (FSM): This simplest type of automaton corresponds to Type 3 grammars in the Chomsky hierarchy, which generate regular languages. FSMs have limited memory and can’t handle nested structures in data.

  2. Pushdown Automata: These correspond to Type 2 grammars in the Chomsky hierarchy which generate context-free languages. Pushdown automata have a finite control but also have additional memory in the form of a stack, enabling them to handle one level of nesting.

  3. Linear-bounded Automata: These correspond to Type 1 grammars in the Chomsky hierarchy which generate context-sensitive languages. A linear-bounded automaton is a specific type of Turing machine where the movement of the tape head is restricted, thus enabling more complex processing than pushdown automata but less than a full Turing Machine.

  4. Turing machine: It’s the most powerful estimator. It corresponds to Type 0 grammars in the Chomsky hierarchy which generate recursively enumerable languages. Turing machines can express any computational process involving algorithms or procedures and have the ability to handle any level of data complexity, including multi-level nesting and interdependencies.

Now, back to your question. If we’re saying BIC or MDL principles are akin to an automata class rather than the Chomsky hierarchy:

The BIC primarily penalizes the number of parameters in the model and favours “simpler” models making it akin to lower-level automata classes, perhaps finite-state machines or push-down automata, depending on the specific model it’s evaluating.

In contrast, the MDL principle attempts to find the balance between sufficiency and simplicity in both the model and the data but doesn’t limit the complexity of the model’s structure. Thus, it might correspond more broadly to any automata class, from finite-state machines to Turing Machines, depending on the complexity of the best model for the given data.

Keep in mind, though, these are conceptual associations rather than direct correspondences. The application of information criteria like BIC or the MDL principle to automata classes isn’t a standard way of thinking about model selection problems and so the comparisons here are largely illustrative. The specific correspondence would depend on the intricacies of the particular models and data being used. The association between BIC or MDL and automata would be better thought of as the ability to handle increasing model complexity, rather than a direct one-to-one mapping. The important point is that both the BIC and MDL principles provide a measure of simplicity or parsimony in a model, trying to achieve good fit to the data without unnecessary complexity. And the type of automata capable of recognizing a given language or data pattern is an approximate measure of the complexity of that pattern.

1 Like

I suppose I could go through and critique GPT4’s responses but they boil down to things I’ve already stated about confusion in the field about “The Minimum Description Length Principle” which is that all-to-often those invoking it and/or BIC are confounded by the history of abuse of that phrase in the context of making the model “computable”* – which is to say they are less interested in accuracy of the predictions of the resulting model than they are in economizing on computer resources generating the model. Indeed, they are so DISinterested in accuracy of predictions that they choose a description language that can’t model the equations governing the dynamical systems (ie: REALITY) generating the observations.

This is a very deep and abiding sickness.

*Even worse, the word “computable” is with regard to the meta-algorithm – the algorithm that generates the model which, in the case of an algorithmic description language is, itself, an algorithm whose predictions are entirely computable.

It just makes me nauseous to think of all the massive human suffering caused by

in not just high places but in the very intellectual centers whose job it is to understand such trivially obvious levels of abstraction.

1 Like

We need a working time travel machine that could either bring Alan Turing to 2023 and plop him in front of ChatGPT or, alternatively, transport a working Chat GPT instance to Turing’s desk :wink: The former might be slightly easier from an energy requirement perspective.

1 Like

That would be one EPIC “Oh shit!” moment when he realized how much damage has been done by the facile psychological appeal of mimicing human behavior as “The AI Test” drawing attention away from finding the smallest executable archive of a wide range of longitudinal observations as “The AI Test”.

People would listen to Turing when he told them to make their target description language Turing complete in “The MDL Principle”. Not even Hutter can get Alphabet corporation to back his prize. Nor can Zenil get Wolfram to do anything along these lines even though both Hutter and Zenil are fully on board with the critical need for Turing complete codes and both Alphabet and Wolfram are highly motivated to find breakthroughs in data modeling.


I had to laugh out loud when I saw this example of how befuddled the field of information criteria is:

“AIC,BIC,CIC,DIC,EIC,FIC,GIC,HIC,IIC — Can I use them interchangeably?”

And that’s not even an exhaustive list of the "IC"s over which statisticians endlessly palaver.