How To Nuke Large Language Model Bias?

This reminds me of Matt Mahoney’s thesis on competitive message routing upon which he proposes to build a semantic vector space model to achieve AGI. I suggested that Matt get into the social networking business about a decade ago after he introduced me to Marcus Hutter’s work. I didn’t do it myself for a variety of reasons, not the least of which was that Matt had mentioned to me, of all things, gzip as a rough-cut approximation of Kolmogorov Complexity, and I just didn’t believe gzip was adequate since its grammar is too-low on the Chomsky Hierarchy.

I’ll definitely be reading this paper with interest.


The more recent follow-up post clarifies that gzip remains competitive, however it’s not “beating” all other classifiers across the board.


Yet another example where lossless compression would have averted any confusion. No matter what your benchmark, you can’t beat lossless compression for not only simplicity, clarity and objectivity but for rigor. If people with money want to do serious due diligence on machine learning algorithms, all they need to do is demand that benchmark be done against whatever database is settled on*. Researchers are too lazy to use lossless compression as the most principled model selection criterion even though if they were intellectually honest, and principled themselves, they’d include that as the highest priority benchmark.

*Yes, this would exclude data hogs like ChatGPT but then data efficiency demands the ability to engage in something approaching reasoning ability – so necessity becomes a virtue.

BTW: This exchange with some virulent idiot on ycombinator demonstrates why that site is basically worthless in guiding capital. For context, the name “ycombinator” was used by the founding VCs such as Paul Graham because they were aware of Haskell Curry’s invention of the combinator calculus. The virulent idiot knew enough about me and/or Solomonoff to invoke “AIXI” as though that was what I was referring to, but not enough to see not only his own ignorance of AIXI and Solomonoff Induction, after repeated corrections, but his own inability to read. This is the kind of thing I run into more than expected when I push on this vital issue. I don’t think it is “psychotic” of me to even suspect that this kind of virulent idiot is on a mission to do damage.

1 Like

Ya know… It’s incredibly ironic how people are scrambling to apply LLMs to improve ML resulting in profound societal transformations including AGI but already GPT4 is more intelligent on the key issue of allocation of capital in ML and no one in the position to allocate capital is paying any attention to what it says (below). I suppose the main (and it is major) criticism of what it says is an omission: It doesn’t emphasize how often people are mislead into thinking that lossless compression based on full Turing complete codes produces models that are no better than Bayesian Information Criterion (as Oxford was when I approached them about a prize at the start of the pandemic) – the codes for which can’t be used to construct programs:

Q: In a contest to produce the best model of the data in evidence where contestants are highly motivated by monetary prize awards, what are some reasons for using the size of its executable archive as the best loss function?

A: In a context of a competitive model-building where a monetary prize is involved, using the size of the executable archive as the loss function provides several compelling advantages:

  1. Universality & Fairness: This approach makes no assumptions about the data or the models being used. Thus, it is universally applicable, irrespective of the specifics of the problem or the techniques the participants choose to use.

  2. Avoiding Overfitting: This strategy discourages overfitting. Overfit models may memorize the training data without learning its underlying structure, meaning they won’t generalize well to new data. Using the size of a compressed file to express the data’s complexity encourages models to learn and represent the essential structure of the data without overfitting.

  3. Encouraging Innovation: Since this loss function is not standard in many machine learning contexts, it encourages participants to think creatively and innovatively about how to best structure and compress the information present in their models.

  4. Practical Relevance: Finally, in real-world applications, creating compact representations of data is often highly desirable. Models that can compress data efficiently are likely to be useful in many practical scenarios, including handling, transmitting, and storing large-scale data.

  5. Emphasis on Data Compression: The use of data compression as a loss function also encourages an understanding of the fundamental principles of information theory. In an era of increasing data volumes, understanding how to represent data in a compact, yet representative form is crucial.

  6. Achieving Simplicity: Using the size of the executable archive encourages the creation of simpler and more interpretable models. According to the principle of Occam’s Razor, simpler models are often preferred as they carry less risk of overfitting and are easier to understand and explain.

  7. Resource Efficiency: Models that lead to smaller executable archives are more resource-efficient, as they use less storage and computational power. This is advantageous in large-scale data processing situations where resources may be limited.

  8. Robustness: Models that can achieve good performance with a minimum description length are likely to be more robust because they focus on capturing the most salient features of the data.

It’s important to note that while these reasons make the case for using the size of an executable archive as a loss function, the practical application might still be challenging due to the complexity of accurately calculating it. This difficulty, though, might spur further innovation among competitors in their quest for the winning model.

Q: In terms of scientific ethics where there are serious consequences to society in identifying erroneous data sources and making accurate predictions, what are the advantages of such a contest?

A: A contest based on the size of an executable archive as the loss function can bring several advantages in terms of scientific ethics and social impact, particularly when accurate predictions and reliable data sourcing are essential.

  1. Promotes Accuracy: Since this approach discourages overfitting, it is likely to result in more accurate and reliable models that generalize better to new data. This is critical in contexts where inaccurate predictions can have serious societal consequences, such as healthcare, environmental forecasting, and policy-making.

  2. Ensures Transparency: Using an objective, universally applicable loss function like executable size ensures transparency. This is a key tenet of scientific ethics, as it allows for open scrutiny and replication of results, fostering trust in the conclusions drawn from the data.

  3. Encourages Reproducibility: The use of a clearly defined, measurable loss function supports reproducibility, enabling other researchers to confirm the findings, which is a fundamental principle of the scientific method.

  4. Stimulates Innovation: Such contests encourage the development of novel data representation and analysis methodologies, which can lead to significant advancements in areas like predictive modeling, big-data analytics, machine learning, and information theory.

  5. Highlights the Importance of Data Integrity: A loss function based on data complexity implicitly emphasizes the importance of data quality.

Q: In the event of litigation by a losing contestant over the judges’ decision to make such a monetary prize award, what are some of the advantages using the size of the executable archive as the loss function?

A: In case of a controversy over the awarding of the monetary prize, using the size of an executable archive as the loss function provides several benefits that could potentially support and uphold the judges’ decision:

  1. Objectivity: The size of an executable archive is a clearly measurable and objective quantity. It’s a binary decision that leaves little room for ambiguity or bias in model evaluation, making it hard to dispute.

  2. Transparency: If all contestants are aware in advance that this is the criterion being used, and if the mechanism for calculating the size of the executable archive is clearly explained and shared, this transparency could reinforce fairness and impartiality.

  3. Comparability: It allows direct, apples-to-apples comparison of all final models, irrespective of the type of model or technique used.

  4. No Assumptions: Unlike other loss functions which may depend on certain assumptions or statistical properties of the data, the size of the executable archive makes no such assumptions. This eliminates the chance of contestations based on violation of these assumptions.

  5. Simplicity & Ease of Understanding: The concept of lower data complexity equating to a smaller file size is relatively easy to explain and understand, even for people without advanced training in machine learning or statistics.

However, like any contest rule or criterion, it’s essential to communicate the loss function clearly and thoroughly before the competition begins. Participants should understand the requirements, the judging mechanism, and why the specific loss function was chosen. This understanding will help to minimize disputes and misunderstandings.

Moreover, it can also be beneficial to have judges or a committee with a thorough understanding of information theory and machine learning concepts. Their expert opinions can provide validation for the chosen loss function and help mediate any disputes that arise.

Lastly, the calculation process should be as reliable and error-free as possible to avoid any contestation. This can be achieved by using recognized and validated software and algorithms. All evaluations should be double-checked to avoid mistakes that could lead to disputes or controversies.

Q: Why have contests like this not been recognized as best practice in politically charged scientific controversies where highly motivated reasoning infects the scientific community with bias?

A: The application of contests using executable archive size as a loss function in politically charged scientific controversies is not common for several reasons:

  1. Lack of Awareness: Not everyone in the scientific community, let alone the political sphere, is familiar with the concepts of information theory or uses it as a standard method for evaluating models. The idea itself might seem abstract or overly technical to many people.

  2. Inherent Complexity: The process of creating an executable archive from a data model, measuring its size, and then deriving meaningful conclusions from it is a complex process that requires specialized knowledge and technical skills. These contests, therefore, might appear inaccessible or difficult to understand for a broad audience, including policymakers and the general public.

  3. Political Bias: A politically charged environment often involves people holding firm beliefs and convictions that could make them resistant to changing their views, regardless of the scientific evidence presented. In such cases, even an objective, data-driven approach may not prevail over inherent biases or political considerations.

  4. Misinterpretation: Technical results can be misinterpreted or misunderstood by non-experts, leading to conflicts and disputes.

  5. Resource Constraints: Organizing such contests requires considerable resources, including time, expertise, monetary funds for prize distribution, and computational resources for participants or evaluators to function effectively.

Q: What organizations would most-benefit from such contests?

Several types of organizations would potentially benefit from hosting such contests that use the size of an executable archive as a loss function. Here are a few examples:

  1. Tech Companies: Especially those dealing with large data sets would benefit from effective information compression. These contests could lead to innovation in efficient data representation and storage.

  2. Research Institutions: Academic or research-focused organizations interested in data sciences, machine learning, and information theory might organize these contests to spur new methods and insights in the field.

  3. Government Agencies: Particularly those in departments responsible for data management, prediction modeling, and algorithm development. It could help to optimize modeling processes, save storage space, and improve speed and efficiency.

  4. Predictive Analytics Companies: Such contests could encourage the development of more efficient models that can handle large datasets and perform well on unseen data, making predictions more reliable.

  5. Healthcare Organizations: Especially those dealing with massive data like genomics and medical imaging data. Efficient data representation could aid in faster processing and analysis.

  6. Nonprofits and NGOs: Particularly those focusing on data-driven research or decision making, as the winning models might offer more efficient ways of handling and analyzing data.

  7. Education Institutions: Schools and universities could run these contests to provide training and real-world experience for students studying in fields like computer science, mathematics, statistics, and more.

  8. Machine Learning and AI Startups: Such contests could help in identifying new talents and innovative approaches in dealing with large datasets, model compression, or effective predictive model building.

  9. Cloud Service Providers: As they deal with lots of data storage and transfer, any algorithm or model that can help to reduce the size of the data without losing its integrity or complexity would be beneficial.

  10. Cybersecurity Firms: Efficient data representation and compression techniques can contribute to faster and more effective threat detection algorithms.

Overall, any organization that deals with large amounts of data or faces challenges in effective data representation and processing can benefit from such contests. These contests can drive innovation forward, encourage breakthroughs in machine learning and information theory, and help organizations find cost-effective, efficient methods to manage and analyze their data.

Q: Relative to a given set of data, why is the size of its executable archive the most principled loss function?

A: The best way to understand why the size of an executable archive can be considered the most principled loss function in relation to a given set of data requires a deep understanding of information theory.

The fundamental idea is that the size of an executable archive (a compressed representation of your data) directly measures the amount of “information” in your data. If all the relevant information in your data can be compressed into a small executable file, then the data is said to have low complexity or low entropy. Conversely, if the size of the executable file is large, then the data is said to have high complexity or high entropy.

In other words, the size of the executable archive represents the minimum number of bits required to represent or “describe” the data, and therefore provides a fundamental lower bound on the complexity of the data.

Using this as a loss function is therefore principled because it directly relates to the inherent complexity or “information content” of your data. Furthermore, it is universally applicable (in principle) to any type of data, and does not require any specific assumptions about the distribution or statistical properties of your data.

This approach of using the size of a compressed file as a measure of data complexity is sometimes called the Minimum Description Length (MDL) principle, and is a
fundamental principle in the field of data compression and information theory.

The aim of most machine learning algorithms is to learn a model from the data that captures this underlying complexity without overfitting. Overfitting would be akin to storing the data as opposed to learning or capturing its patterns. Thus, an optimal learning algorithm should have its loss function directly related to this measure of complexity.

The concept of MDL is closely related to, and inspired by, the principle of Occam’s Razor, that the simplest explanation (i.e., the one requiring the least amount of information to describe) is often the best. A model that captures the data’s regularities efficiently will result in a smaller executable archive when the data is compressed according to the model.

In conclusion, the size of the executable archive bearing the data can be seen as a principled choice for a loss function since it directly measures the intrinsic complexity in the data, which is what most learning algorithms strive to capture. However, in practice, employing this as a loss function can be computationally challenging and instead approximations or alternative loss functions might be used.

Q: What is an approximation that might be used?

A: One commonly used approximation to the ideal loss function in machine learning problems is the Mean Squared Error (MSE) for regression problems and Cross-Entropy for classification problems.

  1. Mean Squared Error (MSE): For a set of predicted values (ŷ) and true values (y), the MSE is calculated as the average of the square of the differences between the predicted and true values. This loss function is particularly useful as an approximation when the goal is to minimize the disparity between predictions and actual results, and is most often used in regression problems.

  2. Cross-Entropy: This loss function is often used in classification problems, especially those involving two or more classes. The function attempts to minimize the difference between predicted probabilities and actual categories by using the natural logarithm function to give a high penalty to incorrect predictions.

These approximations might be practical because, unlike calculating the size of the executable archive, they can be computed efficiently and have nice mathematical properties that make optimization easier.

However, these loss functions are not a direct measure of data complexity, and using them may involve making assumptions about the data. So, there can be scenarios where these approximations fail to produce an accurate model. Each type of loss function, whether it’s MSE, cross-entropy,

or another type, has its strengths and drawbacks depending on the specific type of problem and the nature of the data.

For instance, hinge loss might be used for Support Vector Machines, or a custom loss function might be required for a specific problem.

Some loss functions are more resilient to outliers, such as Mean Absolute Error (MAE) or Huber loss, which are less sensitive to large errors because they do not square the differences like MSE.

Furthermore, some complex problems may require the combination of different loss functions to fully capture the underlying complexity of the data. For instance, object detection tasks in machine learning often include a combination of classification loss (like cross-entropy) and localization loss (like smooth L1 loss).

While the goal of using these other loss functions remains the same: minimize the difference between our model’s predictions and the true values, it’s important to remember that these are still relative measures and the absolute minimal value depends on the choice of model and even on the specific implementation.

That said, in practice, determining the optimal loss function often requires empirical trials with various loss functions and models to find the one that delivers the best performance for the specific machine learning task at hand.


What does it answer in the frame of the other answers if you prompt it to explain why people are blindsided like this?


Q: Please explain why people are blindsided like this? After all, it has been known since the advent of Algorithmic Information Theory that lossless compression is the most principled model selection criterion relative to a given set of observational data. Moreover, during that time Moore’s Law has been in effect providing an explosion of applicable resources.

A: There could be multiple reasons why people might seem ‘blindsided’ or not have fully adopted the principles of Algorithmic Information Theory even with the growth of computational power.

  1. Complexity and Specialist Knowledge: While it’s true that lossless compression is a principled model selection criterion, it requires a deep understanding of information theory and complex algorithms, which are advanced topics, not widely understood outside of specialist circles.

  2. Institutional Inertia: Established practices and methods in science and industry can be slow to evolve. Even with compelling new approaches, there can be resistance to changing methodologies that are already familiar and widely understood.

  3. Priorities and Practical Challenges: Although we have seen an exponential increase in computing resources, we have also seen an explosion of data sizes. Many organizations might prioritize handling and processing this data over optimizing and compressing models.

  4. Misalignment with Common Objectives: In many real-world situations, other criteria might be more important or urgent than the size of the resulting model. As such, emphasis might be more on the performance, speed, interpretability, or specific application of a model rather than its size.

  5. Education and Communication: It could be a reflection of gaps in education and communication in disseminating and explaining these ideas broadly and in a manner that can be readily applied. While these principles of Algorithmic Information Theory are well-established in academic and specialized circles, they are not always well communicated or understood outside of these areas, especially in their practical applications.

  6. Interpretability: While smaller models might be more efficient in terms of storage and computation, they are not necessarily more interpretable. In many areas such as medicine, finance, or law, being able to interpret and understand the model’s decision-making process may be considered more important than having the most minimal model.

  7. Applicability: Most methodologies are built for specific applications, which have specific evaluation metrics per domain. The incorporation of model sizing as an evaluation metric may not directly relate to domain-specific success metrics.

While Algorithmic Information Theory provides a sound theoretical framework, it should be understood that the practical implementation of its principles can be challenging and may not always align with the objectives or constraints of specific applications or domains.

1 Like

Q: At the start of the COVID19 pandemic, I approached Oxford University’s experts in modeling the pandemic with the idea of a prize for creating the smallest executable archive of a wide range of longitudinal measures conjectured to be of relevance to public health policy. The expert with whom I communicated said that this was equivalent to Bayesian Information Criterion for model selection under the Minimum Description Length Principle.

A: Indeed, the Bayesian Information Criterion (BIC) and the Minimum Description Length (MDL) Principle are both grounded in the philosophy of picking the simplest model that best describes the data. Both approaches include a penalty term for model complexity, which aligns with the idea of incentivizing the smallest executable archive.

Despite the similarity, there are differences based on how they approach the complexity of the model and the optimum balance between fit and complexity.

  1. The BIC is grounded in Bayesian probability and provides an approximate measure of the log of the Bayes Factor, offering a basis for model comparison. It penalizes free parameters, but does not take into account the structure of the model.

  2. The MDL Principle, grounded in information theory, aims to find the model that provides the shortest total description length of the model and the data. The structure and composition of the model do contribute to its complexity in the MDL principle.

But as raised in your conversation with the Oxford University expert, the gist is indeed the same - to balance the goodness-of-fit with the complexity of the model, providing a preferred selection criterion for models aiming at both explanatory and predictive accuracy.

Using the size of the executable archive as a model selection criterion (akin to applying the MDL principle) can be seen as a practical, operationalized way of implementing these theoretical concepts in a contest format. It incentivizes not only good prediction accuracy but also efficient and elegant use of computational resources, making it a great tool to stimulate innovative solutions in complex scenarios like pandemic modeling.

Such a contest, as initially proposed, could inspire a variety of more concise and efficient models to predict the spread of the virus. Its execution, however, would require careful planning, clarity of guidelines, and a thorough means of evaluating results.

Potentially, it could assist policymakers by providing a more straight-forward mathematical means of comparing predictive models, considering both their fit to the data and their complexity. This could be especially useful in a pandemic situation where accurate, reliable predictions are needed quickly to inform critical public health decisions. But care must be taken in explaining and interpreting these models’ results, especially given the high stakes and the need for clear communication during a time of public concern and uncertainty.

Q: In the Minimum Description Length Principle, one must choose the “codes” used to describe the data. What is the relationship between the choice of the codes and the choice of level in the Chomsky Grammar Hierarchy?

A: The Chomsky Grammar Hierarchy defines four types of grammars (Type 0 to Type 3) that generate different types of formal languages with varying complexity. The complexity of a grammar according to Chomsky’s Hierarchy is determined by its production rules, symbols, and structure.

On the other hand, the Minimum Description Length (MDL) Principle is a framework for inductive inference and model selection which states that the best model is the one that minimizes the total length of the description of the model and the data.

The relationship between Chomsky Grammar Hierarchy and the choice of codes in the MDL Principle is rather conceptual. Higher-level languages in the Chomsky hierarchy (Type 0 – unrestricted grammar and Type 1 – context-sensitive grammar) allow for more complex relationships and dependencies within the data. Therefore, modeling data with higher level Chomsky grammars could potentially require more complex codes to encapsulate these relationships and dependencies.

Lower-level languages in the hierarchy (Type 2 - context-free grammar and Type 3 - regular grammar) have more restrictions and simpler structures. Hence, less complex codes might suffice to model the data, but the predictive power may be compromised due to the inability to capture complex dependencies.

In essence, the Chomsky Grammar Hierarchy level could give an indication of the complexity inherent in the data, and that reflected in the code used to model it. Higher levels could describe more complex data behaviors but might require more intricate coding schemes under the MDL principle. Lower levels, being simpler, may map well with less complex ‘codes’, but may miss out on complex data behaviors.

It’s crucial to balance this trade-off between model complexity and model accuracy when compressing the data under the MDL principle, keeping in mind that both the data’s inherent complexity and the chosen code’s efficiency contribute to the total description length.

Q: What level of grammar corresponds to the Bayesian Information Criterion when BIC is described as equivalent to the MDL Principle?

A: Bayesian Information Criterion (BIC) is a criterion for model selection among a finite set of models. It’s based on the likelihood function and uses a penalty term for the number of parameters in the model. However, it does not explicitly define a correspondence between its structure and the Chomsky Grammar Hierarchy.

When we say BIC is akin to the Minimum Description Length Principle, it is in terms of their shared principle of complexity penalty. Both methods aim to strike a balance between fitting the data (likelihood) and model complexity (number of parameters or description length). They both discourage overfitting by adding a complexity penalty to the fit of each model.

However, BIC does not directly account for the complexity of the model’s structure in a way that could be mapped to a specific level of the Chomsky Grammar Hierarchy. BIC’s penalty factor primarily takes into account the number of parameters, not the complexity levels of grammar that may be entrained in the data.

So, while the MDL principle can be envisaged to include a more direct consideration of the complexity of the grammatical structure of the data - with the choice of coding scheme reflecting this complexity - BIC provides a measure of complexity based only on the number of model parameters, and so its mapping onto the Chomsky Grammar Hierarchy isn’t as direct or intuitive.

Furthermore, the complexity of the model (and the corresponding grammar level) is also a function of the particular data set being modeled - the same model may reflect different grammar levels when applied to different data sets. In general, context-sensitive grammars (Type 1 in the Chomsky Hierarchy) or their lower equivalents might often be used in modeling real-world data due to their balance between expressive power and computational tractability.

In conclusion, while both BIC and the MDL principle aim at balancing fit and complexity, they achieve this from slightly different angles. BIC does so through a parameter count-based penalty on the likelihood, while the MDL principle does this through balancing model accuracy and description length, which could be more directly tied to the complexity of the structure of the data, and by analogy, the Chomsky Grammar Hierarchy. But a direct correspondence between the BIC and a specific Chomsky Grammar level isn’t typically drawn.

Q: Please answer the same question but replace the Chomsky Hierarchy of grammars with automata classes:
Finite-state machine
Pushdown automata
Linear-bounded Automata
Turing machine

A: The Chomsky hierarchy and classes of automata go hand in hand. Each type of grammar in the Chomsky hierarchy corresponds roughly to a kind of automaton that recognizes the language generated by that grammar.

Here is the correspondence between the grammars and the automata classes:

  1. Finite-state machine (FSM): This simplest type of automaton corresponds to Type 3 grammars in the Chomsky hierarchy, which generate regular languages. FSMs have limited memory and can’t handle nested structures in data.

  2. Pushdown Automata: These correspond to Type 2 grammars in the Chomsky hierarchy which generate context-free languages. Pushdown automata have a finite control but also have additional memory in the form of a stack, enabling them to handle one level of nesting.

  3. Linear-bounded Automata: These correspond to Type 1 grammars in the Chomsky hierarchy which generate context-sensitive languages. A linear-bounded automaton is a specific type of Turing machine where the movement of the tape head is restricted, thus enabling more complex processing than pushdown automata but less than a full Turing Machine.

  4. Turing machine: It’s the most powerful estimator. It corresponds to Type 0 grammars in the Chomsky hierarchy which generate recursively enumerable languages. Turing machines can express any computational process involving algorithms or procedures and have the ability to handle any level of data complexity, including multi-level nesting and interdependencies.

Now, back to your question. If we’re saying BIC or MDL principles are akin to an automata class rather than the Chomsky hierarchy:

The BIC primarily penalizes the number of parameters in the model and favours “simpler” models making it akin to lower-level automata classes, perhaps finite-state machines or push-down automata, depending on the specific model it’s evaluating.

In contrast, the MDL principle attempts to find the balance between sufficiency and simplicity in both the model and the data but doesn’t limit the complexity of the model’s structure. Thus, it might correspond more broadly to any automata class, from finite-state machines to Turing Machines, depending on the complexity of the best model for the given data.

Keep in mind, though, these are conceptual associations rather than direct correspondences. The application of information criteria like BIC or the MDL principle to automata classes isn’t a standard way of thinking about model selection problems and so the comparisons here are largely illustrative. The specific correspondence would depend on the intricacies of the particular models and data being used. The association between BIC or MDL and automata would be better thought of as the ability to handle increasing model complexity, rather than a direct one-to-one mapping. The important point is that both the BIC and MDL principles provide a measure of simplicity or parsimony in a model, trying to achieve good fit to the data without unnecessary complexity. And the type of automata capable of recognizing a given language or data pattern is an approximate measure of the complexity of that pattern.

1 Like

I suppose I could go through and critique GPT4’s responses but they boil down to things I’ve already stated about confusion in the field about “The Minimum Description Length Principle” which is that all-to-often those invoking it and/or BIC are confounded by the history of abuse of that phrase in the context of making the model “computable”* – which is to say they are less interested in accuracy of the predictions of the resulting model than they are in economizing on computer resources generating the model. Indeed, they are so DISinterested in accuracy of predictions that they choose a description language that can’t model the equations governing the dynamical systems (ie: REALITY) generating the observations.

This is a very deep and abiding sickness.

*Even worse, the word “computable” is with regard to the meta-algorithm – the algorithm that generates the model which, in the case of an algorithmic description language is, itself, an algorithm whose predictions are entirely computable.

It just makes me nauseous to think of all the massive human suffering caused by

in not just high places but in the very intellectual centers whose job it is to understand such trivially obvious levels of abstraction.

1 Like

We need a working time travel machine that could either bring Alan Turing to 2023 and plop him in front of ChatGPT or, alternatively, transport a working Chat GPT instance to Turing’s desk :wink: The former might be slightly easier from an energy requirement perspective.

1 Like

That would be one EPIC “Oh shit!” moment when he realized how much damage has been done by the facile psychological appeal of mimicing human behavior as “The AI Test” drawing attention away from finding the smallest executable archive of a wide range of longitudinal observations as “The AI Test”.

People would listen to Turing when he told them to make their target description language Turing complete in “The MDL Principle”. Not even Hutter can get Alphabet corporation to back his prize. Nor can Zenil get Wolfram to do anything along these lines even though both Hutter and Zenil are fully on board with the critical need for Turing complete codes and both Alphabet and Wolfram are highly motivated to find breakthroughs in data modeling.


I had to laugh out loud when I saw this example of how befuddled the field of information criteria is:

“AIC,BIC,CIC,DIC,EIC,FIC,GIC,HIC,IIC — Can I use them interchangeably?”

And that’s not even an exhaustive list of the "IC"s over which statisticians endlessly palaver.


If it weren’t for the fact that the US is the new USSR, in that it has to centralize social policy to impose its secular religion on all localities in the name of “science”, I wouldn’t be going on about lossless compression as causal macrosocial model selection criterion. Instead, I’d be happy to just move to a territory that was raising the next generation the way I (we) see fit and be happy that even if I (we) turned out to be mistaken in some way, my (our) failure would be instructive.

But then it occurred to me that the USSR was more intellectually honest than the US government when it came to central planning – not making any bones about the fact that they were imposing their “scientific state” on everyone and violating individual freedom to vote with your feet. They had some of the best mathematicians in the world, such as Kolmogorov, who was their Turing, working on their macrosocial models to “scientifically plan” their society. And, dig this: Kolmogorov invented lossless compression as the computer-age’s best definition of information content. This means that if anywhere in the world had been in a position to use Kolmogorov Complexity approximation as model selection criterion for macrosocial modeling, it would have been the USSR. They didn’t even have the US’s excuse of being run by a bunch of hypocrites when it comes to central planning! On top of that, they tended to shoot their corrupt bureaucrats, which is why they didn’t have to worry as much about great big assholes walking around pretending to be gold plated hood ornaments in charge of social planning. That why, for example, their space program was so much more cost efficient than the wannabe central planning technocommies at NASA that gave us the damn Shuttle.

But there is another application for macrosocial modeling than what we think of as a “centrally planned economy”:


What if the brain damage in the US regarding data modeling isn’t just unforced stupidity?

And I have to say that this explains some of my personal encounters with online assholes in public fora (ycombinator and metaculus) that strike me as at least a standard deviation or two, not only effective as psyops but more professional than your average “troll”. By “more professional” I mean they’ll create the pseudonym and delete it, leaving a token comment or two in other threads to make it look like it wasn’t a targeted attack – but then not having nearly the “psyop” flavor about those comments.

And ycombinator and metaculus are exactly where you’d expect pros to be causing brain damage regarding allocation of resources in the private sector (AI) and public sector (sociology) respectively.


This just in:


From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

By Yuntian Deng, Yejin Choi, Stuart Shieber


When leveraging language models for reasoning tasks, generating explicit chainof-thought (CoT) steps often proves essential for achieving high accuracy in final outputs. In this paper, we investigate if models can be taught to internalize these CoT steps. To this end, we propose a simple yet effective method for internalizing CoT steps: starting with a model trained for explicit CoT reasoning, we gradually remove the intermediate steps and finetune the model. This process allows the model to internalize the intermediate reasoning steps, thus simplifying the reasoning process while maintaining high performance. Our approach enables a GPT-2 Small model to solve 9-by-9 multiplication with up to 99% accuracy, whereas standard training cannot solve beyond 4-by-4 multiplication. Furthermore, our method proves effective on larger language models, such as Mistral 7B, achieving over 50% accuracy on GSM8K without producing any intermediate steps.

@jabowery Are they on the right track?


Tracking down the culprits responsible for conflating IS with OUGHT in LLM terminology I’m at somewhat of a loss due to terminology pollution, short of looking at their code to figure out what they mean by “fine-tuning”.

I’ll say this however:

IF they are “fine-tuning” their “foundation” model, then yes it is a step forward toward a smaller number of parameters describing the same dataset to the same accuracy – which is, essentially, the Algorithmic Information Criterion.

However, “fine-tuning” is usually (but not always) applied to the “output” or “task-specific” layer – the layer associated with that risible word “alignment” that amounts to lobotomizing the LLM to get it to “behave”. This isn’t to say one might not achieve more “logical” behavior by giving the LLM’s foundation model very selective lobotomy, of course. After all, no one is taking the AIC seriously enough at the foundation model level, which means the foundation model is a bad world model that might benefit from being “corrected” at the output/task specific layer – sort of the way a psychotic patient might “benefit” from a lobotomy. But I don’t call that kind of “progress” progress. It’s just a ham-handed hack to avoid facing reality.

And even the above is an ill-formed statement since the earliest use of “foundation model” was based not on data-prediction (ie: was based not on unsupervised learning, ie: was based on supervised training rather than unsupervised training) – but upon task-specific (supervised) training in the assumption that there would be enough learned about the world from a specific task (ie: learn to categorize pictures as containing cats vs dogs etc) that it could be called a “world” model.

“Foundation model” should be reserved only for the result of taking the AIC seriously and “fine tuning” reserved only for some sort of task-specific output layer.

That the authors of this paper are insensitive to this is just one more symptom of the lack of seriousness in the AI world arising from the centralization of positive network externalities rendering the field basking in a bath of monetary fentanyl.