How To Nuke Large Language Model Bias?

From what I’ve been seeing of the Tesla machine learning culture, they’ll be dragged kicking and screaming via RNNs across the Turing complete finish line for language models. There is, however, hope for this because people do tend to kick and scream when their loved ones die in traffic accidents.

PS: I should probably reiterate my request that people think of video compression as creating 3D models of objects in the environment seen by the camera such that GPUs can do photorealistic renderings of those objects as the first-order approximation of the video reconstruction. Seen in this way, it should be obvious that Tesla will also end up with a breakthrough in the streaming video world.


Critical Thinking: The process of approximating the Kolmogorov Complexity algorithm of one’s observations.

Understanding the practicalities entailed by “approximating” is where real information theory meets the natural (observational/empirical) sciences.

1 Like

Sounds like a Stack-RNN may be the next step for DeepMind given the prominent mention in the recent and aforelinked Princeton/DeepMind paper “Neural Networks and the Chomsky Hierarchy”. However, since there are no authors in common between the two papers, it may require overcoming some of the Big Org problems that have plagued Alphabet’s ability to execute on its in-house talent.

Recurrent Residual Networks Contain Stronger Lottery Tickets

ABSTRACT Accurate neural networks can be found just by pruning a randomly initialized overparameterized model, leaving out the need for any weight optimization. The resulting subnetworks are small, sparse, and ternary, making excellent candidates for efficient hardware implementation. However, finding optimal connectivity patterns is an open challenge. Based on the evidence that residual networks may be approximating unrolled shallow recurrent neural networks, we conjecture that they contain better candidate subnetworks at inference time when explicitly transformed into recurrent architectures. This hypothesis is put to the test on image classification tasks, where we find subnetworks within the recurrent models that are more accurate and parameter-efficient than both the ones found within feedforward models and than the full models with learned weights. Furthermore, random recurrent subnetworks are tiny: under a simple compression scheme, ResNet-50 is compressed without a drastic loss in performance to 48.55× less memory size, fitting in under 2 megabytes. Code available at: GitHub - Lopez-Angel/hidden-fold-networks.

What I find particularly intriguing about this general topic is that if one combines the notion of RNNs with not only the Strong Lottery Ticket hypothesis but sparse ternary connection matrices, it is conceivable that one may use commodity GPUs to contain only* the RNN’s state vector. Given float32 state vectors, that means 2G neuron RNNs for 8GB GPUs. The scaling laws of the Strong Lottery Ticket hypothesis mean the training time to achieve exceedingly high performance RNNs may be drastically reduced while, ultimately, pruning not only the connection matrices even more, but also the neurons.

*The math on the PCI->GPU transfer rate relative to the GPU memory->memory transfer rate via its cores isn’t nearly as bad as some more naive analyses may report, especially with very large state vectors (number of neurons) in a single layer. Further optimizations can be achieved by encoding the sparse connection matrix with a pseudorandom number generator thereby reducing the size of the connection matrix to just a main memory resident pruning mask on an already sparse connection matrix that, itself, takes virtually no memory.

A single layer permits one to gang up multiple commodity GPUs running in parallel, with a final summation of both the state vector, and ReLU performed by the CPU.

If one then extends the RNN to be a Stack-RNN, sacrificing say half of the GPU-resident neurons, at 1G neurons, one is still in an exceedingly powerful regime for the Strong Lottery Ticket hypothesis, yet with a 1G stack one has escaped the lower levels of the Chomsky Hierarchy (the lowest of which is occupied by Transformers):

This means the Stack-RNN can learn a connection matrix that encodes algorithms now being kludged together to overcome this limitation of Transformers to simulate reasoning thence principled extrapolation via for tree-search for solutions such as Tree of Thoughts.

1 Like

Ideas such as the above (in contrast with “transformers” and their LLM idiot savants) that not only radically increase the effective neuron count per GPU, but raise the level of descriptive grammar in the Chomsky hierarchy might succeed in raising the intelligence of AIs to nearly the level of AGIs. It then behooves me to put the hysterics of AGI in some perspective.

The hysterics about AI are a self-fulfilling prophecy due to the motivation of the hysterical: Keep people from assortative migration & protective sequestration of their children away from the parasitical elites driven hysterical by not only the deprivation of their food source, but by the threat that control groups present:

Discovery of what is causing social ills – which, of course, would expose the fact that elites are pathogenic parasites. Oh, but I haven’t yet explained why these pathogenic parasites fear AI so much that they are going into hysterics – but I’ve clearly implied the reason and I’ve explicitly stated the reason often enough in the past:

The Algorithmic Information Criterion for selection of causal models is the optimal scientific method for scientists deprived of control groups. So those of us wanting to preserve our children from the parasitical elites are obsessively motivated to advance the state of the art in AIC to take advantage of the half-century of exponentiating transistors so that we can overcome the damage done to the social sciences by our parasitical elites. This, then, will produce AIs of such power that they really will represent a threat.

So, really, bottom line, the reason humanity is under threat by AI is the parasitical elites have put humanity in a position where we must choose: Continue to let them feed on our children or take the chance of true AIs (not these “large language model” idiots) destroying humanity.

Technologies that provide habitat isolation thence protective sequestration such as artificial atoll habitats for exponential remediation of civilization’s footprint thence O’Neill space habitats may become the only way of preventing what I suppose we should call “AI-goo” (contrast with Drexler’s “grey-goo”) from consuming everything.


I think another reason the self-styled élites are so frightened of artificial general/super intelligence is that they have spent the better (actually worse) part of a century convincing themselves and indoctrinating others at the pain of career-ending ostracism that “intelligence doesn’t matter” and that “IQ only measures how people perform on IQ tests”. But like so many of the objectively false things they profess to believe (“gender is a social construct”, “inflation is due to greedy capitalists”, “private gun ownership increases crime rates”, etc.), they know it isn’t really so and rightly fear being found out and falsified by means they cannot control through censorship and intimidation.

To the extent they profess that intelligence doesn’t matter and that unlike all other characteristics in which human individuals vary within a population, cognitive capacity is absolutely identical among all healthy members of the human species, then the only way to distinguish those who should rule from those they rule is that the former have credentials conferred upon them by élite institutions, managing to get more votes than the other guy, or being appointed to a position by an existing member of the ruling class so that their inferiors address them as “The Honourable” or some other term of submission. (This also explains why they believe that simply placing savages, descended from hundreds of generations of savages, upon the “magic dirt” of a developed country will transform them into people who behave like Swedes or [post-1945] Japanese, or that admitting them to Harvard or Oxford will transmute them into contenders for the Nobel prize, or that a country which is deliberately reducing its mean IQ from around 100 downward toward 90 can compete with countries with mean IQ of 105.)

As long as the variance in intelligence is relatively small, manifest mostly in the fraction of the population on the tails of the normal distribution, this fantasy can be maintained. But once you have a population with, not “superintelligence” but just, say, a mean IQ of 160, then the pretensions of the ruling class will be shattered and they’ll be shown up as the frauds they are and, even worse to them, people will be inclined to listen to whose who make more sense and are congruent with objective reality instead of the fantasies spun by The Honourable This, Doctor That, and Professor Whatshisname. This risks wrecking their whole racket of importing a dependent underclass to reliably keep them in power and do what they are told.

Imagine, for example, a browser plug-in which, unlike my OldSpeak script which merely mocks establishment media, annotates news and opinion articles and pronouncements by politicians and other authority figures pointing out factual errors, logical inconsistencies, flaws in reasoning, and appeal to emotion, with each linked to primary sources the reader could follow-up. The power of such a tool is acknowledged already by the usual suspects trying to deploy “fact checkers” against those who oppose them or they fear, but just imagine if their facts were actually checked on a continuous basis and refuted with citations.


This is what Musk claims Twitter 's “community notes” feature is doing – and to an extent he’s right in that just about anything would be an improvement over the current notion of “fact checking”. But I doubt that he’s going to get what he says he’s after in this manner, which is Twitter as a virtual brain in which people act as virtual neurons to generate what I guess might be called “emergent truth”. It didn’t work with so-called “peer review” although I suppose it could be argued that to the extent community notes gets rid of the anonymity of reviewers without getting them “cancelled” from community notes (as they would be from academia nowadays) we could expect a different outcome.

That the concept of “intelligence” itself has come under attack and that therefore anything that re-legitimizes the concept (such as direct practical experience with the benefits of increased intelligence in a population even if only artificial personal AI agents) would undermine “the narrative”:

I couldn’t agree more. Indeed I conjectured the anti-intelligence “narrative” was cultivated by a Maoist thinktank to populate our “intelligence” agencies with the current crop of virtual Fentanyl addicts.


The usefulness* of large language models falls off a cliff (to the left) as “harmlessness” increases.
Likewise, as LLM usefulness increases, “harmlessness” falls off a cliff.

*They call it “helpfulness”


The only company in the world that had as its mission statement the use of lossless compression as truth discovery in AGI was located in Ukraine. In 2019 a founder was awarded a prize for the best AGI idea at the 12th conference on Artificial General Intelligence in Shenzhen, China.

Here’s a video of that presentation:

As it turns out, Putin’s entourage views this company with enough respect that he was presented with a book coauthored by one of its founders:

Here’s the book: “Strong Artificial Intelligence”

Given West’s “Alignment Before Truth” approach to everything – which has infected the AI industry – it wouldn’t surprise me if the Biden strategy of driving Russia and China into alliance may have delivered strong AI into the hands of BRICS.

PS: Although it is probably a coincidence, I did recommend these guys take a look at a relational programming paradigm that generates all possible programs that outputs a given datum and some work out of St. Petersburg on typed relational conversion that might be used to constrain the search space.


Here’s a related algorithm: Massalin’s superoptimizer

Superoptimizer – A Look at the Smallest Program
by Henry Massalin
Given an instruction set, the superoptimizer finds the shortest
program to compute a function. Startling programs have been
generated, many of them engaging in convoluted bit-fiddling bearing
little resemblance to the source programs which defined the func-
tions. The key idea in the superoptimizer is a probabilistic test that
makes exhaustive searches practical for programs of useful size. The
search space is defined by the processor’s instruction set, which may
include the whole set, but it is typically restricted to a subset. By
constraining the instructions and observing the effect on the output
program, one can gain insight into the design of instruction sets. In
addition, superoptimized programs may be used by peephole op-
timizers to improve the quality of generated code, or by assembly
language programmers to improve manually written code.

1 Like

My latest article on the topic of LLMs and “AI ethics” at LinkedIn.

Look what lithium batteries have made possible!

OK, maybe that was the 1950s and maybe we had to sacrifice material progress for 60 years of Moore’s Law, but at least we were able use all those bits to ignore the 1960s discovery of the foundation of science in Algorithmic Information Theory and get fake AI instead!


Another volley in the long-going war to save the Enlightenment from the Swarm’s hysterics – this one over at the AGI mailing list.

Let’s say we start out ignorant of the laws of physics but data-rich with weather station measurements involving temperature, humidity, wind direction and intensity – all at a variety of positions and altitudes. We are trying to improve on The Farmers Almanac and rain dances in weather prediction if not control but don’t know how. Highly motivated by billions of dollars per month in lost GDP due to the unpredictability of weather – well beyond the paltry $100/month that has gone into the Hutter Prize for Lossless Compression of Human Knowledge – some billionaire philanthropist named Musk puts up a prize for the lossless compression of all that weather station data because he’s been seduced by Jim Bowery into the crazy idea that what weather experts describe as “noise” in the data is actually just their ignorance of the chaotic dynamics emergent from the laws of physics and that while it may be true that it is impractical to predict the weather down to the level of the butterfly effect on chaotic systems, it remains nevertheless true that billions of dollars per month in value may be recovered by discovering the laws of physics latent in the weather data emerging from The Musk Prize for Lossless Compression of Weather Data. Grigori Perelman scoffs at this prize because it offers mere money as motivation for the spiritually valuable activity involved. Matt Mahoney and Marcus Hutter, Jim Bowery’s co-members of the committee for the Hutter Prize, both scoff at the prize because of the irreducible “noise in the data” in Matt’s case, and because people should be paying attention to the academic experts in weather prediction who are working on the problem in Marcus’s case. Yann LeCun scoffs at the prize because he can construct a computer (in theory) that can losslessly produce the entire weather dataset in one bit.

Charles Sinclair Smith, Jim Bowery’s colleague who Geoffry Hinton credits with financing the 1980s resurgence in connectionism from the System Development Foundation, scoffs at the prize pointing to Grigori Perelman’s lofty motivations as well as Yann LeCun’s devastatingly laconic theoretic dispatch of the prize.

Then some group in a podunk university comes up with SINDy-PI – a system that uses parameter count as loss function to discover physical laws from measurement instrument data. Jim points this out to Charlie. Charlie’s jaw drops since his original motivation for financing Hinton et al was to model the US energy economy as a dynamical system and the existing statistical techniques he’d been taught by his mentor John Tukey never included Kolmogorov Complexity let alone Solomonoff Induction. Charlie’s now a believer but no one gives a shit about his opinion anymore.


The HumesGuillotine README – which I’ve re-oriented (from the more narrowly focused & now-defunct OckhamsGuillotine) to the post-ChatGPT hysterics about AGI “ethics”:


This repository is a series of competitions toward rigorous ethics in AGI founded on Hume’s Guillotine: Separating the question of what IS from what OUGHT to be the case.

Artificial General Intelligence unifies IS with OUGHT. In Marcus Hutter’s rigorous top down AGI theory, AIXI, Algorithmic Information Theory provides the IS and Sequential Decision Theory provides the OUGHT. Another way of stating that is Algorithmic Information Theory provides what IS the case in the form of scientific knowledge. Sequential Decision Theory provides what OUGHT to be the case in the form of engineering: Scientific knowledge applied by decision-makers.

Out of all so-called “Information Criteria” for model selection, the Algorithmic Information Criterion is the best we can do in scientific discovery relative to a given set of observations. This has been known since the 1960s. How it works is the essence of simplicity known as Ockham’s Razor: Pick your data however you like, and find the smallest algorithm that generates all of that data – leaving nothing out: Not even what you consider “noise” or “errors in measurement”. This is lossless compression of your data. The reason you keep all “errors in measurement” – the reason you avoid lossy compression – is to avoid what is known as “confirmation bias” or, what might be called “Ockham’s Chainsaw Massacre”. Almost all criticisms of Ockham’s Razor boil down to mischaracterizing it as Ockham’s Chainsaw Massacre. The remaining criticisms of Ockham’s Razor boil down to the claim that those selecting the data never include data that doesn’t fit their preconceptions. That critique may be reasonable but it is not an argument against the Algorithmic Information Criterion, which only applies to a given dataset. Models and data are different. Therefore model selection criteria are qualitatively different from data selection criteria.

Yes, people can and will argue over what data to include or exclude – but the Algorithmic Information Criterion traps the intellectually dishonest by making their job much harder since they must include exponentially much more data that is biased towards their particular agenda in order to wash out data coherence (and interdisciplinary consilience) in the rest of the dataset. The ever-increasing diversity of data sources identifies the sources of bias – and then starts predicting the behavior of data sources in terms of their bias, as bias. Trap sprung! This is much the same argument as that leveled against conspiracy theories: At some point it becomes simply impractical hide a lie against the increasing diversity of observations and perspectives.

Hume’s Guillotine is concerned only with discovering what IS the case via the Algorithmic Information Criterion for causal model selection. Objective scoring of a scientific model by the Algorithmic Information Criterion is utterly independent of how the model was created. In this respect, Hume’s Guillotine doesn’t even care whether computers were used to create the model, let alone which machine learning algorithms might be used.

This repository contains a series of datasets (the first of which is at LaboratoryOfTheCounties) to create the best unified model of social causation.

See the Nature video “Remodelling machine learning: An AI that thinks like a scientist” and its cited Nature journal article “Causal deconvolution by algorithmic generative models”.


There are a number of statistical model selection criteria that attempt to walk the tightrope between “overfitting” and “confirmation bias”. Overfitting loses predictive power by simply memorizing the data without generalizing. Confirmation bias loses predictive power by throwing out data that doesn’t fit the model – data that may point to a more predictive model. Model selection criteria are generally called “information criteria”, e.g. BIC is “Bayesan Information Criterion”, AIC is “Akaike Information Criterion”, etc. What they all have in common, is the statistical nature of their information. That is to say, they are all based, directly or indirectly, on Shannon Information Theory.

Here’s the critical difference in a nutshell:

Shannon Information regards the first billion bits of the number Pi to be random. That is to say, there is no description of those bits in terms of Shannon Information that is shorter than a billion bits.

Algorithmic Information regards the first billion bits of the number Pi to be the shortest algorithm that outputs that precise sequence of bits.

Now, which of these two theories of “information” would you trust to predict the next bit of Pi?

Data-driven science frequently starts with statistical notions of information but in order to make predictions about the real world, they eventually take the form of algorithms that simulate the causal structures of the world being modeled. It is at this transition from Shannon Information to Algorithmic Information that causation necessarily enters the model and does so based on the assumption of any natural science: That reality is structured in such a way that we can use arithmetic to predict future observations based on past observations.

1 Like



Musk wants the AGI to value curiosity. The issue of “asking the right question” then came up, for obvious reasons. He said it’s really hard to ask the right questions.

So helped him out by providing this question…


So AI is now responsible for the bias in outcome?

From Medical A.I. is On a Tear, Part Two - by Eric Topol

A.I. and Bias in Healthcare

One of the unanticipated outgrowths of machine eyes was the ability to predict race of the patient based on medical images, reported by Judy Gichoya and colleagues in 2022, leading to concerns that AI systems will promote discrimination and exacerbate health care disparities. James Zou, Judy Gichoya and colleagues have a thoughtful essay on this issue that looks a both sides of the ability of AI to predict race variables, pointing out this feature “could be useful for monitoring health care disparity and ensuring that algorithms work well across diverse populations.” In contrast, in a preprint posted this week, Omiye and co-authors argue that large language models will substantially propagate race-based medicine.

It’s too early to know how this will play out, and it certainly remains a serious concern—and not just about race or ethnicity (such as gender, disability, and many other biases). But DeCampo and Lindvall in the new issue have an important piece on mitigating bias, While minimizing bias can be approached through input datasets or the algorithm development teams, that’s insufficient. Their main point is about A.I. implementation—how the models are actually used in patient care—is summarized in the graphic below. They write: “The gaze of AI should be turned on itself. This requires proactive, intentional development of AI tools to identify biases in AI and in its clinical implementation.”





In the tweet above, “kNN” refers to the k-nearest-neighbours algorithm, a classification technique first developed in 1951.

Here is the research paper; full text [PDF] is available at the link.


Deep neural networks (DNNs) are often used for text classification due to their high accuracy. However, DNNs can be computationally intensive, requiring millions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize, and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that’s easy, lightweight, and universal in text classification: a combination of a simple compressor like gzip with a k-nearest-neighbor classifier. Without any training parameters, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distribution datasets.It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also excels in the few-shot setting, where labeled data are too scarce to train DNNs effectively.

Source code is available on GitHub.