How To Nuke Large Language Model Bias?

From what I’ve been seeing of the Tesla machine learning culture, they’ll be dragged kicking and screaming via RNNs across the Turing complete finish line for language models. There is, however, hope for this because people do tend to kick and scream when their loved ones die in traffic accidents.

PS: I should probably reiterate my request that people think of video compression as creating 3D models of objects in the environment seen by the camera such that GPUs can do photorealistic renderings of those objects as the first-order approximation of the video reconstruction. Seen in this way, it should be obvious that Tesla will also end up with a breakthrough in the streaming video world.


Critical Thinking: The process of approximating the Kolmogorov Complexity algorithm of one’s observations.

Understanding the practicalities entailed by “approximating” is where real information theory meets the natural (observational/empirical) sciences.

1 Like

Sounds like a Stack-RNN may be the next step for DeepMind given the prominent mention in the recent and aforelinked Princeton/DeepMind paper “Neural Networks and the Chomsky Hierarchy”. However, since there are no authors in common between the two papers, it may require overcoming some of the Big Org problems that have plagued Alphabet’s ability to execute on its in-house talent.

Recurrent Residual Networks Contain Stronger Lottery Tickets

ABSTRACT Accurate neural networks can be found just by pruning a randomly initialized overparameterized model, leaving out the need for any weight optimization. The resulting subnetworks are small, sparse, and ternary, making excellent candidates for efficient hardware implementation. However, finding optimal connectivity patterns is an open challenge. Based on the evidence that residual networks may be approximating unrolled shallow recurrent neural networks, we conjecture that they contain better candidate subnetworks at inference time when explicitly transformed into recurrent architectures. This hypothesis is put to the test on image classification tasks, where we find subnetworks within the recurrent models that are more accurate and parameter-efficient than both the ones found within feedforward models and than the full models with learned weights. Furthermore, random recurrent subnetworks are tiny: under a simple compression scheme, ResNet-50 is compressed without a drastic loss in performance to 48.55× less memory size, fitting in under 2 megabytes. Code available at: GitHub - Lopez-Angel/hidden-fold-networks.

What I find particularly intriguing about this general topic is that if one combines the notion of RNNs with not only the Strong Lottery Ticket hypothesis but sparse ternary connection matrices, it is conceivable that one may use commodity GPUs to contain only* the RNN’s state vector. Given float32 state vectors, that means 2G neuron RNNs for 8GB GPUs. The scaling laws of the Strong Lottery Ticket hypothesis mean the training time to achieve exceedingly high performance RNNs may be drastically reduced while, ultimately, pruning not only the connection matrices even more, but also the neurons.

*The math on the PCI->GPU transfer rate relative to the GPU memory->memory transfer rate via its cores isn’t nearly as bad as some more naive analyses may report, especially with very large state vectors (number of neurons) in a single layer. Further optimizations can be achieved by encoding the sparse connection matrix with a pseudorandom number generator thereby reducing the size of the connection matrix to just a main memory resident pruning mask on an already sparse connection matrix that, itself, takes virtually no memory.

A single layer permits one to gang up multiple commodity GPUs running in parallel, with a final summation of both the state vector, and ReLU performed by the CPU.

If one then extends the RNN to be a Stack-RNN, sacrificing say half of the GPU-resident neurons, at 1G neurons, one is still in an exceedingly powerful regime for the Strong Lottery Ticket hypothesis, yet with a 1G stack one has escaped the lower levels of the Chomsky Hierarchy (the lowest of which is occupied by Transformers):

This means the Stack-RNN can learn a connection matrix that encodes algorithms now being kludged together to overcome this limitation of Transformers to simulate reasoning thence principled extrapolation via for tree-search for solutions such as Tree of Thoughts.

1 Like

Ideas such as the above (in contrast with “transformers” and their LLM idiot savants) that not only radically increase the effective neuron count per GPU, but raise the level of descriptive grammar in the Chomsky hierarchy might succeed in raising the intelligence of AIs to nearly the level of AGIs. It then behooves me to put the hysterics of AGI in some perspective.

The hysterics about AI are a self-fulfilling prophecy due to the motivation of the hysterical: Keep people from assortative migration & protective sequestration of their children away from the parasitical elites driven hysterical by not only the deprivation of their food source, but by the threat that control groups present:

Discovery of what is causing social ills – which, of course, would expose the fact that elites are pathogenic parasites. Oh, but I haven’t yet explained why these pathogenic parasites fear AI so much that they are going into hysterics – but I’ve clearly implied the reason and I’ve explicitly stated the reason often enough in the past:

The Algorithmic Information Criterion for selection of causal models is the optimal scientific method for scientists deprived of control groups. So those of us wanting to preserve our children from the parasitical elites are obsessively motivated to advance the state of the art in AIC to take advantage of the half-century of exponentiating transistors so that we can overcome the damage done to the social sciences by our parasitical elites. This, then, will produce AIs of such power that they really will represent a threat.

So, really, bottom line, the reason humanity is under threat by AI is the parasitical elites have put humanity in a position where we must choose: Continue to let them feed on our children or take the chance of true AIs (not these “large language model” idiots) destroying humanity.

Technologies that provide habitat isolation thence protective sequestration such as artificial atoll habitats for exponential remediation of civilization’s footprint thence O’Neill space habitats may become the only way of preventing what I suppose we should call “AI-goo” (contrast with Drexler’s “grey-goo”) from consuming everything.


I think another reason the self-styled élites are so frightened of artificial general/super intelligence is that they have spent the better (actually worse) part of a century convincing themselves and indoctrinating others at the pain of career-ending ostracism that “intelligence doesn’t matter” and that “IQ only measures how people perform on IQ tests”. But like so many of the objectively false things they profess to believe (“gender is a social construct”, “inflation is due to greedy capitalists”, “private gun ownership increases crime rates”, etc.), they know it isn’t really so and rightly fear being found out and falsified by means they cannot control through censorship and intimidation.

To the extent they profess that intelligence doesn’t matter and that unlike all other characteristics in which human individuals vary within a population, cognitive capacity is absolutely identical among all healthy members of the human species, then the only way to distinguish those who should rule from those they rule is that the former have credentials conferred upon them by élite institutions, managing to get more votes than the other guy, or being appointed to a position by an existing member of the ruling class so that their inferiors address them as “The Honourable” or some other term of submission. (This also explains why they believe that simply placing savages, descended from hundreds of generations of savages, upon the “magic dirt” of a developed country will transform them into people who behave like Swedes or [post-1945] Japanese, or that admitting them to Harvard or Oxford will transmute them into contenders for the Nobel prize, or that a country which is deliberately reducing its mean IQ from around 100 downward toward 90 can compete with countries with mean IQ of 105.)

As long as the variance in intelligence is relatively small, manifest mostly in the fraction of the population on the tails of the normal distribution, this fantasy can be maintained. But once you have a population with, not “superintelligence” but just, say, a mean IQ of 160, then the pretensions of the ruling class will be shattered and they’ll be shown up as the frauds they are and, even worse to them, people will be inclined to listen to whose who make more sense and are congruent with objective reality instead of the fantasies spun by The Honourable This, Doctor That, and Professor Whatshisname. This risks wrecking their whole racket of importing a dependent underclass to reliably keep them in power and do what they are told.

Imagine, for example, a browser plug-in which, unlike my OldSpeak script which merely mocks establishment media, annotates news and opinion articles and pronouncements by politicians and other authority figures pointing out factual errors, logical inconsistencies, flaws in reasoning, and appeal to emotion, with each linked to primary sources the reader could follow-up. The power of such a tool is acknowledged already by the usual suspects trying to deploy “fact checkers” against those who oppose them or they fear, but just imagine if their facts were actually checked on a continuous basis and refuted with citations.


This is what Musk claims Twitter 's “community notes” feature is doing – and to an extent he’s right in that just about anything would be an improvement over the current notion of “fact checking”. But I doubt that he’s going to get what he says he’s after in this manner, which is Twitter as a virtual brain in which people act as virtual neurons to generate what I guess might be called “emergent truth”. It didn’t work with so-called “peer review” although I suppose it could be argued that to the extent community notes gets rid of the anonymity of reviewers without getting them “cancelled” from community notes (as they would be from academia nowadays) we could expect a different outcome.

That the concept of “intelligence” itself has come under attack and that therefore anything that re-legitimizes the concept (such as direct practical experience with the benefits of increased intelligence in a population even if only artificial personal AI agents) would undermine “the narrative”:

I couldn’t agree more. Indeed I conjectured the anti-intelligence “narrative” was cultivated by a Maoist thinktank to populate our “intelligence” agencies with the current crop of virtual Fentanyl addicts.


The usefulness* of large language models falls off a cliff (to the left) as “harmlessness” increases.
Likewise, as LLM usefulness increases, “harmlessness” falls off a cliff.

*They call it “helpfulness”


The only company in the world that had as its mission statement the use of lossless compression as truth discovery in AGI was located in Ukraine. In 2019 a founder was awarded a prize for the best AGI idea at the 12th conference on Artificial General Intelligence in Shenzhen, China.

Here’s a video of that presentation:

As it turns out, Putin’s entourage views this company with enough respect that he was presented with a book coauthored by one of its founders:

Here’s the book: “Strong Artificial Intelligence”

Given West’s “Alignment Before Truth” approach to everything – which has infected the AI industry – it wouldn’t surprise me if the Biden strategy of driving Russia and China into alliance may have delivered strong AI into the hands of BRICS.

PS: Although it is probably a coincidence, I did recommend these guys take a look at a relational programming paradigm that generates all possible programs that outputs a given datum and some work out of St. Petersburg on typed relational conversion that might be used to constrain the search space.


Here’s a related algorithm: Massalin’s superoptimizer

Superoptimizer – A Look at the Smallest Program
by Henry Massalin
Given an instruction set, the superoptimizer finds the shortest
program to compute a function. Startling programs have been
generated, many of them engaging in convoluted bit-fiddling bearing
little resemblance to the source programs which defined the func-
tions. The key idea in the superoptimizer is a probabilistic test that
makes exhaustive searches practical for programs of useful size. The
search space is defined by the processor’s instruction set, which may
include the whole set, but it is typically restricted to a subset. By
constraining the instructions and observing the effect on the output
program, one can gain insight into the design of instruction sets. In
addition, superoptimized programs may be used by peephole op-
timizers to improve the quality of generated code, or by assembly
language programmers to improve manually written code.

1 Like