How To Nuke Large Language Model Bias?

Wolfram conducts a really good interview with Terry Sejnowski who illustrates the point I’ve been making about the failure of the ML industry regarding recurrence and dynamical systems – and in particular the failure of the large language models like ChatGPT.

At one point in the interview Sejnowski asserts quite correctly that the field should have paid a lot more attention to dynamical systems experts!

I was most excited to see someone of his reputation finally say the Emperor Has No Clothes and do so in a way that was almost a confession of prior negligence.

But then Sejnowski answers Wolfram’s question about ChatGPT in regards to dynamical systems and asserts that transformers hence ChatGPT embodies a dynamical system.

Sejnowski is not referring to the Transformer model itself as being recurrent but rather to the fact that the vector output by the Transformer model is used to refine the conversations context for further dialogue. He then goes on to say that this is the kind of “recurrence” the human brain does. BUT that elides the internal dynamics of the human brain – which dynamics entail causal chains of reasoning we call “thought”. So his answer was specious and, worse, when I brought up the issue, someone in the chat answered me incorrectly that the Transformer uses RNNs – and this was a guy who had actually worked with some of the luminaries in the field that influenced Marcus Hutter’s PhD students that founded DeepMind – but not Marcus Hutter himself.

I did appreciate that he debunked the notion that “Hebbian learning” is merely “co-occurrence” of neuron firings – that it is a actually close-in-time occurrence, with one firing leading another providing the direction of neuron connection growth. Moreover, he also pointed out that this is the origin of “causal” reasoning at the very lowest level of learning! Good for him!

2 Likes

For 17 years I’ve been telling people increased funding for the Hutter Prize is the way to fix bias in AI. Now POTUS exec-orders this.

“root out bias in the design and use of new technologies, such as artificial intelligence”

Time’s running out.

I strongly suggest contacting Marcus at his email address to discuss how you can support the Hutter Prize.

For my own part, I’m taking $100/month out of my income stream, that has been annihilated by my political pariah status and 20 years of caring for a HD-degenerating wife, to increase the Hutter Prize steadily.

This is the vault where I’m putting $100/month of BTC to increase the next Hutter Prize payout.

bitcoin:BC1QGWJ2GYVA5GZ2J7E6095CJJDA9QQJ2SJQYKDY73

PS: Marcus suggested that I keep a vault for my monthly donations to the Hutter Prize under my control and I chose a BTC address for this purpose. The reason I am violating the so-called “best practices” of BTC by “reusing” the address is, first of all, because I want to be transparent and second of all because people shouldn’t be using BTC if they want anonymity!!! Come on folks, all this nonsense about avoiding BTC address reuse just strikes me as a trap for the unwary – giving a false sense of security while using BTC. If people are serious about protecting privacy they need to use other solutions such as Monero.

---------- Forwarded message ---------
From: James Bowery jabowery@gmail.com
Date: Thu, Feb 9, 2023 at 6:40 PM
Subject: Sending Donations For the Next Prize Award?
To: Marcus Hutter marcus.hutter@anu.edu.au
Cc: Matt Mahoney mattmahoneyfl@gmail.com

Marcus,

Given all the hype about large language models, and the growing suspicion about “bias”, I suspect now would be a pretty good time to start accepting monthly donations to gradually increase the size of the next HP payout. There may be quite a number of folks who get that the HP fills an important gap. Paypal has facilities for this kind of thing, as well as Patreon, etc.

A funding thermometer on the HP page showing the current level of prize payout may help to induce not only contestants but donors as well.

For my own part, I’d be willing to send $100/month toward this end without any requirement that I be credited. I just want to see more attention paid to the value of algorithmic information approximation in drawing a distinction between the “is” and the “ought” notions of “bias” as well as debunking the specious notion that “bigger is better” in ML .

– Jim

4 Likes

Have you considered making a video presentation targeting potential investors who may not have a background in computer science, explaining why you think LLM bias poses an existential threat and how the Hutter Prize is the solution?

6 Likes

Yes. The big problem I have is what educators call “placement” – not just in understanding but also in motivation of potential “investors” – although philanthropists may be a more proper word here.

I’ve gotten so little feedback its hard to figure out where to start. It’s like I’m out in the desert living on locusts and honey:

REPENT!
Take Ockham’s Razor and The Information Age SERIOUSLY!
REPENT!

If I could read minds – imputation of variables latent in the data and all that causal inference rot – I can only guess they’re silently thinking:

“Don’t be insulting. Of course I take Ockham’s Razor seriously. Of course I take Moore’s Law seriously. Of course I take the data flood seriously. The entire human race takes ALL that as seriously as a heart attack! Where have you been? And why are you wearing that lion skin? Could you just dump the lion skin and take a shower?”

Meanwhile…

5 Likes

The very day after I started using BTC to fight back against Biden’s AI “bias” Executive Order, I get this bizarre “recommendation” from YouTube:

If there is any group of people who should have been using Algorithmic Information for macrosocial model selection, its the Intelligence Community.

All those destination “Ft. Mead” supercomputers lined up downstairs at CDC Arden Hills back in the late 70s come to mind.

This looks like the same scammers who have been running crypto cons that appear to be SpaceX announcements, often using the SpaceX logo, and names like “Space X” or “SpaceX CEO”, where the video content is months-old live streams with Elon Musk (one of which was the same one shown in your tweet with Jack Dorsey and Elon), wrapped with the scam payload which, when I checked them out, was a “Send BTC or ETH here to double your money” with a fake chat window with people saying how well it worked for them. When you check out the domain name, it was always registered a few days before you see the scam video. For the OPENAI-X2.INFO in the one you posted:

Creation Date: 2023-02-16T12:11:34Z
Registry Expiry Date: 2024-02-16T12:11:34Z

The registrant ID is always “REDACTED FOR PRIVACY”.

I discussed these scams in a post here on 2022-02-22, ‘  “SpaceX” Scammers—YouTube Just Doesn’t Care’.

4 Likes

Thanks. I’d never run across YouTube’s de facto participation in this kind of scam before. I’d seen stuff like that on Twitter but YouTube never bothered me with such recommendations before. It’s probably just because I had mentioned BTC in various places combined with YouTube’s fight against fake news, disinformation and grifters that led to the “recommendation”.

4 Likes

And how not to nuke the bias, but grow it strategically:

6 Likes

How much power have conservative organizations lost due merely to the threat that a weaponized IRS will engage in lawUNfare against them as they clearly did under Obama? If you write off any such donation, you could end up in prison and don’t tell me you’ll be able to pay.

The worst are full of passionate intensity…

One of the advantages of the Hutter Prize is that the “leftists” can hardly claim that Wikipedia is that “biased”, nor can they claim that payouts based on lossless compression are subject to “bias”.

It may be one of the few places where a “conservative” agenda could operate under 501C3 tax exemption. The leftists would have to not only become aware of the fact that lossless compression is the gold-standard for truth discovery, but become self-aware that they are hostile to truth.

6 Likes

NextBigFuture has an article on “scaling laws” for language models, with a recommendation on where to put funding, and I responded.

3 Likes

One of about 4 or 5 pedantic objections to using algorithmic information criterion for model selection is that the choice of UTM in defining algorithmic information is arbitrary. This has always seemed obstructionist to me, and not just for the obvious reason that the additive constant of an additional UTM emulator is small compared to just about any reasonable-size dataset (ie: 1GB of Wikipedia).

It is pedantic because there are obvious notions of UTM complexity going back at least to Shannon. Pedants will continue to hammer on this by objecting that the language used to describe the UTM determines the length of the description of the UTM. So shortly after Hutter announced his prize and people started in on this line of obstruction over at LessWrong, I came up with the idea of a UTM that emulates itself – the length of the self-emulator being the Algorithmic Information closure of the entire system.

That was 17 years ago and I still don’t see anything in the literature of pedants dealing with this, but I did find one question over at cstheory stackexchange that asked the question in rather technical jargon. No reponse after 6 years! This seems to be something that Wolfram should have been researching as part of his NKS since he awarded $25,000 for the shortest UTM description (and proof that it is universal).

So I asked the question there and moderators deleted it – perhaps because I posted it under “challenges”. So I just reposted it to Wolfram’s NKS group and took a screen shot in case the moderators delete it again:

PS: There are various games one can play to try to evade this question such as arise when people talk about the combinatorial calculus (SK combinators) or other concatenation formal languages as requiring descriptions external to their physical implementation. To these objections I can respond with my original suggestion to Hutter which was that to whatever level of abstraction from the physical one may wish to retreat, there is still the challenge of simulating that level of physical abstraction as the substratum of universal computation. So pick your level, obstructionists. NOR gates? Maxwell’s equations? QED? I don’t care.

PPS: In a kind of meta-ironic question to ChatGPT, I get this response consisting of word salad tossed with fresh steaming bullshit:

image

2 Likes

ChatNPC’s answer to my brain teaser made me think a truly evil thought:

Tailor it with LessWrong’s corpus and …

The Applied Brain Research guys appear to have a breakthrough in language modeling described in this video seminar. I didn’t want to talk about this video until I had given them a chance to enter the Hutter Prize, which I suggested they do during the seminar – but its been nearly 6 months and I haven’t heard anything from them. They outperform transformers on language modeling benchmarks, and are technically able to scale up more easily and economically in terms of CUDA cores and VRAM.

I’ve been watching the guys at ABR ever since they published their first paper on Legendre Memory Units a few years ago because of the parsimony of their models measured in terms of parameters – as well as their orientation toward dynamical systems. Both of these are big advantages in terms of the Algorithmic Information Criterion for causal model selection. This is the most promising of the emerging language modeling technologies.

2 Likes

The ABR paper on their Legendre Memory Unit approach to language modeling was rejected as of a few months ago. The arguments against publishing it were specious and based in large measure on the fact that ABR is resource limited compared to those submitting papers from the major corporations. Here’s my response to that rejection in OpenReview.net:

I pinged ABR about their need for funding because it is apparent that the risk adjusted RoI on the Hutter Prize is too low:

1% is a big improvement in the state of the art but only gets you 1% of 500k Euro or only 5k Euro.

This, combined with the lack of recognition, however deserved, of the Hutter Prize as industry benchmark renders it not worth their investment.

3 Likes

When I went looking for stack based RNNs today (while revisiting some thoughts I had years ago regarding minimum size grammar induction), I discovered quite a surprise: The guys I most respect in the field had just published a paper on it! Moreover, they have some pretty interesting things to say regarding my repeated assertion that people should be paying attention to the fact that Transformers aren’t recurrent – that people should be looking for RNN versions thereof.

WELL… the analysis of this paper admits it is theoretically the case that RNNs are adequate BUT it turns out not to be practically the case – and by that I mean in a stronger sense of practically than you’ll usually hear out of the advocates of “Attention Is All You Need” (which, as I previously mentioned, is a really specious title attacking LSTM (a version of RNN) as “impractical”).

Here’s the money quote:

Regarding the Chomsky hierarchy, some while ago I had to virtually rewrite the Wikpedia article on minimum description length to point out that unless you go to the top of the Chomsky hierarchy as your description language, you’re going to be barking up the wrong tree in the kinds of language models we’re seeing now emerge. Basically the problem with RNNs is that they don’t use stacks or tapes or other structured memories that permit you to emulate dynamic memory allocation. While it is theoretically true one can achieve this by a very wide RNN – to essentially represent the data storage – it is a far less efficient form of memory allocation than is tolerable in practice.

2 Likes

I decided to rudely inject myself into the dry academic conversation about Algorithmic Information Theory with this (existentially urgent) question:

2 Likes

Note the date:

I’m really in a quandary as to whether to call our current regime #InstitutionalAntiWhiteism or #InstitutionaZombiism

3 Likes

Well, ChatGPT got all of these questions right:

James Bowery:
what is the difference between statistical models and dynamical models?

ChatGPT: Statistical models and dynamical models are both used in various fields of science to make predictions and understand phenomena. However, there are some key differences between the two.

Statistical models are designed to capture relationships between variables and to make predictions based on these relationships. They are based on the assumption that the data is generated from some underlying probability distribution, and they use statistical methods to estimate the parameters of this distribution. Statistical models can be used to make predictions about new data, to identify significant features of the data, and to estimate the uncertainty associated with these predictions.

Dynamical models, on the other hand, are designed to capture how a system changes over time. They are based on mathematical equations that describe the evolution of a system’s state variables. Dynamical models can be used to simulate the behavior of a system under different conditions, to predict how the system will respond to perturbations, and to identify the key factors that influence the system’s behavior.

One key difference between statistical and dynamical models is the type of data they are suited to. Statistical models are typically used when there is a large amount of observational data available, and when the relationships between variables are complex and difficult to model directly. In contrast, dynamical models are typically used when the system being studied can be described by a relatively small number of state variables and a set of known equations governing their behavior.

Another key difference between statistical and dynamical models is the level of abstraction they provide. Statistical models typically focus on the relationships between observable variables, while dynamical models provide a more detailed picture of how the system’s state variables evolve over time.

In summary, statistical models are focused on making predictions based on observed data and the relationships between variables, while dynamical models are focused on describing the underlying mechanisms that drive a system’s behavior over time.

James Bowery:
which are more closely related to causality?

ChatGPT: Dynamical models are generally considered more closely related to causality than statistical models. This is because dynamical models explicitly model the underlying mechanisms that drive a system’s behavior over time, while statistical models typically focus on the relationships between observable variables.

Dynamical models are often used to test causal hypotheses by manipulating the system’s initial conditions or by simulating the effects of interventions. In contrast, statistical models are generally not well-suited for testing causal hypotheses, as they cannot distinguish between correlation and causation.

Moreover, dynamical models are often used in fields such as physics, engineering, and biology, where the underlying physical or biological mechanisms are well-understood. In these fields, dynamical models can be used to identify the key factors that drive a system’s behavior and to make predictions about the effects of perturbations or interventions.

However, in fields such as social sciences and economics, where the underlying mechanisms are often less well-understood, statistical models may be more useful for making predictions and identifying relationships between variables. In these fields, statistical models are often used to control for confounding variables and to identify associations between variables that may suggest causal relationships.

In summary, while both statistical models and dynamical models have their uses in different fields of science, dynamical models are generally considered more closely related to causality due to their ability to explicitly model the underlying mechanisms that drive a system’s behavior over time.

James Bowery
to what levels of the chomsky hierarchy do they most closely relate?

ChatGPT: Dynamical models and statistical models are not directly related to the levels of the Chomsky hierarchy, as the Chomsky hierarchy is a classification scheme for formal languages, while dynamical and statistical models are used to model real-world phenomena.

However, it is possible to make some loose connections between these concepts. The Chomsky hierarchy classifies formal languages into four types, based on the type of grammar that generates them: regular, context-free, context-sensitive, and recursively enumerable. These types of grammars correspond to increasing levels of complexity and expressiveness.

In terms of the Chomsky hierarchy, regular languages are the simplest and can be recognized by finite-state automata. Context-free languages can be recognized by pushdown automata, while context-sensitive languages require linear-bounded automata. Recursively enumerable languages are the most complex and can be recognized by Turing machines.

In general, statistical models can be thought of as tools for modeling patterns in data, and these patterns may be related to regular or context-free languages. For example, language models in natural language processing are typically based on statistical models and are designed to predict the probability of a given sequence of words. These models are often based on n-gram models or recurrent* neural networks, which can be thought of as recognizing patterns in the data.

Dynamical models, on the other hand, are often used to model the behavior of systems over time, and these models may be related to context-sensitive or recursively enumerable languages. For example, differential equations are a type of dynamical model that are used to describe the behavior of physical systems over time, and these equations can be very complex and difficult to solve.

In summary, while there is no direct correspondence between statistical and dynamical models and the levels of the Chomsky hierarchy, statistical models may be more closely related to regular or context-free languages, while dynamical models may be more closely related to context-sensitive or recursively enumerable languages.

* NB (JAB) Just as more general languages and model less general, recurrent neural nets can be degenerated into statistical models. This is what had been done with LSTM’s prior to “Attention is all you need” and therefore is likely what ChatGPT is referring to.

James Bowery:
are transformer models more closely related to statistical or dynamical models?

ChatGPT: Transformer models are more closely related to statistical models than dynamical models.

Transformer models, such as the ones used in natural language processing (NLP), are based on neural networks that use self-attention mechanisms to encode the input data. These models are trained to predict the probability distribution of the next word in a sequence, given the previous words. This training is done through a process called backpropagation, where the model adjusts its parameters to minimize the difference between its predictions and the actual target output.

While transformer models do have some temporal aspects, such as the sequential processing of inputs, they do not explicitly model the underlying mechanisms that drive a system’s behavior over time, which is a key feature of dynamical models. Instead, transformer models are designed to capture the statistical regularities in the data, which is a key feature of statistical models.

James Bowery:
what machine learning models are more closely related to dynamical models?

ChatGPT: Recurrent neural networks (RNNs) and other sequence models are more closely related to dynamical models than other machine learning models.

RNNs are a type of neural network that is designed to process sequential data, such as time series or natural language. RNNs work by maintaining a “hidden state” that captures information about previous inputs, which is updated at each time step as new inputs are processed. This hidden state can be thought of as a summary of the system’s past behavior, and it is updated based on the current input and the previous hidden state.

By maintaining this hidden state over time, RNNs are able to model the temporal dependencies in the data, which is a key feature of dynamical models. RNNs can be used for a variety of tasks, including language modeling, machine translation, speech recognition, and image captioning.

Other sequence models, such as Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs), are also designed to model temporal dependencies and are closely related to dynamical models. These models are able to learn to selectively remember or forget information from the past, which allows them to capture long-term dependencies and avoid the vanishing gradient problem that can occur with standard RNNs.

In summary, RNNs and other sequence models are more closely related to dynamical models than other machine learning models, as they are designed to model the temporal dependencies in the data and maintain a hidden state that captures information about past inputs.

James Bowery:
can more context sensitive grammars model context free grammars?

ChatGPT: Yes, more context-sensitive grammars can model context-free grammars.

A context-free grammar is a type of formal grammar that generates a context-free language, which is a language that can be generated by a pushdown automaton. Context-free grammars are a subset of the more general class of context-sensitive grammars, which generate context-sensitive languages.

A context-sensitive grammar is a formal grammar in which the left-hand side of every production rule consists of a single nonterminal symbol and a context of terminal and/or nonterminal symbols to the left and right of the nonterminal. In other words, the production rules are more restrictive than context-free grammars, which only require the left-hand side to be a single nonterminal symbol.

Since context-sensitive grammars are more general than context-free grammars, they can generate any language that can be generated by a context-free grammar. This means that context-sensitive grammars can model context-free grammars by simply using a subset of their production rules.

In fact, every context-free grammar can be converted into an equivalent context-sensitive grammar. This is done by adding additional production rules that enforce the context-sensitive restrictions on the left-hand side of each production rule. This process is known as the “augmentation” of the context-free grammar, and it results in a context-sensitive grammar that generates the same language as the original context-free grammar.

In summary, context-sensitive grammars can model context-free grammars by using a subset of their production rules, and every context-free grammar can be converted into an equivalent context-sensitive grammar.

James Bowery:
can algorithmic information theory be formulated in terms of context sensitive grammars?

ChatGPT: Yes, algorithmic information theory can be formulated in terms of context-sensitive grammars.

Algorithmic information theory is a branch of information theory that deals with the amount of information contained in a finite object, such as a string of symbols. It is based on the concept of Kolmogorov complexity, which is a measure of the amount of information needed to describe an object.

Context-sensitive grammars are a formalism for generating languages, which can be used to generate strings of symbols. In fact, every recursively enumerable language, which includes all computable languages, can be generated by a context-sensitive grammar.

Therefore, it is possible to use context-sensitive grammars to generate the strings of symbols that are the objects of study in algorithmic information theory. Specifically, the Kolmogorov complexity of a string can be defined as the length of the shortest context-sensitive grammar that generates the string.

This definition of Kolmogorov complexity captures the intuitive notion of the amount of information needed to describe an object, since a short context-sensitive grammar implies that the object can be generated by a simple and concise rule.

In summary, algorithmic information theory can be formulated in terms of context-sensitive grammars by defining the Kolmogorov complexity of a string as the length of the shortest context-sensitive grammar that generates the string.

James Bowery:
can algorithmic information theory be formulated in terms of context free grammars?

ChatGPT: Algorithmic information theory cannot be formulated directly in terms of context-free grammars, since context-free grammars are not expressive enough to generate all recursively enumerable languages, which are the set of objects that algorithmic information theory is concerned with.

However, it is still possible to relate context-free grammars to algorithmic information theory through the concept of Chomsky normal form. Chomsky normal form is a specific form of context-free grammar in which all production rules have the form A → BC or A → a, where A, B, and C are nonterminal symbols, and a is a terminal symbol. In other words, the right-hand side of each production rule consists of exactly two nonterminal symbols or a single terminal symbol.

Every context-free grammar can be converted into an equivalent grammar in Chomsky normal form. Therefore, any recursively enumerable language generated by a context-free grammar can also be generated by a grammar in Chomsky normal form.

The conversion to Chomsky normal form involves adding new nonterminal symbols and production rules to the grammar, which increases its size. Therefore, the Kolmogorov complexity of a string may be different depending on whether it is generated by the original context-free grammar or the equivalent grammar in Chomsky normal form.

In summary, while algorithmic information theory cannot be formulated directly in terms of context-free grammars, it is still possible to relate them through the concept of Chomsky normal form, which allows us to consider the Kolmogorov complexity of strings generated by context-free grammars.

James Bowery:
is algorithmic information approximated by maximizing lossless compression under constrained resources?

ChatGPT: Yes, algorithmic information can be approximated by maximizing lossless compression under constrained resources.

The idea behind this approach is to use compression algorithms to measure the amount of information in a string. Compression algorithms are designed to remove redundancy in a string by finding regular patterns and replacing them with shorter codes. The more redundancy there is in a string, the more it can be compressed.

However, in practice, it is not feasible to compress a string to its minimum size, since this would require an infinite amount of computational resources. Therefore, compression algorithms are usually constrained to a fixed amount of computational resources, such as time or space. The goal is then to find the compression that achieves the maximum compression ratio within these constraints.

This approach can be used to approximate the Kolmogorov complexity of a string, which is a measure of the amount of information needed to describe the string. The Kolmogorov complexity of a string is the length of the shortest program (in some fixed programming language) that generates the string. Since it is not possible to compute the Kolmogorov complexity directly, one can use compression algorithms as a proxy to estimate it.

The idea is that the more a string can be compressed, the more redundancy it contains, and therefore, the less complex it is. By constraining the amount of computational resources, we can approximate the shortest program that generates the string under these constraints, which is a proxy for its Kolmogorov complexity.

In summary, algorithmic information can be approximated by maximizing lossless compression under constrained resources, which allows us to estimate the Kolmogorov complexity of a string.

2 Likes

LLMs are now downloadable and this technology will now be integrated everywhere:

It will be a develop-your-own-bias game.

5 Likes