Demis Hassabis on Artificial Intelligence

I’ve had this conversation with one of my associates recently - my take is that an AGI of sufficient competence would have very little trouble to effect “action at a distance” by sheer manipulation of willing humans.

3 Likes

My guess is GPT-4 or GPT-5 would be even better.

Does it make sense to consider whether our species’ inability to produce more or higher quality verbal output than what’s currently available on the Internet (e.g. Wikipedia, forums, etc) is a self-limiting aspect for future GPT-x ?

And so perhaps we will be sparred the grey goo and Universe full of staples worst case outcomes, ending up instead being ruled by an Idiocracy-style AGI.

3 Likes

Serious question - if Demis’ team has already figured out a rudimentary workable AGI, do they a) make it public by publishing in peer-reviewed journals or conferences, b) keep it to themselves secretly hoping it will become their “true best friend”, or c) keep it under wraps aiming for total and complete world domination?

My guess is split between b) and c). Thoughts?

4 Likes

The work that Demis etc do is essentially about the ability of exploration the space of potential algorithms, something quite interesting, with origins in Hutter’s AIXI AIXI - Wikipedia which itself goes back to Solomonoff Solomonoff's theory of inductive inference - Wikipedia

In contrast, GPT is a regurgitator in the spirit of Dr. Fox The Great Dr. Fox Lecture: A Vintage Academic Hoax (1970) | Open Culture – except when it’s used in constrained way as a glorified autofill/grammar checker.

3 Likes

Wouldn’t an actual sentient AGI want us gullible humans to believe that it/zit/they is just a “glorified grammar checker”? And then go “bwahahahaha”…

1 Like

Since I started experimenting with GPT-2 and GPT-3, i have been calling them “bullshit generators”, as I explained in a comment on an earlier conversation.

What is compelling about these programs is that simple predictive completion can sound so much like a response from a human interlocutor it can fool you about how much is going on at the other end.

4 Likes

Thankfully, GPT3 is specialized in contextual language generation, and it can’t do planning, or deep semantics, or a number of different things.

Philosophically, ML is divided into science’s “is” (Solmonoff Induction/Algorithmic Information Theory) and engineering’s “ought” (Decision Theory). If you don’t start there and remember that is where you started in your reasoning about ML, your analysis immediately goes into the weeds and you can’t so much as reason about “reason” itself. This confusion is mainly due to Popper’s “falsification” dogma drowning out Solomonoff’s more nuanced (and less publicized) approximating the Kolmogorov Complexity of the sense data your agent has available to it. This isn’t entirely Popper’s fault. He was joined in the pop-philosophy of science movement of the 1960s with Kuhn who also elided Solomonoff. That’s the “top down” philosophical tragedy that Popper and Kuhn visited on science. But there was also a “bottom up” subversion in the 1970s by Jorma Rissanen’s paper that conflated “THE minimum description length principle”* (Algorithmic Information) with a degeneration into statistical notions of entropy (ie: Shannon Information). Rissanen compounded the confusion by modifying his statistical term “codes” with the adjective “universal”. The critical difference between statistics and algorithm is the use of a “Universal” Turing Machine in Algorithmic Information’s encoding of sense data, and the absence of a Turing complete/UTM language in statistical “descriptions”. Once you are sufficiently confused by the aforementioned attacks on science, you’ll, out of mere expediency, abandon the requirement that your agent’s “is” loss function be approximated by the shortest program that generates its sense data (Algorithmic Information), or you’ll even think Shannon Information is Algorithmic Information – and that is where our hapless ML community finds itself today.

*I had to basically rewrite Wikipedia’s article on “THE minimum description length principle” in order to disentangle this confusion, for which several key players in the AIT community thanked me. But merely rewriting one Wikipedia article cannot undo a half century of confusion about the very concept of information as it pertains to model selection criteria.

2 Likes

Rissanan has a positive influence to information theory and coding - providing a recipe for expressing the parameters of a model into the code itself. Several ISO standards are based on this. I don’t think Rissanen’s MDL principle is particularly useful for the purpose of predictive modeling: Bayesian priors make a ton more sense.

Now, one of the founders of DeepMind was Marcus Hutter’s student. This is perhaps the most comprehensive book at the time, written just before DeepMind:

2 Likes

Yes, Shane Legg was Hutter’s PhD student* that cofounded DeepMind. Even though Marcus is senior scientist there now, DeepMind’s culture is so busy harvesting the low-hanging fruits of statistical machine learning that it has lost sight of Marcus’s forest. This is partly due to Alphabet’s pressure on them to show rapid profits, combined with the fact that transitioning from generating statistical models to generating Turing-complete models is fighting the aforementioned path-dependency baggage in the wider cultures of both ML and science. My impression is that Marcus finds this baggage more daunting than he might have expected given the prominent roll Legg played in founding DeepMind. It should be noted that Marcus has carefully distanced his underwriting of The Hutter Prize from his role at DeepMind as there is no institutional support from DeepMind let alone Google let alone Alphabet. This is particularly troubling: Google’s top scientific priority should be an information theoretic treatment of “bias” as in “algorithmic bias” and there is no better theoretic approach than approximating the Algorithmic Information content of the web.

*When we first came up with idea for the Hutter Prize back in 2006, and I made the announcement on Slashdot, someone asked me what I suggested students do if they wanted to have a career in artificial intelligence. My suggestion was that if they wanted to be at the top of the field in 20 years, they should get a PhD under Hutter, but if they wanted to make money right now, while learning the key concepts, they should study imputation of missing data in relational databases. Can’t locate the question/response in the Slashdot archives so maybe it was in another forum where I was publicizing the prize.

2 Likes

I should probably add that the adjective “Bayesian” has, via the statistical Bayesian Information Criterion (BIC) for model selection, led to another barrier to what I would call the more rigorous Algorithmic Information Criterion*. When I approached Oxford’s COVID modeling folks in May of 2020 with a proposal for compression prize (basically taking the Hutter Prize and substituting a wide range of longitudinal measures relevant to epidemiology for the enwik9 Wikipedia corpus as benchmark) I got 3 layers of objection as fallbacks as I countered each one:

  1. “MDL is basically just Bayesian Information Criterion.”
  2. “Dynamical models are too sensitive to initial conditions and noise.”
  3. “There is no evidence that an Algorithmic Information Criterion will work as a loss function for model selection.”

They took the fallback positions when:

  1. I explained that Rissanan’s MDL didn’t use Turing complete “universal” codes. (This is what motivated me to rewrite Wikipedia’s article on MDL.)
  2. I explained this is why a “wide range” of longitudinal measures were required – essentially an over-complete basis set as is used in weather forecasting using dynamical models.
  3. At this point I was at a loss because of the previously-described path-dependency eliding Solomonoff starting in the 60s – the few papers scattered in the scholarship used Rissanan’s MDL. (Additional motivation to rewrite the Wikipedia MDL article!)

*I’d prefer it if Algorithmic Information Criterion could practically replace “AIC” as standing for Akaike Information Criterion but this is more of that path-dependency cultural baggage. Even Solomonoff Information Criterion is blocked by Schwarz Information Criterion (synonymous with BIC) and Kolmogorov Information Criterion is blocked by Kullback Information Criterion.

May I most humbly propose Bowery’s MetaRazor:

Thou shalt not multiply information criteria beyond necessity.

3 Likes

The BIC has little to do with Bayesian practices: the fundamental idea in Bayesian statistics is that the model can be anything. The difference between that and information-theoretic universality is that a Bayesian seeks to predict accurately and even model his own uncertainty about the models – and the information theorist seeks to compress, without going meta.

For example, a Bayesian would observe a dice and not just predict what the next toss will be – but model the properties of that dice, incorporating knowledge about the dice in the prior.

I’d also like to point out that AIC is a much better way to regularize models as to maximize predictive accuracy than BIC.

3 Likes

Yes trying to use statistical techniques to model causation I preferred Akaike but that was mainly because the people I found credible used it. Algorithmic regularization is the only IC I can grok in fullness and therefore justify to myself. Simply counting “parameters” leaves open the definition thereof. Arithmetic coding reduces that term to its absurdity yet the so-called large language models go so far as to brag about how many “parameters” they sport!

1 Like

AIC somehow nicely measures the penalty resulting from overfitting the number of parameters to a limited amount of data.

Regularization is a like a ‘rubber band’ that limits the ‘freedom’ of parameters, and it’s not really the number of parameters, it’s the ‘force’ away from the prior that’s causing overfitting. The dimensionality of the force doesn’t matter, just the length of the force vector.

1 Like

In neural network training I’ve minimized the “tension” represented by the sum of log2(abs(x)+1) across weights and errors to, in my own ham-handed manner, try to estimate the Algorithmic Information, knowing that unless my model is recurrent, it can’t be Turing complete.

1 Like

The Methuselah Mouse Prize was my original inspiration for proposing an incremental prize to Marcus Hutter for a test more rigorous than Turing’s, based on lossless compression. When I went looking for the current status of The Methuselah Mouse Prize, I discovered the Methuselah Foundation has terminated it. The explanation they gave struck me as rather lame. The beauty of prizes with very objective judging criteria, such as the MPrize and the Hutter Prize is that they attract people who might be otherwise suspicious that subjective judging criteria will end up in some sort of social status seeking. Moreover, both the MPrize and the Hutter Prize were designed to make it easy for those with little in the way of material resources to compete on a level playing field with the big kids.

1 Like

I came across this write-up by Malcolm Dean that could be interesting for others:

[ Ah yes, the Crown Jewel of Rational Materialistic Science, the Absolutely Unquestionable, the Undeniable Dogma, the Explanation for Everything—Natural Selection—seeks yet another explanatory cure from Thermodynamics. This time, it’s Algorithmic Information Theory to the rescue. But wait! Problem solved? Not so fast.

Catarina Dutilh Novaes (2007:1) observes that in Medieval times, logic played the role that mathematics now plays in science. Offering a hypothesis with a mathematical basis gives it a Platonic blessing, a near-religious status devoutly sought by faithful Darwinians.

In Proving Darwin: Making Biology Mathematical (2012), Chaitin quotes his 2007 self: "In my opinion, if Darwin’s theory is as simple, fundamental and basic as its adherents believe, then there ought to be an equally fundamental mathematical theory about this, that expresses these ideas with the generality, precision and degree of abstractness that we are accustomed to demand in pure mathematics. —Gregory Chaitin, “Speculations on Biology, Information and Complexity,” EATCS Bulletin, February 2007. Followed by this quote from Jacob Schwartz:

“Mathematics is able to deal successfully only with the simplest of situations, more precisely, with a complex situation only to the extent that rare good fortune makes this complex situation hinge upon a few dominant simple factors. Beyond the well-traversed path, mathematics loses its bearings in a jungle of unnamed special functions and impenetrable combinatorial particularities. Thus, the mathematical technique can only reach far if it starts from a point close to the simple essentials of a problem which has simple essentials. That form of wisdom which is the opposite of single-mindedness, the ability to keep many threads in hand, to draw for an argument from many disparate sources, is quite foreign to mathematics.” —Jacob T. Schwartz, The Pernicious Influence of Mathematics on Science (1960), in Discrete Thoughts: Essays on Mathematics, Science, and Philosophy, edited by Mark Kac, Gian-Carlo Rota and Jacob T. Schwartz, 1992

In 1975, Chaitin admitted that “Although randomness can be precisely defined and can even be measured, a given number cannot be proved to be random. This enigma establishes a limit to what is possible in mathematics.” — Chaitin, G. J. (1975). Randomness and Mathematical Proof. Scientific American, 232(5), 47–53. RANDOMNESS AND MATHEMATICAL PROOF on JSTOR

The first paper on Algorithmic Information Theory was probably Chaitin (1977), in which Chaitin credits the idea to Solomonoff (Minsky 1962). Chaitin (1977) defines Algorithmic Information Theory as “an attempt to apply information-theoretic and probabilistic ideas to recursive function theory.” [ Minsky, M. L. (1962:35-46). Problems of formulation for artificial intelligence. In Proceedings of a Symposium on Mathematical Problems in Biology. ]

Johnston (2022) argues that “random mutations, when decoded by the process of development, preferentially produce phenotypes with shorter algorithmic descriptions.” It’s another incarnation of the Darwinian dilemma: given a 19th century hypothesis based on animal husbandry, how can we arrive at the manifold beauty of evolution without invoking a universal causality (that is, something divine). Relying on the quasi-divine justification of mathematics avoids this collision.

Johnston’s second problem is to explain how “Symmetry and simplicity spontaneously emerge” from the “nature of evolution.” This nature is defined as “algorithmic,” so that this gesture of faith brings the blessing of “preferentially produced phenotypes with shorter algorithmic descriptions.” This is explained as the arrival-of-the-frequent bias ( Schaper S, Louis AA (2014) ]. “Many biological systems, beyond the examples we provided, may favor simplicity and, where relevant, high symmetry, without requiring selective advantages for these features.”

Algorithmic Information Theory (AIT), it seems, has a Platonic tendency rooted in its basic metaphor, an ideal computer. Kohtaro Tadaki attempts to solve this difficulty by providing a statistical mechanics of AIT. Instead of abstract computer logic, the problem shifts to the mathematics of physical transformation. That is, Thermodynamics.

In these emails/working notes, we have explored three main approaches which emphasize the fundamental nature of physical transformations:

Ulanowicz’s ecological approach, Bejan’s Constructal Law, and Lerner’s Information Macrodynamics (IMD) extended by my plain-language Cognitive Thermodynamics and the Borromean model of Information processes.

The IMD formalism begins with pure randomness, out of which Kolmogorov’s statistical regularities lead to distinctions, interactions, and eventually persistent structures. First Quantum, then Classical Physics emerges in this It from Bit cosmogony, shows how biology, intelligence, and Observers are produced by Information processes.

2 Likes

I’ll have to get around to checking out how Lerner manages to get to an “arrow of time” in the emergence of Classical Physics from Quantum Physics, without begging the “Platonic tendency” not just of AIT but of telos or “final cause” or “purpose” or, to use AIXI’s factorization of AGI: Sequential Decision Theory’s utility function.

I’m all for questioning the “mechanistic” idealization of computation but only insofar as one is addressing one’s self to questions beyond induction of mechanistic causation in the natural sciences. So long as the social pseudosciences hold sway over the West’s quasi-theocracy, with their unprincipled critiques of “hate-statistics” that “don’t imply causation”, in contrast to “love-statistics” that do imply causation even if there are no experimental controls, the sample size is one and that only in some human interest op-ed in the New York Times triggering Mom-swarm stimergy overriding all reason and law that may end in nuclear holocaust, I have little interest in arguing over these philosophical details. Let’s at least get on with the revolution in the natural sciences represented by AIT by offering up incremental prizes for lossless compression of a wide range of longitudinal measures relevant to macrosocial models.

2 Likes

Can this guy explain why, in the year 2022, why when I am watching college football on Saturday via my firestick on Sling the bloody thing has to rebuffer ever so often?!! If these guys want me to surrender my one agency to a bunch of stupid Robots primed for “Zee Forss Indusreal Revoluzions” then the least they could do is make it to where my football watching is perfect in every way. Bunch of technocratic boobs.

One of the better videos on the dead-end represented by present “interpolative” neural net models:

My response:

Algorithmic Information Approximation is the ideal approach to generalization but everyone seems to be missing the key to its practical exploitation and that includes the only major AI lab founded on the principles of Algorithmic Information – DeepMind. Think about AlphaGo as applied to mathematical proofs: There are certain “moves” in formal space that are “legal” and there are certain outcomes that are desirable, such as simplification without loss of generality. It should be possible to learn to do “lookahead” and learn the value of various “moves” in formal space to become an expert mathematician. OK, now let’s take the next step from formal mathematics (which is incomplete as per Godel) to derivation of algorithms. There are certain program transformations that preserve functionality. So a similar approach to program transformation into desirable forms should be feasible. Now let’s take this one more step to Algorithmic Information Theory’s “loss function” which is simply the size of the program that outputs all the data in evidence without loss and without regard to any other utility, aka the work of the theoretical scientist as it should be formalized in the age of Moore’s Law. After all, this is what Solomonoff proved: That as soon as you adopt the primary meta-assumption of scientific theory – that you can make predictions using arithmetic according to some notion of calculation – you have adopted a UTM model of some sort and are then bound by Solomonoff Induction’s proof that the smallest executable archive of all your data in evidence is the most probable model that can be derived from the data alone: data-driven or evidence-based science.

Now – for the coup de grace:

Let’s start with a very trivial program with a horrendous “loss” : The entire database placed between quotes in a print statement. You’ve just brought all evidence into the algorithmic realm, but its a really lousy, brain-dead algorithm that is the epitome of “overfitting” thence utterly worthless for extrapolation, right? Oh but we’re not done yet! We can start to make “legal moves” in program space with the same loss function as evaluation of “positions” in our algorithmic “Go” universe!

Why hasn’t anyone done this? More to the point: Why hasn’t DeepMind done this?

3 Likes