The Life & Times of Claude Shannon

A Mind at Play: How Claude Shannon Invented the Information Age”, by Jimmy Soni & Rob Goodman, ISBN 978-1-4767-6668-3, 366 pages (2017).

Claude Shannon (1916 - 2001) once said: “A very small percentage of the population produces the greatest proportion of the important ideas”. Although his development of the mathematics which underpin communications technology has been critical to much of the modern world, making him one of that very small percentage, Shannon is relatively little known. This may be in part due to his intense modesty. One minor telling detail: Ludwig Boltzmann, who developed the thermodynamic equation for entropy, had that equation carved prominently onto his gravestone; Shannon’s gravestone similarly carries the equation he developed for communications entropy – but carved onto the back of the headstone, mostly hidden by a bush.

Shannon’s lack of interest in self-promotion clearly presented a challenge to his biographers. Accordingly, Soni & Goodman dug deep, and created this beautifully-crafted & heavily-documented tale which sets Shannon’s life & contributions into their historical contexts. The authors demonstrate the low signal-to-noise ratio which plagued early telegraphy by summarizing the problems with the first trans-Atlantic cables. They set the stage for Shannon’s exploration of machine intelligence by telling about “The Turk”, a chess-playing mechanical mannequin which astonished audiences in the 1820s. (It was a hoax).

Looking back from today, Claude Shannon’s life seems to belong to a different world rather than simply last century. It was a time when a boy from a small town in Michigan could get a top-notch education; when the US had world-leading industrial research organizations like Bell Labs; when a university would give a prominent academic the freedom to pursue his interests without the burden of “publish or perish”.

Shannon came to the attention of Vannevar Bush when he took a job at MIT in 1936, working on a mechanical type of calculating device. Bush recognized Shannon’s “almost universal genius” and helped direct Shannon’s career into Bell Labs in 1940, where he revolutionized the design of switching devices by application of Boolean algebra. During World War II Shannon worked on cryptography and other technologies about which little is in the public domain.

While working at Bell Labs, Shannon recognized that the content of a message was irrelevant to the task of transmitting it over a wire in the face of inevitable noise. He summarized years of work on how to optimize message transmission in a 1948 Bell Labs technical publication – “A Mathematical Theory of Communication”. The authors do an excellent job of explaining the key concepts without bogging down the lay reader in too much mathematics. That publication made Shannon the founder of the field of digital communication at the very time that the computer age was beginning to sprout. Some would argue it also marked the end of his significant contributions to science & engineering.

In 1956, Shannon left Bell Labs (although the Labs insisted on keeping him on salary) and took up a professorial position at MIT, where he had the freedom to pursue his own interests – thinking machines (including machines built with own hands), juggling (both himself & with machines he built), unicycles, the stock market, even gambling. Working with a collaborator, he built a device the size of a pack of cigarettes which provided a probabilistic edge on predicting the outcome of a roll on a roulette wheel. They made a successful trial run in Las Vegas – and then decided the possibility of running afoul of the Mafia was not worth the risk.

Sadly, Shannon was diagnosed with Alzheimer’s disease in 1983. In 1993 he had to be moved into a nursing facility, where he died in 2001.

Shannon had said “… semantic aspects of communication are irrelevant to the engineering problem”. He gave us the theoretical understanding to vastly improve the engineering efficiency of communications. It might be a comment on the rest of us that we have used much of that high efficiency to perfect the broad distribution of cat videos.


As Kelvin R. Throop said, over and over again, until people started to throw overripe vegetables at him, “Once Pareto gets into your head, you’ll never get him out.”

The implications of his work and the intellectual foundation it laid has had profound consequences for physics, philosophy, and information science—the latter a discipline he, in large part, invented.

Central to this is the concept of Shannon entropy, which provides a rigorous definition for the amount of information in a message and the essential insight that information consists exclusively of what you do not already know. If the content of a message is already known to a recipient, it has zero information content. If a message can be losslessly compressed to half its original size and then expanded upon receipt, its entropy is 50%. Counter-intuitively, the message with the greatest entropy is one which is completely random—which cannot be compressed and whose content cannot be predicted by any means and can only be transmitted explicitly. This has deep implications for cryptography, data compression, and the definition of the scientific method as a means of compressing the complexity of observations of the natural world into compact mathematical laws or models with predictive power.

Fourmilab’s “ENT” utility computes the Shannon entropy of an arbitrary stream of data and performs other tests to seek structure which may allow it to be compressed or predicted.

It was many years after Shannon defined entropy for information that it was realised that the entropy in thermodynamics is essentially the same thing. The reason Maxwell’s Demon cannot violate the second law of thermodynamics is an information processing issue, and any time information is lost in a computational process, a minimum amount of energy must be lost as heat. This realisation is central to the fields of reversible computing and quantum computation.

The realisation that information is as fundamental a component of the universe as matter and energy is central to understanding mysteries ranging from the origin of life to the behaviour of economic systems, and the essential insight was in Claude Shannon’s work.


His paper is worth reading in its original, it’s crisp, clear, succinct - yet had the impact like no other:


Certainly we all have benefitted tremendously from the uncommon abilities of a Newton, an Einstein, a Shannon. Consequently, it is worth taking a moment to ponder what would have happened to Shannon if he had had the misfortune to be born into today’s world.

A small-town white boy from an intact right-leaning family in fly-over country? No chance! Today’s equivalent of Vannevar Bush would likely be the kind of far-left Democrat woman who seems to dominate government, academia, and industry – she would be looking to advance the careers of people who tick the correct Woke boxes, regardless of ability. Today’s Shannon would likely become one of Gray’s untapped opportunities:
Full many a flower is born to blush unseen
And waste its sweetness on the desert air.”

Meritocracy – it is an idea worth considering.


A good way of thinking about the difference between Shannon information and Kolmogorov Complexity is to consider the way Shannon estimated the amount of information carried in text:

Consider what the human is doing:

Based on a lifetime of having read texts, the human is coming to each letter as it is exposed with prior expectations about the source of the text. These prior expectations are, in fact, not only a “language” model, but a world model – a world that contains within it objects such as “authors” of things like “text”. So the “human entropy” is highly dependent on something that is going on in the mind of the human:

Learning from experience of the world.

Based on whatever that “prior” is, your unit of information carries, itself, different amounts of information.

Kolmogorov Complexity is concerned with the entirety of experiences of the world measured by the minimum size of the executable binary that outputs those experiences as phenomena we call data. Any “surprise” that greets the “Language Model” undergoing a similar Shannon Test (a more objective test of intelligence than the Turing Test) is greeted by the Solomonoff Induction process as a need to refactor the algorithm perhaps to be smaller than it would be if it merely appended the surprise datum to the end of the algorithm’s output as a program literal (say, as a residual error term to add to the program’s expected observation).


That this hasn’t caught on in the philosophy of science as a way of objectively judging between different models of the same data is really a crime against humanity.

(speaking of harping on a topic until people throw rotten vegetables at me)


What we have now is mobocracy: give us power or we start torching neighborhoods. And fakeocracy: whoever tells the best story, wins - this applies to the billions going into “AI” as well as to positions of power going to those whose CVs look good. They’re all subtypes of idiocracy that arose from positive psychology and boosting idiots’ self esteem.

Meritocracy is a bit backwards anyway: developing merit requires opportunities. And once one’s had merit, maybe giving them even more opportunities isn’t the best use of resources.

Instead, we need resultocracy: you give resources & positions to those who are going to do the most with them in terms of a multiplier effect.

One of the early things that Eliezer Yudkovsky did was his Intuitive Explanation of Bayes Theorem. Maybe you can do something like that for Kolmogorov Complexity?


LessWrong spawned a cottage industry whose primary product is ever more pedantic ways to obscure the obvious regarding Kolmogorov Complexity, ie Occam’s Razor. Take, for example the LessWrong article that came out concurrent with ChatGPT:

In particular take note of this quote:

the component corresponding to randomness is a much larger contributor to the K complexity, by 6 orders of magnitude! This arises from the fact that we want the Turing Machine to recreate the string exactly , accounting for every bit.

This is a Chimera with a thousand faces – every one of which is an “expert” who just knows what is “random” data that can be reduced to some standard statistic like root mean squared error as the “loss function” (with, perhaps, one of the zoo of “information criteria for model selection” to avoid “overfitting”).

So, having just now found that article on a search of LessWrong for examples of the Curse of Yudkowsky placed on our hapless world, I posted this response (that I’ve been posting to people ever since the Hutter Prize began):

No one ever “gets it” mainly because they are either “experts” who think I’m a kook, or they are one of the vast demoralized populace who believe in “experts” – especially those who call themselves “skeptics” as they heap ridicule on those who are skeptical of “experts”. Don’t want to be further demoralized by ridicule, now do you?


Nature’s loss function is Darwinian.


“Meritocracy” is one of those words, like “liberal” and “fascist”, which has so mutated since its having been coined (in 1958) that it means almost nothing or conveys Shannon information of essentially zero bits if used without further explanation. “Meritocracy” first appeared in Michael Young’s 1958 novel The Rise of the Meritocracy (link is to my review). Young was an intellectual stalwart of the British Labour Party and author of its first postwar manifesto. His book is a wry dystopia couched as a Ph.D. thesis written in the year 2034. As I note in the review, “Young’s dry irony and understated humour has gone right past many readers, especially those unacquainted with English satire, moving them to outrage, as if George Orwell were thought to be advocating Big Brother.”

The meritocracy of this book is nothing like what politicians and business leaders mean when they parrot the word today (one hopes, anyway)! In the future envisioned here, psychology and the social sciences advance to the point that it becomes possible to determine the IQ of individuals at a young age, and that this IQ, combined with motivation and effort of the person, is an almost perfect predictor of their potential achievement in intellectual work. Given this, Britain is seen evolving from a class system based on heredity and inherited wealth to a caste system sorted by intelligence, with the high-intelligence élite “streamed” through special state schools with their peers, while the lesser endowed are directed toward manual labour, and the sorry side of the bell curve find employment as personal servants to the élite, sparing their precious time for the life of the mind and the leisure and recreation it requires.

And yet the meritocracy is a thoroughly socialist society: the crème de la crème become the wise civil servants who direct the deployment of scarce human and financial capital to the needs of the nation in a highly-competitive global environment. Inheritance of wealth has been completely abolished, existing accumulations of wealth confiscated by “capital levies”, and all salaries made equal (although the élite, naturally, benefit from a wide variety of employer-provided perquisites—so is it always, even in merito-egalitopias). The benevolent state provides special schools for the intelligent progeny of working class parents, to rescue them from the intellectual damage their dull families might do, and prepare them for their shining destiny, while at the same time it provides sports, recreation, and entertainment to amuse the mentally modest masses when they finish their daily (yet satisfying, to dullards such as they) toil.

But there is fly in this soothing ointment.

Young’s meritocracy is a society where equality of opportunity has completely triumphed: test scores trump breeding, money, connections, seniority, ethnicity, accent, religion, and all of the other ways in which earlier societies sorted people into classes. The result, inevitably, is drastic inequality of results—but, hey, everybody gets paid the same, so it’s cool, right? Well, for a while anyway…. As anybody who isn’t afraid to look at the data knows perfectly well, there is a strong hereditary component to intelligence. Sorting people into social classes by intelligence will, over the generations, cause the mean intelligence of the largely non-interbreeding classes to drift apart (although there will be regression to the mean among outliers on each side, mobility among the classes due to individual variation will preserve or widen the gap). After a few generations this will result, despite perfect social mobility in theory, in a segregated caste system almost as rigid as that of England at the apogee of aristocracy. Just because “the masses” actually are benighted in this society doesn’t mean they can’t cause a lot of trouble, especially if incited by rabble-rousing bored women from the élite class. (I warned you this book will enrage those who don’t see the irony.) Toward the end of the book, this conflict is building toward a crisis. Anybody who can guess the ending ought to be writing satirical future history themselves.

Young writes,

Contrast the present — think how different was a meeting in the 2020s of the National Joint Council, which has been retained for form’s sake. On the one side sit the I.Q.s of 140, on the other the I.Q.s of 99. On the one side the intellectual magnates of our day, on the other honest, horny-handed workmen more at home with dusters than documents. On the one side the solid confidence born of hard-won achievement; on the other the consciousness of a just inferiority.

This is a tale of how “equal opportunity” can lead to gross disparity in outcomes unless suppressed by a tyrannical regime of redistribution. I don’t think this is what present-day advocates of “meritocracy” (Tony Blair was particularly fond of the word) envision, but perhaps they do.


The ideal system for a society in which we are all inter-dependent would be (a) to keep “rule by others” to an essential minimum and (b) to select the best people to perform that minimum necessary “rule by others”. The challenge with any such meritocratic system is – How to select those “best people”.

What we are really looking for is competence. But competence is like pornography – easy to recognize, difficult to define. We can all quickly recognize a competent carpenter or a competent auto mechanic, but how could we identify a competent bureaucrat or a competent politician?

Back in the days of small tribes, it was natural that the individual whom everyone recognized as most competent would end up leading the hunting party. As societies got larger, the founders of successful dynasties were necessarily competent. But the descendants of a Genghis Khan may not be as competent.

Sadly, alternatives to parentage as a method for selecting leaders have not proven much better. China’s long-term use of the Keju examination system for selecting Imperial bureaucrats did not save China from occupation by foreigners and backwardness. Today’s “democracy” is clearly proving to be a very poor way to select competent politicians. Perhaps the human inability to develop a reliable method for selecting competent leaders is why history tells us that every organized society degenerates and collapses after about a couple of centuries?

The development of a truly reliable long-term method of meritocratic selection for competence might be the greatest advance a human society could ever make.


Humanity has long ago evolved societal structures where the competent rule over the expendable. It’s about time we stopped being in denial about the competitive efficacy of such systems, and instead question the ideological opposition to efficacy.

Systems compete with one another, and ‘algorithms for promoting competence’ are selected along with the CPUs executing these algorithms. Algorithms might be the societal values, religions might be operating systems, societies are networks, and we’re the computers.


The realisation that information is as fundamental a component of the universe as matter and energy is central to understanding mysteries ranging from the origin of life to the behaviour of economic systems

A related, hard-to-summarize 2014 blog post:

Compression, Entanglement and a Possible Basis for Morphic Fields

blog post

So the universe is analogous to a class of computational processes, some more efficient than others, with the most efficient being heavily favored as representations, which compress natural patterns of evolution of matter and fields so that required resources are minimized to model or instantiate the universe. These compressed representations of patterns have a supra-physical, informational component which is encoded in the thermal radiations of all matter and fields, which cause a cascade of entanglements which in turn have the history of the universe’s changing patterns encoded within them. The entanglement of the particles in new patterns with those of past patterns requires the new pattern to be consistent with all the quantum informational constraints of the past patterns. The only consistent universes are those where all the past information from all past patterns is still implicit in each and every new pattern, sub-pattern and interaction. So the past patterns can serve as templates for later patterns, with a size-dependent degree of clarity, as with parts of a hologram, and allow effective compression of all similar situations in the past to each local region of the universe. The thermal radiation information field compresses all similar past situations because it is not truly possible to erase information, but only to turn it into “heat” which is basically just information that one has decided to ignore. Everything in the universe that “stays happened” (as opposed to quantum eraser-type situations) is on the permanent, ineradicable record.

These templates are patterns in both space and time, allowing for example the progressive elaboration of structures in the development of embryos, and so can most effectively be modeled by generative programs which produce the evolving state of the simulation or instantiation, rather than just static data, that is, efficiency implies not just compressibility but minimizing the Kolmogorov complexity of the computational processes analogous to the physical situations. This allows not just physical structures but patterns of behavior and modes of development to be optimized for their analogous computational processes’ equivalents of memory space and processing power, and thus gives not just a memory but a super pattern-recognition capability in every part of the universe, which can read a developing situation and compare it with everything in the past light-cone, thus compressing it to effectively require only the new, original information content it embodies to be added to the thermal motions and radiations that communicate past interactions and patterns among the parts of the universe through quantum phases and entanglements’ implications. The past patterns it embodies are already in the information field, but each repetition and close variation makes them “stronger”, or more compressed and efficient.

Effectively this is like compression with unlimited side information available. The information capacity of thermal radiation is enormous given it has about 10^19 to 10^21 photons per joule. Even the milli-atto-joules characteristic of the smallest molecular motions give rise to photons. To see the potential power of this sort of compression, movies would be very easy to “compress” to send over a wire if the sender and all viewers already had a copy of every movie ever made as “side information” – only a serial number or tag code would need to be transmitted to “transmit” gigabytes of movie. (But in such a large data set as the the universe’s information field there probably is a shortage of short tag codes, codes shorter than the patterns they represent, even though the codes be context-dependent.)

The information field, being in its heat diffusion the same as the wave equation with time replaced by imaginary time, implies that its dynamics occur in imaginary time, which is like a small cylindrical manifold with a particle that changes phase as it spirals along it helically, as in electron zitter motion, rather than staying at one angle on the cylinder as in normal time. (See Zitterbewegung comment on the article on “time crystals” on the Simons Foundation site, reposted here.) It is recurrent time, cyclical time, perhaps not time but eternity. And among the compressed patterns in the information field are all the people who ever lived and every thought and action they ever had or did. Not just the dead ones, either, nor just the distant past, but the past that starts a nanosecond ago, even a yoctosecond ago. In fact, the parts of the future that are implied by the past are already in the field, so it’s really somewhat atemporal or eternal.

So the afterlife, precognition, remote viewing and telepathy are implications of this view. It even suggests how it is possible to give a remote viewing target with only an arbitrary code number. The code and the target are physically associated on the envelope or in the computer and the target information is sent via the code in the same way that the movies were “sent” in the example of compression with unlimited side information.

See Daniel Burfoot’s book: “Notes on a New Philosophy of Empirical Science” (arXiv:1104.5466v1 [cs.LG] 28 Apr 2011 ) for more down-to-earth applications of the idea of treating science as a data compression problem, compressing vast quantities experience and experiment down to pithy theories and equations.

There are some similar posts on my blog, and others that may be interesting.


A post was merged into an existing topic: Scanalyst Tuesday (early) audio meetup

Thank goodness!

What a relief!

A book I don’t have to write even if I had the talent and time to do so!

And the time is ripe for people to read it.

May his work be more widely read and may he be spared vicious character assassination by various pseudonymous identities who are threatened by algorithmic information as model selection criterion (as I’ve occasionally been subjected to when bringing it up in “threatening” ways). But given that he cites physics as the empirical science that is most amenable to algorithmic information approximation as the gold standard model selection criterion, and at least one of these pseudonymous identities was a physicist, one can only imagine the kind of attacks he’ll be subject to by the social pseudosciences since that is the current front line in the battle for relevant truth.

I may have to admonish him to edit and amplify to become a central theme, what he says here almost as a side comment:

The programming language format provides the largest possible degree of flexibility, and also the largest overhead cost. But in the limit of large data, the programming language format may very well be the best choice.

Failure to properly emphasize this has been perhaps the single greatest stumbling block to the information age’s acceptance of this more rigorous approach to empirical science.

A less critical problem (although it is an actual error rather than emphasis misdirection) is his statement that includes the “compressor” size rather than the decompressor size in the approximate measure of algorithmic information:

Indeed, when dealing with specialized compressors, the distinction between “program” and “encoded data” becomes almost irrelevant.
The critical number is not the size of the compressed file, but the net size of the encoded data plus the compressor itself.

PS: ycombinator is a mosh pit.


Dear Daniel Burfoot,

Your important book “Notes on a New Philosophy of Empirical Science” is becoming more timely although it has been overdue since at least Solomonoff and the dawn of Moore’s Law.

A few suggestions:

This passage is more central than you may realize:

The programming language format provides the largest possible degree of flexibility, and also the largest overhead cost. But in the limit of large data, the programming language format may very well be the best choice.

One of the primary failures of statistics is the lack of dynamical systems models – and reality is one big dynamical system. This is particularly important when dealing with issues of “causality” - which is at the heart of the most controversial scientific questions regarding complex systems such as society. Although there is no finite computer that is “Turing Complete” in the abstract sense, it is vital that models at least be recurrent so as to incorporate time as an element. In this respect any formally “Turing Complete” code is, in fact, a finite state machine since the hardware is finite. But it should be at least that capable.

So I would suggest a fairly major edit to emphasize dynamical models – especially unified models of complex systems where there is no good separation between dependent and independent variables.

My other suggestions are not nearly as important, but here they are:

  • Make the following phrases into quasi-idioms: “data selection” “model creation” and “model selection”. For some background as to why these need reification see the readme at the Hume’s Guillotine repo.

  • Wherever you use the word “compressor” as part of the approximate measure of algorithmic information, you should use “decompressor”. Although this is a somewhat pedantic nit – saying “compressor” is a technical error. For example, see this passage where you should have used “decompressor”:

The critical number is not the size of the compressed file, but the net size of the encoded data plus the compressor itself.

The compressor takes data and generates a program of minimum size that, when executed, regenerates (decompresses) the original data.

  • One of the ways that scientists, even physicists, react poorly to the use of algorithmic information approximation as model selection is the impression that humans are to be replaced. The way I try to defuse this bomb is to emphasize that not only is “data selection” irreducibly subjective – hence must involve the scientific community – but that even once a selection of data is agreed on, scientists will be responsible for introducing their priors to the compressors which may automate only part of the “model creation” process. The only place where their role will be pretty much eliminated is in the final stage of “model selection” since that will be reduced to a single figure of merit: The size of the executable archive of the data their community has selected.

  • In terms of the foregoing suggestion about programming language as the encoding format, it is important to note the role of program literals as the way one encodes “noise” for the exact reconstruction during decompression. As I like to say to people when this issue comes up and they try to short-cut the encoding of “noise” with “measures” like RMSE: “One man’s noise is another man’s ciphertext.”

Again, thank you for taking on a very challenging, very important and long-overdue task in the philosophy of science.

– Jim


Thanks for the YC link – a mosh pit indeed, with gwern getting a few slams in and the book’s author, Daniel Burfoot joining in as well. It’s great that the book is getting some recent attention; I linked to it back in 2014.

I’ve read many of your comments on Next Big Future, you seem a lot more focused on AIT and compression here, though. I’m still working on Ed Jaynes and MaxEnt basics after 20+ years, though I’ve also been studying compressed sensing ideas for a long time. Applications I’ve been thinking about:

  • a workable design for a deep hyperspectral video camera using only a handful of line sensors
  • super-compression of financial time series, using the rate of return of an optimum series of trades for the uncompressed data as the reference for judging the quality of the compressed version – if the optimum series of trades is the same, then the compression is perfect; the degree to which the rate of return of the compressed optimum trade sequence is less than that reference when run on the original uncompressed data gives a measure of information loss.

I’ll restrain myself from going on too much in this reply, but I think quantitative finance is the application that would get the most attention (and money).


Ever since the DotCon bubble burst leaving us with The LAST Big Future, NBF’s Brian Yang reminds me to some extent of “Criswell PREDICTS!”

Everyone was breathlessly wondering what would provide a resurgence of tech optimism (the next bubble). It was around this time that a descendant of a Mad magazine cartoonist hyped his mysterious revolutionary invention that would be The Next Big Future so profound as motivate the blaring of the apocalyptic trumpet in the form of Jeff Bezos’s honking laughter at a demonstration thereof.

But see here’s the thing:

There is no better theory of data-driven prediction than AIT. That’s why it does no good to bring it up with people like Brain, or, for that matter, others who have attained social status by predicting. My attempt to get metaculus to pay attention fell on deaf ears and Brian is a “star” over at metaculus in much the same manner as my former housemate Karl Hallowell became a star at ideosphere: lots of arbitrage grunt work on topics that are so far from black swan predictions as to make one wonder who could possibly be interested in their predictions?

No, there aren’t that many environment in which one would not be considered a mere annoyance for appropriately emphasizing AIT, which is one reason why I believe myself to be considered something other than a mere annoyance here.


Having been challenged by one of us to come up with an “Algorithmic Information for Dummies”, I entered into a Socratic Dialogue with GPT4, asking it questions, to see if I could guide it to self-examination of its own “knowledge”.

One can only imagine what Socrates would have done with a student who, suffering from the “little professor” syndrome of Aspergers, talked down to Socrates with bullshit while engaging in zero critical self-awareness.


Hector Zenil put together this brief video trying to get people to attend his Santa Fe Institute course on algorithmic information theory and its relationship to data-driven causal discovery (although he says “model driven” which seems an interesting inversion):


I don’t think he understands machine learning enough to talk about it - but it’s interesting to see him drawing the distinction between model-driven and data-driven perspectives.

You might be interested in Malcolm R. Forster’s work on popularizing Whewell’s consilience - which is, actually, a very neat way of explaining one example of algorithmic information theory to philosophers of science.

The successful unification of disparate phenomena is a feature of good scientific theories that William Whewell referred to as the consilience of inductions.

Here’s more about Whewell: