The most important chapter in the era during which the upper Midwest dominated the world’s computer industry. I’ve written about my own small part in attempting to bypass the personal computer era with mass market computer networking while I was at CDC’s Arden Hills operations working on the PLATO network. Wall Street press was having none of it and basically got the fashion conscious middle management to deep six it. The same types have completely overtaken the nervous system of the world-wide network today.
“A team of 34 people, based in a small Wisconsin town…”
I recall seeing the transcript of a speech Cray gave to “The Agency” (presumably NSA) that described the above history and then some. I do recall he made a point of mentioning that he always maintained a size limit on his engineering staff of about 40. This, of course, is now known as one of “the numbers” (e.g. “Dunbar” of 150 which would be about the size of Cray’s “tribe” including wives and children) for which humans are evolved. I’m trying to locate a copy of that speech since it apparently isn’t online anywhere – probably because it was classified at the time I saw a pirated copy we had at the UofIL PLATO lab. I’ve asked some of the old guard if they have any copies stuffed away in a box somewhere to retrieve.
One thing not emphasized enough in the video is the importance of the memory control unit which permitted “duel head” CPU systems that “interleaved” memory access between banks based on the low address bits, so in addition to parallel functional units, multiple memory banks could be in operation. This left an impression on me due to the need for shared memory access between CPUs to be fast as possible. That plus the need to reduce memory latency for short vector lengths is what motivated me to come up with a wild idea for keeping memory access on-chip by sacrificing cache in favor of shared memory with analog crossbar mutex. Going off chip is a very serious time penalty. While I understand that cache is important for latency in the era of speed of light delay due to wire length (what I’ve seen called the “radius of control” problem in system on chip design) this still seems to me a major gap in the architecture space – particularly given the need for sparse representations in AI and their correspondence to lossless compression of training data, hence Solomonoff Induction approximation – where locality of reference is minimized by Occam’s Razor.
Some of this speed of light latency limit can be addressed by stacking memory chips with Through Silicon Via.
In the world of Seymour Cray’s former employer, Univac, interleaved memory was a matter of some controversy. Traditional Univac architecture separated memory into banks which, as I recall were 32,768 (36-bit) words in size. The memory controller could perform simultaneous reads or writes from two separate banks, but if it were processing two accesses to the same bank, it had to execute them serially, doubling the access time. Because most instructions involved one access to fetch the instruction and another to load or store data, programs were usually structured to separate “ibank” (instruction bank) and “dbank” (data bank) into separate blocks the operating system would load into different physical banks, which would make the program run around twice as fast as if instructions and data were all in the same bank,
This made sense for a single processor system, but for multiprocessor systems (the 1108 could have up to 3 CPUs and 2 separate I/O controllers [IOCs], while the 1110 could have 6 CAUs [command/arithmetic units] and 4 IOUs [input/output units]), the system was prone to long memory waits when multiple processors were accessing instructions and data from the same physical bank (even if the memory belonged to different programs being run simultaneously). The answer was to interleave all memory based upon the low address bit. This resulted in greater throughput for the multiprocessor access case but slower performance when running a program with strict ibank/dbank separation on a single processor.
Since the 1108 billed programs based on CPU time, this meant that unless cost were adjusted, it would cost more to run a program on a machine with interleaved memory that one with separate contiguous banks. Users were not happy paying more for running the same program as before. On the 1110, Univac introduced a “storage reference counter” which would provide a repeatable measure of memory accesses regardless of delays in accessing memory, allowing consistent billing regardless of how programs were laid out in physical memory or conflicts in memory accesses by processors.
Fully populated 6600 had 5 bits to select the bank – which meant far less contention. My recollection is that Cray had more to do with the memory controller innovations than he did the “score board” (superscalar architecture) – which was Thornton’s baby. Cray did, however, pay a lot of attention to optimizing the instruction set (nascent RISC) based on the actual FORTRAN code used by the science community (e.g. national labs). Maybe that’s contributed to people eliding Thornton in the less-than-adequate historical accounts of superscalar architecture.
Cray had a bit of a conflict with the NSA over the choice of instruction set.
In the original concept of RISC (Dave Patterson, UC Berkeley), one of the main ideas was that the instruction sets of core memory era computers were optimised to perform as many arithmetic and logic operations as possible per instruction because instruction fetches were time consuming compared to the speed of logic circuitry. The advocates of RISC argued that once memory access time was comparable to that of logic operations, better performance could be obtained with a simple instruction set and large register files, counting on the compiler to optimise code to perform as many operations as possible register to register without accessing memory.
Architectures such as SIMD (Illiac, etc.) and vector (Cray) went the other direction: allow a single instruction to perform processing of bulk data which could be pipelined (taking advantage of interleaving to do fetches in parallel).
Now we have most machines running on a perfect hybrid (or bastard) architecture, where the instruction set seen by the programmer or compiler is indescribably baroque but is transformed by the processor’s microcode into a very long instruction word (VLIW) internal code which the hardware actually executes with multiple functional units, register renaming, multi-level cache, and branch prediction all getting into the game to try to squeeze out more performance.
It seems crazy to plow all that real estate into doing work that could be done with higher level language descriptions involving AND and OR parallelism (ie: work compilers should be doing). It has always seemed like the hardware world and software world have never gotten over what Backus tried to remedy with his Turing Award lecture. Instead they make programmers “compile” higher level descriptions into lower level descriptions for the high purpose of making the hardware designers reverse that work to get the parallelism back.
If the mutex circuit worked, nowadays (3nm → 200M transistors/mm^2) you should be able to get about 100 Cray-1s along one side of a 1mm^2 die with about 100 shared RAM banks between them on the same 1mm^2 die – the registers etc would all remain the same for a lot faster processing. That means about 100x faster shared memory latency for scalar operations than going off-die. Although the total on-die shared memory would be tiny by today’s standards, that can be increased by going to 1cm^2 which would bring the shared RAM on-die latency down to about 10x going off die. That would also increase the Cray-1 cores to 1000 and shared memory for sparse AI operations to around 1GB. You don’t need to keep that many of all those cores busy to pay for the real estate.
That was the idea with the Intel Itanium—a very long instruction (VLIW) architecture where each instruction could encode several operations with the conflict resolution and scheduling delegated to the compiler. Simulation indicated it should easily out-perform both RISC and CISC (x86 and VAX-type architectures) but in reality performance was disappointing, perhaps because compiler developers did not make sufficient investment in optimising for the architecture. Itanium performed poorly emulating the x86 architecture, which discouraged migration to it. The series was discontinued with the last shipments in July 2021.
I shouldn’t have said “work the compilers should be doing” since that brings to mind exactly what I was trying to avoid by going to higher level, declarative expressions for AND parallelism (conjunction of inputs for functional programming) and OR parallelism (disjunction of inputs for relational programming). There was a lot of work on “auto-vectorizing compilers” going back to the Cray-1’s vector instructions trying to speed things up with the existing FORTRAN code. There was also work going on with FORTRAN compilers on the 6000 series to keep take the remaining functional unit utilization left by Thornton’s nascent superscalar scoreboarding.
With high level programming languages you’re actually offloading the compiler and the instruction decoding hardware of the work to reconstruct the parallelism that was in the programmers’ head all along.
I don’t know if the Itanium would have been able to support the kind of high level programming language parallelism I envision being supported by 1000 Cray-1s accessing 1000 on-chip shared memory banks. But that’s what I had in mind when I spoke of going to higher level programming languages.
A (very) rare video (or even audio) presentation by Seymour Cray, which includes a bit of history in the first several minutes.
The thing to keep in mind about the “cult of personality” surrounding Cray is not just his insistence on a Dunbar Number tribe isolated from from middle management, and not just that the isolation be in the countryside, and not just that the isolation go to restriction of telephone conversations from “HQ”, but that a good deal of the system design was done in his head – often during long drives (at times having to keep his kids quiet to avoid distracting him).
This image of the way “communication” occurs inside a human head at the pre-verbal level – emerging ideas into formal representation only after they have had a chance to organically grow in a massively parallel process of intuitive conciliation prior to being butchered into “words” – helped form my ideas about the value of individualism and see how dangerous social environments, with their demands for verbal formulations --can be to creativity.
PS: When Cray Computer Corporation (the GaAs computer company) went out of business, I contacted their GaAs fab guy to find out the disposition of the GaAs line. M recollection was that the potential for GaAs to beat CMOS was there and not just in speed, but in manufacturing cost. But due to industrial path dependencies, and what Sarah Hooker has recently called “The Hardware Lottery” regarding AI hardware, it was crowded out of the capitalization feeding tough.