Incoherent Ramblings On Computer Architecture

publié sur 2021-05-06 13 min de lecture

All computers are too bloated, slow, inflexible, complicated, and poorly designed.

The most abstract notion of what a "computer" is its high level architecture. And every "computer" follows the same von Neumann architecture (some just have extra steps).

In its simplest, a von Neumann computer (which is every computer) has three parts: a central processing unit (or CPU), memory (any place which stores data), and something to transmit data from the CPU to memory and vice-versa (this is called a "bus").

Times have changed, and there are many more elements that go into a computer, but they can all still be simplified into these three things. For our examples, let's pretend that hard drives and RAM are the same thing because they essentially do the same things: store data (it's just that RAM is faster), so they're just going to both be called "memory".

So with our simplified example, here's how a computer executes a program (which sits in memory):

The CPU fetches the next instruction of the program from memory.
It executes the instruction.
If the instruction requires a piece of data, it fetches the data from memory.
If the instruction creates data it will need later, it gives the data to memory.
If its the last instruction, it's finished, otherwise go to 1.

The task of any program is to change the contents of the memory in some major way, which means endlessly passing data back and forth through the bus. This shared bus between the CPU and memory leads to the "von Neumann bottleneck" as the work that the CPU can do is limited by the data transfer rate of the bus. This can seriously limit the effective processing speed when the CPU is required to perform minimal processing on large amounts of data. The CPU is continually forced to wait for needed data to move to or from memory.

A large part of the traffic in the bottleneck is not just the data but also names of that data, as well as operations and data used only to compute such names. With our current address-based memory architecture, before data can be sent through the bus its address in memory must be in the CPU, so it must either be also sent through the bus from memory or be generated by some CPU operation (operations which also come from memory).

The von Neumann bottleneck was described by John Backus in his 1977 ACM Turing Award lecture. In it he expressed that:

Surely there must be a less primitive way of making big changes in the store than by pushing vast numbers of words back and forth through the von Neumann bottleneck. Not only is this tube a literal bottleneck for the data traffic of a problem, but, more importantly, it is an intellectual bottleneck that has kept us tied to word-at-a-time thinking instead of encouraging us to think in terms of the larger conceptual units of the task at hand. Thus programming is basically planning and detailing the enormous traffic of words through the von Neumann bottleneck, and much of that traffic concerns not significant data itself, but where to find it.

The last sentence, I think is the most important. Many game developers over the past few years have been letting go of object-oriented principles in favour of data-oriented design, which is motivated through efficient use of CPU cache lines, a thing we simplified over in our previous example. Basically, modern CPUs have several caches which can store small amounts of data to alleviate some of the constant back and forth in the buses.

I think it would be easiest to explain the concept with an example.

Imagine we have a city of people and we want to count how many people have the profession of "baker". Using OOP, we'd have a class called Person with an attribute of profression and we'd just loop through each Person object and checking their profession. In C#,⁽¹⁾ it would look like this.

⁽¹⁾ I don't actually know any C#. I wanted to use Python but I thought it would be clearer if I used a more strict OOP language and one that's actually used in game dev.

class Person {
  int size;
  string name;
  string profression;
  int age;

  static void Main(string[] args) {
    // an array of 100,000 people
    Person[] people = new Person[100000];

    //
    // give each person in the array their attributes here
    //

    // count bakers
    int bakerCount = 0;
    foreach (Person p in people) {
      if (p.profression == "baker") {
        bakerCount++;
      }
    }

    Console.WriteLine("Number of bakers: {0}", bakerCount);
  }
}

I'm now realising, since I'm not going to actually run this code, I could have just written pseudocode instead of wasting 20 mins looking at C# tutorials. But this should have been easy enough to understand anyway.

Every time you request a byte of memory that is not already present in one of the CPU caches, an entire cache line is fetched from the memory even if you need only 1 byte.

Let's say the address reference to name and profression is 4 bytes each, and the integer age is also 4 bytes, and maybe let's also add another 8 bytes for class headers. That bring the total class size to 16 bytes. If we take in consideration that on a modern CPU the cache line size is 64 bytes, this means we can fetch at most 4 instances per cache line which will make us fetch from memory 25,000 times to get all 100,000 people. But out of those 64 bytes we are only using 16 bytes (4 of the profression references), so we're wasting 48 bytes

In the worst case scenario, the instances are not allocated one after another and we can fetch only one instance per cache line because they aren't in a contiguous block of memory. For the entire city that means we have to fetch from memory 100,000 times and for every cache line fetched we're wasting 60 bytes.

Since fetching the data from memory is expensive and prone to the von Neumann bottleneck, we would prefer to keep the number of memory fetches as small as possible. This could be achieved in two ways:

By reducing the amount of data that has to be fetched for our task.
By keeping the necessary data in contiguous blocks in order to fully utilize the cache lines.

We can improve this by modelling the entire city at once instead of each individual person.

class City {
  int size;
  string[] names;
  string[] profressions;
  int[] ages;

  static void Main(string[] args) {
    City city = new City();

    //
    // fill the city with people and update `size` here
    //

    // count bakers
    int bakerCount = 0;
    for (int i = 0; i < city.size; i++) {
      if (city.profressions[i] == "baker") {
        bakerCount++;
      }
    }

    Console.WriteLine("Number of bakers: {0}", bakerCount);
  }
}

Both examples are algorithmically equivalent O(n), but the data-oriented solution will outperform the object-oriented one.⁽²⁾

⁽²⁾ In theory. I don't know how to run C# and actually benchmark it.

The data oriented example fetches less data and that data is fetched in contiguous chunks (because that's how arrays work). So now we can fetch 16 profression references at once to fill up the cache line without any waste. Since we are only fetching the data that we need, we only have to fetch from memory 6,250 times, a 4x speed-up from our previous best-case scenario, and a 16x speed-up from the worst-case.

I'm sure you could think of some very obvious drawbacks data-oriented design (although most of them come from the fact that C# is object-oriented and not data-oriented, so it doesn't really have the tools to support it), but the point is that - in terms of performance, this is objectively better.

The real problem here is that this is C#, a garbage collected language, I thought I wouldn't have to ever think about memory. Yet here I am, talking about cache lines and contiguous bytes. The abstraction clearly isn't working.

There are very many boring attempts to solve the bottleneck (and they mostly just lessen it's impact instead of "solving" it). Things like CPU caching (as above), branch predicting, and implementing the computer as a "system on chip". My favourite way would completely solve our issue⁽³⁾ above (the wasting bytes in the cache part), and it's good 'ol content addressing.

⁽³⁾ If anybody apart from me seemed to care about it, so we had actual working devices using it and wasn't just some niche research topic.

Content-addressable memories (CAM) are an implementation of memory for an associative computational model. The memory of an associative computer takes some of the responsibility for processing. Only intermediate results are exchanged between memory and processor which greatly reduces the amount of communication between them.

CAM is a memory device that uses a technology similar to the one used in RAMs to store information but, contrary to RAM, the CAM selects the physical location based on the data contents. While the RAM requires the address of the location in which the information is stored, the processor has only to describe the data it wants to access and the CAM selects stored data matching the description.

Valid descriptions of memory words usually include combinations of the following properties: matching a binary pattern, being smaller or larger than a value, being in a range of values, being the largest value stored, and being the smallest value stored within a CAM. These comparisons are performed in parallel with the time required to execute the operation essentially independent from the number of words stored in memory.^[1]

Ahh, content addressing, what can't you do?

Obviously this doesn't solve the von Neumann bottleneck as there's still a bus to transfer data, it's just faster. But isn't faster good enough? I'm pretty content with the speed of my M1 MacBook, so why do I keep railing on von Neumann architecture?

Our fixation on von Neumann languages has continued the primacy of the von Neumann computer, and our dependency on it has made non-von Neumann languages uneconomical and has limited their development.

[…]

The dominance of von Neumann languages has left designers with few intellectual models for practical computer designs beyond variations of the von Neumann computer.

[…]

Furthermore, in an effort to introduce storage and to improve their efficiency on von Neumann computers, applicative systems have tended to become engulfed in a large von Neumann system. For example, pure Lisp is often buried in large extensions with many von Neumann features.

These are more quotes from John Backus' lecture in 1977, and they're still incredibly relevant today 44 years later. We haven't improved at all.

To make an effective computer system and make full use of the computer, you are forced into a single model of thinking. To even use a computer system, you are forced into a single model of thinking.

The "von Neumann correspondence" is the recognition that the problems - technical, conceptual and managerial in computer systems design correspond to certain problems in the user interface.^[2] The communication channels between user and computer have a relatively low capacity in comparison with the memory and CPU of the computer.

A CPU can directly ask for what it wants and completely understand it, but the computer has to represent its information in some way for the user. "Modes" in a user interface arise because of this communication bandwidth bottleneck - information not transferred between computer and user increases "modiness".

A modeless system (or, pedantically, a system with precisely one mode) is a system in which user commands can be viewed as pure state transitions. A modey system is one where commands effect state transitions, but possibly also change the command set.

So a modeless system would be WYSIWYG (what you see is what you get), where there is a simple relation between information transmitted from the computer to the user and the results obtained). That is, what the user can see (information transmitted from computer to user) is what the user gets (information transmitted from computer to results).

Lisp can be considered WYSIWYG (if you're me, and bend the facts enough) because its user interface compares to what data is being processed, where once an expression is evaluated, it is treated as data.

Having to unnecessarily refine the distinction between program and data (or commands and state) is a direct result of von Neumann architecture. Data is finite and program is infinitary (where they could possibly do an infinite number of things). This means, where data can be compared for equality by an effective procedure, a program cannot. Equally, we can observe that a user can compare results and other data for equality by inspection (however tedious this may be), but has no effective procedure for comparing commands.

Non-von Neumann languages permit infinitary data (e.g. lazy structures). Similar distinctions arise in user interface design, for instance between commands the user submits and the data that is being manipulated. In a "non-von Neumann" user interface, the user could manipulate commands (program) as data (like with "undoing").

Modes and WYSIWYG are generally recognised as major issues in the design of easy to use interactive systems. Typically, claims are made that "good" systems are WYSIWYG and have few modes.

I don't think I'm saying what I want to say very clear. So let's back up for a bit.

In order to work with a computer system (effectively), you need to understand how the computer system works (as in, you do what the computer wants and not the other way around). I know this is completely okay with you, because you know how a computer works. That's great dude. But most people don't.

The most high-level programming languages hide their abstractions and suffer from it, making them less useful then they could and should be. The simplest and most intuitive UIs hide their implementation detail and suffer from it, making them less useful then they could and should be.

There are numerous indications that the applicative style of programming can become more powerful than the von Neumann style. Therefore it is important for programmers to develop a new class of history-sensitive models of computing systems that embody such a style and avoid the inherent efficiency problems that seem to attach to lambda-calculus based systems. Only when these models and their applicative languages have proved their superiority over conventional languages will we have the economic basis to develop the new kind of computer that can best implement them. Only then, perhaps, will we be able to fully utilize large-scale integrated circuits in a computer design not limited by the von Neumann bottleneck.

Have some imagination, use Lisp, and one day, maybe we'll be able to write Haskell without monads.

^[1]: https://pdxscholar.library.pdx.edu/cgi/viewcontent.cgi?article=5186&context=open_access_etds

^[2]: https://ieeexplore.ieee.org/document/209312