Content Addressed Identification

publié sur 2021-03-01 13 min de lecture

I hate having to think of titles for these posts. Since I don't plan any of these things out, when I start writing I'll have no real idea of how it will turn out, and they might go in a completely different direction to what I originally thought.

This is most problematic with the slugs I use for the post path. Have you noticed that the previous post is on the path /post/invest_in_oil, but has very little to do with investing except for a joke at the end? That's because the post went through 3 rewrites and I was too lazy to change the file name of the markdown document from my first draft.

If you've been here for a long time, you might also remember that the path to each post was on /article/ instead of /post/. I've changed this location some time ago, but apparently some search engines still link to the old path, forcing me to set-up a redirect from /article/ to /post/ or people would be directed to a dead link.

If you've assigned a label to a thing and that thing changes, are you supposed to give it a new label or just try to tell people that the label means something different and hope they understand? What happens if a label is incorrect from inception, but it caught on so people just keep using it?⁽¹⁾

⁽¹⁾ Like a certain political ideology.

There's a very simple answer to all of these: don't have labels and describe things by what they are.

How would this work? Let's look at some examples.

IPFS

IPFS (the InterPlanetary File System) is a hypermedia distribution protocol addressed by content and identities. It enables the creation of completely distributed applications, and in doing so aims to make the web faster, safer, and more open.

IPFS is a distributed file system that seeks to connect all computing devices with the same system of files. In some ways, this is similar to the original aims of the Web, but IPFS is actually more similar to a single BitTorrent swarm exchanging Git objects. You can read more about its origins in the paper IPFS - Content Addressed, Versioned, P2P File System.

This is what the web should be. IPFS is a peer-to-peer (p2p) storage network. Content is accessible through peers located anywhere in the world, that might relay information, store it, or do both. IPFS knows how to find what you ask for using its content address rather than its location.

I highly encourage you to read more about it. HTTP is a flawed protocol and IPFS eliminates the need for websites to have a central origin server, making it perhaps our best chance to entirely re-architect the Internet — before its own internal contradictions unravel it from within. I'm only going to talk about content addressing here, but there is a lot more that makes it work.

IPFS uses content addressing to identify content by what's in it rather than by where it's located. With HTTP (and every file system based on the Filesystem Hierarchy Standard), we identify content by it's location, e.g:

https://justin.duch.me/post/content_based_identification
/Users/justin/dev/github.com/beanpuppy/justin.duch.me/_posts/2021-03-03-content_based_identification.md

By contrast, every piece of content that uses the IPFS protocol has a content identifier, or CID, that is its hash. Many distributed systems make use of content addressing through hashes as a means for not just identifying content but also linking it together - everything from the commits that back your code to the blockchains that run cryptocurrencies leverage this strategy. However, the underlying data structures in these systems are not necessarily interoperable.

This is where the Interplanetary Linked Data (IPLD) project comes in. IPLD translates between hash-linked data structures allowing for the unification of the data across distributed systems. IPLD provides libraries for combining pluggable modules (parsers for each possible type of IPLD node) to resolve a path, selector, or query across many linked nodes, allowing you to explore data regardless of the underlying protocol. IPLD provides a way to translate between content-addressable data structures: "Oh, you use Git-style, no worries, I can follow those links. Oh, you use Ethereum, I got you, I can follow those links too!"

IPFS/IPNS hashes are big, ugly strings that aren't easy to memorize. So IPFS gives us DNSLink which allows you to use the existing Domain Name System (DNS) to provide human-readable links to IPFS/IPNS content. It does this by allowing you to insert the hash into a TXT record on your nameserver.

With this, IPFS has archiving and permanency in built into the protocol. Let's say this post has a CID like F6450F3FB903⁽²⁾, which would be the hash of all the text in here. Someone can then pin it to their node and start serving the post as well (as they should). Now, If I were to change anything in this post I would get a completely new CID, maybe something like 4F579386372C. But the other nodes will still be serving the old post F6450F3FB903.

So while I didn't really do anything, the nature of the protocol saved the history of this post.

⁽²⁾ CIDs don't actually look like this.

Of course, people want to update and change content all the time and don't want to send new links every time they do it. This is possible with IPFS in a number of ways, such as using a Mutable File System (MFS) or the previously mentioned DNSLink.

It's important to remember in all of these situations, using IPFS is participatory and collaborative. If nobody using IPFS has the content identified by a given address available for others to access, you won't be able to get it. On the other hand, content can't be removed from IPFS as long as someone is interested enough to make it available, whether that person is the original author or not. Note that this is similar to the current web, where it is also impossible to remove content that's been copied across an unknowable number of websites; the difference with IPFS is that you are always able to find those copies.

Unison

Unison is a functional, typed language largely influenced by Haskell, Erlang and a research language called Frank. Unison treats a codebase as an content addressable database where the "content" is a function definition. In Unison, the "codebase" is a somewhat abstract concept (unlike other languages where a codebase is a set of files) where you can inject definitions, somewhat similar to a Lisp image.

One can think of a program as a graph where every node is a definition and a definition's content can refer to other definitions. Unison content-addresses each node and aliases the address to a human-readable name.

When writing a new function, Unison calculates the hash of the implementation and instead of storing text files, what you save in the code base is the abstract syntax tree (AST) of the function, where references to other functions are made using the corresponding hashes. In this way, a management of the codebase is achieved that allows, among other things, the following:

Not having to recompile anything
Trivial renaming
Cache test results
Eliminate dependency conflicts
Persistent typing and simple storage

The Unison codebase is append only so definitions are never modified or deleted, only new ones are added, which means it can be versioned and synchronised with Git or similar tools without ever generating conflicts, and many types of information can be cached without worrying about invalidation.

The Unison codebase manager is the piece that makes all these things possible by storing the AST and becoming the only source of truth and not relying on textual representations found in text files. The human readable names you give are stored separately from the definitions, so renaming is fast and 100% accurate because Unison only needs to change the name associated with the hash in one place.

We're used to thinking of a program as a thing that describes what a single OS process will do, and then using a separate layer of technologies outside of our programming languages to "configure" many separate programs into a single distributed, elastic "system". This gets complicated. The core language of Unison starts with the premise that no matter how many nodes a computation occupies, it should be expressible via a single program, not many separate programs. Unison programs can describe their own deployment, elastically scale and orchestrate themselves, and deploy themselves in parallel onto any number of nodes for execution.

Since Unison is based of Haskell, it has a ML/Haskell type syntax which can be off-putting for some people. But because the codebase is actually just an AST, it would technically be possible to have many different syntaxes for the codebase manager to translate into an AST. There doesn't look like there's been that much activity on that front from the creators, but it is planned and is something I am very interesting in seeing be done and I'll be keenly following it's progress in the GitHub issue.

Speaking of GitHub, one disadvantage that Unison has is that it doesn't really work with Git repo viewers like GitHub. Since Unison doesn't really deal with text files, there's no way to actually view the codebase of a Unison project with GitHub as you can see by their base libraries. It seems there will need to be a special GitHub-like site for Unison that integrates the Unison codebase manager.

This isn't a programming language post, so I skipped out on any code snippets, but I'd recommend reading the tour to learn more.

Nix

Nix is a purely functional package manager. Which means packages are built by functions that don't have side-effects, and they never change after they have been built. Nix stores packages in the Nix store, usually the directory /nix/store, where each package has its own unique subdirectory such as:

/nix/store/b6gvzjyb2pg0kjfwrjmg1vfhh54ad73z-firefox-33.1/

Where b6gvzjyb2pg0… is a unique identifier for the package that captures all its dependencies (it's a cryptographic hash of the package's build dependency graph). Like all the others I've talked about, this gives us some cool and powerful features.

Again, Nix is much more than content addressing, and NixOS is a completely declarative OS (which is arguably much more exciting), but content addressing is an important part of it and what we're here for, so I'm mainly going to talk about that.

With Nix, you can have multiple versions or variants of a package installed at the same time. This is especially important when different applications have dependencies on different versions of the same package — it prevents the "DLL hell". Because of the hashing scheme, different versions of a package end up in different paths in the Nix store, so they don't interfere with each other.

It also gives easy multi-user support. This means that non-privileged users can securely install software. Each user can have a different profile, a set of packages in the Nix store that appear in the user's PATH. If a user installs a package that another user has already installed previously, the package won't be built or downloaded a second time. At the same time, it is not possible for one user to inject a Trojan horse into a package that might be used by another user because the hash would become different.

Since package management operations never overwrite packages in the Nix store but just add new versions in different paths, they are atomic. So during a package upgrade, there is no time window in which the package has some files from the old version and some files from the new version — which would be bad because a program might well crash if it's started during that period. And since packages aren't overwritten, the old versions are still there after an upgrade. This means that you can rollback to the old version.

Nix is also extremely useful for developers as it makes it easy to automatically set up the build environment for a package. In Nix, you can look at any package and compute the entire dependency graph. This graph includes not only the names of the packages, but also precisely which version⁽³⁾ of each package was used as a build input. This goes for both build inputs and runtime dependencies.

⁽³⁾ By "version" I mean not only the version as listed in the release notes, but every detail of how the package was built: which version of Python was used? And then transitively: what version of C was used to compile that Python?

Thanks to Nix's enforced determinism, you can trivially build container environments (and all of their required packages) in parallel across a fleet of build machines, similar to Docker.

In a container environment, Nix allows you to share packages with the host machine, and then each Nix container you spin up has the necessary runtime dependencies bind-mounted into the container's root file system. As a result, Nix has much better de-duplication (read: zero) than Docker. Also, by the same graph traversal mechanism, it's trivial to take any environment, serialize the runtime dependency graph, and send that graph to another machine, back it up somewhere, create a bootable ISO for CD/USB, do whatever.

These containers perform better than Docker as they don't use virtualisation and for multi-container deployment, there is NixOps. You write some Nix files describing the VMs you want to deploy, and run a bash command to deploy them to various back-ends (AWS, Azure, etc…). Again, the big difference here is that you can incrementally modify these VMs in a safe way using Nix. If you change your deployment config file, Nix will figure out something has changed, and modify the corresponding VMs to achieve the desired state.

Everything I've shown is still new and early in development. Nix is probably the most robust of the bunch, but even then it can get confusing as hell. Now obviously you can't content address everything, unless we start figuring out how to hash physical objects, but content addressing and declarative systems are very interesting concepts I'd like to see in more things.

I think there's something to be said about how content addressing has enabled so many benefits in different ways to a variety of projects. You could easily apply it to real life, where you can say that: the descriptiveness of natural languages makes casual, arbitrary names of abstract concepts too ambiguous and open to interpretation. So referring to something by what it is rather than what it's called is generally better.

Or in another, simpler way:

PLEASE STOP CALLING YOURSELF AN ANARCHO-CAPITALIST.