Content-addressable everything

The most important idea in git isn’t branches. It isn’t distributed workflow. It isn’t the rebase vs. merge argument. It’s that the identity of a commit is the hash of its content. That sentence is easy to gloss past. It’s also the reason git won and the reason every version-control system that came after it — Mercurial aside — looks like git. The content-addressable store made possible a whole family of properties that previous systems had to fight for and usually failed to get. Reproducibility. Deduplication. Untrusted replication. Offline work. Cheap branching. All of these fall out of one decision: stop naming things by where they live or when they were made. Name them by what they are. Now apply the same decision to a running machine. The consequences are at least as interesting.

What location-addressing gets wrong

Almost everything in systems has historically been location-addressed. A file lives at /etc/nginx.conf. A row lives at primary-key 12. A VM lives on node-042. When you want the thing, you ask for its location, and whoever owns that location gives it to you. Location-addressing has one deep problem: the name doesn’t tell you what the content is. Two files at the same path on two different servers can be wildly different. Two VMs with the same name on different days can be unrelated. The address is an index into mutable storage; the same address can point to different bytes over time. Version control before git inherited this model. Subversion gave every file a path and a monotonically-increasing revision number. If you wanted “the file as it was at revision 1428,” you asked the central server to look it up. If you wanted to know whether your copy of src/foo.c@1428 was the same bytes as somebody else’s src/foo.c@1428, you asked the server, because the revision number didn’t encode the content. Two servers could disagree about what @1428 meant and you’d have no way to tell. This is a bigger problem than it sounds like. It means:

You need a trusted central server to resolve names
Replication is lossy by default (two replicas can silently drift)
Deduplication requires a coordinator who can compare content
The identity of history depends on who’s telling you about it

These aren’t minor ergonomic issues. They make whole classes of software hard to build.

Git’s idea

Linus’s trick, stolen from earlier content-addressable stores, was: every object in the repository is named by the SHA of its own bytes. A file blob’s name is sha1(contents). A tree’s name is sha1(tree entries). A commit’s name is sha1(tree-ref + parent-refs + metadata). Four consequences fall out of this, each of which would have been worth a paper on its own:

1. Identity equals content

Two objects with the same hash are byte-identical. Two objects with different hashes are not. You never have to ask “is this the same version I saw before” — you look at the hash. No coordination, no server call, no ambiguity.

2. Deduplication for free

If two repositories, two branches, or two machines ever produce the same content, they produce the same hash. Storage systems can deduplicate aggressively without any semantic understanding of what they’re storing. The whole world of git objects, across every repo on earth, is a single flat namespace where identical content hashes identically.

3. Reproducibility

“The project at commit c1a2b3c4” means the same thing forever, everywhere, to everyone. There is no “my version of c1a2b3c4” vs “your version of c1a2b3c4.” If the hash matches, the bytes match. This is the only reliable way to reproduce a build: not by describing the source, but by pinning its content hash.

4. Untrusted replication

Because the hash is derived from content, you can pull objects from anyone — a stranger on the internet, a mirror you don’t know the operator of, a cached copy from a proxy — and verify their authenticity by recomputing the hash. The sender can’t lie about what a commit is. They can refuse to give it to you, but they can’t substitute a different commit under the same name.

The decade after git

Most software built in the decade after git borrowed this trick, sometimes explicitly, usually without acknowledging the debt:

Docker image layers: content-addressed. You can pull a layer from any registry; the hash verifies authenticity and enables dedup.
IPFS: the whole filesystem is content-addressed. Same idea, generalized beyond source code.
Nix: every package is hashed. Two builds with identical inputs produce bit-identical output with identical names. Reproducible builds become the default instead of an aspiration.
Blockchain: every block is hashed, every transaction is a commitment to content. The mechanism is content-addressing at the tip of an append-only log.
CDN caches: static asset URLs are increasingly hash-suffixed (app.c1a2b3c4.js) so changing content can’t silently collide with cached versions.

Every one of these is the same move: stop referring to things by where they are; refer to them by what they are. And every one of these, in its domain, unlocked capabilities that were painful or impossible before.

The primitive still missing

The glaring hole in this list is the thing we interact with every day: running computers. VMs are still location-addressed. “The staging VM.” “VM-042 on node-07.” “The EC2 instance in us-east-1c.” Every one of those names refers to a specific machine at a specific address, and what’s running on it is mutable — anything can happen to it between now and the next time you look. This is why:

Snapshots are second-class. You take a snapshot, you get an ID, the ID refers to a blob in some vendor’s storage. It’s not a commit; it’s a reference to a managed object. Different vendors use different schemes. You can’t copy a snapshot from AWS to GCP without re-encoding everything.
Reproducing production is hard. “The bug only happens in prod” is a phenomenon that exists only because we lack a way to say “give me a byte-identical copy of the machine that was running at the moment of the bug.” We can approximate with AMIs, runbooks, terraform, but each of those is a recipe for approximately the same thing, not a way to get the same thing exactly.
CI preview environments are a bespoke feature. Every company that offers them — Vercel, Netlify, Render — had to invent per-stack machinery. There’s no universal primitive to say “branch the production machine at this commit, make it the preview.”
Dev/prod drift is endless. Your local environment is an arbitrary accumulation of years of apt-get calls. Prod is a different accumulation. They diverge because there’s no hash of their content to compare against.

All of these are the VCS problems of the 90s. All of them get solved by the same move: make running machines content-addressable.

Content-addressable machines

Apply git’s idea to a VM:

The content of a VM at a moment in time is its memory state plus its filesystem state plus some minimal metadata (CPU registers, open files, etc.)
Hash that content. Call the hash the commit ID.
Two VMs with the same hash are byte-identical running machines.
Two different hashes are different machines.
A VM commit is immutable. A machine at commit c1a2b3c4 is that machine forever, restorable anywhere, verifiable by anyone.

Now work out the consequences:

Deduplication across the fleet

A thousand VMs that were all branched from the same golden image share their base layers. They don’t each copy ten gigabytes — they each reference the same ten gigabytes by hash. Storage usage is proportional to divergence, not to number of VMs. A project that spawns a million ephemeral workers doesn’t need a million times the disk.

Reproducibility without runbooks

“Run the workload against c1a2b3c4” is a thing you can say today, tomorrow, next year. The commit is byte-identical. Runbooks become unnecessary. The environment is the hash.

Moving state becomes moving hashes

You don’t have to ship gigabytes of VM image to another region, another cluster, another customer. You ship the hash, and the receiving side either already has the bytes or pulls them on demand. Exactly like pulling a git object you don’t have yet. Same mental model, whole-machine granularity.

Merges, forks, and provenance

If a VM’s identity is its content hash, and each commit records its parent hash, you have a directed acyclic graph of state over time. You can walk it. You can ask “what VMs derive from commit c1a2b3c4?” You can identify the lineage of a production bug. You can branch off an old commit to test a fix against the state at the time of the failure.

Trust without a central authority

Because the hash is derived from content, you can verify a commit is what it claims to be by inspecting its bytes. This matters for air-gapped environments, untrusted transport, and adversarial scenarios. A malicious intermediate can’t give you a different VM under the same commit ID. The commit is what the commit is.

Cross-vendor portability

A content-addressable VM format doesn’t care where it runs. If two clouds implement the same hash scheme, c1a2b3c4 on AWS and c1a2b3c4 on GCP are the same machine. The vendor lock-in that plagues snapshots goes away the moment the ID is derived from content rather than issued by the vendor.

What this unlocks

Once VMs are content-addressable, a bunch of workflows that were always painful become obvious:

Bisecting production bugs. Walk the commit history, restore any commit, test against it. Same loop as git bisect — applied to whole machines.
Sharing environments. Send someone a commit ID. They restore. They’re running the exact same machine as you.
Cache warming across the fleet. A warmed cache is just a committed state. Share the commit, skip the warming.
Deterministic CI. Build in a committed environment. The commit is the reproducibility guarantee. Same commit → same output forever.
Rollback with memory. Restoring to c1a2b3c4 isn’t “redeploy the binaries from commit c1a2b3c4.” It’s “boot the exact machine, memory and processes included, that was running when you committed c1a2b3c4.” Full rollback, not a re-approximation.

These aren’t new ideas in the abstract — they’re the obvious consequences of content-addressing, as soon as the primitive is cheap. The reason none of them were standard-issue is that content-addressing running machines requires sub-millisecond branching and restoration. Without that, you can theorize about content-addressable VMs but you can’t use them in a tight loop.

Where git’s model breaks down

The analogy isn’t perfect. Two places it diverges:

A running VM includes memory. Git only had to hash bytes on disk. Memory pages change faster and are larger than source code. A content-addressable VM store has to dedup memory aggressively or it drowns in storage. Copy-on-write at the memory-page level is the way out, but it means the VM’s commit identity has to be defined carefully — it’s the hash of the semantic state, not naively the hash of the memory image.
Running is state itself. A git commit is a static thing. A VM commit is the state of a running process — a pause point in a continuous execution. Restoring a VM commit isn’t like checking out a file; it’s resurrecting a process tree with its open sockets and in-memory data structures. This is harder than git, and it’s why the primitive wasn’t cheap until recently.

Both of these are engineering problems, not conceptual ones. The conceptual move — identity equals content — transfers directly.

The shift

Once you’ve seen the move, you start seeing everywhere it hasn’t been applied yet. Kubernetes pod state. Database replicas. Long-running LLM sessions. Build caches. Dev environments. Every one of them is a location-addressed thing today that should probably be content-addressed tomorrow. The reason they’re not yet is almost always the same reason VMs weren’t: the primitive to do it cheaply didn’t exist. Once it does, the pattern spreads. This is what Vers is. The observation that running machines deserved what source code got thirty years ago. Everything that falls out of that observation — branching, restoring, snapshotting, deduplication, reproducibility, shareable state — falls out quickly, because it’s the same story git told, applied to a bigger artifact.

Architecture

How Vers commits are actually content-addressed. Overlay filesystem, copy-on-write memory, the commit graph.

Time-travel debugging

What debugging looks like when every commit is a restoration point.

Why stateless ran out

The case for treating running state as something worth preserving at all.

Core concepts

Projects, VMs, commits — the user-facing model of content-addressed running machines.

​What location-addressing gets wrong

​Git’s idea

​1. Identity equals content

​2. Deduplication for free

​3. Reproducibility

​4. Untrusted replication

​The decade after git

​The primitive still missing

​Content-addressable machines

​Deduplication across the fleet

​Reproducibility without runbooks

​Moving state becomes moving hashes

​Merges, forks, and provenance

​Trust without a central authority

​Cross-vendor portability

​What this unlocks

​Where git’s model breaks down

​The shift

​Further reading