The most important idea in git isn’t branches. It isn’t distributed workflow. It isn’t the rebase vs. merge argument. It’s that the identity of a commit is the hash of its content. That sentence is easy to gloss past. It’s also the reason git won and the reason every version-control system that came after it — Mercurial aside — looks like git. The content-addressable store made possible a whole family of properties that previous systems had to fight for and usually failed to get. Reproducibility. Deduplication. Untrusted replication. Offline work. Cheap branching. All of these fall out of one decision: stop naming things by where they live or when they were made. Name them by what they are. Now apply the same decision to a running machine. The consequences are at least as interesting.Documentation Index
Fetch the complete documentation index at: https://docs.vers.sh/llms.txt
Use this file to discover all available pages before exploring further.
What location-addressing gets wrong
Almost everything in systems has historically been location-addressed. A file lives at/etc/nginx.conf. A row lives at primary-key 12. A VM lives on node-042. When you want the thing, you ask for its location, and whoever owns that location gives it to you.
Location-addressing has one deep problem: the name doesn’t tell you what the content is. Two files at the same path on two different servers can be wildly different. Two VMs with the same name on different days can be unrelated. The address is an index into mutable storage; the same address can point to different bytes over time.
Version control before git inherited this model. Subversion gave every file a path and a monotonically-increasing revision number. If you wanted “the file as it was at revision 1428,” you asked the central server to look it up. If you wanted to know whether your copy of src/foo.c@1428 was the same bytes as somebody else’s src/foo.c@1428, you asked the server, because the revision number didn’t encode the content. Two servers could disagree about what @1428 meant and you’d have no way to tell.
This is a bigger problem than it sounds like. It means:
- You need a trusted central server to resolve names
- Replication is lossy by default (two replicas can silently drift)
- Deduplication requires a coordinator who can compare content
- The identity of history depends on who’s telling you about it
Git’s idea
Linus’s trick, stolen from earlier content-addressable stores, was: every object in the repository is named by the SHA of its own bytes. A file blob’s name issha1(contents). A tree’s name is sha1(tree entries). A commit’s name is sha1(tree-ref + parent-refs + metadata).
Four consequences fall out of this, each of which would have been worth a paper on its own:
1. Identity equals content
Two objects with the same hash are byte-identical. Two objects with different hashes are not. You never have to ask “is this the same version I saw before” — you look at the hash. No coordination, no server call, no ambiguity.2. Deduplication for free
If two repositories, two branches, or two machines ever produce the same content, they produce the same hash. Storage systems can deduplicate aggressively without any semantic understanding of what they’re storing. The whole world of git objects, across every repo on earth, is a single flat namespace where identical content hashes identically.3. Reproducibility
“The project at commitc1a2b3c4” means the same thing forever, everywhere, to everyone. There is no “my version of c1a2b3c4” vs “your version of c1a2b3c4.” If the hash matches, the bytes match. This is the only reliable way to reproduce a build: not by describing the source, but by pinning its content hash.
4. Untrusted replication
Because the hash is derived from content, you can pull objects from anyone — a stranger on the internet, a mirror you don’t know the operator of, a cached copy from a proxy — and verify their authenticity by recomputing the hash. The sender can’t lie about what a commit is. They can refuse to give it to you, but they can’t substitute a different commit under the same name.The decade after git
Most software built in the decade after git borrowed this trick, sometimes explicitly, usually without acknowledging the debt:- Docker image layers: content-addressed. You can pull a layer from any registry; the hash verifies authenticity and enables dedup.
- IPFS: the whole filesystem is content-addressed. Same idea, generalized beyond source code.
- Nix: every package is hashed. Two builds with identical inputs produce bit-identical output with identical names. Reproducible builds become the default instead of an aspiration.
- Blockchain: every block is hashed, every transaction is a commitment to content. The mechanism is content-addressing at the tip of an append-only log.
- CDN caches: static asset URLs are increasingly hash-suffixed (
app.c1a2b3c4.js) so changing content can’t silently collide with cached versions.
The primitive still missing
The glaring hole in this list is the thing we interact with every day: running computers. VMs are still location-addressed. “The staging VM.” “VM-042 on node-07.” “The EC2 instance in us-east-1c.” Every one of those names refers to a specific machine at a specific address, and what’s running on it is mutable — anything can happen to it between now and the next time you look. This is why:- Snapshots are second-class. You take a snapshot, you get an ID, the ID refers to a blob in some vendor’s storage. It’s not a commit; it’s a reference to a managed object. Different vendors use different schemes. You can’t copy a snapshot from AWS to GCP without re-encoding everything.
- Reproducing production is hard. “The bug only happens in prod” is a phenomenon that exists only because we lack a way to say “give me a byte-identical copy of the machine that was running at the moment of the bug.” We can approximate with AMIs, runbooks, terraform, but each of those is a recipe for approximately the same thing, not a way to get the same thing exactly.
- CI preview environments are a bespoke feature. Every company that offers them — Vercel, Netlify, Render — had to invent per-stack machinery. There’s no universal primitive to say “branch the production machine at this commit, make it the preview.”
- Dev/prod drift is endless. Your local environment is an arbitrary accumulation of years of
apt-getcalls. Prod is a different accumulation. They diverge because there’s no hash of their content to compare against.
Content-addressable machines
Apply git’s idea to a VM:- The content of a VM at a moment in time is its memory state plus its filesystem state plus some minimal metadata (CPU registers, open files, etc.)
- Hash that content. Call the hash the commit ID.
- Two VMs with the same hash are byte-identical running machines.
- Two different hashes are different machines.
- A VM commit is immutable. A machine at commit
c1a2b3c4is that machine forever, restorable anywhere, verifiable by anyone.
Deduplication across the fleet
A thousand VMs that were all branched from the same golden image share their base layers. They don’t each copy ten gigabytes — they each reference the same ten gigabytes by hash. Storage usage is proportional to divergence, not to number of VMs. A project that spawns a million ephemeral workers doesn’t need a million times the disk.Reproducibility without runbooks
“Run the workload againstc1a2b3c4” is a thing you can say today, tomorrow, next year. The commit is byte-identical. Runbooks become unnecessary. The environment is the hash.
Moving state becomes moving hashes
You don’t have to ship gigabytes of VM image to another region, another cluster, another customer. You ship the hash, and the receiving side either already has the bytes or pulls them on demand. Exactly like pulling a git object you don’t have yet. Same mental model, whole-machine granularity.Merges, forks, and provenance
If a VM’s identity is its content hash, and each commit records its parent hash, you have a directed acyclic graph of state over time. You can walk it. You can ask “what VMs derive from commitc1a2b3c4?” You can identify the lineage of a production bug. You can branch off an old commit to test a fix against the state at the time of the failure.
Trust without a central authority
Because the hash is derived from content, you can verify a commit is what it claims to be by inspecting its bytes. This matters for air-gapped environments, untrusted transport, and adversarial scenarios. A malicious intermediate can’t give you a different VM under the same commit ID. The commit is what the commit is.Cross-vendor portability
A content-addressable VM format doesn’t care where it runs. If two clouds implement the same hash scheme,c1a2b3c4 on AWS and c1a2b3c4 on GCP are the same machine. The vendor lock-in that plagues snapshots goes away the moment the ID is derived from content rather than issued by the vendor.
What this unlocks
Once VMs are content-addressable, a bunch of workflows that were always painful become obvious:- Bisecting production bugs. Walk the commit history, restore any commit, test against it. Same loop as
git bisect— applied to whole machines. - Sharing environments. Send someone a commit ID. They restore. They’re running the exact same machine as you.
- Cache warming across the fleet. A warmed cache is just a committed state. Share the commit, skip the warming.
- Deterministic CI. Build in a committed environment. The commit is the reproducibility guarantee. Same commit → same output forever.
- Rollback with memory. Restoring to
c1a2b3c4isn’t “redeploy the binaries from commit c1a2b3c4.” It’s “boot the exact machine, memory and processes included, that was running when you committed c1a2b3c4.” Full rollback, not a re-approximation.
Where git’s model breaks down
The analogy isn’t perfect. Two places it diverges:- A running VM includes memory. Git only had to hash bytes on disk. Memory pages change faster and are larger than source code. A content-addressable VM store has to dedup memory aggressively or it drowns in storage. Copy-on-write at the memory-page level is the way out, but it means the VM’s commit identity has to be defined carefully — it’s the hash of the semantic state, not naively the hash of the memory image.
- Running is state itself. A git commit is a static thing. A VM commit is the state of a running process — a pause point in a continuous execution. Restoring a VM commit isn’t like checking out a file; it’s resurrecting a process tree with its open sockets and in-memory data structures. This is harder than git, and it’s why the primitive wasn’t cheap until recently.
The shift
Once you’ve seen the move, you start seeing everywhere it hasn’t been applied yet. Kubernetes pod state. Database replicas. Long-running LLM sessions. Build caches. Dev environments. Every one of them is a location-addressed thing today that should probably be content-addressed tomorrow. The reason they’re not yet is almost always the same reason VMs weren’t: the primitive to do it cheaply didn’t exist. Once it does, the pattern spreads. This is what Vers is. The observation that running machines deserved what source code got thirty years ago. Everything that falls out of that observation — branching, restoring, snapshotting, deduplication, reproducibility, shareable state — falls out quickly, because it’s the same story git told, applied to a bigger artifact.Further reading
Architecture
How Vers commits are actually content-addressed. Overlay filesystem, copy-on-write memory, the commit graph.
Time-travel debugging
What debugging looks like when every commit is a restoration point.
Why stateless ran out
The case for treating running state as something worth preserving at all.
Core concepts
Projects, VMs, commits — the user-facing model of content-addressed running machines.