Time-travel debugging, productionized

Here is the typical shape of a production bug investigation. A user reports something weird. By the time the report reaches engineering, it’s six hours old. The logs from the moment of the bug have rolled off. The user’s session is gone. The container that handled their request was recycled at the next deploy. The data they triggered the bug on may have been modified by them or by a subsequent job. You have their description, some log lines, maybe a stack trace, and a very strong hunch that nobody will reproduce this in a test environment. So you do archaeology. You read the code. You stare at the logs you have. You try to imagine the chain of events that produced the symptoms. You write a one-off script to try to reconstruct an approximation of the state. You fail, twice. Eventually you either find the bug by reasoning alone or you shrug and close the ticket because it hasn’t recurred. This is how almost all serious production debugging is done. It is fundamentally a detective activity performed on corpses. It is expensive, it is error-prone, and it is entirely downstream of one property of our infrastructure: the state that produced the bug isn’t preserved anywhere, so you can’t go back to it. Now imagine you could. That’s what this essay is about.

What “time-travel debugging” has meant so far

The phrase has an existing meaning in the trade. Tools like rr, Pernosco, and Microsoft’s Time-Travel-Debugging in WinDbg record a deterministic replay of a single process and let you step backward through its execution. Fantastic technology; I use rr regularly. These tools are limited in exactly one way: they have to be turned on before the bug happens. You run your test under rr, the bug reproduces, you step backward through it. If you didn’t start recording, the replay doesn’t exist. And since the overhead is non-trivial (5–10%, worse for some workloads), you don’t run every production service under rr all the time. Deterministic replay is an incredible technique for the bugs you can reproduce, where “reproduce” means “trigger a second time in a controlled setting.” It does not help with the bug that already happened in production while you weren’t looking. That’s the class of bug time-travel-debugging-as-a-production-primitive would help with.

What changes when every commit is a restoration point

Imagine your production environment takes a commit every N seconds, or on every deploy, or on every significant state change — whatever fits the workload. A commit here doesn’t mean a source-code commit; it means a content-addressable snapshot of the full VM: memory, filesystem, open sockets, running processes, in-flight requests. Call this rate of commits C. What does debugging look like at different values of C?

C = 0: today, for most services

Never commit. State is lost the moment it changes. Debugging is detective work.

C = once per deploy

Committing on deploy is basically a blue-green rollout. When a deploy breaks, you restore the previous commit. Useful, but the window of “this commit ran” is hours to days. Too wide to pin a specific bug.

C = once per minute

Now you have a sliding window of the last hundred minutes of production. When a bug report arrives, you restore the commit closest to the user’s timestamp. You’re looking at the actual machine, memory and all, as it was. Logs still exist because the processes that wrote them are still running. Caches still have entries. Open sockets are still open (well, almost — see below). This is a big jump. At C=1/min, “reproduce the bug in prod state” stops being a research project.

C = every request

Now the debugging primitive isn’t “restore a minute-accurate snapshot of prod.” It’s “restore the exact state at the moment this user’s request was handled.” You can step backward through the state the request saw. You can run the request again, against the committed state, and see what happened. You can branch the commit, change one variable, and see what would have happened.

C = every tool call (for agents)

For long-running agents, a commit per tool call is the natural rate. Every time the agent makes a decision, the state before and after is committed. When the agent ends up somewhere weird, you walk back through the commit graph to find the decision that took it there, restore to before it, branch, try a different decision, see the alternate history. What’s blocking C from being high isn’t philosophical. It’s the latency and storage cost of a commit. If each commit takes two seconds and a gigabyte of dedicated storage, committing per-request is absurd. If each commit takes 258µs and deduplicates aggressively against the base image, committing per-request is routine.

The debugging workflows that become possible

Bisect the state, not just the code

git bisect lets you find the commit that introduced a bug by searching through source-code history. It is one of the most powerful debugging tools in common use, and it works because commits are cheap, restorable, and content-addressed. Now imagine vers bisect: walk through the production state commits, restoring each one, running a predicate against it, and converging on the commit where the bug first appeared. The code might not have changed between commits. The data might have. Or the load pattern. Or the feature-flag configuration. Bisecting on state — not source — finds bugs that have no source-code origin.

Branch from the moment of failure

The classic debugging flow is “reproduce the bug, then try to fix it.” The second half — trying to fix it — is what chews through reproductions. Each fix attempt modifies the state, and if the fix is wrong, you need to get back to the buggy state to try again. With commits and branches, the loop looks different:

Restore the commit at the moment of failure.
Branch it. (This is the “scratch environment” where you try things.)
Try a fix. Did it work? If yes, note it; if no, discard the branch.
Return to the commit-of-failure. Branch again. Try another fix.
Run five fix attempts against the same exact state without ever having to rebuild it.

The iteration speed on this loop is bounded by how fast you can branch, which is 258µs, which is effectively instant. Debugging stops feeling like a chain of expensive reproduction attempts and starts feeling like exploring a garden of alternate timelines. You’re working on a bug. Your colleague has an idea. Today you describe the state to them, they try to reproduce it locally, they get a different state, they offer a fix based on the state they got, you try it against yours, it doesn’t work. The chain of translation has lost signal at every step. With commits: you send them the commit ID. They restore. They’re debugging the exact same machine you’re debugging. Not a re-creation. The original. The conversation isn’t about “what state are you in” — both of you are in the same state, by hash.

A/B the fix in the state that produced the bug

Here’s the most interesting one. You have a fix. Before shipping, you want to know if the fix actually addresses the bug. Today you ship it to staging and hope. Or you write a test that approximates the bug’s conditions and validate against that. With committed state, you apply the fix to a branch of the commit-of-failure and observe directly whether the bug still reproduces. You’re A/B-testing in the literal state that produced the bug, not in a synthetic approximation. If the fix doesn’t address it, you know instantly, because the bug is right there, not hiding behind a failed reproduction.

Regression tests that are machines, not scripts

The test suite entry for a bug fix usually looks like a script: set up conditions, trigger the bug, assert the fix worked. Writing one is expensive (you’re re-implementing the bug’s conditions in code), brittle (the conditions drift), and lossy (the real state had properties the script can’t encode). With committed state, the regression test is: “restore commit c1a2b3c4, run the request, expect success.” The conditions aren’t simulated — they are the conditions, byte-identical. The test runs against an actual machine in the state that previously failed. No drift. No encoding loss.

What the state is vs. what the logs say

Logs are the current-state-of-the-art way we try to preserve enough of past state to debug later. And logs are limited in obvious ways: they capture what you thought to log, at the time you wrote the code. Anything you didn’t think was worth logging is gone. Anything in memory is gone. The state of caches is gone. The exact timing of events relative to each other is often fuzzy. A committed VM is a lossless capture of state. Everything in memory. Every file. Every open socket’s buffered bytes. The kernel’s view of the process tree. Restoring a commit isn’t “replaying the logs” — it’s booting the actual machine as it was at the commit moment. The difference in debugging power is enormous. Logs tell you what you thought was important. A committed state lets you ask questions you didn’t know to ask. “What was in the LRU cache?” Fine, restore, dump it. “What was the in-flight HTTP request’s partial body?” Fine, look. “What state was the connection pool in?” Fine, inspect it. The thing that made this impractical was the cost of committing. If committing takes a second and a gigabyte, you don’t commit continuously in production. You commit rarely. The logs remain your primary tool, because they’re cheap-to-produce. Once commits are microseconds and dedup-heavy, the economics flip. Commit continuously. Use commits as the primary debugging artifact. Logs become summaries, not the source of truth.

The obvious counterargument: storage

If you commit production state every minute at a microsecond cost, where does the storage go? Two answers:

Content-addressing dedups aggressively. A commit that only changed a handful of memory pages from its parent stores only those pages; the rest are references to the parent’s content. Consecutive commits that are mostly-unchanged are cheap. You don’t store N copies of the VM; you store the divergence.
Retention policies apply, obviously. You don’t keep every minute of commits forever. You keep the last hour at minute-granularity, the last day at hour-granularity, the last week at day-granularity, the last month at checkpoint-granularity. Same rolling-window logic you already use for logs and metrics. The difference is that each retained commit is orders of magnitude richer than a log line.

Where this is and where it’s going

The loop described in this essay — commit-heavy production, branch-to-investigate, restore-to-reproduce — isn’t theoretical. It works today with Vers for the workloads where the commits and branches are cheap enough. The class of bugs that are addressable this way grows as the commit rate you can afford in production grows. The interesting bet is that production debugging as detective-work-on-corpses is a temporary artifact of expensive state. Once the state is cheap to preserve, debugging stops being about reconstruction and starts being about inspection. We didn’t reason our way to git bisect from first principles; cheap commits made it possible, and then someone noticed and wrote the tool. The same sequence will play out for production state. I expect that within a few years, production services that aren’t routinely committing their state will look the same way services without structured logs look today: under-instrumented, under-auditable, debugging the hard way out of habit. The tooling hasn’t caught up to the primitive yet, but the primitive is here.

Content-addressable everything

Why committing state is coherent at all — and how git’s idea generalizes to running machines.

The cost of rebuilding state

The general case: most engineering time goes into reconstructing state. Debugging is one acute instance.

Database state testing

A tutorial that uses branching-from-a-committed-state to test multiple migrations against the same baseline.

Core concepts

Projects, VMs, HEAD, branches, commits — the primitives that make this loop cheap.

Tutorials

Essays

Time-travel debugging, productionized

What “time-travel debugging” has meant so far

What changes when every commit is a restoration point

C = 0: today, for most services

C = once per deploy

C = once per minute

C = every request

C = every tool call (for agents)

The debugging workflows that become possible

Bisect the state, not just the code

Branch from the moment of failure

A/B the fix in the state that produced the bug

Regression tests that are machines, not scripts

What the state is vs. what the logs say

The obvious counterargument: storage

Where this is and where it’s going

Further reading

Content-addressable everything

The cost of rebuilding state

Database state testing

Core concepts

Tutorials

Essays

Documentation Index

​What “time-travel debugging” has meant so far

​What changes when every commit is a restoration point

​C = 0: today, for most services

​C = once per deploy

​C = once per minute

​C = every request

​C = every tool call (for agents)

​The debugging workflows that become possible

​Bisect the state, not just the code

​Branch from the moment of failure

​Share the bug state

​A/B the fix in the state that produced the bug

​Regression tests that are machines, not scripts

​What the state is vs. what the logs say

​The obvious counterargument: storage

​Where this is and where it’s going

​Further reading

Content-addressable everything

The cost of rebuilding state

Database state testing

Core concepts

What “time-travel debugging” has meant so far

What changes when every commit is a restoration point

C = 0: today, for most services

C = once per deploy

C = once per minute

C = every request

C = every tool call (for agents)

The debugging workflows that become possible

Bisect the state, not just the code

Branch from the moment of failure

Share the bug state

A/B the fix in the state that produced the bug

Regression tests that are machines, not scripts

What the state is vs. what the logs say

The obvious counterargument: storage

Where this is and where it’s going

Further reading