The Cutover Manual: Multi-Stage Rollout Patterns for 2026
The deliverable is not a merged PR. It is traffic, on production, served safely, while the metrics that matter stay flat or improve. Most engineering teams treat cutover as the last 10% of the work. It is the part that decides whether the previous 90% was worth shipping at all.
We see this pattern often when we get called in. The new system was built, the tests passed, the PR was approved by three reviewers. Then the team tried to put it in front of users on a Friday afternoon, something broke in a way staging never reproduced, and the rollback took six hours because nobody had ever practiced one. The post-mortem will call it a deployment issue. It was not. It was a planning issue that showed up at deployment time.
This is a working manual for the rollout patterns that keep production stable while you change it under load. Five patterns, where each one fits, where each one breaks, what your rollback profile looks like. Then the 5-minute rollback target and how to drill it on a Tuesday before you need it on a Saturday.
Shadow Mode
Shadow mode is the pattern where the new system processes a full copy of production traffic but does not serve any results to users. The old system still owns the response. The new system computes alongside, writes its output to a comparison log, and you grade the difference.
Use shadow mode when correctness is what matters and the new code is a rewrite of business logic, not a refactor. Pricing engines. Risk scoring. Tax calculation. Fraud rules. Anywhere a wrong answer is worse than a slow answer, and where synthetic test data cannot give you coverage because the input space is too rich. Shadow mode lets you run the new system against the actual production input distribution for a week or a month, count the divergences, chase down every one before any user sees a result from the new path.
Do not use shadow mode for read-heavy paths where the result is hard to compare deterministically, or for systems whose side effects cannot be made idempotent. If the new system would send an email, charge a card, or trigger a webhook, shadow mode requires a clean way to stub those side effects without changing the code under test. If you cannot stub cleanly, you are not running a shadow; you are running a parallel system that is also doing real things, which is the bug you were trying to avoid.
The pitfall: the comparison harness becomes the bug. We have seen shadow runs where 4% of requests showed a divergence and the team spent three weeks finding root causes in the new code before realizing the comparison logic was normalising timestamps in two different timezones. The harness needs the same rigour as the production code. Schema-validate the diff records. Sample and hand-grade a few hundred. Treat the harness like the load-bearing piece it is.
The rollback profile is excellent because there is nothing to roll back. The new system was never on the response path. You delete the shadow, delete the comparison job, and production keeps running on the old code. The cost is compute and the operational complexity of running a comparison pipeline that nobody enjoys maintaining.
Feature Flags
Feature flags are the workhorse of modern deployment strategies. Every other pattern on this list either uses flags directly or borrows the same conceptual machinery. A flag is a runtime switch that selects between code paths, and the value of that switch can be scoped per user, per account, per region, per request header, or per any other attribute you can read at request time.
Variants matter. A boolean flag (on or off) is the simplest case. A percentage flag (on for 10% of users, sticky by user ID) is what you reach for during a feature flag rollout to real traffic. A multivariate flag (control vs. variant A vs. variant B) is what you use when you are running an experiment, not just a release. Most production systems end up with all three kinds, and the discipline is keeping straight which flag is which.
Hosted platforms like LaunchDarkly are the default for teams that can spend on the line item. Unleash and Flagsmith are the open-source paths if you want to self-host or you have data residency rules that make a third-party service awkward. The platform matters less than the practice. A homegrown YAML file in the repo will do the job for a small team if the deploy story is fast. What you cannot get away with is flag values that require a code deploy to change. That defeats the point.
Pitfall one: flag debt. Every flag is a fork in the codebase. After two years of shipping, a typical team has 80 to 200 flags, half of which are fully rolled out, a third of which nobody remembers, and a quarter of which the original author has left the company. Each one is a branch in every code path it touches, multiplying the test surface and slowing the build. The fix is mechanical. Every flag gets an owner, an expiry date, and a cleanup ticket created the day the flag goes to 100%. The cleanup ticket is the deliverable. Without it, the flag stays forever.
Pitfall two: the flag-of-flags problem. Teams start nesting flags to express complex rollout rules. “If feature A is on and feature B is off and the user is in the EU and the request came after 9 AM.” This compiles to a truth table nobody can reason about, which means nobody knows what production is actually doing. The rule we hold pods to: a flag’s enabled rule fits in one sentence. If it does not, you are designing a small DSL and you have not noticed.
Pitfall three: performance cost. The flag evaluation path runs on every request. Bad SDK choices, blocking network calls to fetch flag values, or repeated evaluations inside hot loops have all caused production incidents we have walked into. Cache values per request. Evaluate once at the entry point. Treat the flag SDK like any other dependency that can take down the service.
When do you delete the flag? The day it goes to 100% for 14 consecutive days with no rollback. Teams do not delete flags because deletion is a code change, code changes need review, and the engineer who wrote the flag has moved on. The fix is a quarterly flag-cleanup sprint that gets time on the roadmap, not a hope that someone will get to it.
Canary
Canary deployment is the pattern where the new code goes to 1% of traffic first, you watch the metrics for a defined window, then you go to 5%, then 25%, then 50%, then 100%. The name comes from the canary in the coal mine. The 1% is the early warning system for the 99%.
Canary works when you have enough traffic for 1% to be statistically meaningful (a few thousand requests per minute as a floor), when the rollout target is a stateless service or a service whose state can tolerate two versions running simultaneously, and when you have observability good enough to tell within 10 minutes whether the canary is healthy. If any of those is missing, canary is theatre.
The pitfall that catches teams: the slow rollout that hides slow regressions. A canary at 1% for 24 hours and then 5% for 48 hours sounds careful, but a regression that shows up at p99 latency under peak load will not surface until the canary is large enough to take peak load itself. We have seen teams celebrate a “successful” canary rollout on Thursday and then watch the same code degrade Monday morning when real traffic hit. The fix is to time the canary stages to span at least one full traffic cycle (day and night, weekday and weekend) before promoting, or to use synthetic load to compress the cycle.
The metrics that matter at each stage are not the same. At 1% you watch error rates and crash signals; that is what the sample size supports. At 5% you read p50 and p95 latency. At 25% you read p99 and look at downstream effects: database query patterns, cache hit rates, queue depths. At 50% and 100% you watch business metrics: conversion, checkout completion, signup rate, whatever the system exists to do. A canary that watches error rate the whole way through and never checks business metrics is a canary that ships a 2% revenue regression with confidence.
Tooling-wise, the load balancer or service mesh does the work. AWS ALB target groups with weighted routing. Envoy or Istio with subset routing. Argo Rollouts if you are on Kubernetes and want the canary state machine declared as a CRD. Pick one, learn it deeply, do not switch midway through a rollout.
Rollback profile: change the weight, traffic shifts back, the bad version drains. If the new version wrote anything to a shared data store that the old version cannot read, your rollback is not a weight change, it is an outage. Canary assumes backward compatibility at the data layer. Most rollout failures we see are at this seam.
Blue-Green
Blue-green is two complete production environments, blue and green, running in parallel. One is live (say, blue). You deploy the new version to green, run smoke tests against it, then flip the load balancer to send all traffic to green. Blue sits idle as the rollback target. If green misbehaves, you flip the load balancer back to blue and you are done, in seconds, not minutes.
The appeal of blue-green is the cleanest rollback story of any pattern on this list. A weight flip on the load balancer is one operation, atomic from the user’s perspective, fast enough that the rollback fits inside the 5-minute target with room to spare. For services where the cost of a bad deploy is high and the cost of running double infrastructure is acceptable, blue-green is the right call.
Where blue-green falls down is the data layer. Two copies of the application are easy. Two copies of the production database are not. If the new version requires a schema change, you choose between deploying that schema change to a database both colours share (and now blue and green both need to speak it, which means you are doing a dual-write migration anyway) or running two databases with replication between them (and now you have a split-brain risk during the cutover window). Most teams end up hybrid: blue-green at the application layer, dual-write or expand-contract at the data layer.
Cost is the other constraint. You pay for double the compute for the duration of the rollout window, which in a managed cloud is real money. Teams that run blue-green continuously pay for it every day. Teams that spin up green only at deploy time get the cost back but pay in deploy speed, since the freshly-provisioned environment has cold caches, cold JIT, cold everything.
Pitfall: the smoke test that runs against an empty green environment. Green has no real traffic shape until you flip the load balancer. Smoke tests can pass and you can still discover a config issue that only manifests under concurrency, or a connection pool problem that only manifests at production QPS. The fix is shadowing some real traffic to green before the cutover, or doing the cutover as a fast canary (10% to green, watch for 5 minutes, then 100%) instead of an instant swap.
Tooling is straightforward: ALB or NLB target groups, or DNS-based routing if you can tolerate the propagation window (often you cannot). Kubernetes service routing with two deployments behind one service does the same thing inside a cluster.
Dual-Write
Dual-write is the pattern that delivers a zero downtime migration when you are changing the shape of your data, not just the shape of your code. It is also the pattern teams most often skip and most often regret skipping.
The five steps:
- Expand. Add the new schema (new table, new columns, new database) alongside the old. Write nothing to it yet. Deploy the change.
- Dual-write. Application code writes to both old and new on every mutation. Reads still come from old. The new schema is being populated, but nothing reads from it. Run a backfill job to populate the new schema from the historical data in old. Run a consistency check that compares old and new on every write.
- Dual-read. Reads start coming from both old and new. Old is still the source of truth. The dual-read is a comparison check that flags any divergence and alerts a human. This stage runs until the divergence rate is at acceptable noise floor (say, less than 0.001%).
- Cut over reads. Reads come from new. Writes still go to both. If anything blows up, you can roll back reads to old in a flag flip.
- Drop old. Writes stop going to old. The old schema, table, or database is dropped after a quarantine period (we hold pods to 30 days minimum).
This is a five-step migration that takes a quarter of calendar time on a meaningful system, and it saves the company every time. The alternative is a “big bang” migration where you take a maintenance window, stop the world, run a migration script, hope, and bring traffic back. We have been called in to clean up enough of those to have strong opinions. The big bang either works or it does not, and when it does not, the rollback is “restore from backup and pretend the last six hours did not happen,” which is not a rollback, it is a disaster.
Tooling for the database side: logical replication is the modern primitive (Postgres logical replication, Debezium with Kafka, Aurora’s built-in replication). For online schema changes that do not require a full new database, gh-ost and pt-online-schema-change are the workhorses on MySQL, and Postgres’s native ADD COLUMN with NULL default is usually enough on the Postgres side if you are careful about defaults and indexes.
The pitfall: the consistency check that runs forever. Teams set up the comparison job in step 2, it surfaces a handful of divergences, the team fixes the easy ones, and then there is a long tail of three or four divergences a week that nobody can root-cause and the migration stalls. The disciplined move is to set a budget at the start. We are willing to spend N engineer-weeks on dual-write before we either ship it or abandon it. Without a budget, the migration project becomes a permanent fixture of the team.
The second pitfall: the rollback path that nobody tested. Dual-write has a rollback at every step, but only if the application code can be flipped back to “read from old, write to old only” in a flag flip. If the new schema has columns that the old code does not know how to handle, or if the new schema has been live long enough that there is data in it that old code would corrupt, the rollback is gone. The fix is to keep the old code path live, tested, and runnable through the entire dual-write window.
The 5-Minute Rollback Target
The team gets to roll back any production change in under five minutes. Not the recovery time of the underlying incident, just the time from “we have decided to roll back” to “traffic is back on the old version.” Five minutes is the threshold below which a bad deploy stays an annoyance and above which it becomes an incident with public-facing impact, a status page update, and a post-mortem.
What does five minutes require? Three things.
One: a single command (or single button) that rolls back. Not a runbook with twelve steps that the on-call engineer has not read in three months. The rollback action is the rollback. If your rollback requires SSH-ing to a box, running a script, then redeploying a different artifact, then clearing a cache, you do not have a rollback, you have a recovery procedure. Recovery procedures take an hour. Five minutes requires a button.
Two: feature flags as the primary control plane for risky changes. The deploy is the boring part. The risky part is gated behind a flag, and the flag flips to off in a few seconds. This is why feature flag rollout is the connective tissue across every other pattern on this list. Canary is a percentage flag with a load balancer attached. Blue-green is a flag at the LB layer. Dual-write is a stack of flags at the read and write paths. The flag is the kill switch.
Three: no DDL that is hard to reverse. A column drop. A type narrowing. A constraint that the old code violates. Any of these makes the rollback gone, because reversing them is an outage of its own. The discipline is expand-contract at every schema change. Expand always; contract only after the new code has been at 100% for long enough that nobody is rolling back to a version that needs the old shape.
The drill: actually run a rollback during a non-incident, on a Tuesday, with a stopwatch. Pick a recent low-risk deploy, declare an exercise, and roll it back. Time it. If it takes longer than five minutes, the rollback is broken and you have learned this on a Tuesday instead of a Saturday. Run the drill quarterly at minimum. The teams that do this stop being surprised by their own systems.
What This Looks Like in Practice
A few patterns from the rescue work in our case studies. Names redacted, shape preserved.
A fintech we picked up was three months into rewriting their pricing engine and had no cutover plan. The math in the new engine was right, but the old engine had a decade of edge-case patches baked in. We ran six weeks of shadow mode, found 23 divergences (14 bugs in the new code, 7 in the old, 2 legitimate differences the business had to resolve), and cut over with a feature flag rollout over another four weeks. Zero customer-visible regressions on cutover day.
A SaaS platform was on a single Postgres instance running out of headroom and needed to be split into per-tenant shards. The original plan was a weekend maintenance window. We replaced it with a six-step dual-write over eleven weeks. The migration finished on a Wednesday morning. Nobody noticed. That is the success criterion.
An MVP rescue where the team had built a new auth system and wanted to ship it on a Friday. We ran a 1% canary for 48 hours, caught a token refresh edge case that only triggered for sessions over 18 hours old, then 10% for a week, then 100% on a Tuesday afternoon. The Friday deploy would have shipped a logout bug to every long-session user.
Closing
The team that ships safely is not the team that takes the longest. It is the team that has drilled the rollback. Cutover is engineering, not paperwork. The patterns on this list are tools to pick from based on what the change requires, and the only way to know which tool fits is to have used each of them on something that mattered.
If your team is staring down a migration, a rewrite, or a cutover that nobody has slept well thinking about, the fastest way out is usually to bring in people who have shipped this pattern before. That is most of what our pods do. If the math of building this capability internally vs. renting it for a quarter is on your mind, our pod calculator is a reasonable place to start the comparison.
Ship the rollback first.
One engineering teardown a week. Real pods, real code, no fluff. About 3 minutes a week.