Blog / engineering

Production-Grade Engineering: The 2026 Audit Checklist

Metafic Team May 19, 2026

There are three lies a software team tells itself before launch. It works on my laptop. It works in staging. We tested it last Tuesday and nothing broke. None of those statements describe the system anyone will actually run on Monday morning, when real users hit it from three time zones at once with stale browser tabs, broken cookies, half-finished sessions from a deployment two weeks ago, and a Stripe webhook that has been retrying since 6am.

A production-ready checklist is not a vibe. It is a list of specific things that, if missing, will eventually wake someone up at 2am. We have done this audit on a lot of codebases that were “ready”. The pattern is consistent. The team built something that worked, fast, well by the standards of a prototype. Then they shipped to real load and discovered the gap between functioning software and production software is roughly the gap between a working car engine on a workbench and a car that survives a Mumbai monsoon at rush hour.

The cost of skipping this audit is invisible until it isn’t. A six-figure customer churns because their data was wrong for three days and nobody noticed. A 4-hour outage costs a startup its Series A meeting. A rollback that should have taken five minutes takes four hours because nobody wrote the runbook. The audit below is the one we run before our pods hand a system back to a client. If you are asking “is my MVP ready for production”, read all seven sections honestly. Most teams fail at least three.

The seven sections of production-readiness

1. Observability

You cannot operate what you cannot see. The first thing we check on any audit is whether the team can answer four questions in under two minutes, from a dashboard, without SSH-ing into a box: request rate, error rate, p95 latency, saturation on the most contended resource. These are the four golden signals, and most teams have three out of four at best.

What to check: structured logs (JSON, not free-text), metrics flowing into a time-series store (Prometheus, Datadog, whatever, but flowing), distributed tracing across at least the synchronous service boundaries, and SLOs that are written down somewhere a human can read. Alerts fire on symptoms (users are seeing errors) not causes (CPU is at 80%). Cause alerts produce 3am pages nobody acts on, and within six weeks every alert is muted.

What good looks like: a new engineer joins, opens one dashboard, and within five minutes describes the current health of the system. A spike in 5xx responses fires an alert within 90 seconds. Every alert has a runbook link, and the runbook works. Logs have a correlation ID that ties a single user request through every service it touches.

Common failure modes: logging everything at INFO so the signal is buried under heartbeats. Metrics that measure the wrong thing (queue depth without queue age, request count without status code breakdown). Alerts on absolute thresholds that made sense in March and made no sense by September after the user base grew 5x. Traces that stop at the API gateway because the team never instrumented the async worker tier. Dashboards built by one engineer who left, full of red panels everyone has agreed to ignore.

2. Error handling

The defining feature of production code is that the unexpected has already happened to it. A prototype handles the happy path. Production code handles the database connection dropping, the third-party API returning a 502 at the worst moment, the queue backing up, the disk filling, and a malformed payload arriving that was technically never possible according to the spec.

What to check: every external call has an explicit timeout (no defaults, write the number). Every retry has exponential backoff with jitter, a maximum attempt count, and a circuit breaker behind it. Every write that can be retried has an idempotency key, so the customer is not charged twice when the network blips between the API gateway and the payment service. Every async job has a dead letter queue, and the dead letter queue has an alert on depth. No try/except block swallows an exception without logging it with full context.

What good looks like: when a downstream dependency goes down, the system degrades. Reads still work. Writes queue. The UI shows a sensible message. Nothing crashes. When the dependency comes back, the queue drains and the system catches up without manual intervention. The error budget for the month is not blown by a single 20-minute incident.

Common failure modes: a single unhandled promise rejection in Node that takes down the worker. A retry storm that turns a five-second blip into a 30-minute outage because every service in the chain is retrying simultaneously. Idempotency that is not actually idempotent because the key is derived from a timestamp. Dead letter queues that nobody monitors, accumulating six months of failed jobs. The most common failure: a generic except Exception: pass block someone added during a Friday afternoon deploy. We find one of these in roughly 80% of the codebases we audit. Finding and fixing these is the bulk of what our pods do in MVP rescue engagements.

3. Security

Security is the section that hurts to read because nobody wants to find what is in it. The audit is uncomfortable on purpose. The premise: assume an attacker with reasonable motivation has 30 days, a copy of your stack, and your dependency list. What do they own?

What to check, starting with the basics that get missed most often: no secrets in a .env.example with real values commented out. No API keys in client-side bundles (grep the production JavaScript bundle for every key prefix your team uses, right now). All secrets live in a vault, KMS, or secrets manager, with audit logs on access. Rotation is automated, or at minimum documented with a calendar reminder. Every input is validated at the trust boundary, not three function calls deep. Every output that crosses into a different context (HTML, SQL, shell, JSON) is encoded for that context.

Authentication and authorization are separate concerns and live in separate code paths. Authentication answers “who are you”. Authorization answers “what are you allowed to do”. The number of breaches we have seen from an authenticated user calling an endpoint that should have required a specific role is, frankly, embarrassing. The fix is boring: a policy layer that runs on every request, evaluated against a clear set of rules, deny by default.

The dependency tree is the next attack surface. Run a software bill of materials (SBOM) generator on the production build and run the dependency list through a vulnerability scanner weekly. A package with a known critical CVE that is six months old is a finding. Automate dependency updates with a sensible patch cadence, human review for major versions.

The OWASP Top 10 audit is not optional. SQL injection, broken auth, sensitive data exposure, XSS, broken access control, security misconfiguration, the whole list. Add security headers (Content-Security-Policy, Strict-Transport-Security, X-Frame-Options, Referrer-Policy) and verify them with an automated scan that runs in CI. If the team has never run a basic penetration test, run one. A half-day external review is trivial compared to a breach disclosure.

What good looks like: secrets rotate automatically. Dependencies are patched within a defined SLA. The team has a security incident response plan, written down, with phone numbers in it. Every input is validated, every output is encoded, every endpoint has explicit authorization. There is a CVE alert pipeline that creates a ticket within 24 hours of a critical disclosure in a dependency.

Common failure modes: the JWT signing key sitting in a Docker image layer that was pushed to a public registry six months ago. A debug endpoint left enabled in production that returns the full environment. An S3 bucket that is “private” in the console but has a bucket policy that says otherwise. A rate limiter that operates per-instance instead of globally, so attackers just round-robin across the fleet.

4. Data

Data is where small mistakes become unfixable mistakes. Code can be redeployed. A botched migration that dropped a column on Friday and was discovered on Monday is, for the customers whose records were in that column, irreversible.

What to check: backup strategy with point-in-time recovery, tested by actually restoring to a fresh environment within the last 90 days (not “we have backups”, but “we have restored from backup and the system came up clean”). Defined RPO (recovery point objective) and RTO (recovery time objective). These are business decisions, signed off by someone with authority. If the company sells to enterprise customers and the answer to “what is our RPO” is a shrug, the next procurement questionnaire will be painful.

Schema migrations should be forward-and-back-compatible. A deploy of new code should not break old code, and vice versa. The pattern is well-known: expand the schema, deploy the code that writes to both old and new, backfill, deploy the code that reads from new, then contract the schema in a later release. Skipping any of these steps makes rollback impossible, which means the team will not roll back even when they should.

Data retention is a question with a legal answer. Map out every piece of personally identifiable information (PII) the system stores. Know where it lives, who can access it, how long it is kept, and what happens when a user requests deletion under GDPR or CCPA. The PII inventory is a real document, updated when the schema changes, reviewed by someone who is not an engineer. Encryption at rest and in transit is table stakes. Column-level encryption for payment data, health records, or government IDs is the next level, and the right call for any regulated workload.

What good looks like: backups are tested monthly via automated restore. Migrations are reviewed by two engineers before merge. The PII inventory is current. A deletion request from a user can be fulfilled within the legally required window without a panic.

Common failure modes: a production database with no backup, or with backups going to the same physical region as the primary. A migration that runs ALTER TABLE on a 100-million-row table during peak hours and locks reads for 20 minutes. A “soft delete” that leaves PII in the database forever, in violation of the retention policy. Encryption keys stored in the same database as the encrypted data.

5. Cutover

Cutover is the moment of maximum risk: pushing new code to production, taking traffic, and (when things go wrong) backing it out. The team that gets cutover right will outlive the team that gets it almost right.

What to check: every deploy is behind a feature flag, or behind a blue-green or canary mechanism that shifts traffic gradually. A new feature ships to 1% of traffic, then 10%, then 50%, then 100%, with metrics evaluated at each step. The rollback target is five minutes from “we have a problem” to “the problem is gone”. If the team’s rollback takes longer, it will not happen during an incident, because by the time the decision is made the on-call engineer will be told to “just push a fix forward”. Pushing fixes forward at 3am is how outages turn into outages-plus-data-corruption.

Every deployment has a runbook. The runbook says what is being deployed, what could go wrong, how to detect it, and how to roll back. It is written by a human who has thought about the deployment, reviewed in the deploy PR.

For database changes, dual-write patterns are non-negotiable on anything customer-facing. The application writes to both old and new schemas during the transition, reads from old, then reads from new, then writes only to new. This sequence allows rollback at every step. Cutting over in a single release leaves only one path forward when something breaks: roll forward, into a postmortem.

What good looks like: deploys happen multiple times a day, calmly, with one person watching dashboards. A bad deploy is detected within minutes and reversed within five. The team has rolled back this month and it was a non-event. Feature flags are cleaned up within 30 days of full rollout, so the codebase does not become a museum of dead flags.

Common failure modes: a deploy script that does too much at once (schema migration, code deploy, cache flush, config change) so when it fails partway through, the system is in an undefined state. A feature flag system that fails closed instead of open during an outage, taking down the application when the flag service goes down. A runbook from 2024 that references a service that was renamed last year.

6. Performance

Performance is not the same as speed. Performance is whether the system holds its latency budget under realistic load, with realistic data, with realistic concurrency. A benchmark that runs 10,000 RPS against an endpoint that returns “hello world” tells you nothing.

What to check: a load test against the actual production shape of traffic. Real endpoints, real payload sizes, real concurrency patterns, real database state (anonymized production data or a generated dataset of the same scale). The test runs against the production environment, or a copy with identical infrastructure, not the dev cluster on a shared MacBook. Latency budgets are written down per endpoint, broken out by p50, p95, and p99. The p99 is the one that matters for user perception; the p50 is the one teams optimize accidentally.

Database query analysis on every endpoint in the request path. No N+1 queries. EXPLAIN plans on the slow query log, reviewed weekly. Indexes that match actual access patterns, not the patterns from 18 months ago. The slow query log is empty most days, and when it isn’t, someone fixes it.

Caching has an explicit invalidation strategy. “Cache for five minutes and hope” is not a strategy; it is the reason customers see stale data after they update their profile. The cache key includes a version, the cache is invalidated on write, and the cache miss path is tested under load so a cold cache does not collapse the database. A capacity model exists: when traffic doubles, here is what saturates first, here is the runway, here is the next bottleneck.

What good looks like: the team knows their headroom. They can answer “what is our breaking point” with a number. A spike that doubles requests is a non-event because autoscaling already accounted for it.

Common failure modes: a beautifully fast endpoint in isolation that collapses at 50 concurrent users because of lock contention on a single row. A cache that returns stale data because the invalidation logic has a race condition. A “read replica” that is actually four seconds behind the primary, causing UI bugs that are impossible to reproduce locally. The classic: a single Redis instance that becomes the entire system’s bottleneck and is held together with prayer.

7. Testing

Coverage is not the metric. We have seen 100% line coverage on codebases that ship critical bugs every release, because the tests assert functions were called, not that the system behaved correctly. Coverage tells you what code ran, not what it did.

What to check: critical paths are tested end-to-end. The five most important user journeys (signup, login, the core action, billing, account deletion) have automated tests that run in CI against a production-like environment. Contract tests sit at every service boundary so a change in one service that breaks a consumer is caught at build time, not at runtime.

For systems that are mission-critical from day one (anything handling money, anything with regulatory exposure, anything with significant downstream dependencies), chaos testing belongs in the launch checklist, not on the someday list. Kill a database replica during a load test. Inject latency into the third-party API call. Drop the cache layer. The system degrades gracefully, recovers automatically, and the team knows what happens because they have already seen it.

What good looks like: CI runs the critical path tests in under 15 minutes. A failing test blocks the merge, and the team treats a flaky test as a bug, not background noise. Manual QA complements automation rather than serving as the only safety net. The team has practiced a database failover and knows what happens to in-flight transactions. (Our pods include a dedicated manual QA engineer for exactly this reason, automation catches regressions on known paths, humans find the bugs automation cannot.)

Common failure modes: a 90-minute test suite, so nobody runs it locally and main stays broken for hours at a time. Flaky tests retried automatically until they pass, hiding real bugs in race conditions. Mocks that diverge from the real service so the tests pass and production fails. The 100%-coverage codebase that ships a regression in the login flow because the test asserted the login function was called, not that the user was actually logged in.

The pre-launch dry run

Before flipping the production switch, run the system in shadow mode for a week. The mechanics: take real production traffic from the existing system (or a representative sample of synthetic traffic that matches the production shape) and replay it against the new system, in parallel, with the responses discarded. The new system sees real load. The customers see nothing. The team sees everything.

Shadow mode catches bugs that load tests miss because real traffic has a long tail of weirdness no synthetic test generates. The malformed payload from an iOS app version nobody on the team still runs. The customer with 50,000 line items in a single request. The webhook that arrives 90 minutes late because Stripe was queuing it during their own incident. After a week of shadow traffic, the metrics dashboards are honest, the error rates are real, the latency distribution is the one users will actually see.

During the same week, run a tabletop incident exercise. Pick three plausible failure scenarios (the database goes down, the payment provider returns 500s for an hour, a key dependency is breached and needs to be rotated). Walk through each with the team, in a room, runbooks open. Time the response. Find the gaps. Fix them before launch, not during the first incident.

Teams that skip the dry run discover their production dashboards under stress for the first time during a real outage. Teams that do the dry run have already seen the system fail, in controlled conditions, with the on-call rotation watching. The difference shows up in the postmortems for the next 18 months.

Closing

Production-readiness is not a checkbox. The seven sections above are not a launch gate signed off once and forgotten. They are a posture: a way of building software that assumes things will go wrong and engineers for that assumption from the start. A team that audits itself against this list every quarter, honestly, ends up with a system that survives growth, scale, adversarial users, and the inevitable Monday morning when something nobody anticipated happens.

Teams that skip the audit usually call someone else to do it for them, after the first serious incident, at considerable expense. We have run this audit on dozens of post-MVP codebases that were “almost ready”. The pattern in the rescue work documented in our case studies is consistent: the team built something good, hit production, discovered the gap, and needed a senior pod to close it before the next funding round. The fix is rarely a rewrite. It is usually three weeks of disciplined work against a checklist exactly like this one.

If you are weighing whether to staff this in-house, contract it out, or stitch together freelancers, the comparison against hiring and against agency models like Toptal is worth a read. The right answer depends on the system, the timeline, and the cost of getting it wrong. The wrong answer is to skip the audit and hope. Hope is the absence of an engineering practice.

More like this, in your inbox.

One engineering teardown a week. Real pods, real code, no fluff. About 3 minutes a week.

You're in. First teardown lands Sunday.