Blog / case study

Case Study: Rebuilding a SaaS Monolith Into Microservices in 8 Weeks

Metafic Team May 5, 2026

CloudMetrics had 47,000 paying users on a Rails monolith that took 45 minutes to deploy. Every Tuesday, their VP of Engineering blocked off the afternoon for “deploy and pray.” A bug in the billing module once took down event ingestion for 22 minutes, which meant 22 minutes of lost analytics data for every customer on the platform.

They had $22M in Series B funding and a contractual commitment to ship real-time dashboards by Q3. The monolith made that impossible.

What Was Actually Wrong

The entire application was one Rails process talking to one PostgreSQL database. Six distinct domains (ingestion, processing, reporting, billing, auth, and the API layer) lived in the same repo, the same deploy pipeline, and the same runtime.

The internal team had tried to break it apart twice. Both times, customer tickets and feature requests pulled engineers back to product work within two weeks. The CTO told us: “We kept starting the migration on Monday and abandoning it by Thursday.”

That pattern is common. The team that maintains the product cannot also rewrite the product. You need a second team.

How We Sequenced the Migration

We deployed a pod: one architect, three backend engineers, one DevOps engineer, and a PM. The pod was writing code by day four.

The critical decision was extraction order. Most teams start with something easy to build confidence. We started with billing, then auth, then analytics. Here is why.

Week 1: Domain mapping and strangler fig setup

The architect spent five days reading code and drawing boundaries. We set up an nginx-based API gateway in front of the monolith. All traffic flowed through it, but initially it just proxied everything to Rails. This gave us the routing layer we would need to redirect traffic service by service.

We provisioned the target infrastructure using Terraform: an EKS cluster on AWS with three node groups, a shared RDS Aurora cluster (temporary, for the transition period), and Datadog for observability across both old and new systems.

Weeks 2-3: Billing extraction

Billing went first because it had the clearest domain boundary and the most dangerous coupling. A billing bug should never take down event ingestion. We extracted it into a standalone service (still Ruby, since the logic was well-tested and did not need a rewrite). Stripe webhooks pointed to the new service. The gateway routed /billing/* to the new service, everything else to Rails.

Week 4: Auth extraction

Auth was next because every subsequent service needed it. We built a lightweight auth service in Go that issued JWTs. The gateway validated tokens, which meant downstream services did not need to know about auth at all. We ran the old session-based auth and the new token-based auth in parallel for 72 hours, comparing results on every request. Three edge cases surfaced: expired sessions that were still valid in the old system, a timezone bug in token expiry, and a race condition when users changed passwords mid-session.

Weeks 5-6: Analytics extraction (the big one)

This was the service that mattered most. We rebuilt the ingestion pipeline in Go because Ruby could not handle the throughput target (1 billion events/day in load testing). Events landed in a Kafka topic, got validated and enriched by a consumer group, then wrote to a new TimescaleDB instance. The reporting layer read from TimescaleDB instead of scanning the main PostgreSQL database.

Week 7: Cleanup and database separation

We moved user management out of the monolith and finished separating the databases. Each service got its own schema, then its own database instance. We used Datadog APM to trace every cross-service call and verify latency budgets.

Week 8: Parallel run and cutover

We ran both architectures simultaneously for five days. A comparison worker checked outputs from old and new systems on a sample of requests. It caught three data consistency bugs: a rounding difference in revenue aggregation, a missing timezone conversion on event timestamps, and a pagination edge case where the old system returned one extra row.

The cutover happened on a Tuesday at 10am ET. Zero downtime.

The Numbers

Deploy time went from 45 minutes to 4 minutes. Individual services deploy independently, so a billing fix does not require redeploying the analytics pipeline.

The team went from one deploy per week to 8-12 deploys per day.

Event processing capacity hit the 5x target. The Go ingestion service handled 1.1 billion events/day in load testing without autoscaling.

Onboarding time for new engineers dropped from three weeks to one. Smaller codebases with clear boundaries are just easier to learn.

The internal team shipped real-time dashboards four weeks after the migration finished, well within the contractual deadline.

What We Would Do Differently

We should have set up the Datadog comparison dashboards in week 1, not week 6. Earlier visibility into behavioral differences between old and new systems would have caught issues sooner.

The temporary shared database in weeks 2-4 created some locking issues under load. In retrospect, we should have moved to database-per-service from the start, even if it meant more data sync work early on.

Four Months Later

CloudMetrics has onboarded 40% more customers and processes 3x their pre-migration event volume. We are still working with them, now building multi-tenant data isolation so their enterprise customers can get dedicated processing pipelines.

If your SaaS platform is stuck behind a monolith that takes 45 minutes to deploy, we have done this before. We can talk specifics about your architecture and give you an honest assessment of the timeline.