From Vibe-Coded MVP to Production-Grade SaaS in 6 Weeks: A Case Study

A founder shipped a SaaS MVP with Cursor in 2 weeks. The product was a project management tool for construction companies — think Asana, but with industry-specific features like permit tracking, subcontractor management, and daily site reports.

He posted it on LinkedIn. Three construction companies signed up that week. Within a month, he had 47 paying users and $4,200 MRR.

Then it started breaking.

The "before" snapshot

When he came to us, the codebase was 100% AI-generated. Cursor + Next.js 14 + Supabase + Stripe. Here's what we found during the initial audit.

Authentication: Leaking across tenants

The RLS (Row Level Security) policies were checking auth.uid() but not the organization. User A from Company X could see User B's data from Company Y by modifying the API request. This is the same cross-tenant data exposure bug we described in our codebase audit post, and it shows up in almost every AI-generated multi-tenant app.

In this case, 47 paying companies were sharing a database with no tenant isolation. Any user who knew how to open browser dev tools could access any other company's project data, financials, and subcontractor information.

Error handling: None

The entire application had zero try-catch blocks. Zero error boundaries. Zero fallback UI. When the Supabase connection timed out (which happened during peak hours because of the connection pooling issue below), users saw a raw Next.js error page with a stack trace.

In production. With paying customers.

Webhook handling: Race conditions

The Stripe webhook endpoint had no signature verification and no idempotency handling. Stripe's retry mechanism was firing duplicate events, causing some users to be charged twice and others to have their subscription status flip between "active" and "canceled" multiple times per hour.

The founder was manually fixing subscription statuses in the Supabase dashboard every morning. He'd been doing this for three weeks.

N+1 queries: 8-second page loads

The project dashboard — the page every user hits first — was making 47 individual database queries to render a single page. One query for the project list, then one query per project for the latest activity, then one query per project for the team members.

For a user with 15 projects, that's 31 queries to render a dashboard. Page load: 8.2 seconds. On mobile: timeout.

// What Cursor generated — classic N+1
export async function getProjects(userId: string) {
  const { data: projects } = await supabase
    .from("projects")
    .select("*")
    .eq("user_id", userId);

  // N+1: one query per project for activity
  for (const project of projects) {
    const { data: activity } = await supabase
      .from("activity_logs")
      .select("*")
      .eq("project_id", project.id)
      .order("created_at", { ascending: false })
      .limit(5);
    project.recentActivity = activity;
  }

  // N+1 again: one query per project for members
  for (const project of projects) {
    const { data: members } = await supabase
      .from("project_members")
      .select("*, users(*)")
      .eq("project_id", project.id);
    project.members = members;
  }

  return projects;
}

Connection pooling: Not configured

Supabase's default connection limit is 60 for the Pro plan. The app was opening a new connection for every request and never closing them. During peak hours (7-9 AM when construction teams start their day), connections exhausted within minutes. The entire app went down for all users.

This happened 3-4 times per week. The founder would restart the Supabase project from the dashboard, which would kill all connections and temporarily fix it — until the next morning.

Monitoring: Zero visibility

No error tracking. No performance monitoring. No uptime monitoring. No alerting. The founder learned about outages from angry customer emails. Sometimes hours after the outage started.

The numbers

Metric	Before
Page load (dashboard)	8.2 seconds
Errors per day	~120
Uptime (30-day)	~94%
Connection pool exhaustion	3-4x/week
Webhook failures	~20% of events
Time to detect outages	1-3 hours
Security vulnerabilities	7 critical

Week 1-2: Security audit and critical fixes

We don't start with refactoring. We start with the things that can destroy the business overnight.

Day 1-2: Threat assessment

We mapped every API endpoint, every database table, and every RLS policy. The full threat assessment found 7 critical vulnerabilities:

Cross-tenant data access (no org-level RLS)
Stripe webhook endpoint accepting unverified requests
Supabase service role key exposed in client-side code
No input validation on any API route
File upload endpoint accepting any file type with no size limit
Password reset flow with no rate limiting
Admin API routes with no role-based access control

Day 3-5: Auth and tenant isolation

We rewrote every RLS policy with organization-level checks:

-- Before: user-level only (broken for multi-tenant)
CREATE POLICY "select_projects" ON projects
  FOR SELECT USING (auth.uid() = created_by);

-- After: organization-level isolation
CREATE POLICY "select_projects" ON projects
  FOR SELECT USING (
    EXISTS (
      SELECT 1 FROM org_members
      WHERE org_members.user_id = auth.uid()
        AND org_members.org_id = projects.org_id
    )
  );

Every table got the same treatment. We also added a test suite that attempts cross-tenant access for every RLS policy — if any test passes, the CI pipeline fails.

Day 6-8: Webhook security

We implemented proper Stripe webhook handling with signature verification, idempotency, and event ordering. The same pattern from our auth loops post: verify the signature, check for duplicate events, process idempotently.

We also backfilled subscription statuses from Stripe's source of truth. 6 users had incorrect subscription states in the database. All 6 were paying customers who had been silently downgraded to the free tier by duplicate webhook events.

Day 9-10: Secret rotation and input validation

We moved the Supabase service role key to server-side only, added the server-only import guard, and rotated every credential. Then we added Zod validation to every API endpoint and form submission.

import { z } from "zod";

const createProjectSchema = z.object({
  name: z.string().min(1).max(100),
  description: z.string().max(2000).optional(),
  startDate: z.string().datetime(),
  endDate: z.string().datetime().optional(),
  budget: z.number().positive().max(100_000_000).optional(),
  type: z.enum(["residential", "commercial", "infrastructure"]),
});

Every endpoint went from "trust whatever the client sends" to "validate everything, reject anything unexpected."

End of Week 2 status: All 7 critical vulnerabilities closed. Zero cross-tenant data access possible. Webhooks processing correctly. Secrets secured.

Week 3-4: Backend refactor

With the security fires out, we moved to performance and reliability.

Connection pooling

We configured Supabase's built-in connection pooler (PgBouncer) in transaction mode and updated the application to use the pooled connection string:

// Before: direct connection (exhausts pool)
const supabase = createClient(
  process.env.NEXT_PUBLIC_SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY!
);

// After: pooled connection via PgBouncer
const supabase = createClient(
  process.env.NEXT_PUBLIC_SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY!,
  {
    db: {
      schema: "public",
    },
    auth: {
      persistSession: false,
    },
  }
);

On the Supabase side, we configured the pooler for the connection string used in server-side code and set the pool size to match the expected concurrent users. Connection exhaustion stopped immediately.

Killing the N+1 queries

We replaced the loop-based data fetching with a single query using Supabase's nested select:

// After: single query with joins
export async function getProjects(orgId: string) {
  const { data, error } = await supabase
    .from("projects")
    .select(`
      *,
      activity_logs (
        id, action, created_at, user:users(name)
      ),
      project_members (
        user:users(id, name, avatar_url),
        role
      )
    `)
    .eq("org_id", orgId)
    .order("updated_at", { ascending: false })
    .limit(20);

  if (error) throw new DatabaseError("Failed to fetch projects", error);
  return data;
}

One query instead of 31. We also added database indexes on the columns used in WHERE clauses and ORDER BY:

CREATE INDEX idx_projects_org_updated ON projects (org_id, updated_at DESC);
CREATE INDEX idx_activity_logs_project ON activity_logs (project_id, created_at DESC);
CREATE INDEX idx_project_members_project ON project_members (project_id);

Response caching

For data that doesn't change frequently (project metadata, team members), we added a caching layer using Next.js unstable_cache with revalidation:

import { unstable_cache } from "next/cache";

export const getCachedProjects = unstable_cache(
  async (orgId: string) => getProjects(orgId),
  ["projects"],
  {
    revalidate: 60, // Revalidate every 60 seconds
    tags: ["projects"],
  }
);

When a project is updated, we invalidate the cache tag:

import { revalidateTag } from "next/cache";

export async function updateProject(projectId: string, data: ProjectUpdate) {
  await supabase.from("projects").update(data).eq("id", projectId);
  revalidateTag("projects");
}

Error boundaries and handling

We added error boundaries at every route segment, a global error handler, and try-catch blocks around every database and external API call:

// app/dashboard/error.tsx
"use client";

export default function DashboardError({
  error,
  reset,
}: {
  error: Error & { digest?: string };
  reset: () => void;
}) {
  return (
    <div className="flex flex-col items-center justify-center p-8">
      <h2 className="text-xl font-semibold">Something went wrong</h2>
      <p className="mt-2 text-muted">
        We've been notified and are looking into it.
      </p>
      <button
        onPress={reset}
        className="mt-4 rounded-lg bg-primary px-4 py-2 text-white"
      >
        Try again
      </button>
    </div>
  );
}

Every external call got a wrapper with retry logic, timeout, and structured error logging:

async function withRetry<T>(
  fn: () => Promise<T>,
  options: { retries: number; timeout: number; label: string }
): Promise<T> {
  for (let attempt = 1; attempt <= options.retries; attempt++) {
    try {
      const controller = new AbortController();
      const timeoutId = setTimeout(
        () => controller.abort(),
        options.timeout
      );
      const result = await fn();
      clearTimeout(timeoutId);
      return result;
    } catch (error) {
      if (attempt === options.retries) {
        logger.error(`${options.label} failed after ${options.retries} attempts`, {
          error,
        });
        throw error;
      }
      // Exponential backoff
      await new Promise((r) => setTimeout(r, 1000 * Math.pow(2, attempt)));
    }
  }
  throw new Error("Unreachable");
}

End of Week 4 status: Dashboard loads in under 500ms. No connection pool exhaustion. Proper error handling everywhere. Caching layer active.

Week 5-6: Monitoring, CI/CD, and hardening

Monitoring and alerting

We set up a complete observability stack:

Error tracking with Sentry — every uncaught error gets captured with context, user info, and breadcrumbs.
Uptime monitoring with BetterStack — checks every 30 seconds, alerts via Slack and SMS within 60 seconds of downtime.
Performance monitoring — Core Web Vitals tracked per page, with alerts when LCP exceeds 2.5 seconds.
Custom metrics — webhook processing time, database query duration, API response times. All dashboarded.

The founder went from "I find out about outages from customer emails" to "I get a Slack alert within 60 seconds with the root cause."

CI/CD pipeline

Before: deployment was git push to a Vercel branch. No checks. No tests. No gates.

After:

Type checking — tsc --noEmit catches type errors before deployment
Linting — Biome enforces code quality standards
Security tests — Cross-tenant access tests run on every PR
Preview deployments — Every PR gets a preview URL for manual testing
Production deployment — Only from the main branch, only after all checks pass

# .github/workflows/ci.yml
name: CI
on: [push, pull_request]
jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - run: pnpm install --frozen-lockfile
      - run: pnpm check-types
      - run: pnpm lint
      - run: pnpm test:security
      - run: pnpm build

Load testing

We ran load tests simulating 200 concurrent users during peak morning hours. Before our changes, the app fell over at 30 concurrent users. After:

200 concurrent users: all requests under 800ms
500 concurrent users: p95 at 1.2 seconds, no errors
1000 concurrent users: p95 at 2.8 seconds, 0.1% error rate

The app could now handle 10x its current user base without any infrastructure changes.

Documentation and handoff

We documented everything:

Architecture decision records (ADRs) for every major technical choice
Runbook for common operational tasks (restart, rollback, scale)
Database schema documentation with relationship diagrams
API endpoint documentation with request/response examples

The founder could now onboard a developer who would understand the system without reverse-engineering it.

The "after" numbers

Metric	Before	After	Change
Page load (dashboard)	8.2s	380ms	95.4% faster
Errors per day	~120	2-3	97.5% reduction
Uptime (30-day)	~94%	99.95%	Production-grade
Connection pool exhaustion	3-4x/week	0	Eliminated
Webhook success rate	~80%	99.9%	Reliable
Time to detect outages	1-3 hours	< 60 seconds	Real-time
Security vulnerabilities	7 critical	0	Closed
Concurrent user capacity	~30	500+	16x increase

What we learned (again)

This project reinforced patterns we see in every vibe-coded rescue:

AI tools are excellent at scaffolding

Cursor generated a working MVP in 2 weeks. The UI was clean, the feature set was right, the product-market fit was validated quickly. That's genuinely valuable. The founder would not have gotten 47 paying customers without the speed that Cursor enabled.

AI tools are terrible at production engineering

Auth, security, performance, error handling, monitoring, connection management, webhook idempotency — these are all patterns that require understanding why they exist, not just how to implement them. AI tools generate the "how" without the "why," which means they skip it entirely when it's not explicitly requested.

The gap is predictable

Every vibe-coded app we audit has the same categories of issues: auth/security, performance (N+1 queries, missing indexes, no caching), error handling (none), and operational readiness (no monitoring, no CI/CD, no documentation). The specifics vary. The categories don't.

The rescue window is narrow

This founder came to us at $4,200 MRR with 47 users. If he'd waited another 3 months — more users, more data, more compounding technical debt — the rescue would have been twice as expensive and taken twice as long. The best time to fix a vibe-coded MVP is right after it gets traction.

The takeaway

Vibe coding is not the problem. Stopping at the vibe code is.

If your AI-generated MVP has traction, you're sitting on a validated product with an unvalidated foundation. The product-market fit is real. The code isn't ready for what comes next. The longer you wait, the more expensive the rescue becomes.

The founder in this case study went from "my app breaks every morning" to "I haven't thought about infrastructure in two months." His MRR grew from $4,200 to $11,800 in the three months after the rescue — because he could finally focus on the product instead of firefighting.

That's the ROI of production-grade engineering: it's not a cost, it's what lets you grow.