The AI Integration Stack We Use for Every SaaS Client in 2026

After shipping AI features for multiple B2B SaaS products, we've stopped debating the stack. We standardized it.

Every client gets a variation of the same architecture. Not because we're lazy — because we've already paid the cost of experimenting with alternatives and know where each tool breaks down. The stack below is what survives production: real users, real scale, real edge cases.

Here's every piece and why it's there.

The stack at a glance

Layer	Tool	Why
LLM Interface	Vercel AI SDK	Model-agnostic, streaming-native, TypeScript-first
Model Routing	LiteLLM	Unified API, fallback chains, cost tracking
Vector Store	pgvector or Pinecone	Depends on scale and operational complexity
Observability	Langfuse	Open-source, self-hostable, production-grade tracing
Guardrails	Custom + Zod	Output validation, PII filtering, hallucination checks
Prompt Management	Code (not dashboards)	Version-controlled, testable, reviewable
Cost Control	Per-user/org budgets	Built into the routing layer

Let's go layer by layer.

Vercel AI SDK: The LLM interface layer

We use the Vercel AI SDK as the interface between our application code and LLM providers. Not LangChain. Not a direct API wrapper. Here's why.

Model-agnostic by design

The AI SDK provides a unified interface across providers. Switching from Claude to GPT-4o is a one-line change:

import { generateText } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { openai } from "@ai-sdk/openai";

// Switch models without changing application logic
const result = await generateText({
  model: anthropic("claude-sonnet-4-20250514"),
  // model: openai("gpt-4o"),  // swap provider in one line
  prompt: "Summarize this document...",
});

This matters because model deprecation is constant. OpenAI sunsets models every 6-12 months. Anthropic releases new Claude versions quarterly. If your application code is tightly coupled to one provider's API format, every model change is a refactor. With the AI SDK, it's a config change.

Streaming is native

The SDK handles Server-Sent Events (SSE) streaming out of the box. This is critical for any user-facing AI feature — nobody wants to wait 15 seconds for a response to appear all at once.

import { streamText } from "ai";
import { anthropic } from "@ai-sdk/anthropic";

export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = streamText({
    model: anthropic("claude-sonnet-4-20250514"),
    messages,
  });

  return result.toDataStreamResponse();
}

On the client, the useChat hook handles the streaming UI automatically. No manual EventSource management. No buffering logic. It just works.

TypeScript-native

The entire SDK is built in TypeScript with full type safety. Structured output with Zod schemas is first-class:

import { generateObject } from "ai";
import { z } from "zod";

const sentimentSchema = z.object({
  sentiment: z.enum(["positive", "negative", "neutral"]),
  confidence: z.number().min(0).max(1),
  reasoning: z.string(),
});

const result = await generateObject({
  model: anthropic("claude-sonnet-4-20250514"),
  schema: sentimentSchema,
  prompt: "Analyze the sentiment of this customer review...",
});

// result.object is fully typed as { sentiment, confidence, reasoning }

No JSON.parse() and praying. The output conforms to the schema or it throws.

Why not LangChain

LangChain introduced too many abstractions for what we need. Its chain-of-thought patterns add layers of indirection that make debugging harder, not easier. The AI SDK gives us the right level of abstraction: model-agnostic calls with streaming support, without hiding what's actually happening.

LiteLLM: Model routing and fallbacks

LiteLLM sits between our application and the model providers. It handles three things we don't want to build ourselves:

1. Unified API across 100+ models

LiteLLM translates any model call into the OpenAI-compatible format. This means we can route to Claude, GPT-4o, Gemini, Mistral, or any other provider through the same API interface. Combined with the AI SDK, our application code never touches provider-specific logic.

2. Fallback chains

Models go down. OpenAI has had multiple outages in 2026 alone. Our standard fallback chain:

# litellm config
model_list:
  - model_name: "primary"
    litellm_params:
      model: "anthropic/claude-sonnet-4-20250514"
      api_key: "sk-ant-..."
  - model_name: "primary"
    litellm_params:
      model: "openai/gpt-4o"
      api_key: "sk-..."
  - model_name: "primary"
    litellm_params:
      model: "anthropic/claude-haiku-4-20250514"
      api_key: "sk-ant-..."

router_settings:
  routing_strategy: "simple-shuffle"
  num_retries: 2
  timeout: 30
  fallbacks: [{"primary": ["primary"]}]

If Claude goes down, the request automatically routes to GPT-4o. If that's also down, it falls back to Haiku (faster, cheaper, but still functional). The user never sees an error.

3. Cost tracking per request

LiteLLM logs every request with token counts and costs. We pipe this into our observability layer and use it to enforce per-user and per-organization spending limits. No surprise $50K bills at the end of the month.

Vector store: pgvector vs Pinecone

This is the only layer where we don't have a single answer. The right choice depends on two factors.

Use pgvector when:

You're already running PostgreSQL (which most SaaS products are)
Your vector dataset is under 5M rows
You want to keep your infrastructure simple (no new service to manage)
You need to join vector results with relational data in a single query

-- pgvector: semantic search with relational joins in one query
SELECT
  d.title,
  d.content,
  d.created_by,
  u.name as author_name,
  1 - (d.embedding <=> $1::vector) as similarity
FROM documents d
JOIN users u ON d.created_by = u.id
WHERE d.org_id = $2
  AND 1 - (d.embedding <=> $1::vector) > 0.7
ORDER BY d.embedding <=> $1::vector
LIMIT 10;

That query does semantic search, filters by organization, joins with user data, and returns ranked results — all in one database call. With Pinecone, you'd need two round-trips (vector search + relational lookup).

Use Pinecone when:

Your vector dataset exceeds 5M rows
You need sub-100ms query latency at scale
You're doing high-throughput concurrent queries
You want managed infrastructure with zero operational overhead

For most B2B SaaS products with tens of thousands to low millions of documents, pgvector handles the load comfortably. We default to pgvector and only recommend Pinecone when the data scale demands it.

Langfuse: Observability

LLM features are black boxes without observability. You can't improve what you can't measure. Langfuse gives us:

Trace every LLM call

Every request gets a trace with: input prompt, model used, output, latency, token count, cost, and any tool calls. When a user reports "the AI gave me a wrong answer," we can pull the exact trace and see what happened.

import { Langfuse } from "langfuse";

const langfuse = new Langfuse({
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
  secretKey: process.env.LANGFUSE_SECRET_KEY,
});

// Wrap any AI SDK call with tracing
const trace = langfuse.trace({ name: "document-summary" });
const generation = trace.generation({
  name: "summarize",
  model: "claude-sonnet-4-20250514",
  input: messages,
});

const result = await generateText({
  model: anthropic("claude-sonnet-4-20250514"),
  messages,
});

generation.end({ output: result.text });

Evaluation scores

We attach quality scores to traces — both automated (did the output match the schema? did it pass guardrails?) and human feedback (user thumbs up/down). Over time, this builds a dataset for measuring whether prompt changes actually improve quality.

Why not LangSmith

LangSmith is solid, but it's tightly coupled to the LangChain ecosystem. Since we use the Vercel AI SDK, Langfuse integrates more naturally. It's also open-source and self-hostable, which matters for clients with data residency requirements.

Guardrails: Output validation

LLMs hallucinate. They leak PII. They generate offensive content. They ignore instructions. Guardrails are not optional.

Structured output validation with Zod

Every LLM call that returns structured data goes through a Zod schema. The AI SDK's generateObject handles this at the provider level, but we add a second validation layer for defense in depth:

import { z } from "zod";

const recommendationSchema = z.object({
  items: z.array(z.object({
    id: z.string().uuid(),
    score: z.number().min(0).max(1),
    reason: z.string().max(500),
  })).max(10),
  metadata: z.object({
    modelUsed: z.string(),
    latencyMs: z.number(),
    cached: z.boolean(),
  }),
});

// Validate after generation — defense in depth
const validated = recommendationSchema.safeParse(result.object);
if (!validated.success) {
  logger.error("LLM output validation failed", {
    errors: validated.error.issues,
    rawOutput: result.object,
  });
  // Return fallback response, not an error
  return getFallbackRecommendations(userId);
}

Key pattern: When validation fails, return a fallback — not an error. Users don't care that the LLM misbehaved. They care that the feature works.

PII filtering

Before sending user data to any LLM, we strip PII that isn't necessary for the task:

function stripPII(text: string): string {
  return text
    .replace(/\b[\w.+-]+@[\w-]+\.[\w.]+\b/g, "[EMAIL]")
    .replace(/\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, "[PHONE]")
    .replace(/\b\d{3}-\d{2}-\d{4}\b/g, "[SSN]")
    .replace(/\b(?:\d{4}[-\s]?){3}\d{4}\b/g, "[CARD]");
}

This isn't foolproof — no regex catches every PII pattern. But it catches the common cases and demonstrates due diligence. For clients in regulated industries, we add more aggressive filtering or use on-premise models.

Hallucination checks

For RAG features, we check that the LLM's response is grounded in the retrieved context:

async function checkGrounding(
  response: string,
  context: string[]
): Promise<{ grounded: boolean; score: number }> {
  const result = await generateObject({
    model: anthropic("claude-haiku-4-20250514"),
    schema: z.object({
      grounded: z.boolean(),
      score: z.number().min(0).max(1),
      ungroundedClaims: z.array(z.string()),
    }),
    prompt: `Check if this response is fully supported by the provided context.
Context: ${context.join("\n")}
Response: ${response}
Return whether the response is grounded and identify any claims not in the context.`,
  });

  return result.object;
}

Yes, this is using an LLM to check an LLM. It's not perfect. But it catches the obvious hallucinations — fabricated statistics, invented product features, made-up citations — which are the ones that damage user trust.

Architecture patterns

Beyond individual tools, these patterns appear in every integration we build.

Cost caps per user and organization

Uncapped AI usage is a path to bankruptcy. Every client gets per-user and per-org spending limits:

async function checkBudget(orgId: string, userId: string): Promise<boolean> {
  const orgUsage = await getMonthlyUsage(orgId);
  const userUsage = await getDailyUsage(userId);

  const orgLimit = await getOrgBudget(orgId); // e.g., $500/month
  const userLimit = 5.00; // $5/day per user

  if (orgUsage >= orgLimit || userUsage >= userLimit) {
    return false; // Block the request, show upgrade prompt
  }

  return true;
}

Prompt versioning in code, not dashboards

We store prompts in code, not in a third-party dashboard. Reasons:

Version control. Prompts change with the feature. They should be in the same PR.
Code review. Prompt changes get reviewed like any other code change.
Testing. Prompts can be tested with eval suites in CI.
Rollback. git revert works.

// prompts/document-summary.ts
export const DOCUMENT_SUMMARY_PROMPT = {
  version: "2.1.0",
  system: `You are a document summarizer for a B2B project management tool.
Rules:
- Summarize in 3-5 bullet points
- Focus on action items and decisions
- Never fabricate information not in the document
- If the document is unclear, say so`,
  template: (doc: string, context: string) =>
    `Summarize this document for the project team:\n\n${doc}\n\nProject context: ${context}`,
} as const;

Model selection by task complexity

Not every task needs the most expensive model. We route based on task complexity:

function selectModel(task: "simple" | "moderate" | "complex") {
  switch (task) {
    case "simple":
      // Classification, extraction, simple Q&A
      return anthropic("claude-haiku-4-20250514");
    case "moderate":
      // Summarization, analysis, structured output
      return anthropic("claude-sonnet-4-20250514");
    case "complex":
      // Multi-step reasoning, code generation, creative tasks
      return anthropic("claude-opus-4-20250514");
  }
}

This alone typically reduces API costs by 40-60% versus sending everything to the largest model.

What we don't use (and why)

LangChain

Too many abstractions for production TypeScript work. The chain-of-thought patterns made sense when LLMs needed more hand-holding. With modern models and the AI SDK, we don't need the overhead.

Embedding-only search (no hybrid)

Pure vector search misses exact matches. A user searching for "invoice INV-2024-001" expects an exact match, not a semantic approximation. We always combine vector search with keyword search (BM25 or PostgreSQL full-text search) in a hybrid approach.

Prompt playgrounds / managed prompt services

Third-party prompt management tools add a deployment step between code and production. Prompts live in code, get reviewed in PRs, and deploy with the application. No separate deployment pipeline, no sync issues.

Auto-scaling GPU infrastructure

For inference (not training), API-based models from Anthropic and OpenAI handle the scaling. We haven't found a client use case where self-hosted models beat API models on cost-adjusted quality. When we do, we'll add it to the stack.

The principle behind the stack

Every tool in this stack earns its place by solving a problem we've actually hit in production. No "nice to have" layers. No tools added because a blog post said they were important.

The stack is opinionated because production AI is opinionated. The choices that matter aren't "which LLM framework" — they're "how do you handle a model outage at 2 AM" and "what happens when a user's query costs $3 in tokens" and "how do you know the AI's answer was wrong."

Those are engineering problems, not AI problems. And this stack is built to solve them.