Skip to content
back to journal

mcp server

MCP Server Architecture: How Production MCP Servers Are Actually Built

Four layers every production MCP server needs — transport, JSON-RPC shape, tool definitions, and guardrails. With real code for rate limiting, circuit breakers, and structured errors.

Ralph DuinApril 17, 20268 min read

MCP Server Architecture: How Production MCP Servers Are Actually Built

TL;DR — A production MCP server is not "an AI plugin." It's a tool runtime with four hard-to-get-right layers: transport (stdio vs HTTP), JSON-RPC 2.0 routing, tool definitions (schemas + handlers), and guardrails (rate limits, circuit breakers, response caps). Most MCP servers you find on GitHub nail layer 3 and punt on layer 4 — which is exactly the layer that matters when Claude hits your server 60 times a minute at 2am.

I've shipped two production MCP servers: AppHandoff and MCP Beast. This post is what I'd tell the version of me who started the first one — the architecture decisions that actually matter once real agents start hammering your endpoints.

The four layers

  1. Transport — stdio (local subprocess) or HTTP/SSE (remote). Pick one.
  2. JSON-RPC 2.0 router — the SDK handles this; don't rewrite it.
  3. Tool definitions — name + JSON-Schema + handler. Where most tutorials stop.
  4. Guardrails — rate limits, circuit breakers, response caps, auth. Where production starts.

Transport: stdio vs HTTP

Use stdio when the server needs local files or CLI credentials. Use HTTP when multiple users share the server or you want a single cloud deployment.

My actual ~/.cursor/mcp.json:

{
  "mcpServers": {
    "apphandoff": { "type": "http", "url": "https://api.apphandoff.com/api/mcp-bot", "autoConnect": true },
    "context7": { "type": "http", "url": "https://mcp.context7.com/mcp", "headers": { "X-API-Key": "[REDACTED]" } },
    "fly": { "type": "stdio", "command": "fly", "args": ["mcp", "server"] },
    "mailgun": { "type": "stdio", "command": "npx", "args": ["-y", "@mailgun/mcp-server"],
      "env": { "MAILGUN_API_KEY": "[REDACTED]" } }
  }
}

The choice has real operational consequences. stdio servers inherit the parent process's file descriptors and credentials, which is fantastic for local dev and terrible for supply-chain safety — every npx -y some-mcp-server is running untrusted code with your shell's permissions. I only use stdio for servers I've audited or vendors I trust (Fly, Mailgun, my own). Anything else goes over HTTP behind a bearer token so the blast radius is bounded by what the token can do.

HTTP has its own failure modes. SSE streams drop silently when a load balancer idle-timeouts at 60s — your agent hangs waiting for a tools/call response that will never arrive. Either keep idle timeouts above your longest tool's p99 latency, or use plain HTTP POST and design tools to return fast (long work goes to a job queue with a separate get_job_status tool). AppHandoff's proxy uses POST-only for this reason; the two times I tried SSE in production I regretted it.

The JSON-RPC 2.0 client (40 lines)

When AppHandoff proxies tool calls to user-configured remote MCP servers:

// apphandoff/apps/web/lib/mcp-foreign.ts
async function execCustomMcp(tool, args, secrets, dryRun) {
  const serverUrl = tool.config.mcp_server_url
  const headers = {
    'Content-Type': 'application/json',
    ...(secrets.MCP_AUTH_TOKEN ? { Authorization: `Bearer ${secrets.MCP_AUTH_TOKEN}` } : {}),
  }

  if (dryRun) {
    const res = await fetchWithTimeout(serverUrl, {
      method: 'POST', headers,
      body: JSON.stringify({ jsonrpc: '2.0', id: 1, method: 'tools/list', params: {} }),
    }, 10_000)
    const json = await res.json()
    return { data: { tools_count: (json?.result?.tools ?? []).length }, status: 'success' }
  }

  const res = await fetchWithTimeout(serverUrl, {
    method: 'POST', headers,
    body: JSON.stringify({
      jsonrpc: '2.0', id: 1, method: 'tools/call',
      params: { name: tool.config.remote_tool_name ?? tool.name, arguments: args },
    }),
  }, 10_000)
  const json = await res.json()
  if (json.error) return { error: json.error.message, status: 'error' }
  return { data: json.result, status: 'success' }
}

Every outbound call is wrapped in a 10-second timeout. If your MCP server blocks for 60 seconds, the agent gives up and marks the tool broken.

Tool definitions: three rules

  1. Self-describing. search_github_issues beats tool_1. Include units, ranges, and enum values in the description.
  2. Idempotent where possible. list_handoff_tickets is safe to retry; send_email is not. Say so in the description.
  3. Informatively failable. {"status": "error", "code": "rate_limited", "retry_after_ms": 30000} beats Error: 429.

The third rule is the one I missed on AppHandoff v1. Once I made errors structured, the agent started waiting and retrying on its own. That single change cut support tickets roughly in half.

Tools vs resources vs prompts

MCP exposes three primitive types and most teams conflate them. Tools are verbs the agent calls — they have side effects or compute answers, and the agent decides when to invoke them. Resources are nouns the agent reads — a file, a row, a rendered page, addressable by URI, fetched on demand. Prompts are parameterized templates the user picks from a menu. If you model everything as a tool, the agent has to guess from descriptions which ones are safe to call speculatively, and it will guess wrong on the expensive ones. My rule: anything idempotent and cheap that the agent might want to prefetch goes on the resource surface; anything with side effects or cost stays a tool. MCP Beast got this wrong on the first pass — we shipped read_config as a tool, and Claude called it on every turn "just in case." Moving it to a resource cut per-session tool calls by about a third.

JSON Schema gotchas

The schema in a tool definition isn't just documentation — it's the contract the model generates against. Three things bite people: (1) additionalProperties: false is almost always what you want, otherwise the model invents fields and you silently drop them; (2) prefer enum over free-form strings for anything with a known set — the model picks from the list instead of hallucinating; (3) put units in the field name, not just the description — timeout_ms beats timeout because the description gets summarized away when the tool list is large.

Guardrail 1: rate limiting (60/min, 5s cache)

// mcp-foreign.ts — checkRateLimit
const RATE_LIMIT_PER_MIN = 60
const RATE_LIMIT_CACHE_TTL_MS = 5_000

export async function checkRateLimit(projectId: string) {
  const cached = rateLimitCache.get(projectId)
  if (cached && Date.now() < cached.expiresAt) {
    return { allowed: cached.allowed, retryAfterMs: cached.retryAfterMs }
  }

  const oneMinuteAgo = new Date(Date.now() - 60_000).toISOString()
  const { count } = await supa
    .from('foreign_tool_calls')
    .select('*', { count: 'exact', head: true })
    .eq('project_id', projectId)
    .gte('created_at', oneMinuteAgo)

  const result = (count ?? 0) >= RATE_LIMIT_PER_MIN
    ? { allowed: false, retryAfterMs: 30_000 }
    : { allowed: true }
  rateLimitCache.set(projectId, { ...result, expiresAt: Date.now() + RATE_LIMIT_CACHE_TTL_MS })
  return result
}

The 5-second cache cuts Postgres reads per call from 2 to ~0.03.

Guardrail 2: circuit breaker (closed → open → half-open)

Five errors in a 5-minute window opens the circuit for 15 minutes. Three successes in half-open close it:

const CIRCUIT_ERROR_THRESHOLD = 5
const CIRCUIT_WINDOW_MS = 5 * 60_000
const CIRCUIT_COOLDOWN_MS = 15 * 60_000
const HALF_OPEN_TRIAL_SUCCESSES = 3

Without this, a downstream outage turns into cascading timeouts that hold Claude's context hostage.

Guardrail 3: response size cap (2 MB)

const MAX_FOREIGN_BODY_BYTES = 2 * 1024 * 1024

async function readBodyCapped(res: Response, maxBytes: number) {
  const reader = res.body?.getReader()
  let totalBytes = 0
  const chunks = []
  while (true) {
    const { done, value } = await reader.read()
    if (done) break
    totalBytes += value.byteLength
    if (totalBytes > maxBytes) { reader.cancel(); throw new Error(`Response exceeded ${maxBytes} bytes`) }
    chunks.push(value)
  }
  return new TextDecoder().decode(Buffer.concat(chunks))
}

Agents don't know what's big. A 40 MB JSON payload will push the context over the limit and crash the session.

Guardrail 4: structured error categorization

Log every call with a status you can aggregate on: success | error | timeout | auth_failure.

const totalErrors = rows.filter(r => r.status === 'error').length
const totalTimeouts = rows.filter(r => r.status === 'timeout').length
const totalAuthFailures = rows.filter(r => r.status === 'auth_failure').length
const errorRatePct = Math.round(
  ((totalErrors + totalTimeouts + totalAuthFailures) / total) * 1000
) / 10

When error rate on a single tool jumps from 2% to 40%, you need to know within 5 minutes.

Guardrail 5: auth and tenant scoping

Every production MCP server ends up being multi-tenant whether you planned for it or not — the moment a second customer connects, you need per-tenant rate limits, per-tenant audit logs, and per-tenant blast radius on a leaked credential. AppHandoff issues bearer tokens scoped to a single project; the token resolves to a project_id on every call, and every query (foreign_tool_calls, handoff_tickets, rate-limit checks) is filtered by it at the database layer via Supabase RLS. That means a compromised token can't read another tenant's data even if a tool handler forgets to filter — the policy catches it.

The common failure mode I see in other people's MCP servers: a single shared API key per server, passed through the Authorization header unchanged to every downstream call. When that key leaks (and it will — agents paste tool output into logs, screenshots, support tickets) you're rotating credentials across every tenant at once. Issue scoped tokens, log which token made each call, and make revocation a single SQL update instead of a deploy.

Deploying: Fly.io + Supabase

  • Keep at least 1 Machine running. Cold-start on an agent call is a terrible experience.
  • Auth with Bearer tokens scoped to the caller. Every call logs which token invoked it.
  • If you can't answer "which tool is slow right now?" in 30 seconds, you haven't finished building.

See Fly.io + Supabase: the reliability stack that scales.

When not to build an MCP server

If the only requirement is "let Claude read our docs," a REST endpoint + retrieval is simpler. MCP earns its overhead when multiple AI clients hit the same surface, the agent needs to take actions (not just read), and per-tool audit logs matter.

See the full decision framework in MCP Server vs REST API.


Need a production MCP server built? Describe the system you want to make AI-native and I'll tell you what the first version should look like.

Related posts