How We Route Haiku → Sonnet in Production

Not every task needs the same model. Our 18-layer agent stack picks the model per-task based on complexity, latency requirements, and cost. Here's the system.

The Core Problem

If you run Claude Sonnet for every request, you get excellent quality — but the cost adds up fast when you're serving $297 audits and $997 sprints. If you run Haiku for everything, you save money but lose reasoning quality on the tasks that matter. The answer is routing: use the cheapest model that can do the job, and escalate when needed.

Our Routing Tiers

Tier	Model	Latency	Use Cases
1	Claude Haiku	~300ms	Intent classification, tagging, simple transforms, message routing
2	Claude Sonnet	~2s	Report generation, multi-step analysis, campaign copy, code generation
3	Claude Opus	~5s	Architecture decisions, ambiguous judgment calls, complex reasoning chains

How the Decision Happens

The routing logic lives in the first layer of our 18-layer agent pipeline. When a task enters the system, the router (itself a Haiku call) classifies the task's complexity and selects the model. The classification is based on:

Token budget — short-output tasks (classification, tagging) go to Haiku
Reasoning depth — tasks requiring multi-step logic or cross-referencing go to Sonnet
Stakes — client-facing deliverables where quality matters most go to Sonnet or Opus
Latency sensitivity — interactive chatbot responses need Haiku speed; batch processing can wait for Sonnet

Real Example: The AI Business Audit Pipeline

When we run a $297 business audit through the system:

Haiku classifies the intake form fields and determines which research paths to activate
DeerFlow (multi-agent framework running Sonnet) executes the research — competitor scans, Instagram analysis, revenue leak identification
Sonnet synthesizes the research into the 10-page report structure
Haiku formats the final output and generates the table of contents

The Haiku calls at steps 1 and 4 cost fractions of a cent and complete in milliseconds. The Sonnet calls at steps 2 and 3 do the heavy reasoning. The total API cost per audit is well under the margin needed to sustain $297 pricing.

What We've Learned

Don't over-route. Early versions tried to route every sub-task independently. The overhead of classification calls exceeded the savings. Now we route at the pipeline-stage level, not per-message.
Haiku is underrated for classification. It handles intent detection, entity extraction, and simple transforms with near-perfect accuracy at 10x lower cost than Sonnet.
Prompt caching compounds the savings. When you route correctly AND cache the system prompt across calls within a pipeline run, the cost curve flattens dramatically at scale.
Human review is still the last layer. No model routing replaces a human reading every deliverable before it ships. The routing optimizes cost and speed, not judgment.

The Business Case

Model routing is why we can offer fixed-price services ($297 audit, $997 reactivation sprint) instead of hourly billing. The cost per delivery is predictable and low enough to sustain margin. Without routing, every audit would eat 3–5x more API cost, and the $297 price point wouldn't work.

This isn't clever engineering for its own sake. It's the infrastructure that makes human-led AI economically viable for solo operators.

← Back to Field Notes · How We Build With Claude →