HomeInsightsAI Strategy
AI Strategy · 10 min read

LLM Routing: How to Cut Your AI API Bill by 60 Percent Without Losing Quality

There is a 16.7x price difference between the most and least expensive OpenAI models. Most small businesses use the expensive one for everything. Here is why that is the wrong approach, and how to fix it without rebuilding your workflows.

In January 2026, Jonas ran a twelve-person growth marketing agency in Copenhagen. When he opened his quarterly AI API invoice, it read $3,200. He had been running GPT-4o across every workflow in the business: campaign briefs, meeting note summaries, competitor analysis, email drafts, performance report narratives, client Q&A responses. All routed through the same model regardless of the complexity of the task.

When he audited the invoice by task type, $340 was attributable to meeting note summaries. GPT-4o at $10 per million output tokens was converting spoken-word transcripts into structured bullet points, a task that requires no reasoning, no world knowledge, and no sophisticated instruction-following. It requires pattern extraction. A model that costs $0.60 per million output tokens does that task identically.

Jonas was not a wasteful spender. He was a rational one who had never been given a reason to think about model selection. That changed when he implemented LLM routing. His Q2 bill came in at $1,140, a reduction of $2,060 on the same volume of work.

The price gap nobody told you about

There is a 16.7x price difference between GPT-4o and GPT-4o-mini for input tokens. GPT-4o costs $2.50 per million input tokens and $10 per million output tokens. GPT-4o-mini costs $0.15 per million input tokens and $0.60 per million output tokens. Both are OpenAI models. Both are accessible through the same API key. The capability difference matters for complex reasoning tasks. It is irrelevant for summarisation, classification, simple extraction, and structured formatting.

Anthropic's model family shows a similar gap. Claude Sonnet costs $3 per million input tokens and $15 per million output tokens. Claude Haiku costs $1 per million input tokens and $5 per million output tokens. Google's Gemini 1.5 Pro costs $1.25 per million input tokens; Gemini Flash costs $0.30. The pricing differential between frontier and capable-but-cheaper models runs from 3x to 17x across the major providers.

Most businesses using AI APIs send every task to the same model because that is the default configuration. The configuration requires no decision; it just costs the maximum possible amount for each task.

The average SMB running AI workflows on a single frontier model is overpaying for 40-60 percent of its API usage. The tasks in that range do not need a $10 model. They need a $0.60 model.

What LLM routing actually is

LLM routing is a system that evaluates each incoming AI task and directs it to the cheapest model capable of handling it reliably. A simple classification prompt asking for one of five output categories goes to a cheap, fast model. A multi-step analysis requiring reasoning across a long document and producing a nuanced recommendation goes to a premium model. The routing logic makes this decision automatically, without changing the user-facing application.

The routing decision can be based on several signals: the complexity of the input prompt, the type of task being performed, the length of the expected output, the classification confidence of a lightweight gating model, or explicit tags applied by the developer. More sophisticated systems use a small model to evaluate query difficulty before passing it to the appropriate tier.

The key insight is that you are not choosing between quality and cost. You are identifying the tasks where quality is determined by a different set of model capabilities than the tasks you have been treating as equivalent. A model excellent at complex reasoning is not better at meeting note summarisation than a model a fraction of its cost. They produce outputs that are indistinguishable in quality because the task does not require the capabilities that differentiate them.

What the research says

The two most cited papers on LLM routing both demonstrate dramatic cost reductions with minimal quality loss.

RouteLLM, from the LMSYS research team at UC Berkeley, published at ICLR 2025 (arXiv:2406.18665), trained a small router model to predict whether a given query required a strong model or could be answered adequately by a weaker one. In their evaluation, RouteLLM achieved an 85 percent reduction in API costs while maintaining 95 percent of GPT-4 level performance across a broad benchmark set. The router itself adds minimal latency because it is a small, fast model making a binary classification, not a full inference call.

FrugalGPT, from Stanford University, published in 2023 (arXiv:2305.05176), took a different approach: a cascade architecture that tries cheap models first and escalates only when confidence is insufficient. In the paper's evaluation, FrugalGPT achieved up to 98 percent cost reduction with equivalent quality on the tested benchmarks. The cascade approach trades some latency for additional savings on tasks where the first-tier model fails and escalation is required.

Both papers are academic benchmarks, not production measurements. Real-world deployments show more modest reductions in the 40-70 percent range depending on workload composition. But the direction is consistent: significant savings without quality degradation are achievable on diverse task mixes.

The tools SMBs can use today

The commercial routing market has matured substantially since those papers were published. Several tools now make LLM routing accessible without requiring academic-level implementation.

OpenRouter

OpenRouter provides a single API that connects to 400-plus models across OpenAI, Anthropic, Google, Meta, Mistral, and dozens of smaller providers. It processes 25 trillion tokens per week and raised $113 million in Series B funding in 2025. For SMBs, the relevant capability is that it normalises the API interface across all providers, making model switching trivial once routing logic is added. You set model preferences in a single configuration change rather than rewriting application code.

LiteLLM

LiteLLM is an open-source proxy with 49,600 GitHub stars and Y Combinator backing. It provides a unified interface for 100-plus LLM APIs, cost tracking by model and task type, budget controls, and fallback routing if a model is unavailable. It is the tool most commonly used by engineering teams that want visibility into exactly where API costs are going before implementing routing logic. The cost tracking alone reveals the overpricing pattern that motivates routing.

Martian

Martian is a commercial routing layer that automatically selects the optimal model for each query based on its trained understanding of task complexity. It raised $9 million in seed funding and is designed for teams that want routing intelligence without writing routing logic themselves. Martian sits between your application and the model APIs and makes routing decisions transparently.

NotDiamond

NotDiamond raised $2.3 million in pre-seed funding and focuses on performance-optimised routing: directing queries to the model most likely to produce the highest-quality output for that specific task type rather than purely minimising cost. It is useful for businesses where some task categories genuinely benefit from a specific model's strengths, and the routing logic captures which model wins on which task type through historical performance data.

How to decide what routes where

The practical starting point is a task audit, not a tool selection. Before implementing any routing layer, categorise the AI tasks in your business by complexity.

Simple tasks that reliably route to cheap models include: meeting note summarisation, email formatting, data extraction from structured documents, classification into a defined category set, simple question-answering against provided context, and content reformatting such as long to short or formal to informal.

Complex tasks that warrant premium models include: multi-document synthesis requiring judgment about conflicting information, strategy documents requiring nuanced recommendation across ambiguous inputs, code generation for non-trivial logic, and complex client-facing analysis where output quality is visible and consequential.

Most business AI workloads, when audited honestly, contain more simple tasks than complex ones. The instinct to use the best available model for everything is understandable, but it is the equivalent of flying first class for a 45-minute domestic flight because you can afford it. The destination is identical.

The decision is not which model is best. The decision is which model is sufficient for this specific task. Those are different questions with different answers.

What this means for your AI budget

Jonas's Q2 outcome was a $2,060 quarterly saving on the same volume of work. That projects to $8,240 per year from a configuration change that took one developer two days to implement. For a twelve-person agency, that figure is not transformational but it is material, and it comes with zero quality degradation on the tasks where routing redirected work to cheaper models.

The a16z 2025 State of AI report found that 37 percent of enterprises already use five or more AI models in production. Menlo Ventures tracked $8.4 billion in H1 2025 enterprise AI infrastructure spending. The pattern is consistent: as AI becomes embedded in workflows, cost optimisation follows adoption. LLM routing is the primary mechanism for that optimisation.

For SMBs that are earlier in the adoption curve, implementing routing before scale means the cost savings compound from the beginning rather than arriving as a remediation effort after a large invoice triggers a review.

The three-step implementation sequence is: audit your current tasks by complexity, implement a unified API layer (LiteLLM is the simplest starting point for teams with technical capacity; OpenRouter is the simplest for teams without), and assign task categories to model tiers. Review the invoice after sixty days. The saving is visible, measurable, and does not require ongoing maintenance once the routing configuration is set.

See how AutoCore AI designs cost-optimised AI workflows for small teams

Sources

Quick answers

Common questions.

Want this in your business?

The €49 audit shows you exactly which automations would pay back fastest in your specific operation.

€49 entryFull AI audit + strategy call included

Reserve your auditNo commitment. No contracts. Just clarity.