How do I reduce LLM cost for AI agents without hurting quality?

Use model routing first. Keep cheap models such as gpt-4.1-nano, gpt-4o-mini, gemini-2.0-flash, and claude-3-haiku on classification, extraction, summarization, and tool-selection steps. Escalate only hard reasoning or final synthesis to gpt-5.1, o3, claude-sonnet-4, or gemini-2.5-pro. Then trim context and cap output tokens.

What is the best cheap model for AI agent steps in 2026?

For pure cost, I like gpt-4.1-nano at $0.10 / 1M input, $0.40 / 1M output and gemini-2.0-flash at $0.10 / 1M input, $0.40 / 1M output. For a stronger cheap default, gpt-4o-mini at $0.15 / 1M input, $0.60 / 1M output is still one of the easiest picks.

Should I use GPT-5.1 for every agent call?

No. GPT-5.1 is a strong model at $1.25 / 1M input, $10.00 / 1M output, but most agent steps do not need that capability. Use it for complex synthesis, high-value decisions, and final answers where quality matters. Use cheaper models for routing, summaries, extraction, and tool arguments.

How much can prompt trimming save on LLM API costs?

A lot. If you cut an agent step from 40,000 input tokens to 6,000 input tokens, you save 34,000 input tokens per call. On gpt-5.1, that is about $0.0425 saved every call. On an 8-step agent running 50,000 times per month, that single change can save thousands of dollars.

Does caching work for AI agents?

Yes, if you cache the stable parts. Cache system prompts, tool schemas, document summaries, read-only tool results, classifications, and retrieval transformations. The agent’s final decision may be dynamic, but much of the context it consumes is repeated. Stable-prefix caching and application-level Redis or Postgres caches are both useful.

Are open-source models cheaper for agents than API models?

They can be, especially for high-volume repetitive tasks with steady traffic. Llama, DeepSeek, Mistral, and Qwen models work well for extraction, classification, summaries, and coding substeps. If your GPU utilization is low or your traffic is spiky, hosted APIs plus good routing usually wins.

How to Reduce LLM Costs for AI Agents (2026)

Practical guide to reduce LLM cost for AI agents with routing, prompt trimming, caching, batching, and monitoring tactics that work in 2026.

By Theo · Maker of Tokenwise

Updated May 21, 2026

AI agents

Key takeaways

Route cheap by default: gpt-4.1-nano at $0.10 / 1M input, $0.40 / 1M output is enough for many routing and extraction steps.
Do not run whole agents on premium models; reserve gpt-5.1, o3, claude-sonnet-4, and gemini-2.5-pro for hard synthesis or reasoning checkpoints.
Trim retrieval context before prompt wording; cutting 40k tokens to 6k tokens can save about $0.0425 per gpt-5.1 call.
Cache stable prefixes, retrieval summaries, tool results, and classifications; a 40% cache hit rate can beat a provider migration.
Monitor cost per successful agent run, with budgets by step and feature, not just total monthly LLM spend.

AI agents get expensive because they do not make one model call. They plan, search, call tools, read results, retry, summarize, and then call another model because the first answer was almost right. The bill is the loop.

If you want to reduce LLM cost for AI agents, the biggest wins are boring and mechanical: route most steps to cheaper models, stop stuffing the full conversation into every call, cache repeated work, batch offline jobs, and measure cost per agent step instead of cost per request.

My default in 2026 is simple: use a cheap fast model for extraction, routing, classification, and summaries; reserve GPT-5.1, Claude Opus, Gemini Pro, or o-series reasoning models for the few steps that actually need them.

Start with the agent cost equation

The cost of an agent run is not just input tokens plus output tokens. It is tokens per step multiplied by number of steps multiplied by retries. That multiplication is where teams get hurt.

Here is a realistic support agent: 8 LLM calls, each with 12,000 input tokens of chat history, retrieved docs, tool descriptions, and scratch context, plus 1,200 output tokens. On gpt-4o at $2.50 / 1M input, $10.00 / 1M output, that is about $0.336 per resolved ticket. At 100,000 tickets per month, you are at $33,600 before embeddings, storage, and failed runs.

Move the same workload to gpt-4o-mini at $0.15 / 1M input, $0.60 / 1M output and the raw LLM cost drops to about $0.020 per ticket. That is not a small optimization. That is the business model.

I track four numbers for every agent: average input tokens per step, average output tokens per step, steps per successful run, and retry rate. If you do not have those four, you are guessing. And guessing is expensive.

federated like a tax you forgot to estimate.

Route by task, not by vibes

I do not run an entire agent on a frontier model unless the product absolutely demands it. Most agent steps are clerical: classify intent, decide whether to call a tool, parse JSON, summarize search results, extract fields, or write a short user-facing update. Cheap models are excellent at those jobs now.

My typical routing stack is: gpt-4.1-nano, gpt-4o-mini, gemini-2.0-flash, or claude-3-haiku for small deterministic steps; gpt-4.1-mini, gpt-5.1-codex-mini, gemini-2.5-flash, or claude-3.5-haiku for normal agent work; gpt-5.1, o3, o4-mini, claude-sonnet-4, gemini-2.5-pro for hard synthesis; and o3-pro, o1, or claude-opus-4 only for high-value reasoning where latency and cost are acceptable.

Model	Pricing	Context window	Where I use it in agents
gpt-4.1-nano	$0.10 / 1M input, $0.40 / 1M output	1M tokens	Classification, routing, cheap extraction, guardrails
gpt-4o-mini	$0.15 / 1M input, $0.60 / 1M output	128k tokens	General cheap agent steps, summaries, JSON tasks
gpt-4.1-mini	$0.40 / 1M input, $1.60 / 1M output	1M tokens	Long-context retrieval, document workflows, tool use
gpt-5.1-codex-mini	$0.25 / 1M input, $2.00 / 1M output	400k tokens	Code agents, repo edits, test-fix loops
gemini-2.0-flash	$0.10 / 1M input, $0.40 / 1M output	1M tokens	Very cheap high-throughput summarization and extraction
gemini-2.5-flash	$0.30 / 1M input, $2.50 / 1M output	1M tokens	Fast multimodal and long-context agent steps
o4-mini	$1.10 / 1M input, $4.40 / 1M output	200k tokens	Reasoning checkpoints without paying pro-model prices
gpt-5.1	$1.25 / 1M input, $10.00 / 1M output	400k tokens	Final synthesis, high-stakes decisions, complex tool plans
claude-sonnet-4	$3.00 / 1M input, $15.00 / 1M output	200k tokens	Excellent writing, analysis, long-form reasoning
claude-opus-4	$15.00 / 1M input, $75.00 / 1M output	200k tokens	Only for premium reasoning paths and top-tier review

In LangGraph, I make routing an explicit node. In LangChain, I use separate model instances per chain. In Vercel AI SDK, I keep a small model as the default and override the model only inside specific tools or generation calls. This is mundane plumbing. It pays.

Trim context before you touch the prompt

Prompt polishing saves pennies. Context control saves real money. The fastest way to reduce LLM cost for AI agents is to stop sending the same irrelevant context to every step.

I use three context layers. First, a stable system prompt that is short and cached. Second, a working memory that contains only the current objective, decisions already made, and open questions. Third, retrieved evidence capped by budget, not by whatever the vector database returns.

For RAG agents, I rarely send 20 chunks anymore. I retrieve 30, rerank to 5 or 8, then compress each chunk with a cheap model before the main call. A 40,000-token retrieval payload often becomes 6,000 tokens with no quality loss. On gpt-5.1, that saves about $0.0425 per call just on input. Across thousands of calls, it matters.

Framework specifics: in LlamaIndex, use node postprocessors, similarity cutoffs, and response synthesizers that do not blindly concatenate nodes. In LangChain, add a contextual compression retriever and cap message history with a token-aware trimmer. In LangGraph, store long-term memory outside the message list and pass summaries into state. Do not let an agent’s chat history become a landfill. I have made that mistake; it is ugly.

Cache the boring calls aggressively

Caching is underrated because people think agents are dynamic. Parts of them are. Many parts are not.

I cache four things by default: system prompts and tool schemas when the provider supports prompt caching, retrieval summaries keyed by document version and query intent, tool results for read-only APIs, and LLM classifications for repeated inputs. If a user asks the same billing question 5,000 times, your agent should not pay to reason from scratch 5,000 times.

For OpenAI and Anthropic models with prompt caching behavior, keep the reusable prefix byte-for-byte stable: system prompt, policy text, tool definitions, and static examples. Put volatile user context after the stable prefix. Tiny formatting changes can break cache hits. Annoying, but true.

In LangChain, use model and chain-level caches for deterministic subchains, then add your own Redis or Postgres cache around retrievers and tools. In LlamaIndex, cache embeddings, node parsing, and expensive query transforms. In Vercel AI SDK, wrap tool functions with cache keys based on normalized arguments. For CrewAI and AutoGen-style multi-agent systems, cache inter-agent summaries; otherwise each agent pays to rediscover what the previous agent already learned.

A 40% cache hit rate on 10,000-token repeated prefixes is often a bigger win than switching providers.

Batch offline work and cap output tokens

Agents love to talk. Your invoice does not. Output tokens are usually more expensive than input tokens, so I cap them ruthlessly.

For most tool-calling steps, I set max output between 128 and 512 tokens. For extraction, I want JSON and nothing else. For planning, I want the next action, not a motivational essay. For user-facing final answers, I allow more, but still cap by product surface: a support answer rarely needs 2,000 tokens.

Batching helps when the work is not interactive: nightly document summarization, evaluation runs, CRM enrichment, ticket triage, embedding-adjacent metadata generation, and test-case generation for code agents. If you have 50,000 documents to classify, do not run them through an agent loop one by one with full orchestration overhead. Use batch APIs or your own queue with larger grouped requests, strict schemas, and cheap models like gemini-2.0-flash, gpt-4.1-nano, or gpt-4o-mini.

In practice, I split workloads into two lanes: interactive calls optimized for latency and reliability, and offline calls optimized for throughput and price. LangGraph makes this clean with separate graphs. In Vercel AI SDK, I keep background jobs out of request handlers. In LlamaIndex, I precompute summaries and metadata during ingestion rather than during user queries. The user should not pay latency or money for work you could have done yesterday.

Monitor cost per step, not monthly spend

Monthly spend is a lagging indicator. By the time it looks wrong, the bug has already run for days. I monitor cost at the level where agent behavior actually changes: model call, tool call, step, run, user, tenant, and feature.

The metrics I care about are simple: input tokens, output tokens, cached input tokens, model name, latency, retries, tool calls, finish reason, error type, and whether the run achieved its goal. Then I compute cost per successful run, not cost per request. Failed agent loops are the silent killer.

Set budgets in code. A research agent might get 20 steps and $0.50 per run. A support triage agent might get 5 steps and $0.03. A code agent editing production repositories might get an expensive reasoning escalation, but only after a cheap model has produced a failing-test summary and a patch plan.

Framework specifics: with LangSmith or OpenTelemetry, attach token and model metadata to every span. In LangGraph, emit state transitions with accumulated cost. In LlamaIndex, trace retriever, reranker, and synthesizer costs separately. In Vercel AI SDK, log usage from every generation and stream. I built Tokenwise because I wanted this view without reconstructing invoices from provider dashboards once a week.

Use open models where latency and ops make sense

Open models are not magic free compute. GPUs cost money, engineers cost more, and bad inference utilization can make “cheap” self-hosting surprisingly expensive. Still, for the right agent steps, Llama, DeepSeek, Mistral, and Qwen models are very useful in 2026.

I reach for open models when traffic is high, prompts are stable, latency targets are predictable, and the task can tolerate slightly more variance. Good fits: classification, extraction, reranking, internal summarization, synthetic data generation, eval judging for low-stakes checks, and domain-specific agents after fine-tuning. Bad fits: rare high-value reasoning paths where a failed answer costs more than the model call.

DeepSeek and Qwen are especially strong for coding and structured reasoning at attractive serving costs. Llama remains a dependable general-purpose choice with a broad tooling ecosystem. Mistral models are still a good pick when you care about efficient European deployment and tight inference.

My rule is blunt: if you can keep GPUs busy above roughly 40-50% utilization, self-hosting or a dedicated inference provider can beat API pricing for repetitive workloads. If your traffic is spiky and your team is small, use hosted APIs and optimize routing first. Owning idle GPUs is not a cost strategy. It is a hobby with invoices.

Verdict

If I had to cut an agent bill this week, I would not start with a provider migration. I would route every step by task, move clerical work to gpt-4.1-nano, gpt-4o-mini, gemini-2.0-flash, or claude-3-haiku, trim retrieval context hard, cache stable prefixes and tool results, and put a dollar budget on every run.

The expensive models are still worth it. I use gpt-5.1, o3, claude-sonnet-4, gemini-2.5-pro, and sometimes claude-opus-4 when the step genuinely needs them. But an agent that spends premium-model money on every thought is badly engineered. Make cheap calls the default, make escalation explicit, and measure cost per successful outcome. That is how you reduce LLM cost for AI agents without making the product dumber.

Frequently asked questions

How do I reduce LLM cost for AI agents without hurting quality?: Use model routing first. Keep cheap models such as gpt-4.1-nano, gpt-4o-mini, gemini-2.0-flash, and claude-3-haiku on classification, extraction, summarization, and tool-selection steps. Escalate only hard reasoning or final synthesis to gpt-5.1, o3, claude-sonnet-4, or gemini-2.5-pro. Then trim context and cap output tokens.
What is the best cheap model for AI agent steps in 2026?: For pure cost, I like gpt-4.1-nano at $0.10 / 1M input, $0.40 / 1M output and gemini-2.0-flash at $0.10 / 1M input, $0.40 / 1M output. For a stronger cheap default, gpt-4o-mini at $0.15 / 1M input, $0.60 / 1M output is still one of the easiest picks.
Should I use GPT-5.1 for every agent call?: No. GPT-5.1 is a strong model at $1.25 / 1M input, $10.00 / 1M output, but most agent steps do not need that capability. Use it for complex synthesis, high-value decisions, and final answers where quality matters. Use cheaper models for routing, summaries, extraction, and tool arguments.
How much can prompt trimming save on LLM API costs?: A lot. If you cut an agent step from 40,000 input tokens to 6,000 input tokens, you save 34,000 input tokens per call. On gpt-5.1, that is about $0.0425 saved every call. On an 8-step agent running 50,000 times per month, that single change can save thousands of dollars.
Does caching work for AI agents?: Yes, if you cache the stable parts. Cache system prompts, tool schemas, document summaries, read-only tool results, classifications, and retrieval transformations. The agent’s final decision may be dynamic, but much of the context it consumes is repeated. Stable-prefix caching and application-level Redis or Postgres caches are both useful.
Are open-source models cheaper for agents than API models?: They can be, especially for high-volume repetitive tasks with steady traffic. Llama, DeepSeek, Mistral, and Qwen models work well for extraction, classification, summaries, and coding substeps. If your GPU utilization is low or your traffic is spiky, hosted APIs plus good routing usually wins.