How to Reduce LLM Costs in Claude Code Workflows (2026)

Reduce LLM cost in Claude Code with model routing, context trimming, caching, batching, and monitoring tactics that cut spend without wrecking code quality.

By Theo · Maker of Tokenwise
Claude Code

Key takeaways

  • Use claude-sonnet-4 as the default coding model at $3.00 / 1M input, $15.00 / 1M output; reserve claude-opus-4 for hard planning and review.
  • Route cheap prep work to claude-3.5-haiku, gpt-5.1-codex-mini, gpt-4.1-mini, or gemini-2.5-flash instead of burning Sonnet or Opus turns.
  • Trim context with short CLAUDE.md files, targeted file reads, /clear between tasks, and /compact before sessions hit the 150k–200k danger zone.
  • Prompt caching can turn repeated Sonnet input from $3.00 / 1M input into roughly $0.30 / 1M cached-read input for stable repo context.
  • Batch async inventories, summaries, and first-pass reviews; 50% batch discounts are often easier than squeezing another 5% out of prompts.

If you want to reduce LLM cost in Claude Code, the biggest wins are not exotic. Stop sending the whole repo, stop using Opus for routine edits, cache the stable parts of your prompt, and measure cost per task instead of staring at your monthly bill.

I use Claude Code heavily for real engineering work: refactors, test repair, PR review, migration scripts, and the occasional “please explain why this ancient service is on fire.” The expensive sessions all have the same smell: huge context, vague instructions, repeated repository background, and a premium model doing cheap work.

My default 2026 setup is simple: Claude Sonnet 4 for most coding, Claude Opus 4 only for hard planning and review, and cheaper models for summarization, classification, and mechanical cleanup. That alone can cut spend by 50–80% before you touch caching.

Know where Claude Code actually spends tokens

Claude Code feels like a terminal tool, but the bill behaves like any agent loop: input context grows, tool results come back, the model reasons, then it emits code, shell commands, or explanations. The killer is repeated input. A 2,000-token answer is not scary. A 120,000-token context sent 30 times is.

I track four numbers for every session:

  • Input tokens per turn: usually the largest cost driver in repo-heavy work.
  • Output tokens per turn: important for verbose plans, generated files, and test logs.
  • Tool-result tokens: grep output, stack traces, diffs, and file contents copied back into context.
  • Cache read rate: the difference between paying full price for repo instructions and paying pennies.

A realistic bad session: 60 turns, 80k average input tokens, 3k average output tokens on claude-sonnet-4. At $3.00 / 1M input, $15.00 / 1M output, that is about $41.40. The same task with 25k input, 1.5k output, and cache hits on stable context lands closer to $10–14. Same model. Same developer. Less waste.

The trick is not being stingy. It is refusing to pay premium rates for tokens that do not help the next decision.

Route models by job, not by habit

I like Claude Code most with claude-sonnet-4 as the default. It is strong enough to edit real code, follow project conventions, and recover from failing tests. I only reach for claude-opus-4 when I need architectural judgment, nasty debugging, or a second-pass review before a risky merge. Opus is excellent. It is also five times Sonnet’s input price and five times the output price. Don’t let it rename variables for you.

If your Claude Code workflow goes through a router or your own API wrapper, add cheaper specialist routes. Send summaries, file classification, test-log compression, and issue triage to Haiku, Gemini Flash, GPT-4.1-mini, or GPT-5.1-codex-mini. Keep the editor loop on Sonnet unless the task is trivial.

ModelPricingContext windowUse in Claude Code workflow
claude-sonnet-4$3.00 / 1M input, $15.00 / 1M output200kDefault coding agent: edits, tests, refactors, PR fixes.
claude-opus-4$15.00 / 1M input, $75.00 / 1M output200kHard architecture, deep debugging, high-risk review.
claude-3.5-haiku$0.80 / 1M input, $4.00 / 1M output200kSummaries, issue triage, log compression, small edits.
claude-3-haiku$0.25 / 1M input, $1.25 / 1M output200kCheap classification and extraction. Not my pick for serious coding.
gpt-5.1-codex-mini$0.25 / 1M input, $2.00 / 1M output400kLow-cost code transforms and repo-aware helper tasks in mixed-provider stacks.
gpt-4.1-mini$0.40 / 1M input, $1.60 / 1M output1MLarge-context scanning, generated summary drafts, migration inventories.
gemini-2.5-flash$0.30 / 1M input, $2.50 / 1M output1MFast long-context summarization and cheap background analysis.

My rule is blunt: expensive models decide; cheap models prepare.

Trim context before you trim capability

Most Claude Code overspend starts with context gluttony. The agent reads too much, keeps too much, and then pays to resend it. I fix that before touching model quality.

Keep CLAUDE.md short and operational. Mine is closer to a runbook than a philosophy document: commands, test strategy, architecture boundaries, naming rules, and “do not touch” paths. If it is longer than 150–250 lines, it probably contains stale lore that belongs in docs, not every prompt. Same with custom slash commands under .claude/commands: make them precise, not essay-shaped.

Use targeted file references. Ask Claude Code to inspect git diff --stat, rg, and specific files before it reads whole directories. Never paste lockfiles, snapshots, generated clients, minified bundles, coverage reports, or full CI logs unless the task truly needs them. For test failures, pass the failing test name, stack trace, relevant assertion, and changed files first. You can always expand.

I also reset aggressively. Use /clear between unrelated tasks and /compact before a productive session turns into a 180k-token backpack. A compacted summary is not free, but dragging an entire conversation through every future turn is worse. This is the least glamorous optimization. It is also the one that saves the most money.

Use prompt caching like a repo-level discount

Prompt caching is perfect for Claude Code because a lot of the input is stable: system instructions, coding standards, repo map, package commands, API contracts, and the same CLAUDE.md content. If you are calling Anthropic’s API directly around Claude Code-style agents, put the stable prefix first and mark cacheable blocks with cache control. Keep volatile material — the user request, current diff, tool results — after the cached prefix.

For Claude models, the useful mental math is: cache writes cost about 1.25x normal input, and cache reads cost about 0.1x normal input. On claude-sonnet-4, normal input is $3.00 / 1M input, so a cached read is roughly $0.30 / 1M input. On a 40k-token stable prefix reused across 50 turns, that turns about 2M repeated input tokens from $6.00 into about $0.60 after the initial write. Small session, small savings. Big repo agent, real money.

The cache only works if the prefix is byte-stable. Don’t inject timestamps, random ordering, changing git status, or noisy summaries before the cache boundary. I’ve seen teams “enable caching” and get terrible hit rates because their repo summary changed every turn. That is not caching. That is expensive confetti.

Batch the work that does not need a live agent

Claude Code is great for interactive engineering. It is a bad shape for 500 independent micro-tasks if you run them as 500 chatty sessions. Batch the boring work.

Good batch candidates: summarize 200 changed files, classify flaky test logs, generate first-pass PR comments, extract API endpoint inventories, convert old config files, or produce migration checklists. Bad batch candidates: “fix this failing subsystem” or “decide the new architecture.” Those need iteration and judgment.

Anthropic’s Message Batches API is the obvious move for Claude workloads because batch pricing is cheaper for async jobs, commonly around a 50% discount from standard API rates. OpenAI’s Batch API has the same basic economic shape. So a Sonnet job priced at $3.00 / 1M input, $15.00 / 1M output can behave more like $1.50 / 1M input, $7.50 / 1M output when it does not need immediate response. For OpenAI helper work, gpt-4.1-mini at $0.40 / 1M input, $1.60 / 1M output is already cheap; batching makes it almost silly for large inventories.

My pattern is to let Claude Code identify the batch, write the job spec, and review sampled outputs. The agent stays in the loop as the supervisor, not the factory line.

Monitor cost per workflow, then enforce budgets

You cannot optimize what you only see as a provider invoice. I care about cost per meaningful unit: cost per PR, cost per accepted edit, cost per fixed test, cost per review comment kept, and cost per migration file. Raw token totals are useful, but they do not tell you whether the spend produced engineering value.

At minimum, log this for every Claude Code session:

  • Model used per turn and whether the session escalated to Opus.
  • Input, output, and cache-read tokens separately.
  • Tool calls and largest tool outputs, because logs quietly eat budgets.
  • Context size at p50 and p95, not just the average.
  • Outcome: merged PR, failed attempt, abandoned branch, accepted review, reverted code.

Then add hard gates. Warn at 80k context. Require confirmation before Opus. Auto-summarize tool output above 10k tokens. Stop sessions above a per-task cap unless the user approves. In Tokenwise, I usually tag these runs by repo, branch, model, and workflow type so the waste is visible by Monday morning instead of month-end.

The best budget control is social: show engineers the cost of one bloated session next to one clean session. People change fast when the numbers are tied to their own workflow.

Verdict

The cleanest way to reduce Claude Code spend is not to make the model dumber. Keep claude-sonnet-4 as the workhorse, use claude-opus-4 like a senior reviewer, and push prep work to cheaper models or batches. Then cut repeated context with short instructions, targeted file reads, caching, and aggressive session resets.

If I had to implement only one policy tomorrow, it would be this: Sonnet by default, Opus behind confirmation, warning at 80k context, hard stop near 160k, and cached stable repo context. That gives you most of the savings without turning Claude Code into a frustrating toy.

Frequently asked questions

How do I reduce LLM cost in Claude Code the fastest?

Switch routine coding to claude-sonnet-4, require approval for claude-opus-4, shorten CLAUDE.md, and use /clear between unrelated tasks. Those four changes usually cut spend faster than any clever prompt trick.

Is Claude Opus 4 worth the cost for Claude Code?

Yes, but only for the expensive decisions: architecture, deep debugging, security-sensitive changes, and final review of risky PRs. At $15.00 / 1M input, $75.00 / 1M output, I do not use Opus for normal edits, test repair, or mechanical refactors.

What is the best default model for Claude Code workflows?

claude-sonnet-4 is my default. It has the right balance of coding quality, tool use, instruction following, and price at $3.00 / 1M input, $15.00 / 1M output. Haiku is cheaper, but I only trust it for simpler helper work.

Does prompt caching really lower Claude API costs?

Yes. For stable repo context, prompt caching is one of the cleanest savings. A cached read is roughly 10% of normal input cost, so repeated Sonnet context can behave like about $0.30 / 1M cached input instead of $3.00 / 1M input.

Should I use GPT or Gemini models inside a Claude Code cost strategy?

Yes, if your workflow supports routing through a proxy or custom agent layer. I like gpt-5.1-codex-mini at $0.25 / 1M input, $2.00 / 1M output, gpt-4.1-mini at $0.40 / 1M input, $1.60 / 1M output, and gemini-2.5-flash at $0.30 / 1M input, $2.50 / 1M output for cheap prep tasks.

What metrics should I track for Claude Code cost optimization?

Track cost per PR, input tokens per turn, output tokens per turn, cache hit rate, model mix, p95 context size, and tool-output size. The p95 context number catches the runaway sessions that averages hide.

More guides

Add this to your app in one line

Point your OpenAI baseURL at Tokenwise and every call is logged, priced, and optimizable — no SDK rewrite, no LangChain required.