Free playbook · ~10 min read

The $1k LLM bill rescue playbook

Ten specific things to do this week — model swaps, prompt caches, max_tokens caps, batching. Each tactic has real code, before/after math, and a 15-minute hook.

Want the markdown to read offline? (optional)

Or — the page reads fully without signing up.

The $1k LLM bill rescue playbook

10 specific things to do this week. Each one is concrete, copy-pasteable, and bounded to 15-60 minutes of work.

If you're spending $500-$2,000/month on LLM APIs and you've never actually sat down to optimize, I can almost guarantee you can halve the bill in a week without anyone noticing. I've done this for three of my own projects and helped a handful of other makers do it for theirs. The savings always come from the same boring places: wrong model on cheap calls, no cache, no max_tokens cap, three days of forgotten retries hammering the API.

This isn't a "5 tips for LLM costs" listicle. The tactics below are ranked roughly by highest payoff, lowest risk — read top-down and stop when you've hit your target.

Quick legend:

  • TS = TypeScript snippet, PY = Python snippet
  • Numbers in the math tables come from public 2026 pricing, rounded to two decimals

Quick context on me: I run two production AI apps and burned $400 last month on a misconfigured Anthropic fallback because I forgot the chain was set to retry on 5xx. The lessons here are the receipts from those mistakes.


Tactic 1 — Move cheap-tier calls off the frontier model

The single biggest line item on most bills is the same call you've been making for months: a classification, a router, a tag extractor — and you're paying $5 / 1M input tokens for it.

gpt-4o costs ~10x more than gpt-4o-mini per token. claude-sonnet-4 costs 5-6x more than claude-haiku-4. For a binary classification, JSON extraction from a structured prompt, or a "which tool should I call" routing decision, the frontier model is overkill. Mini-tier models pass 95%+ of the time when your eval set is well-written.

Do this in 15 minutes:

TS — OpenAI:

// before
const res = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: classifyPrompt(input) }],
});

// after
const res = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: classifyPrompt(input) }],
});

PY — Anthropic:

# before
res = anthropic.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=200,
    messages=[{"role": "user", "content": classify_prompt(input)}],
)

# after
res = anthropic.messages.create(
    model="claude-haiku-4-5",
    max_tokens=200,
    messages=[{"role": "user", "content": classify_prompt(input)}],
)

Before/after math (100k classification calls/month, 800 input + 50 output tokens each):

PathInput costOutput costMonthly total
gpt-4o ($2.50/$10 per 1M)$200$50$250
gpt-4o-mini ($0.15/$0.60)$12$3$15
Savings$235/mo (94%)

Run your eval set against both before you flip the switch (see Tactic 7). 19 times out of 20, the mini-tier passes. The one time it doesn't is usually a prompt that needs rewriting, not a model that needs upgrading.


Tactic 2 — Use Anthropic's prompt cache for stable system prompts

If you have any prompt with a system message longer than ~1k tokens that doesn't change between calls (a long instruction, a knowledge dump, a few-shot example block), you're leaving 90% on the table.

Anthropic's prompt cache discounts cached input tokens to ~10% of the regular price. The catch: the first call (cache write) costs 25% extra. So caching only pays off if you'll hit the cache more than ~1.3 times before the 5-minute TTL expires.

import anthropic

client = anthropic.Anthropic()
SYSTEM = open("long_system_prompt.txt").read()  # ~3000 tokens

res = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=600,
    system=[
        {
            "type": "text",
            "text": SYSTEM,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": user_query}],
)
print(res.usage)
# usage = { input_tokens, cache_creation_input_tokens, cache_read_input_tokens, ... }

The first call returns cache_creation_input_tokens=3000. Every subsequent call within 5 minutes returns cache_read_input_tokens=3000 instead. Reads are priced at 10% of base input; creations at 125%.

Before/after math (a 3k-token system prompt, 500 calls/day, Claude Sonnet 4 at $3/1M input):

PathDaily input costMonthly
No cache500 × 3000 / 1M × $3 = $4.50$135
With cache (1 write/hour, 8 writes/day)(8 × 3000 × 1.25 + 492 × 3000 × 0.10) / 1M × $3 = $0.53$16
Savings$119/mo (88%)

If your traffic is bursty (one call every few hours), the cache TTL will expire between calls and you'll pay the 25% write premium every time. In that case skip this tactic — it'll cost you ~25% more, not save anything.

OpenAI also has prompt caching, automatically applied to prompts >1024 tokens with no setup. Same idea, smaller discount (~50% off cached input). You don't need to do anything to enable it — but make sure your system prompt is byte-stable (no timestamps, no random IDs, no shuffled examples).


Tactic 3 — Cap your max_tokens

Open your code. Search for max_tokens. I bet you find one of these:

  • It's set to 4096, the default
  • It's set to 2000 because you copy-pasted from a tutorial
  • It's not set at all

Now go look at your actual output lengths. Most real production calls return well under 500 tokens. Anything you set above your real p99 is tokens you'll never generate — except the provider will happily keep going if the model rambles, and you'll pay for every token of it.

// before — accepts up to 4096 tokens you'll never read
const res = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  max_tokens: 4096,
  messages: [...],
});

// after — capped to what you actually use
const res = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  max_tokens: 500, // p99 was 380 last week
  messages: [...],
});

The waste formula:

monthly_waste = (cap_set - your_p99_observed) × output_price_per_token × call_volume

Concrete: 200k calls/month, max_tokens=4096, real p99 = 400, model = gpt-4o ($10/1M output):

waste = (4096 - 400) × $0.00001 × 200000 = $7,392/mo of potential overrun

You won't hit that every call — but on the 1-5% of runaway calls where the model loops or hallucinates a wall of text, you'll pay full freight. Capping at 500 tokens means a runaway call costs you $0.005 instead of $0.04.

For tasks where you genuinely need flexibility (long-form generation, coding), set max_tokens to 2x your observed p99, not the API default.


Tactic 4 — Trim conversation context

Conversational agents quietly double their input cost every time they exchange a few turns. If you're sending the full message history every call, by turn 20 you're sending ~5000 tokens of history for every 200-token user message.

A 10-line helper fixes most of this:

// keep system + last N user/assistant pairs
export function trimMessages<T extends { role: string }>(
  messages: T[],
  keepLastN = 6,
): T[] {
  const system = messages.filter((m) => m.role === "system");
  const turns = messages.filter((m) => m.role !== "system");
  const trimmed = turns.slice(-keepLastN * 2); // pairs
  return [...system, ...trimmed];
}

PY version:

def trim_messages(messages: list[dict], keep_last_n: int = 6) -> list[dict]:
    system = [m for m in messages if m["role"] == "system"]
    turns = [m for m in messages if m["role"] != "system"]
    trimmed = turns[-keep_last_n * 2:]
    return [*system, *trimmed]

For agents that need long memory, the better pattern is summarization: every 10 turns, replace the older messages with a 2-sentence summary generated by a mini-tier model. The summary costs $0.001 to generate and replaces ~3000 tokens of context.

Before/after math (a chat agent, 5000 daily turns, average turn = 200 tokens, gpt-4o-mini):

PathAvg contextMonthly input
Full history4000 tokens$18
Last 6 turns1200 tokens$5.40
Savings$12.60/mo (70%)

Looks small? Multiply by every concurrent conversation. A B2B chatbot with 200 active users instead of one sees those savings ~100x.


Tactic 5 — Stream + early-exit on long outputs

Two-for-one tactic: streaming gets you a faster perceived response (good UX) and lets you abort the generation as soon as you have what you need (good wallet).

If you're extracting structured data (JSON, a tag list, a yes/no), you usually know the response is "done" well before the model decides to stop. Stream, watch the buffer, abort when you have a closing } or your stop condition fires.

import OpenAI from "openai";

const openai = new OpenAI();
const controller = new AbortController();

const stream = await openai.chat.completions.create(
  {
    model: "gpt-4o-mini",
    max_tokens: 1000,
    stream: true,
    messages: [
      { role: "system", content: "Return JSON only, then stop." },
      { role: "user", content: "Extract name, age, city from: ..." },
    ],
  },
  { signal: controller.signal },
);

let buffer = "";
for await (const chunk of stream) {
  buffer += chunk.choices[0]?.delta?.content ?? "";
  // Once we have a balanced JSON object, kill the stream.
  if (buffer.includes("}") && tryParse(buffer)) {
    controller.abort();
    break;
  }
}

Note on billing: with most providers, you're still billed for the generated tokens that arrived before abort, not the model's planned full completion. So if you abort at token 50 of a 400-token plan, you save 350 tokens of output cost. This matters most on gpt-4o-tier and Claude Sonnet where output is $10-15/1M.

For Anthropic, the same pattern works — stream=True, watch chunk.delta.text, close the SSE connection client-side when done. There's no "abort" callback because the network close acts as one.


Tactic 6 — Cache deterministic responses

If two users (or the same user twice) ask the same question, you should pay for the LLM call once.

The simplest version is exact-match caching: hash the prompt, store the response by hash, TTL on the entry. Five lines of code for a 30-60% hit rate on most real apps (people genuinely ask the same things).

import crypto from "node:crypto";

const cache = new Map<string, { value: string; expiresAt: number }>();

function cacheKey(messages: object[], model: string): string {
  return crypto
    .createHash("sha256")
    .update(JSON.stringify({ messages, model }))
    .digest("hex");
}

export async function cachedComplete(
  messages: object[],
  model: string,
  ttlMs = 60 * 60 * 1000,
): Promise<string> {
  const key = cacheKey(messages, model);
  const hit = cache.get(key);
  if (hit && hit.expiresAt > Date.now()) return hit.value;

  const res = await openai.chat.completions.create({ model, messages });
  const value = res.choices[0]!.message.content!;
  cache.set(key, { value, expiresAt: Date.now() + ttlMs });
  return value;
}

For production, replace Map with Redis / KV. For semantic caching (hits on "similar" prompts, not just identical), you need embeddings + a vector index. That's a weekend of work to build well; Tokenwise's edge cache does it automatically at the proxy layer if you want to skip the rebuild.

Before/after math (10k daily calls, 35% identical-prompt rate, gpt-4o):

PathDaily LLM callsMonthly
No cache10,000$300
Exact cache (35% hit)6,500$195
Semantic cache (50% hit)5,000$150
Savings (semantic)$150/mo (50%)

Tradeoff: caching kills personalization. If your prompts include the user's name, timestamp, or session ID, the hit rate drops to ~0%. Normalize those out (template the prompt, hash the template, pass user-specific vars separately).


Tactic 7 — Run an eval set, pick the cheapest model that passes

Most people pick models by reading the announcement post. That's how you end up running gpt-4o on a binary classifier.

The right way: write 20-50 canonical inputs with expected outputs, run each one through every candidate model, score them. Then pick the cheapest one that hits your quality bar.

The methodology — 3 paragraphs:

1. Build the eval set from your actual production traffic. Pull 50 real examples from your logs. Don't write synthetic ones — your synthetic data will pass everything and your real users will hit edge cases you didn't imagine. For each example, write down the expected output (or the rubric: "must include X, must not say Y").

2. Run a sweep. For each model in your candidate set (gpt-4o-mini, claude-haiku, gemini-flash, llama-3-70b on Groq), generate the response for all 50 examples. Score each with either an exact-match function (if your output is structured), a string similarity check, or LLM-as-judge with a 3rd model evaluating both responses. Keep the eval prompt simple — "given input X, response A scored 1-5, response B scored 1-5".

3. Build a cost-quality scatter. Plot quality score on Y, cost-per-1k-calls on X. Pick the leftmost (cheapest) model that's above your acceptable quality threshold. There will almost always be a model that's 5-10x cheaper for a <5% quality drop.

If you want this automated, the tokenwise-audit Claude skill (in the Claude Skills bundle) bootstraps an eval set from your last 1000 production calls, runs the sweep across 6 candidate models, and dumps a markdown report with the verdict per call type.


Tactic 8 — Provider fallback for resilience AND cost

OpenAI goes down. Anthropic rate-limits you at 3 PM EST. When that happens, you have two bad choices: hard-fail the user, or have the user's request hang while your retry loop hammers the API.

The right pattern is a fallback chain: try the primary model, on transient failure try a different provider with similar-quality output. This both fixes reliability and — bonus — lets you fall back to cheaper providers when the expensive one fails.

const PROVIDERS = [
  { name: "openai", model: "gpt-4o-mini", client: openaiClient },
  { name: "groq", model: "llama-3.1-70b-versatile", client: groqClient },
  { name: "deepseek", model: "deepseek-chat", client: deepseekClient },
];

export async function completeWithFallback(messages: object[]): Promise<string> {
  let lastError: Error | undefined;
  for (const { name, model, client } of PROVIDERS) {
    try {
      const res = await client.chat.completions.create({ model, messages });
      return res.choices[0]!.message.content!;
    } catch (err) {
      lastError = err as Error;
      const status = (err as { status?: number }).status;
      // Retry only transient failures. 4xx (except 429) means OUR bug.
      if (status && status < 500 && status !== 429) throw err;
      console.warn(`[fallback] ${name} failed: ${(err as Error).message}`);
    }
  }
  throw lastError ?? new Error("All providers failed");
}

Warning from personal experience: make sure your fallback chain is bounded. I once shipped a chain that on a 5xx would loop back to provider 1 with no max retries. OpenAI had an outage. The loop ran for two hours. $400 of "retry" calls to a borked endpoint that was returning success codes with garbage payloads. Now I cap retries at 3 across the whole chain, full stop.

Groq's Llama 3.1 70b is roughly $0.59 per 1M input / $0.79 per 1M output — about 4x cheaper than gpt-4o-mini for comparable quality on most tasks. DeepSeek is even cheaper. If you set them as fallbacks, on outage days you actually save money instead of just preserving uptime.


Tactic 9 — Batch where the API supports it

OpenAI's Batch API gives you 50% off in exchange for accepting up to 24-hour latency. For any work that isn't user-facing — nightly embedding refreshes, classification of yesterday's records, summarization of a corpus — this is free money.

import json
from openai import OpenAI

client = OpenAI()

# Write a JSONL file of requests
with open("batch_input.jsonl", "w") as f:
    for record in records:
        f.write(json.dumps({
            "custom_id": record["id"],
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o-mini",
                "messages": [{"role": "user", "content": record["text"]}],
                "max_tokens": 100,
            },
        }) + "\n")

# Upload + create batch
batch_file = client.files.create(
    file=open("batch_input.jsonl", "rb"),
    purpose="batch",
)
batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)
# Check batch.status periodically; when "completed", download output_file_id.

Anthropic's Message Batches API is the same idea (also 50% off). Gemini also has a batch endpoint.

When batch makes sense:

  • Nightly cron jobs (cost classification of yesterday's logs)
  • Pre-computing embeddings for a content library
  • A/B testing your prompt variants over a corpus

When it absolutely doesn't:

  • Anything user-facing (chat, search, real-time agents)
  • Sub-minute scheduled jobs (you'll backlog)
  • Workloads where you need partial-result fast iteration

Before/after math (50k classifications/day, run as batch overnight, gpt-4o-mini, 800 in / 50 out):

PathDaily costMonthly
Real-time API$7.50$225
Batch API (50% off)$3.75$113
Savings$112/mo (50%)

This is the lowest-risk, highest-ratio change in the whole playbook if you have any non-interactive workload. The code change is moving from "POST and wait" to "queue and pick up tomorrow".


Tactic 10 — Audit, then repeat monthly

The first audit gives you the biggest jump — usually a 40-60% cut. Then the bill creeps back up as new features ship, the team adds calls, models drift, prompts grow. The compounding loop:

  1. First of every month, pull last month's per-prompt cost breakdown.
  2. Identify the top 5 most expensive call patterns (by total $, not per-call $).
  3. For each one, ask the 3 questions: can it run on a cheaper model? can it be cached? can it be batched?
  4. Make one change, ship it, measure for a week.
  5. Next month, do it again.

A $1,200/month bill cut by 50% saves $7,200/year. A $1,200/month bill cut by 50% every month with 5% creep-back saves a compounded ~$5,500/year and stays cut. The monthly loop is what locks in the savings.

If you want this audit automated continuously, the tokenwise-audit Claude skill from the Claude Skills bundle runs the analysis on demand from your terminal. Or Tokenwise itself sends you a weekly Insights email with the same breakdown delivered to your inbox — no setup beyond the proxy URL change.


The short version

If you only have one hour, do these four:

  1. Swap one frontier-tier call to mini-tier (Tactic 1) — 15 min, 80%+ savings on that call
  2. Set max_tokens to 2x your real p99 on every call (Tactic 3) — 15 min, protects against runaway calls
  3. Wire exact-match caching on your most-repeated prompt (Tactic 6) — 30 min, 30-50% savings
  4. Schedule the monthly audit (Tactic 10) — 5 min in your calendar, prevents regression

That gets you to ~40% savings on most bills the same afternoon. The remaining 30-50% is the long tail: provider fallback, batching, evals, prompt caching.


If you want all of this automated continuously, that's Tokenwise. 1-line setup, weekly insights email, $19/mo. Or stay here and DIY — both work.

Questions: [email protected].

Next step

Got value from this?

Tokenwise automates the boring parts: weekly Insights email with the model swaps, cache hit-rate alerts, and apply-with-one-click recommendations. 1-line setup. $19/mo.

Start free trial

7-day trial · no card · cancel anytime