How to Reduce LLM Costs in TypeScript Applications (2026)

Reduce LLM cost in TypeScript with routing, prompt trimming, caching, batching, and monitoring tactics I use in real apps to cut API spend.

By Theo · Maker of Tokenwise
TypeScript

Key takeaways

  • Route aggressively: use gpt-4o-mini at $0.15 / 1M input, $0.60 / 1M output or gpt-4.1-mini at $0.40 / 1M input, $1.60 / 1M output as defaults, not premium models.
  • A 1,000,000-request workload at 2,000 input and 400 output tokens costs about $7,200/month on gpt-4.1 but only $540/month on gpt-4o-mini.
  • Trim context before the call: cutting input tokens by 35% can save roughly $1,400/month on a gpt-4.1 workload with 2B monthly input tokens.
  • Cache summaries, extraction, labels, embeddings, and other deterministic calls; use model, prompt version, input, and parameters in the cache key.
  • Monitor cost per route, user, prompt version, cache hit, and retry count; invoice-level observability is too late.

If your TypeScript app calls an LLM on every request, your bill is not a mystery. It is arithmetic: input tokens, output tokens, model choice, retries, and how often you send the same context again.

The fastest way to reduce LLM cost in TypeScript is not one trick. I usually cut spend with five moves: route most traffic to cheaper models, trim context before the API call, cache deterministic work, batch background jobs, and monitor cost per route.

My strong default in 2026: start with gpt-4.1-mini, gpt-4o-mini, gemini-2.0-flash, or claude-3-haiku. Escalate only when the request proves it deserves a premium model. That one decision saves more money than weeks of prompt tinkering.

Start with token math, not vibes

I always begin with the boring formula because it catches expensive mistakes fast:

monthly cost = inputTokens / 1,000,000 × inputPrice + outputTokens / 1,000,000 × outputPrice.

Say a Next.js app handles 1,000,000 AI requests per month. Each request sends 2,000 input tokens and receives 400 output tokens. On gpt-4.1 at $2.00 / 1M input, $8.00 / 1M output, that is $4,000 for input plus $3,200 for output: $7,200/month. The same traffic on gpt-4o-mini at $0.15 / 1M input, $0.60 / 1M output is $540/month.

That gap is why I do not let product code call “the best model” by default. I make model choice explicit at the API boundary. In TypeScript, that usually means one server-side wrapper used by every route handler, queue worker, cron job, and agent tool.

type LlmTask = 'classify' | 'extract' | 'draft' | 'code' | 'reason';

function estimateCost(inputTokens: number, outputTokens: number, price: { in: number; out: number }) {
  return inputTokens / 1_000_000 * price.in + outputTokens / 1_000_000 * price.out;
}

Small wrapper. Big savings. I’ve seen teams discover that one forgotten summarization endpoint cost more than their main chat product.

Pick a cheap default and route upward

My default is simple: use the cheapest model that passes your eval, then route the weird cases upward. For TypeScript SaaS apps, I reach for gpt-4.1-mini when I want reliable instruction following, gpt-4o-mini when price matters most, gemini-2.0-flash for very cheap high-throughput work, and claude-3-haiku for fast lightweight Claude calls.

ModelPricingContext windowWhere I use it
gpt-4o-mini$0.15 / 1M input, $0.60 / 1M output128k tokensClassification, extraction, rewrite, cheap chat fallback
gpt-4.1-mini$0.40 / 1M input, $1.60 / 1M output1M tokensDefault app model for solid quality without premium pricing
gpt-4.1$2.00 / 1M input, $8.00 / 1M output1M tokensLong-context analysis, stronger coding, hard structured outputs
gpt-5.1$1.25 / 1M input, $10.00 / 1M output400k tokensTop-tier reasoning and product-critical answers
gpt-5.1-codex-mini$0.25 / 1M input, $2.00 / 1M output400k tokensCode review, patch generation, TypeScript refactors
o4-mini$1.10 / 1M input, $4.40 / 1M output200k tokensCheap reasoning, mathy planning, tool-heavy flows
o3$2.00 / 1M input, $8.00 / 1M output200k tokensReasoning tasks where mini models fail evals
claude-sonnet-4$3.00 / 1M input, $15.00 / 1M output200k tokensWriting quality, agent workflows, code comprehension
claude-opus-4$15.00 / 1M input, $75.00 / 1M output200k tokensRare premium analysis. I do not use it as a default.
gemini-2.0-flash$0.10 / 1M input, $0.40 / 1M output1M tokensBulk summarization, cheap extraction, document triage
gemini-2.5-flash$0.30 / 1M input, $2.50 / 1M output1M tokensFast multimodal and long-context tasks
gemini-2.5-pro$1.25 / 1M input, $10.00 / 1M output1M tokensLong documents, strong reasoning, multimodal analysis
claude-3-haiku$0.25 / 1M input, $1.25 / 1M output200k tokensLow-latency summaries and routing decisions

For open models, I like DeepSeek-V3, DeepSeek-R1, Llama 3.3 70B, Mistral Small, and Qwen3-Coder for controlled workloads. I use them when traffic is predictable enough that GPU utilization stays high. Otherwise managed APIs still win on engineering time.

Implement routing in your TypeScript boundary

Routing should live behind one function, not scattered through React Server Components, NestJS services, Hono handlers, and BullMQ workers. I like a tiny policy layer that chooses a model from task type, input length, user plan, and confidence requirement.

type RouteInput = {
  task: 'classify' | 'extract' | 'draft' | 'code' | 'reason';
  inputTokens: number;
  userPlan: 'free' | 'pro' | 'enterprise';
  risk: 'low' | 'medium' | 'high';
};

function chooseModel(x: RouteInput) {
  if (x.task === 'classify') return 'gpt-4o-mini';
  if (x.task === 'extract' && x.inputTokens > 100_000) return 'gemini-2.0-flash';
  if (x.task === 'code') return 'gpt-5.1-codex-mini';
  if (x.task === 'reason' && x.risk === 'high') return 'gpt-5.1';
  if (x.userPlan === 'free') return 'gpt-4o-mini';
  return 'gpt-4.1-mini';
}

In Next.js, call this in your Route Handler before invoking the provider SDK. In NestJS, put it in an injectable LlmService. In Hono or Fastify, keep it in the request context and log the selected model with the route name.

The key is escalation. Run the cheap model first for tasks that can be verified: JSON extraction with a schema, classification with allowed labels, retrieval answer with citations. If validation fails, retry on gpt-4.1, o3, or claude-sonnet-4. Paying premium prices for 5% of traffic is fine. Paying them for 100% is lazy.

Trim prompts and context before the call

Most LLM waste I see is self-inflicted: giant system prompts, duplicated chat history, full documents pasted into every turn, and tool schemas that read like legal contracts. Context windows are bigger now, but large context is still billed. A 1M-token window is not an invitation to dump your database into the prompt. Ask me how I learned that one.

In TypeScript, I use four trimming rules:

  • Cap history by tokens, not messages. Keep the last useful turns plus a running summary.
  • Retrieve fewer chunks. For RAG, start with topK 4–8. Do not send 30 chunks because search felt uncertain.
  • Strip markup. Remove nav, footers, repeated boilerplate, hidden text, base64 blobs, and tracking junk before summarizing pages.
  • Compress schemas. Tool descriptions and JSON schema descriptions count as input tokens. Keep them short.
function trimMessages(messages: ChatMessage[], maxTokens: number) {
  const kept: ChatMessage[] = [];
  let total = 0;
  for (const msg of [...messages].reverse()) {
    const tokens = estimateTokens(msg.content);
    if (total + tokens > maxTokens) break;
    kept.unshift(msg);
    total += tokens;
  }
  return kept;
}

For Vercel AI SDK apps, do this before streamText or generateObject. For LangChain.js, trim before building the final prompt, not after retrieval. A 35% input-token cut on the earlier example saves $1,400/month on gpt-4.1 without touching product behavior.

Cache deterministic and repeated work

Caching is the highest-ROI cost tactic after model routing. I cache anything where the same input should produce the same answer: summaries, embeddings, moderation labels, entity extraction, SQL explanation, code review of a specific commit, and support macro drafts. I do not cache open-ended chat turns unless the product can tolerate repeated wording.

Use a stable hash of the normalized input, model, prompt version, tool schema version, and decoding parameters. If you change the system prompt, the cache key must change too.

import { createHash } from 'node:crypto';

function llmCacheKey(parts: unknown) {
  return createHash('sha256')
    .update(JSON.stringify(parts))
    .digest('hex');
}

const key = llmCacheKey({
  model: 'gpt-4.1-mini',
  promptVersion: 'extract-v7',
  temperature: 0,
  input: normalize(text)
});

In Next.js, you can combine route-level caching for public AI-generated content with Redis or Upstash for LLM response caching. In NestJS, wrap the LLM service with CacheInterceptor for safe tasks, then use Redis directly for larger payloads. In queue workers, cache before enqueueing follow-up jobs so retries do not double-spend.

Realistic numbers: if 20% of your 1,000,000 monthly requests are repeated summaries and you cache them, the gpt-4.1-mini bill for those calls drops from about $1,440 to $1,152 at the 2,000-in / 400-out shape. Not glamorous. Very real.

Batch background work and constrain outputs

Interactive chat and background processing need different cost strategies. User-facing routes need latency. Backfills, nightly summarization, enrichment, evaluation, and re-indexing need throughput and price discipline.

For background work, I batch by task type and model. Use BullMQ, Temporal, Cloud Tasks, or a plain Postgres job table; the tool matters less than the rule: do not send one tiny request when you can pack twenty homogeneous items into a structured batch. Ask for an array of results and validate it with Zod.

const BatchResult = z.array(z.object({
  id: z.string(),
  label: z.enum(['bug', 'billing', 'feature', 'other']),
  confidence: z.number().min(0).max(1)
}));

Batching cuts repeated system prompts and tool schemas. If your system prompt is 700 tokens and you classify 100,000 tickets one by one, that is 70M prompt tokens before user content. Batch 20 tickets per call and you cut that overhead to 3.5M tokens. On gemini-2.0-flash at $0.10 / 1M input, $0.40 / 1M output, the dollar number is small. On claude-sonnet-4 at $3.00 / 1M input, $15.00 / 1M output, it hurts.

Also constrain output. Use JSON schemas, max output tokens, short answer modes, and stop sequences. Output tokens are usually more expensive: gpt-5.1 is $1.25 / 1M input, $10.00 / 1M output. Rambling is a billing bug.

Monitor cost per route, user, and feature

You cannot reduce what you only see at invoice time. I instrument every LLM call with model, route, user or workspace ID, prompt version, input tokens, output tokens, cache hit, latency, retry count, and estimated cost. That sounds like a lot until you get your first surprise bill; then it sounds minimal.

MetricWhy I track itAction when it spikes
Cost per routeFind the endpoint burning moneyRoute to cheaper model, cap context, add cache
Input tokens per requestCatch context bloat and RAG over-retrievalTrim history, lower topK, strip HTML
Output tokens per requestDetect verbose prompts and missing limitsSet max output tokens, require concise JSON
Cache hit rateProve caching is actually workingNormalize inputs, fix prompt-version churn
Retry rateRetries silently multiply spendFix schema prompts, timeouts, provider fallback logic
Cost per workspaceStop one customer from subsidizing anotherAdd quotas, rate limits, or plan-based routing

In OpenTelemetry, I attach these fields as span attributes around the provider call. In a Next.js app, wrap the SDK function. In NestJS, put it in an interceptor. In workers, log the job ID and batch ID too.

This is also where Tokenwise fits in my own stack: I built it because I wanted LLM cost traces by feature and prompt version, not a monthly blob from a provider dashboard. If you do not use a tool, at least write the numbers to Postgres. Anything beats guessing.

Verdict

If I had to cut a TypeScript app’s LLM bill this week, I would not start with exotic prompt tricks. I would ship a central LLM wrapper, make gpt-4.1-mini or gpt-4o-mini the default, route hard cases to gpt-5.1, o3, or claude-sonnet-4, trim context before every call, and cache deterministic work.

The winning pattern is boring and repeatable: cheap default, measured escalation, smaller prompts, fewer duplicate calls, bounded outputs, and cost per feature in your logs. Do that and you can cut 50–80% from many LLM bills without making the product feel worse. That is the kind of optimization I like: visible on the invoice, invisible to the user.

Frequently asked questions

How do I reduce LLM cost in TypeScript quickly?

Start by centralizing every LLM call behind one TypeScript service, then add model routing, token logging, max output limits, and response caching. The biggest immediate win is usually switching default traffic from premium models to gpt-4o-mini, gpt-4.1-mini, gemini-2.0-flash, or claude-3-haiku.

Which model should I use as the default for a TypeScript app?

My default pick is gpt-4.1-mini for general product features because it is reliable and still cheap at $0.40 / 1M input, $1.60 / 1M output. If price matters more than quality, I use gpt-4o-mini at $0.15 / 1M input, $0.60 / 1M output or gemini-2.0-flash at $0.10 / 1M input, $0.40 / 1M output.

Should I cache LLM API responses?

Yes, for deterministic tasks. Cache summaries, extracted entities, classifications, embeddings, moderation decisions, and generated metadata. Do not blindly cache personal chat unless repeated answers are acceptable. Include the model, prompt version, input hash, temperature, and schema version in the cache key.

How many tokens should I send to an LLM?

Send the smallest context that passes your eval. For chat, keep recent turns plus a summary. For RAG, start with 4–8 chunks, not 30. For tools, shorten descriptions. Large context windows like 1M tokens are useful for hard cases, but they are still billable input.

Is batching worth it for LLM calls?

Yes, especially for background jobs. Batching reduces repeated system prompts, repeated tool schemas, network overhead, and queue churn. It works best for homogeneous tasks like ticket classification, document summarization, metadata extraction, and eval runs.

How do I monitor LLM cost in Next.js or NestJS?

Wrap the provider SDK call and log model, route, workspace ID, prompt version, input tokens, output tokens, cache hit, retries, latency, and estimated cost. In Next.js, do this in your Route Handler or server action wrapper. In NestJS, put it in an injectable LLM service or interceptor.

More guides

Add this to your app in one line

Point your OpenAI baseURL at Tokenwise and every call is logged, priced, and optimizable — no SDK rewrite, no LangChain required.