Methodology

Where the ~30% actually comes from

“Cut 30% without losing quality” is the kind of claim you’ve learned to discount — so here’s the honest breakdown of every lever, and how each one is verified against your own traffic before it counts.

The 30% is a blend, not a single trick, and it varies by workload. It comes from five levers, each measured separately. None of them quietly downgrades your output: every savings figure on your dashboard is something Tokenwise can show its work for.

01

Model right-sizing

Most apps default to the newest, most expensive model for every task. The price spread between frontier and mid-tier models is enormous, and a lot of traffic (classification, extraction, short summaries) doesn't need the top model. Tokenwise finds the calls where a cheaper model holds quality and recommends the swap.

How it’s verified

Before a swap is suggested, it's scored with an LLM-as-judge regression check. You apply it as an A/B traffic split — not a hard cutover — and the realized savings are measured on live traffic before they're counted.

02

Semantic caching at the edge

Repeated or near-identical prompts (FAQs, retries, common queries) don't need to hit the provider twice. Tokenwise caches responses at the Cloudflare edge keyed on meaning, so a cache hit costs you nothing instead of a full generation.

How it’s verified

Cache hits are tracked explicitly and excluded from your cost numbers — they show up as savings, not spend. You see hit rate and dollars saved per workspace.

03

Prompt caching

Providers like Anthropic and OpenAI bill cached input tokens at a fraction of the normal rate. If your prompts share a large stable prefix (system prompts, few-shot examples, RAG context), structuring them for the provider's cache cuts input cost sharply.

How it’s verified

Tokenwise measures your real cache-read vs fresh-input token split and the dollars that cache reads saved — computed from the actual usage your provider reports.

04

Fallback efficiency & reliability

Failed and retried requests cost money and latency. Same-shape fallback chains route a failed call to a healthy provider instead of retrying blindly, so you pay for one successful completion rather than several failures plus a retry.

How it’s verified

Retries saved and fallback engagements are tracked as reliability stats, so the efficiency gain is observable, not assumed.

05

Applied recommendations

On top of the automatic levers, Tokenwise surfaces specific recommendations — enable caching on this tag, switch this route's model, add a fallback here — that you can apply or dismiss.

How it’s verified

Applied recommendations are tracked with a verified actualMonthlySavings figure, validated daily against real traffic. Only savings that actually materialize show up in your savings total.

Verified vs estimated

Your savings total separates what’s been verifiedon real traffic from what’s an estimate. A daily job re-checks applied recommendations so the verified number reflects money you actually kept — not a projection. How we handle your prompts and keys is on the security page.

Frequently asked

Is the 30% just 'use a cheaper model'?

No. Model right-sizing is one lever; on its own, swapping models typically saves around a quarter for the affected traffic. The rest comes from caching (edge + prompt), fallback efficiency, and applied recommendations. The headline figure is a blend across all of them, and it varies by workload — some apps save more, some less.

How do you prove quality didn't drop?

Every model-switch recommendation is gated by an LLM-as-judge regression check before it's suggested, and you apply it via an A/B split rather than a hard cutover. You can roll back any rule in one click. We optimize cost-per-quality, not cost alone.

Are cache hits counted as savings or hidden from cost?

Cache hits are excluded from your cost figures by default — they're savings, not spend — and surfaced separately so you can see hit rate and dollars saved. Latency metrics also exclude cache hits so their near-instant responses don't distort your percentiles.

Does the savings number include things that didn't happen yet?

The savings total separates verified savings (measured on real traffic) from estimated run-rate. Applied recommendations are re-checked daily by a verify-savings job, so the verified figure reflects money you actually kept.

See your own number

Plug in one line and Tokenwise measures these levers on your actual traffic — then emails you the ones worth pulling.