Methodology
Where the ~30% actually comes from
“Cut 30% without losing quality” is the kind of claim you’ve learned to discount — so here’s the honest breakdown of every lever, and how each one is verified against your own traffic before it counts.
The 30% is a blend, not a single trick, and it varies by workload. It comes from five levers, each measured separately. None of them quietly downgrades your output: every savings figure on your dashboard is something Tokenwise can show its work for.
Model right-sizing
Most apps default to the newest, most expensive model for every task. The price spread between frontier and mid-tier models is enormous, and a lot of traffic (classification, extraction, short summaries) doesn't need the top model. Tokenwise finds the calls where a cheaper model holds quality and recommends the swap.
How it’s verified
Before a swap is suggested, it's scored with an LLM-as-judge regression check. You apply it as an A/B traffic split — not a hard cutover — and the realized savings are measured on live traffic before they're counted.
Semantic caching at the edge
Repeated or near-identical prompts (FAQs, retries, common queries) don't need to hit the provider twice. Tokenwise caches responses at the Cloudflare edge keyed on meaning, so a cache hit costs you nothing instead of a full generation.
How it’s verified
Cache hits are tracked explicitly and excluded from your cost numbers — they show up as savings, not spend. You see hit rate and dollars saved per workspace.
Prompt caching
Providers like Anthropic and OpenAI bill cached input tokens at a fraction of the normal rate. If your prompts share a large stable prefix (system prompts, few-shot examples, RAG context), structuring them for the provider's cache cuts input cost sharply.
How it’s verified
Tokenwise measures your real cache-read vs fresh-input token split and the dollars that cache reads saved — computed from the actual usage your provider reports.
Fallback efficiency & reliability
Failed and retried requests cost money and latency. Same-shape fallback chains route a failed call to a healthy provider instead of retrying blindly, so you pay for one successful completion rather than several failures plus a retry.
How it’s verified
Retries saved and fallback engagements are tracked as reliability stats, so the efficiency gain is observable, not assumed.
Applied recommendations
On top of the automatic levers, Tokenwise surfaces specific recommendations — enable caching on this tag, switch this route's model, add a fallback here — that you can apply or dismiss.
How it’s verified
Applied recommendations are tracked with a verified actualMonthlySavings figure, validated daily against real traffic. Only savings that actually materialize show up in your savings total.
Verified vs estimated
Your savings total separates what’s been verifiedon real traffic from what’s an estimate. A daily job re-checks applied recommendations so the verified number reflects money you actually kept — not a projection. How we handle your prompts and keys is on the security page.
Frequently asked
Is the 30% just 'use a cheaper model'?
No. Model right-sizing is one lever; on its own, swapping models typically saves around a quarter for the affected traffic. The rest comes from caching (edge + prompt), fallback efficiency, and applied recommendations. The headline figure is a blend across all of them, and it varies by workload — some apps save more, some less.
How do you prove quality didn't drop?
Every model-switch recommendation is gated by an LLM-as-judge regression check before it's suggested, and you apply it via an A/B split rather than a hard cutover. You can roll back any rule in one click. We optimize cost-per-quality, not cost alone.
Are cache hits counted as savings or hidden from cost?
Cache hits are excluded from your cost figures by default — they're savings, not spend — and surfaced separately so you can see hit rate and dollars saved. Latency metrics also exclude cache hits so their near-instant responses don't distort your percentiles.
Does the savings number include things that didn't happen yet?
The savings total separates verified savings (measured on real traffic) from estimated run-rate. Applied recommendations are re-checked daily by a verify-savings job, so the verified figure reflects money you actually kept.
See your own number
Plug in one line and Tokenwise measures these levers on your actual traffic — then emails you the ones worth pulling.