Free tool

LLM speed benchmarks

Tokens per second and time-to-first-token across every major model and host.

TPS isn’t everything — but for streaming UX, voice agents, and agent loops, it’s the difference between snappy and laggy. Two numbers matter: output throughput (tokens per second once the stream starts) and time to first token (the delay before anything appears). Reasoning models trade both for quality; specialty hosts like Groq trade quality for both.

The same model can be ten times faster on one host than another — Llama 3.3 70B on Groq vs. Bedrock is the canonical example. Host matters at least as much as the model name. We list Groq variants below because they’re the standard cheap-fast lane; for self-host or other regional providers, expect ~⅓ the TPS shown here.

Lightning

>500 TPS

Specialty inference hardware. Use for voice, sub-second agent loops, or long generations where every second compounds.

Model	Provider	Output TPS	Time to first token	Notes
Llama 3.1 8B (Groq)	Groq	750	200 ms	Lightning-fast on Groq's LPU hardware.

Fast

150–500 TPS

Frontier-tier hosted models tuned for streaming UX. Good default for chat and search interfaces.

Model	Provider	Output TPS	Time to first token	Notes
Llama 3.3 70B (Groq)	Groq	280	250 ms	Open-weight model. Pricing + speed depend on host.
Gemini 2.0 Flash	Google	250	300 ms	—
Gemini 1.5 Flash	Google	190	350 ms	—
Claude Haiku 4.5	Anthropic	180	400 ms	—
GPT-4.1 mini	OpenAI	150	400 ms	—
o3-mini	OpenAI	150	1.5 s	—
Mistral Small 3	Mistral	150	350 ms	—

Standard

50–150 TPS

Workhorse range. Quality leaders live here; the stream still feels live for most chat UIs.

Model	Provider	Output TPS	Time to first token	Notes
GPT-4o mini	OpenAI	140	400 ms	—
GPT-4o	OpenAI	110	600 ms	—
GPT-4.1	OpenAI	100	600 ms	—
Claude Sonnet 4.6	Anthropic	90	800 ms	—
Grok-3	xAI	90	600 ms	—
Mistral Large 2	Mistral	85	600 ms	—
Claude 3.5 Sonnet	Anthropic	80	900 ms	Older but still popular for cost-stability reasons.
Grok-2	xAI	70	700 ms	—
Gemini 1.5 Pro	Google	60	800 ms	2M context — the largest of any production model.
DeepSeek V3	DeepSeek	60	700 ms	Off-peak (UTC 16:30–00:30) is 50% cheaper.
Claude Opus 4.7	Anthropic	50	1.2 s	Cache write 1.25× input price. Extended thinking optional. JSON via prefill or tool-use trick.

Slow

<50 TPS

Reasoning and top-quality models. Don't stream these to a typing UI — render with a spinner.

Model	Provider	Output TPS	Time to first token	Notes
o1	OpenAI	30	8.0 s	Reasoning tokens billed but hidden from output.
DeepSeek R1	DeepSeek	30	2.5 s	Reasoning model — outputs include chain-of-thought.

Source: public LLM speed benchmarks (artificialanalysis.ai, provider docs), last verified May 24, 2026. Real-world speed depends on prompt length, region, and provider load. For live measurements, artificialanalysis.ai updates daily. Same data also at /api/llm-prices.json.

See real P50/P95 latency for your own traffic — try Tokenwise.