Free tool

LLM speed benchmarks

Tokens per second and time-to-first-token across every major model and host.

← Back to free tools

TPS isn’t everything — but for streaming UX, voice agents, and agent loops, it’s the difference between snappy and laggy. Two numbers matter: output throughput (tokens per second once the stream starts) and time to first token (the delay before anything appears). Reasoning models trade both for quality; specialty hosts like Groq trade quality for both.

The same model can be ten times faster on one host than another — Llama 3.3 70B on Groq vs. Bedrock is the canonical example. Host matters at least as much as the model name. We list Groq variants below because they’re the standard cheap-fast lane; for self-host or other regional providers, expect ~⅓ the TPS shown here.

Lightning

>500 TPS

Specialty inference hardware. Use for voice, sub-second agent loops, or long generations where every second compounds.

ModelProviderOutput TPSTime to first tokenNotes
Llama 3.1 8B (Groq)Groq750200 msLightning-fast on Groq's LPU hardware.

Fast

150–500 TPS

Frontier-tier hosted models tuned for streaming UX. Good default for chat and search interfaces.

ModelProviderOutput TPSTime to first tokenNotes
Llama 3.3 70B (Groq)Groq280250 msOpen-weight model. Pricing + speed depend on host.
Gemini 2.0 FlashGoogle250300 ms
Gemini 1.5 FlashGoogle190350 ms
Claude Haiku 4.5Anthropic180400 ms
GPT-4.1 miniOpenAI150400 ms
o3-miniOpenAI1501.5 s
Mistral Small 3Mistral150350 ms

Standard

50–150 TPS

Workhorse range. Quality leaders live here; the stream still feels live for most chat UIs.

ModelProviderOutput TPSTime to first tokenNotes
GPT-4o miniOpenAI140400 ms
GPT-4oOpenAI110600 ms
GPT-4.1OpenAI100600 ms
Claude Sonnet 4.6Anthropic90800 ms
Grok-3xAI90600 ms
Mistral Large 2Mistral85600 ms
Claude 3.5 SonnetAnthropic80900 msOlder but still popular for cost-stability reasons.
Grok-2xAI70700 ms
Gemini 1.5 ProGoogle60800 ms2M context — the largest of any production model.
DeepSeek V3DeepSeek60700 msOff-peak (UTC 16:30–00:30) is 50% cheaper.
Claude Opus 4.7Anthropic501.2 sCache write 1.25× input price. Extended thinking optional. JSON via prefill or tool-use trick.

Slow

<50 TPS

Reasoning and top-quality models. Don't stream these to a typing UI — render with a spinner.

ModelProviderOutput TPSTime to first tokenNotes
o1OpenAI308.0 sReasoning tokens billed but hidden from output.
DeepSeek R1DeepSeek302.5 sReasoning model — outputs include chain-of-thought.

Source: public LLM speed benchmarks (artificialanalysis.ai, provider docs), last verified May 24, 2026. Real-world speed depends on prompt length, region, and provider load. For live measurements, artificialanalysis.ai updates daily. Same data also at /api/llm-prices.json.

See real P50/P95 latency for your own traffic — try Tokenwise.