LLM speed benchmarks
Tokens per second and time-to-first-token across every major model and host.
TPS isn’t everything — but for streaming UX, voice agents, and agent loops, it’s the difference between snappy and laggy. Two numbers matter: output throughput (tokens per second once the stream starts) and time to first token (the delay before anything appears). Reasoning models trade both for quality; specialty hosts like Groq trade quality for both.
The same model can be ten times faster on one host than another — Llama 3.3 70B on Groq vs. Bedrock is the canonical example. Host matters at least as much as the model name. We list Groq variants below because they’re the standard cheap-fast lane; for self-host or other regional providers, expect ~⅓ the TPS shown here.
Lightning
>500 TPSSpecialty inference hardware. Use for voice, sub-second agent loops, or long generations where every second compounds.
| Model | Provider | Output TPS | Time to first token | Notes |
|---|---|---|---|---|
| Llama 3.1 8B (Groq) | Groq | 750 | 200 ms | Lightning-fast on Groq's LPU hardware. |
Fast
150–500 TPSFrontier-tier hosted models tuned for streaming UX. Good default for chat and search interfaces.
| Model | Provider | Output TPS | Time to first token | Notes |
|---|---|---|---|---|
| Llama 3.3 70B (Groq) | Groq | 280 | 250 ms | Open-weight model. Pricing + speed depend on host. |
| Gemini 2.0 Flash | 250 | 300 ms | — | |
| Gemini 1.5 Flash | 190 | 350 ms | — | |
| Claude Haiku 4.5 | Anthropic | 180 | 400 ms | — |
| GPT-4.1 mini | OpenAI | 150 | 400 ms | — |
| o3-mini | OpenAI | 150 | 1.5 s | — |
| Mistral Small 3 | Mistral | 150 | 350 ms | — |
Standard
50–150 TPSWorkhorse range. Quality leaders live here; the stream still feels live for most chat UIs.
| Model | Provider | Output TPS | Time to first token | Notes |
|---|---|---|---|---|
| GPT-4o mini | OpenAI | 140 | 400 ms | — |
| GPT-4o | OpenAI | 110 | 600 ms | — |
| GPT-4.1 | OpenAI | 100 | 600 ms | — |
| Claude Sonnet 4.6 | Anthropic | 90 | 800 ms | — |
| Grok-3 | xAI | 90 | 600 ms | — |
| Mistral Large 2 | Mistral | 85 | 600 ms | — |
| Claude 3.5 Sonnet | Anthropic | 80 | 900 ms | Older but still popular for cost-stability reasons. |
| Grok-2 | xAI | 70 | 700 ms | — |
| Gemini 1.5 Pro | 60 | 800 ms | 2M context — the largest of any production model. | |
| DeepSeek V3 | DeepSeek | 60 | 700 ms | Off-peak (UTC 16:30–00:30) is 50% cheaper. |
| Claude Opus 4.7 | Anthropic | 50 | 1.2 s | Cache write 1.25× input price. Extended thinking optional. JSON via prefill or tool-use trick. |
Slow
<50 TPSReasoning and top-quality models. Don't stream these to a typing UI — render with a spinner.
| Model | Provider | Output TPS | Time to first token | Notes |
|---|---|---|---|---|
| o1 | OpenAI | 30 | 8.0 s | Reasoning tokens billed but hidden from output. |
| DeepSeek R1 | DeepSeek | 30 | 2.5 s | Reasoning model — outputs include chain-of-thought. |
Source: public LLM speed benchmarks (artificialanalysis.ai, provider docs), last verified May 24, 2026. Real-world speed depends on prompt length, region, and provider load. For live measurements, artificialanalysis.ai updates daily. Same data also at /api/llm-prices.json.
See real P50/P95 latency for your own traffic — try Tokenwise.