You're paying flagship prices for routine work.
One expensive model is most of your spend, but the actual tasks are simple — drafting, classification, Q&A. Engineers reach for the flagship by default.
Tokens are the units LLMs read and write — roughly ¾ of a word in English. You're billed per million of them. Most teams could cut their bill 40–70% by changing two or three things. Here's where the money leaks — and the short list of fixes.
Four leaks most teams have right now
Pricing as of May 2026. Verify current rates at the provider's site before budgeting.
Forget the textbook definition. Here's what you need to know to read your bill.
1 token ≈ 4 characters ≈ ¾ of an English word. Short common words are usually one token. Long words break into multiple tokens. Code, numbers, and other languages tokenize differently — usually less efficiently.
Input tokens are what you send the model (your prompt, attached documents, conversation history). Output tokens are what it writes back.
Reading is cheap. Writing is expensive — the model has to think one token at a time to produce output, which burns compute. Every current frontier provider charges 3–6× more for output than input.
max_tokens and ask for terse responses. If it's input-heavy (summarization, Q&A), use caching.Think of the context window as the model's desk. It can only see what's on the desk right now. Frontier models today have desks that hold 200,000 to 1,000,000 tokens (roughly 500–2,500 pages). Anything off the desk is forgotten.
The catch: a bigger desk doesn't mean a free desk. Every token you put on it is billed every time you talk to the model. A long conversation that grows turn by turn re-sends prior turns each time — your bill grows quadratically if you're not careful.
Six patterns that account for most surprise bills. If you recognize one in your workload, jump to the lever that fixes it.
The per-million-token rate is the least interesting variable in your bill. The interesting one is how much waste piles up before you ever look at the invoice. Most teams hit at least one of these every day.
One expensive model is most of your spend, but the actual tasks are simple — drafting, classification, Q&A. Engineers reach for the flagship by default.
A long system prompt, knowledge base, or codebase appears verbatim in every request. Each call re-reads the same tokens — at full price.
Long multi-turn chats. Each turn re-sends prior turns, so turn 20 costs roughly 20× turn 1 in input tokens. Nobody notices because the per-call price looks identical.
200-page PDFs in context when the answer lives in three paragraphs. Whole codebases when the model needs three functions. The model is reading tonnage you don't need it to read.
No max-tokens ceiling, no "be terse" in the prompt. The model writes paragraphs when a sentence would do — and output is the expensive side of the bill.
Nightly summarizations, bulk classification, embeddings jobs — all hitting the synchronous, premium-priced API when they don't need to be.
Plug in your workload. Try a different model. Toggle caching. The number moves with you.
Tune inputs. See your bill change in real time.
Reused system prompt + KB; short user turns, medium answers.
Reuse the same long context across requests.
Overnight / async workloads. 50% off.
Monthly cost
$70.20
$854.10 / yr · $2.34 / day
Per request
$0.0029
Per user / mo
$70.20
Same workload, every model
Per request: (input tokens × input price) + (output tokens × output price) + cached portion priced at the cache-read rate.
Today: 2.5K in @ $1.00/M + 400 out @ $5.00/M = $0.0029/req.
Monthly: per-request × 800 req/day × 30 days.
Cache discount applied to 70% of input tokens.
Sortable. The stars mark the right default for most workloads — not the most expensive option, the one that actually does the job.
| Batch | Best for | ||||||
|---|---|---|---|---|---|---|---|
Gemini 2.5 Flash-Lite | Google | $0.1000 | $0.4000 | $0.0100 | 50% off | 1M | Bulk classification, embeddings-adjacent tasks |
Gemini 2.5 Flash | Google | $0.3000 | $2.50 | $0.0300 | 50% off | 1M | Cheap-but-capable for high-volume workloads |
Claude Haiku 4.5 | Anthropic | $1.00 | $5.00 | $0.1000 | 50% off | 200K | High-volume routing, classification, fast chat |
Gemini 2.5 Pro | Google | $1.25 | $10.00 | $0.1250 | 50% off | 1M | Long-context (1M) work at flagship cost |
GPT-5.4 | OpenAI | $2.50 | $15.00 | $0.2500 | 50% off | 1M | Balanced reasoning, multimodal, broad use |
GPT-4o | OpenAI | $2.50 | $10.00 | $1.25 | 50% off | 128K | Legacy integrations, voice-first, multimodal |
Claude Sonnet 4.6 | Anthropic | $3.00 | $15.00 | $0.3000 | 50% off | 1M | Daily driver — production apps, drafting, analysis |
Claude Opus 4.7 | Anthropic | $5.00 | $25.00 | $0.5000 | 50% off | 1M | Hardest reasoning, agentic coding, deep research |
GPT-5.5 | OpenAI | $5.00 | $30.00 | $0.5000 | 50% off | 1M | Frontier reasoning when accuracy beats cost |
Bulk classification, embeddings-adjacent tasks
Cheap-but-capable for high-volume workloads
High-volume routing, classification, fast chat
Long-context (1M) work at flagship cost
Balanced reasoning, multimodal, broad use
Legacy integrations, voice-first, multimodal
Daily driver — production apps, drafting, analysis
Hardest reasoning, agentic coding, deep research
Frontier reasoning when accuracy beats cost
Solid default starting points if you don't want to think hard. Escalate to a flagship only when these can't do the job.
In rough priority order. If you're only going to do one thing, do #1.
Capability is a spectrum. Most routine work (drafting, summarizing, classifying) does not need a frontier model. Start small; escalate only when the small model demonstrably fails.
If you send the same long context repeatedly (a system prompt, a knowledge base, a codebase), cache it. Subsequent reads cost ~10% of the standard input rate.
Most providers offer a 50% discount for asynchronous batch jobs that return within ~24h. Perfect for nightly summarizations, bulk classification, and embeddings-style work.
Don't dump a 200-page codebase when the model needs three functions. Use retrieval or scoping to send only the slice that matters. Less input = less cost, faster answers, fewer hallucinations.
When a chat drags on, summarize the relevant facts into a short brief and start a new conversation. You pay for context every turn — pruning it pays back fast.
Set a max_tokens ceiling and ask explicitly for "the shortest correct answer." Output is the expensive side of the bill — short answers compound.
Let the model fetch what it needs (search, database query, file read) instead of dumping the whole haystack into context up front. You pay for the needle, not the haystack.
The levers above are personal. Here are the ones a CFO/COO gets that an individual doesn't.
Per-seat is predictable but undercharges power users and overcharges the long tail. Metered is fair but harder to budget. The mature answer: per-seat floor with metered overage above a usage threshold. Most provider dashboards now expose both views.
Every major provider supports per-key spend caps. Set them. The number-one cause of a runaway bill is an unmonitored agent looping on retries — caps turn a five-figure incident into a two-figure one.
Build (or buy) a router that sends easy queries to a cheap model and escalates hard ones. Even a crude classifier — based on input length, presence of code, or a small first-pass model — routinely cuts spend 50–70% with no observable quality drop.
Tag every API key with the team, product, or feature that owns it. Allocate the bill the same way you allocate AWS. Without this, AI cost becomes a single fuzzy line item nobody owns — which is how you end up with a $200k surprise.
Watch for: sudden token spikes (loops or prompt injection), output runs longer than normal (broken stop conditions), spikes in flagship-model usage (someone hardcoded the wrong model). A simple daily alert beats a postmortem.
At ~$10–20k/month committed spend, most providers will negotiate. Ask for: committed-use discounts, dedicated capacity (no rate limits), BAA for healthcare, data residency, and a named TAM. Multi-vendor leverage helps — even if you don't end up multi-vendor.
The tightest possible metering creates friction that suppresses adoption — and the productivity tax of slow AI adoption dwarfs the token bill. For most orgs under $50k/month, the right policy is generous defaults + tagged spend visibility. Save tight controls for the runaway-cost surface area (autonomous agents, public-facing endpoints).
The words your team uses, in your language.