AI Essentials

What is a token,
and how do teams waste them?

Tokens are the units LLMs read and write — roughly ¾ of a word in English. You're billed per million of them. Most teams could cut their bill 40–70% by changing two or three things. Here's where the money leaks — and the short list of fixes.

Four leaks most teams have right now

The biggest leak is model choice. The same task can cost 50× more on a flagship than on a fast model — often with no quality difference. Reach for the smaller one first.
Output costs 3–6× more than input. Reading is cheap; writing is expensive. A "be terse" rule plus a max-tokens cap usually pays for itself in a week.
Caching is free money most teams leave on the table. Re-sending a long context (system prompt, knowledge base, codebase) without caching wastes 80–90% of input spend.
Long conversations bill quadratically. Each turn re-sends every prior turn. Turn 20 of a chat can cost 20× turn 1. Compact and restart.

Pricing as of May 2026. Verify current rates at the provider's site before budgeting.

The unit

What is a token, in plain English?

Forget the textbook definition. Here's what you need to know to read your bill.

The ¾-of-a-word rule

1 token ≈ 4 characters ≈ ¾ of an English word. Short common words are usually one token. Long words break into multiple tokens. Code, numbers, and other languages tokenize differently — usually less efficiently.

"The cat sat on the mat."≈ 7 tokens

"Quarterly EBITDA grew 14%."≈ 10 tokens

"antidisestablishmentarianism"≈ 6 tokens

Input vs. output — why output costs more

Input tokens are what you send the model (your prompt, attached documents, conversation history). Output tokens are what it writes back.

Reading is cheap. Writing is expensive — the model has to think one token at a time to produce output, which burns compute. Every current frontier provider charges 3–6× more for output than input.

Rule of thumb: if your workload is output-heavy (drafting, generation), tighten max_tokens and ask for terse responses. If it's input-heavy (summarization, Q&A), use caching.

Context windows — the model's desk

Think of the context window as the model's desk. It can only see what's on the desk right now. Frontier models today have desks that hold 200,000 to 1,000,000 tokens (roughly 500–2,500 pages). Anything off the desk is forgotten.

The catch: a bigger desk doesn't mean a free desk. Every token you put on it is billed every time you talk to the model. A long conversation that grows turn by turn re-sends prior turns each time — your bill grows quadratically if you're not careful.

The diagnostic

Where your tokens get wasted

Six patterns that account for most surprise bills. If you recognize one in your workload, jump to the lever that fixes it.

The per-million-token rate is the least interesting variable in your bill. The interesting one is how much waste piles up before you ever look at the invoice. Most teams hit at least one of these every day.

You're paying flagship prices for routine work.

One expensive model is most of your spend, but the actual tasks are simple — drafting, classification, Q&A. Engineers reach for the flagship by default.

5–25× more than necessaryFix: Lever 1 — Pick the smaller model first

You're sending the same long context every call.

A long system prompt, knowledge base, or codebase appears verbatim in every request. Each call re-reads the same tokens — at full price.

80–90% of input is throwaway re-readsFix: Lever 2 — Turn on prompt caching

Your conversations grow without compaction.

Long multi-turn chats. Each turn re-sends prior turns, so turn 20 costs roughly 20× turn 1 in input tokens. Nobody notices because the per-call price looks identical.

40–70% of long-chat spendFix: Lever 5 — Compact long conversations

You dump whole documents when a slice would do.

200-page PDFs in context when the answer lives in three paragraphs. Whole codebases when the model needs three functions. The model is reading tonnage you don't need it to read.

50–90% of input is unread weightFix: Lever 4 — Trim your context

Your outputs ramble.

No max-tokens ceiling, no "be terse" in the prompt. The model writes paragraphs when a sentence would do — and output is the expensive side of the bill.

20–60% of output spendFix: Lever 6 — Cap output length

You're paying interactive rates for overnight work.

Nightly summarizations, bulk classification, embeddings jobs — all hitting the synchronous, premium-priced API when they don't need to be.

50% above what you should payFix: Lever 3 — Use the Batch API

Interactive

See your bill

Plug in your workload. Try a different model. Toggle caching. The number moves with you.

Interactive Cost Calculator

Tune inputs. See your bill change in real time.

Use case

Reused system prompt + KB; short user turns, medium answers.

Model

Avg input tokens / request~1875 words

Avg output tokens / request~300 words

Requests / day

UsersFor per-seat cost

Prompt caching

Reuse the same long context across requests.

% of input that hits the cache70%

Batch API

Overnight / async workloads. 50% off.

Your bill

Monthly cost

$70.20

$854.10 / yr · $2.34 / day

Per request

$0.0029

Per user / mo

$70.20

Same workload, every model

SelectedCheapest

How we computed this

Per request: (input tokens × input price) + (output tokens × output price) + cached portion priced at the cache-read rate.

Today: 2.5K in @ $1.00/M + 400 out @ $5.00/M = $0.0029/req.

Monthly: per-request × 800 req/day × 30 days.

Cache discount applied to 70% of input tokens.

Comparison

Every model worth knowing

Sortable. The stars mark the right default for most workloads — not the most expensive option, the one that actually does the job.

					Batch		Best for
Gemini 2.5 Flash-Lite	Google	$0.1000	$0.4000	$0.0100	50% off	1M	Bulk classification, embeddings-adjacent tasks
Gemini 2.5 Flash	Google	$0.3000	$2.50	$0.0300	50% off	1M	Cheap-but-capable for high-volume workloads
Claude Haiku 4.5	Anthropic	$1.00	$5.00	$0.1000	50% off	200K	High-volume routing, classification, fast chat
Gemini 2.5 Pro	Google	$1.25	$10.00	$0.1250	50% off	1M	Long-context (1M) work at flagship cost
GPT-5.4	OpenAI	$2.50	$15.00	$0.2500	50% off	1M	Balanced reasoning, multimodal, broad use
GPT-4o	OpenAI	$2.50	$10.00	$1.25	50% off	128K	Legacy integrations, voice-first, multimodal
Claude Sonnet 4.6	Anthropic	$3.00	$15.00	$0.3000	50% off	1M	Daily driver — production apps, drafting, analysis
Claude Opus 4.7	Anthropic	$5.00	$25.00	$0.5000	50% off	1M	Hardest reasoning, agentic coding, deep research
GPT-5.5	OpenAI	$5.00	$30.00	$0.5000	50% off	1M	Frontier reasoning when accuracy beats cost

Gemini 2.5 Flash-Lite

Google

1M ctx

Bulk classification, embeddings-adjacent tasks

Input / M$0.1000

Output / M$0.4000

Cached$0.0100

Batch50% off

Gemini 2.5 Flash

Google

1M ctx

Cheap-but-capable for high-volume workloads

Input / M$0.3000

Output / M$2.50

Cached$0.0300

Batch50% off

Claude Haiku 4.5

Anthropic

200K ctx

High-volume routing, classification, fast chat

Input / M$1.00

Output / M$5.00

Cached$0.1000

Batch50% off

Gemini 2.5 Pro

Google

1M ctx

Long-context (1M) work at flagship cost

Input / M$1.25

Output / M$10.00

Cached$0.1250

Batch50% off

GPT-5.4

OpenAI

1M ctx

Balanced reasoning, multimodal, broad use

Input / M$2.50

Output / M$15.00

Cached$0.2500

Batch50% off

GPT-4o

OpenAI

128K ctx

Legacy integrations, voice-first, multimodal

Input / M$2.50

Output / M$10.00

Cached$1.25

Batch50% off

Claude Sonnet 4.6

Anthropic

1M ctx

Daily driver — production apps, drafting, analysis

Input / M$3.00

Output / M$15.00

Cached$0.3000

Batch50% off

Claude Opus 4.7

Anthropic

1M ctx

Hardest reasoning, agentic coding, deep research

Input / M$5.00

Output / M$25.00

Cached$0.5000

Batch50% off

GPT-5.5

OpenAI

1M ctx

Frontier reasoning when accuracy beats cost

Input / M$5.00

Output / M$30.00

Cached$0.5000

Batch50% off

Solid default starting points if you don't want to think hard. Escalate to a flagship only when these can't do the job.

Levers

The 7 cost levers (personal use)

In rough priority order. If you're only going to do one thing, do #1.

Pick the smaller model first

Capability is a spectrum. Most routine work (drafting, summarizing, classifying) does not need a frontier model. Start small; escalate only when the small model demonstrably fails.

Up to 25× cheaperWhen: always (as the default policy)

Turn on prompt caching

If you send the same long context repeatedly (a system prompt, a knowledge base, a codebase), cache it. Subsequent reads cost ~10% of the standard input rate.

80–90% off inputWhen: reused context > 1,000 tokens

Use the Batch API for non-urgent work

Most providers offer a 50% discount for asynchronous batch jobs that return within ~24h. Perfect for nightly summarizations, bulk classification, and embeddings-style work.

50% off everythingWhen: overnight / async OK

Trim your context

Don't dump a 200-page codebase when the model needs three functions. Use retrieval or scoping to send only the slice that matters. Less input = less cost, faster answers, fewer hallucinations.

50–90% off inputWhen: long documents/codebases

Compact long conversations

When a chat drags on, summarize the relevant facts into a short brief and start a new conversation. You pay for context every turn — pruning it pays back fast.

40–70% on long chatsWhen: > 20 turns or > 50k tokens

Cap output length

Set a max_tokens ceiling and ask explicitly for "the shortest correct answer." Output is the expensive side of the bill — short answers compound.

20–60% on outputWhen: drafting / generation tasks

Tool use over re-prompting

Let the model fetch what it needs (search, database query, file read) instead of dumping the whole haystack into context up front. You pay for the needle, not the haystack.

30–80% on inputWhen: agentic workflows, RAG

Org rollout

What changes when you meter the whole org

The levers above are personal. Here are the ones a CFO/COO gets that an individual doesn't.

Per-seat vs. metered billing

Per-seat is predictable but undercharges power users and overcharges the long tail. Metered is fair but harder to budget. The mature answer: per-seat floor with metered overage above a usage threshold. Most provider dashboards now expose both views.

Budget caps by team and product

Every major provider supports per-key spend caps. Set them. The number-one cause of a runaway bill is an unmonitored agent looping on retries — caps turn a five-figure incident into a two-figure one.

Model routing

Build (or buy) a router that sends easy queries to a cheap model and escalates hard ones. Even a crude classifier — based on input length, presence of code, or a small first-pass model — routinely cuts spend 50–70% with no observable quality drop.

Internal chargebacks

Tag every API key with the team, product, or feature that owns it. Allocate the bill the same way you allocate AWS. Without this, AI cost becomes a single fuzzy line item nobody owns — which is how you end up with a $200k surprise.

Anomaly detection

Watch for: sudden token spikes (loops or prompt injection), output runs longer than normal (broken stop conditions), spikes in flagship-model usage (someone hardcoded the wrong model). A simple daily alert beats a postmortem.

Vendor negotiation

At ~$10–20k/month committed spend, most providers will negotiate. Ask for: committed-use discounts, dedicated capacity (no rate limits), BAA for healthcare, data residency, and a named TAM. Multi-vendor leverage helps — even if you don't end up multi-vendor.

Cost of governance vs. cost of tokens

The tightest possible metering creates friction that suppresses adoption — and the productivity tax of slow AI adoption dwarfs the token bill. For most orgs under $50k/month, the right policy is generous defaults + tagged spend visibility. Save tight controls for the runaway-cost surface area (autonomous agents, public-facing endpoints).

Reference

Glossary

The words your team uses, in your language.

What is a token, and how do teams waste them?

The ¾-of-a-word rule

Input vs. output — why output costs more

Context windows — the model's desk

You're paying flagship prices for routine work.

You're sending the same long context every call.

Your conversations grow without compaction.

You dump whole documents when a slice would do.

Your outputs ramble.

You're paying interactive rates for overnight work.

Interactive Cost Calculator

Gemini 2.5 Flash-Lite

Gemini 2.5 Flash

Claude Haiku 4.5

Gemini 2.5 Pro

GPT-5.4

GPT-4o

Claude Sonnet 4.6

Claude Opus 4.7

GPT-5.5

Pick the smaller model first

Turn on prompt caching

Use the Batch API for non-urgent work

Trim your context

Compact long conversations

Cap output length

Tool use over re-prompting

Per-seat vs. metered billing

Budget caps by team and product

Model routing

Internal chargebacks

Anomaly detection

Vendor negotiation

Cost of governance vs. cost of tokens

Token

Context window

Input vs. output tokens

Prompt caching

Batch API

Fine-tuning vs. RAG

Model tier

Agentic workflow

Tool use

What is a token,
and how do teams waste them?