AI Essentials

    What is a token, and how do teams waste them?

    Tokens are the units LLMs read and write — roughly ¾ of a word in English. You're billed per million of them. Most teams could cut their bill 40–70% by changing two or three things. Here's where the money leaks — and the short list of fixes.

    Four leaks most teams have right now

    • The biggest leak is model choice. The same task can cost 50× more on a flagship than on a fast model — often with no quality difference. Reach for the smaller one first.
    • Output costs 3–6× more than input. Reading is cheap; writing is expensive. A "be terse" rule plus a max-tokens cap usually pays for itself in a week.
    • Caching is free money most teams leave on the table. Re-sending a long context (system prompt, knowledge base, codebase) without caching wastes 80–90% of input spend.
    • Long conversations bill quadratically. Each turn re-sends every prior turn. Turn 20 of a chat can cost 20× turn 1. Compact and restart.

    Pricing as of May 2026. Verify current rates at the provider's site before budgeting.

    The unit

    What is a token, in plain English?

    Forget the textbook definition. Here's what you need to know to read your bill.

    The ¾-of-a-word rule

    1 token ≈ 4 characters ≈ ¾ of an English word. Short common words are usually one token. Long words break into multiple tokens. Code, numbers, and other languages tokenize differently — usually less efficiently.

    "The cat sat on the mat."7 tokens
    "Quarterly EBITDA grew 14%."10 tokens
    "antidisestablishmentarianism"6 tokens

    Input vs. output — why output costs more

    Input tokens are what you send the model (your prompt, attached documents, conversation history). Output tokens are what it writes back.

    Reading is cheap. Writing is expensive — the model has to think one token at a time to produce output, which burns compute. Every current frontier provider charges 3–6× more for output than input.

    Rule of thumb: if your workload is output-heavy (drafting, generation), tighten max_tokens and ask for terse responses. If it's input-heavy (summarization, Q&A), use caching.

    Context windows — the model's desk

    Think of the context window as the model's desk. It can only see what's on the desk right now. Frontier models today have desks that hold 200,000 to 1,000,000 tokens (roughly 500–2,500 pages). Anything off the desk is forgotten.

    The catch: a bigger desk doesn't mean a free desk. Every token you put on it is billed every time you talk to the model. A long conversation that grows turn by turn re-sends prior turns each time — your bill grows quadratically if you're not careful.

    The diagnostic

    Where your tokens get wasted

    Six patterns that account for most surprise bills. If you recognize one in your workload, jump to the lever that fixes it.

    The per-million-token rate is the least interesting variable in your bill. The interesting one is how much waste piles up before you ever look at the invoice. Most teams hit at least one of these every day.

    You're paying flagship prices for routine work.

    One expensive model is most of your spend, but the actual tasks are simple — drafting, classification, Q&A. Engineers reach for the flagship by default.

    You're sending the same long context every call.

    A long system prompt, knowledge base, or codebase appears verbatim in every request. Each call re-reads the same tokens — at full price.

    80–90% of input is throwaway re-readsFix: Lever 2Turn on prompt caching

    Your conversations grow without compaction.

    Long multi-turn chats. Each turn re-sends prior turns, so turn 20 costs roughly 20× turn 1 in input tokens. Nobody notices because the per-call price looks identical.

    You dump whole documents when a slice would do.

    200-page PDFs in context when the answer lives in three paragraphs. Whole codebases when the model needs three functions. The model is reading tonnage you don't need it to read.

    50–90% of input is unread weightFix: Lever 4Trim your context

    Your outputs ramble.

    No max-tokens ceiling, no "be terse" in the prompt. The model writes paragraphs when a sentence would do — and output is the expensive side of the bill.

    20–60% of output spendFix: Lever 6Cap output length

    You're paying interactive rates for overnight work.

    Nightly summarizations, bulk classification, embeddings jobs — all hitting the synchronous, premium-priced API when they don't need to be.

    50% above what you should payFix: Lever 3Use the Batch API
    Interactive

    See your bill

    Plug in your workload. Try a different model. Toggle caching. The number moves with you.

    Interactive Cost Calculator

    Tune inputs. See your bill change in real time.

    Reused system prompt + KB; short user turns, medium answers.

    ~1875 words
    ~300 words
    For per-seat cost

    Reuse the same long context across requests.

    % of input that hits the cache70%

    Overnight / async workloads. 50% off.

    Your bill

    Monthly cost

    $70.20

    $854.10 / yr · $2.34 / day

    Per request

    $0.0029

    Per user / mo

    $70.20

    Same workload, every model

    SelectedCheapest
    How we computed this

    Per request: (input tokens × input price) + (output tokens × output price) + cached portion priced at the cache-read rate.

    Today: 2.5K in @ $1.00/M + 400 out @ $5.00/M = $0.0029/req.

    Monthly: per-request × 800 req/day × 30 days.

    Cache discount applied to 70% of input tokens.

    Comparison

    Every model worth knowing

    Sortable. The stars mark the right default for most workloads — not the most expensive option, the one that actually does the job.

    Gemini 2.5 Flash-Lite

    Google
    1M ctx

    Bulk classification, embeddings-adjacent tasks

    Input / M$0.1000
    Output / M$0.4000
    Cached$0.0100
    Batch50% off

    Gemini 2.5 Flash

    Google
    1M ctx

    Cheap-but-capable for high-volume workloads

    Input / M$0.3000
    Output / M$2.50
    Cached$0.0300
    Batch50% off

    Claude Haiku 4.5

    Anthropic
    200K ctx

    High-volume routing, classification, fast chat

    Input / M$1.00
    Output / M$5.00
    Cached$0.1000
    Batch50% off

    Gemini 2.5 Pro

    Google
    1M ctx

    Long-context (1M) work at flagship cost

    Input / M$1.25
    Output / M$10.00
    Cached$0.1250
    Batch50% off

    GPT-5.4

    OpenAI
    1M ctx

    Balanced reasoning, multimodal, broad use

    Input / M$2.50
    Output / M$15.00
    Cached$0.2500
    Batch50% off

    GPT-4o

    OpenAI
    128K ctx

    Legacy integrations, voice-first, multimodal

    Input / M$2.50
    Output / M$10.00
    Cached$1.25
    Batch50% off

    Claude Sonnet 4.6

    Anthropic
    1M ctx

    Daily driver — production apps, drafting, analysis

    Input / M$3.00
    Output / M$15.00
    Cached$0.3000
    Batch50% off

    Claude Opus 4.7

    Anthropic
    1M ctx

    Hardest reasoning, agentic coding, deep research

    Input / M$5.00
    Output / M$25.00
    Cached$0.5000
    Batch50% off

    GPT-5.5

    OpenAI
    1M ctx

    Frontier reasoning when accuracy beats cost

    Input / M$5.00
    Output / M$30.00
    Cached$0.5000
    Batch50% off

    Solid default starting points if you don't want to think hard. Escalate to a flagship only when these can't do the job.

    Levers

    The 7 cost levers (personal use)

    In rough priority order. If you're only going to do one thing, do #1.

    01

    Pick the smaller model first

    Capability is a spectrum. Most routine work (drafting, summarizing, classifying) does not need a frontier model. Start small; escalate only when the small model demonstrably fails.

    Up to 25× cheaperWhen: always (as the default policy)
    02

    Turn on prompt caching

    If you send the same long context repeatedly (a system prompt, a knowledge base, a codebase), cache it. Subsequent reads cost ~10% of the standard input rate.

    80–90% off inputWhen: reused context > 1,000 tokens
    03

    Use the Batch API for non-urgent work

    Most providers offer a 50% discount for asynchronous batch jobs that return within ~24h. Perfect for nightly summarizations, bulk classification, and embeddings-style work.

    50% off everythingWhen: overnight / async OK
    04

    Trim your context

    Don't dump a 200-page codebase when the model needs three functions. Use retrieval or scoping to send only the slice that matters. Less input = less cost, faster answers, fewer hallucinations.

    50–90% off inputWhen: long documents/codebases
    05

    Compact long conversations

    When a chat drags on, summarize the relevant facts into a short brief and start a new conversation. You pay for context every turn — pruning it pays back fast.

    40–70% on long chatsWhen: > 20 turns or > 50k tokens
    06

    Cap output length

    Set a max_tokens ceiling and ask explicitly for "the shortest correct answer." Output is the expensive side of the bill — short answers compound.

    20–60% on outputWhen: drafting / generation tasks
    07

    Tool use over re-prompting

    Let the model fetch what it needs (search, database query, file read) instead of dumping the whole haystack into context up front. You pay for the needle, not the haystack.

    30–80% on inputWhen: agentic workflows, RAG
    Org rollout

    What changes when you meter the whole org

    The levers above are personal. Here are the ones a CFO/COO gets that an individual doesn't.

    Per-seat vs. metered billing

    Per-seat is predictable but undercharges power users and overcharges the long tail. Metered is fair but harder to budget. The mature answer: per-seat floor with metered overage above a usage threshold. Most provider dashboards now expose both views.

    Budget caps by team and product

    Every major provider supports per-key spend caps. Set them. The number-one cause of a runaway bill is an unmonitored agent looping on retries — caps turn a five-figure incident into a two-figure one.

    Model routing

    Build (or buy) a router that sends easy queries to a cheap model and escalates hard ones. Even a crude classifier — based on input length, presence of code, or a small first-pass model — routinely cuts spend 50–70% with no observable quality drop.

    Internal chargebacks

    Tag every API key with the team, product, or feature that owns it. Allocate the bill the same way you allocate AWS. Without this, AI cost becomes a single fuzzy line item nobody owns — which is how you end up with a $200k surprise.

    Anomaly detection

    Watch for: sudden token spikes (loops or prompt injection), output runs longer than normal (broken stop conditions), spikes in flagship-model usage (someone hardcoded the wrong model). A simple daily alert beats a postmortem.

    Vendor negotiation

    At ~$10–20k/month committed spend, most providers will negotiate. Ask for: committed-use discounts, dedicated capacity (no rate limits), BAA for healthcare, data residency, and a named TAM. Multi-vendor leverage helps — even if you don't end up multi-vendor.

    Cost of governance vs. cost of tokens

    The tightest possible metering creates friction that suppresses adoption — and the productivity tax of slow AI adoption dwarfs the token bill. For most orgs under $50k/month, the right policy is generous defaults + tagged spend visibility. Save tight controls for the runaway-cost surface area (autonomous agents, public-facing endpoints).

    Reference

    Glossary

    The words your team uses, in your language.