GPT-5.4 Mini API: Pricing, Context & Production Use

The GPT-5.4 Mini API has been out since March 17. I’ve been routing real traffic through it for about three months, alongside three other models. This piece is the work note on where it actually fits.

Dora here. A clarification before anything else, because I keep seeing people confused about it: GPT-5.4 is no longer OpenAI’s frontier. GPT-5.5 is. We covered the 5.5 launch in a separate piece, and the pre-launch 5.4 leak pieces from March are by definition stale. What I’m writing here is narrower — mini and nano as the low-cost routing tier, which is the role they’re settling into.

If you’re building model routing for production AI workloads, that distinction matters more than the spec sheet does.

Where GPT-5.4 Mini Fits

Released March 17, 2026 as a fast / low-cost variant

OpenAI shipped GPT-5.4 mini and nano on March 17 as smaller siblings of the GPT-5.4 release a couple of weeks earlier. The framing in the announcement is honest about what they are: distilled, cheaper, faster versions for high-volume work. More than 2× faster than GPT-5 mini, per OpenAI’s numbers. (Their numbers, their conditions. I haven’t run the controlled benchmark — but in actual workload latency, faster matches what I see.)

The bigger question for builders isn’t “is mini good” — it’s “good for what.” That’s the part the announcement underplays and the routing question makes obvious.

API access (mini also on ChatGPT free tier)

Three places to use the gpt 5.4 mini model: the OpenAI API directly, inside ChatGPT (including the free tier — yes, the free tier), and through aggregators that route to OpenAI’s endpoint. GitHub Copilot also picked it up on day one — their changelog landed March 17 alongside the OpenAI announcement.

Nano is API-only. No ChatGPT surface for it. Worth knowing if you’re tempted to point users at nano directly — you can’t, only at the API integration you build around it.

Pricing & Context

Input/output rates & cached input

Numbers as of publication date, from OpenAI’s official model page. These shift, so check before you commit:

GPT-5.4 mini: $0.75 per 1M input, $4.50 per 1M output, $0.075 cached input
GPT-5.4 nano: $0.20 per 1M input, $1.25 per 1M output

The cached input rate is the one I’d actually plan around. **$0.075 is a 10× discount on input**, and for any workload with a stable system prompt or repeated context (almost every agent, most chat interfaces, anything RAG-shaped), caching ends up doing the heavy lifting on cost. The headline gpt 5.4 mini pricing is the worst case, not the typical case.

One footnote: regional processing (data residency) endpoints carry a 10% uplift. Not huge, but worth modeling if you’re routing through EU or other regional surfaces.

Context window

OpenAI’s docs list 400K tokens of context, 128K max output for mini. I’ve seen aggregator pages quote different numbers (one had 1.1M, which doesn’t match the source). When in doubt, the official model page wins — and the official number is 400K.

I tested at 350K (a packed agent transcript plus tool outputs). Worked fine. Didn’t push it to the edge — at this price point I’d rather route the genuinely-long-context cases up to a frontier model than stress-test mini’s ceiling.

Best-Fit Production Workloads

High-volume, latency-sensitive tasks

This is where mini earns its slot in the routing table. The pattern that has held up across the projects I’ve put it on:

Classification, extraction, light reformatting — anything where you need a structured answer fast and the reasoning is one or two steps. Mini handles it at a fraction of frontier cost.
Long chat sessions with simple turns — when 80% of turns don’t need heavy reasoning, paying full GPT-5.5 rates on all of them is just wrong.
High-fanout subtasks — generate 50 variants of something, score 200 retrieved docs, etc. The unit cost difference compounds fast.

Where I’ve watched it fall over: anything that needs deep multi-step planning, or where the model has to decide what to do rather than execute a well-specified step. (I had a workflow where mini was the planner. Three days. Switched it. Don’t be me.)

Tool use & agent subtasks

Worth flagging because it’s the part that surprised me. Per the announcement, on OSWorld-Verified — a computer-use benchmark — mini approaches the full GPT-5.4 and substantially beats GPT-5 mini. In actual use I’d describe it as: reliable at executing tool calls once a plan exists, less reliable at deciding which tool to reach for in an ambiguous spot.

So the pattern that works:

Frontier model (GPT-5.5 or another) plans and decides.
Mini executes the steps — calls the tools, parses the results, hands back to the planner.

OpenAI calls this “subagents in Codex.” The general shape is older than that term — it’s just the standard heavy-planner / cheap-executor split. Mini is unusually good at the executor seat.

Routing Mini in a Multi-Model Setup

When to escalate to a frontier model

Routing is the whole game with mini. Use it everywhere blindly and you’ll feel reasoning failures on hard turns. Don’t use it at all and you’ll burn money on easy turns. The escalation rules I use, roughly in order of importance:

Escalate on plan-shaped questions. Anything that requires choosing a strategy, decomposing an ambiguous goal, or weighing trade-offs. Mini is fine when it knows what to do. It struggles when it has to figure out what to do.
Escalate on inputs >272K tokens. Not because mini can’t take 400K — it can — but because once your prompts are that large, the workload usually involves cross-document reasoning that benefits from a frontier model. (GPT-5.5 also charges 2× input above 272K, so the cost picture changes there too.)
Escalate on high-stakes single calls. If the answer matters and there’s no human review downstream, pay the extra. The cost delta on one call is irrelevant; the cost of being wrong isn’t.
Don’t escalate just because the question “feels hard.” That’s how mini ends up underused. Many “feels hard” questions are actually well-specified and mini handles them. Test before assuming.

A practical setup: have a cheap classifier (mini itself, or nano) decide the route. It’s not perfect, but it’s better than routing everything to frontier or routing everything to mini.

Limits & Trade-offs

A few real ones I’ve hit, listed for what they are:

Nano is meaningfully weaker than mini. The pricing gap suggests they’re in the same conversation. They’re not. Nano works for very narrow tasks (cheap classification, sub-step routing). For anything that needs even modest reasoning, mini wins by a wider margin than the price ratio suggests. Don’t reach for nano on cost alone.
Context window vs context usability**.** 400K is the ceiling. The model is still better at staying coherent on the first 100K than the last 100K — same as nearly every large-context model. Plan your prompt accordingly.
The mini in the OpenAI API is also the mini in ChatGPT free. This matters less for builders and more for product positioning — if you’re building something users could just go do in ChatGPT for free, the differentiation has to come from your application, not from access to the model.
GPT-5.4 is no longer frontier. I noted this up top but it’s worth repeating in the limits section. Don’t pitch a product around “powered by GPT-5.4” as if it were cutting-edge — anyone paying attention knows it isn’t. The honest pitch is the routing logic, not the model name.

I’d also flag the boring thing: API behavior shifts. Pin model snapshots if you care about reproducibility. The auto-routing aliases (gpt-5.4-mini) will silently move to newer snapshots over time.

FAQ

Is GPT-5.4 Mini only available through the API or also in ChatGPT?

Both. The GPT-5.4 Mini API is the developer surface. The same model also runs in ChatGPT, including the free tier. Nano is API-only.

What’s the real context window size for GPT-5.4 Mini?

400K input tokens, 128K max output, per OpenAI’s official docs. Some aggregator pages list other numbers — when they conflict, OpenAI’s model page wins.

Does GPT-5.4 Mini support tool use and multimodal inputs?

Yes. Text and image inputs, plus function calling, web search, file search, computer use, and skills via the Responses API. Text output only. Strong on tool execution; less strong on deciding which tool to reach for under ambiguity.

When should I route a task to GPT-5.5 instead of using Mini?

When the task requires planning rather than execution, when the input exceeds **~272K tokens**, when the single-call answer quality matters more than the cost delta, or when you’re seeing reasoning failures that you can attribute to model capability rather than prompt shape. For everything else, mini.

How do I decide between GPT-5.4 Mini and other models in a routing setup?

Run a small holdout of real production traffic through each candidate model. Measure cost, latency, and a task-specific quality metric you actually care about — not a generic benchmark. Then route accordingly. The decision is empirical and per-workload; no general rule survives contact with your data.

Conclusion

The interesting thing about the GPT-5.4 Mini API isn’t its power. It’s that it’s the right model to put at the executor seat of a routing setup — the cheap, fast layer that does the bulk of the work while a frontier model handles the small fraction of turns where capability matters most.

If your stack is still single-model — one model handles everything — you’re either overpaying on easy turns or underperforming on hard ones. Or both. The thing mini is good at isn’t being the smartest model in the room. It’s being the cheapest model that’s smart enough for most of the room.

What I’d actually do before adding it to a routing layer:

Run a week of real traffic through it on the workloads you’d give it. Measure cost, latency, and quality on whatever metric your product actually depends on. Pin the snapshot. Build the escalation rule before you ship, not after.

Three months is enough to say mini holds up in production. Not enough to say anything about long-term price stability — that’s on OpenAI.

To be verified against the docs the day you actually build.

More to come.

Previous posts: