← 블로그

이 문서는 아직 사용자의 언어로 제공되지 않습니다. 영어 버전을 표시합니다.

GPT-5.4 Mini Pricing: Input, Cached & Output Cost

GPT-5.4 Mini pricing explained: input, cached input, and output token costs, and why small models cut high-volume API bills.

By Dora 9 min read
GPT-5.4 Mini Pricing: Input, Cached & Output Cost

Hello, it’s Dora. I have been routing a high-volume classification workload through OpenAI for a few weeks. The question I kept getting from finance was the one builders ask whenever a “mini” model ships: does the math actually work, or does the cheaper per-token rate get eaten by something else?

This piece is for anyone running the same calculation on GPT-5.4 Mini pricing right now — cost-aware builders, finance-adjacent eng leads. I will break down input, cached input, and output rates, walk through where the unit economics work, and flag where Mini quietly costs more than the headline suggests. All rates come from OpenAI’s official API pricing page, verified as of publication date — these change, so check before quoting.

This is not the integration guide. Routing and the API contract live in a separate piece. This one is about money.

GPT-5.4 Mini Cost Breakdown

Input vs output rates

!

GPT-5.4 Mini is priced at $0.75 per million input tokens and ​**$4.50 per million output tokens**​. The 6x output multiplier is the part most people internalize too late. Input is cheap. Output is what burns budget.

For context, where Mini sits in the GPT-5.4 family as of publication date:

ModelInput ($/M)Cached Input ($/M)Output ($/M)
GPT-5.5$5.00$0.50$30.00
GPT-5.4$2.50$0.25$15.00
GPT-5.4 Mini$0.75$0.08$4.50

Mini is roughly 3.3x cheaper than GPT-5.4 and 6.7x cheaper than GPT-5.5. That ratio is the whole story when you decide whether a workload belongs on Mini.

Standard rates apply to context windows under ~270K input tokens. Above that, OpenAI applies 2x input and 1.5x output pricing to the full session — including under Batch and Flex tiers. The single most expensive surprise on the platform.

Cached input discount

Cached input on GPT-5.4 Mini is $0.075 per million tokens — a 90% reduction off the standard input rate, matching the pattern across the GPT-5.4 and GPT-5.5 families. The cache is automatic: no API flag, no code change. If your request reuses a prefix OpenAI has already computed, those tokens are billed at the cached rate.

The rules that matter:

  • The prefix must be byte-for-byte stable. A timestamp in your system prompt kills the cache.
  • The prefix needs to be long enough (around 1,024 tokens minimum).
  • The cache expires after several minutes of inactivity.
  • Output pricing does not change.

For RAG applications with large stable system prompts, cache hit rates above 70% are realistic. For single-turn workloads with no shared preamble, the cache barely helps. The savings are real but conditional, which is why I never quote them as a flat percentage to finance.

Why Small Models Win on High-Volume

Cost-per-task math vs frontier

The right unit is not cost per token. It is cost per successful task. A representative workload — classifying 1M short support tickets, 800 input tokens and 200 output each, no caching:

ModelInput costOutput costTotal
GPT-5.5$4,000$6,000$10,000
GPT-5.4$2,000$3,000$5,000
GPT-5.4 Mini$600$900$1,500

Based on OpenAI’s standard pricing as of June 2026. Example calculation only — actual prices may vary by region, cache hit rate, volume discounts, or other factors. Always check the official OpenAI pricing page before quoting.

Mini comes in at 15% of GPT-5.5’s bill. The question is whether Mini’s accuracy on your task is close enough that the savings are real and not deferred to re-work, escalations, or human review.

I keep an eval set on every workload I move to Mini. If quality drops below threshold, the spreadsheet stops mattering.

Batch & Flex discounts

Two further levers:

  • Batch API​: flat 50% discount on input and output, asynchronous processing within 24 hours. On Mini: $0.375 input and $2.25 output per million. Most batches complete in 1–6 hours.
  • Flex pricing​: also 50% off, with variable latency rather than asynchronous queuing. Useful for non-user-facing internal tools.

Verify exact billing on a small test batch before committing a large workload — the OpenAI developer pricing docs are the source of truth.

Hidden Cost Factors

Output token sprawl

The failure mode I see most often, and it has nothing to do with the model.

Output tokens cost 6x input tokens on Mini. If you do not constrain output length, a model that occasionally returns 2,000 tokens when 200 would do is silently inflating your bill by an order of magnitude. Fixes are boring:

  • Set max_tokens on every call.
  • Use structured outputs or strict JSON modes where the schema bounds the response.
  • For classification, return a label. Not an explanation. A label.

I caught one workload where Mini was averaging 480 output tokens because the system prompt asked for “a brief justification.” Brief, apparently, means whatever the model wants. After removing the justification field and adding max_tokens, output dropped to 12 tokens. The bill dropped accordingly.

Output sprawl eats any savings the cheaper per-token rate gives you. First thing to audit.

Caching requires stable prefixes

A “consistent system prompt” in code review terms is not the same as “byte-for-byte stable.” If your system prompt includes the current date, a user-specific personalization field at the top, a retrieved document block that varies per request, or any A/B variant — caching is killed for that prefix.

The cache only applies to the longest shared prefix from the start of the input. Fix is structural: pin stable content to the beginning, put variables at the end.

This sounds obvious. I have still seen production systems where caching was assumed to be active and was not. Check your cache hit rate. Do not assume.

Who Should Use Mini for Cost

Best-fit vs cases where it underperforms

Mini works well for high-volume classification, tagging, routing decisions, summarization of structured input, first-pass extraction where a frontier model reviews edge cases, and bounded chat. Anything where you can constrain output length aggressively.

Mini underperforms on multi-step reasoning across long chains, on outputs where quality determines user trust (legal drafting, customer-facing analysis), on tasks where edge cases are common and expensive to miss, and where the eval gap between Mini and GPT-5.4 on your task is measurable.

The honest test is your own eval set on your own data. If Mini’s accuracy is within an acceptable band of GPT-5.4, the savings are real. If not, you are paying for a discount in re-work, support tickets, and user friction — which never shows up on the API bill but absolutely shows up somewhere.

I audit every few weeks on workloads that have been on Mini for a while. Drift is real. The OpenAI API pricing comparison breakdowns keep an updated view if you need to re-evaluate routing.

FAQ

How much does GPT-5.4 Mini actually cost compared to GPT-5.5?

Roughly one-seventh per token. GPT-5.4 Mini is $0.75 input and $4.50 output per million tokens; GPT-5.5 is $5.00 input and $30.00 output. At equivalent volume with no caching, Mini costs about 15% of GPT-5.5. Whether that translates to actual savings depends on whether Mini handles your task without quality regression. See OpenAI’s GPT-5.5 model page for the cross-model pricing table.

Does cached input pricing apply to GPT-5.4 Mini?

Yes. Cached input on GPT-5.4 Mini is $0.075 per million tokens, a 90% discount off the standard $0.75 rate. Automatic when your prompt reuses a stable prefix OpenAI has already computed. The cache expires after several minutes of inactivity and requires the prefix to be long enough (around 1,024 tokens). Output pricing is unaffected.

Is there a batch discount for GPT-5.4 Mini?

Yes. The Batch API gives a flat 50% discount on both input and output for the gpt 5.4 mini api endpoint, with processing within 24 hours. Effective batch rates are $0.375 input and $2.25 output per million tokens. Flex pricing offers a similar 50% discount with variable latency instead of asynchronous queuing.

How do I estimate the real cost of a high-volume workload on Mini?

Three numbers: tokens in per request, tokens out per request, request count per month. Multiply by the relevant rate. For accuracy, sample 100 real requests, measure actual token counts, and check your cache hit rate. Calculators give you a ceiling. Sampled data gives you reality.

When is GPT-5.4 Mini too weak for the task even if it’s cheaper?

When your eval set shows measurable quality regression compared to GPT-5.4 or GPT-5.5 on the outputs that matter. Common triggers: multi-step reasoning, instruction-following on long prompts, edge cases where a wrong answer is expensive, outputs that need to read as polished writing. Savings are not real if they push work into manual review or user-facing errors. Run the eval first.

Conclusion

GPT-5.4 Mini pricing is straightforward on paper: $0.75 input, $0.075 cached input, $4.50 output, per million tokens. Where the numbers get interesting is in cache hit rates, output token discipline, batch routing, and whether the model is actually good enough for the task.

Cheaper per-token rates are necessary but not sufficient. Output sprawl, broken caching, and quality regressions can each independently erase the savings. The math only works if you do the math on your own workload.

I am running one more migration this month — a summarization pipeline currently on GPT-5.4 that I think will hold up on Mini with tighter output limits. Cost projection says 70% savings. We will see what the eval says. Continuing next week.

Previous posts: