Opus 4.8 1M Fast API: Context, Speed & Token Cost

Hey, it’s Dora here. I already have Opus 4.7 in my routing table. The question this piece answers is whether the opus 4.8 1m fast configuration deserves a slot in the same table, and on what conditions. If you’re running a multi-model setup in production and trying to decide whether to flip on 1M context, Fast Mode, or both — this is the breakdown.

Not a release review. Not a migration guide. Just the cost-and-latency math on the two switches that matter.

1M Context at Standard Pricing

The first thing worth understanding is what Anthropic is not charging extra for.

No long-context surcharge

Anthropic’s pricing docs confirm: Opus 4.8 includes the full 1M-token context window at standard pricing. There’s no tier shift at 200K, no cliff at 512K, no separate long-context SKU. Input bills at $5/M and output at $25/M whether your prompt is 10K or 900K.

This matters more than it looks. Most long-context models price in tiers — past some threshold, the entire request flips to a 2x rate. If you’re running models from different providers behind a single routing layer, that asymmetry is one of the more annoying things to model. With Opus 4.8 the math stays flat, which makes cost prediction across the routing table consistent.

The trade-off is the tokenizer. Opus 4.7 introduced a new tokenizer that Anthropic documents as using up to 1.35x more tokens than 4.6 for the same input. The original Opus 4.7 announcement explains the trade — the tokenizer change improves performance across many tasks, at the cost of mapping the same input to roughly 1.0–1.35x more tokens. Opus 4.8 inherits the tokenizer. Independent measurement on technical content (code, JSON) lands closer to 1.4x in practice. So “flat headline price” comes with “input volume goes up.” Net cost for code-heavy workloads is meaningfully higher than the rate card suggests. Plain English prose is mostly unaffected.

128K max output

128K max output synchronously, 300K via the beta header for Batch. The 1M context number is input-side; output remains capped. This is the more common source of “why did this fail” tickets — a long-context request that hits the output ceiling mid-generation. If you’re moving an existing workflow from a 200K model to opus 4.8 fast mode or the standard 1M variant, double-check that max_tokens has been raised. The new tokenizer eats into the budget faster.

Fast Mode Explained

Fast Mode is the lever that actually changes the opus 4.8 1m fast decision. The opus 4.8 fast mode endpoint and the standard endpoint serve the same model with the same capabilities — but very different cost and latency profiles.

2.5x speed, research-preview status

Fast Mode runs about 2.5x faster than the standard endpoint at the same output quality. Same model weights. Same context window. What changes is throughput.

It’s a research preview on the API, gated by waitlist. Inside Claude Code, the /fast command flips the session mid-flight. On the API you need access enabled per organization. The “research preview” framing is worth taking seriously — capacity, availability windows, and exact pricing structure are still subject to change. Don’t build a production SLA around it yet.

$10/$50 (2x standard, 3x cheaper than 4.7)

This is where the math gets interesting. Claude opus 4.8 fast prices at exactly 2x standard: $10 input, $50 output per million tokens. On Opus 4.7 the equivalent Fast tier was $30/$150 — six times the standard rate. Anthropic dropped that to 2x with 4.8. The claude opus 4.8 fast rate change is the biggest single shift in how this tier fits into a routing decision.

Three observations.

First, Fast Mode used to be a luxury tier — turn it on for demos, turn it off for production because the multiplier killed the budget. At 2x, it’s now in range to leave on for latency-sensitive routes. The economic argument flipped.

Second, the 2x multiplier applies across the full context window. There’s no separate Fast rate at 1M. So opus 4.8 1m fast is just standard 1M pricing × 2. Easy to model.

Third, the cost case for Fast Mode still has to clear the bar any premium tier has to clear: is the latency improvement worth more than running the same workload through a cheaper model that’s already fast enough? For many routes, the answer is still no.

How Cost Stacks

Multiple pricing modifiers can apply to the same request, and they don’t all compose the same way.

Fast + prompt caching + data residency multipliers

Fast Mode pricing stacks with other modifiers:

Prompt caching multipliers apply on top of Fast Mode pricing. Cache writes and cache reads are computed against the Fast base rate, not the standard rate. So a cached hit on a Fast Mode request still costs more in absolute terms than the same cached hit on standard.
Data residency multipliers apply on top of Fast Mode pricing as well. If you’re paying a regional premium for EU or other data-residency requirements, that premium is computed against the Fast rate.

The practical implication for opus 4.8 token usage modeling: if you’re already running with caching + residency multipliers in your existing routing table, the Fast Mode case isn’t a single 2x. It’s 2x compounded with whatever multipliers you were already paying. Run the math for your actual configuration before deciding the trade-off is acceptable.

Minimum cacheable prompt length on Opus 4.8 dropped to 1,024 tokens, down from earlier thresholds. This is a small win for short-prompt agent loops where caching previously didn’t kick in.

New tokenizer (~35% more tokens)

I mentioned this above; it’s worth flagging again in the cost-stacking context. The headline rate hasn’t moved since Opus 4.5 — $5/$25. But Opus 4.7 and 4.8 both use the new tokenizer that can consume up to 35% more tokens for identical input. Anthropic’s own pricing page states this directly.

So when you’re stacking modifiers, the base unit isn’t the same as it was on 4.6. A “20% caching savings” on 4.8 is computed against an input volume that’s 30-40% larger to begin with for code-heavy workloads. If you’re benchmarking 4.8 against an older model on cost, normalize for token count, not just rate.

When to Turn Fast On

The honest answer: not by default. The claude api fast mode option is a tool for specific request shapes, not a global toggle. Think of claude api fast mode as a per-route decision, not a per-organization one.

Latency-sensitive vs cost-sensitive workloads

The cases where Fast Mode actually earns its 2x:

Interactive copilots where time-to-first-token and tokens-per-second visibly affect the user experience. The 2.5x speed difference is perceptible.
On-call summarizers, alert triage, customer-facing agents — anywhere wall-clock latency is the dominant cost.
Demo and sales paths where the model needs to feel snappy.

The cases where it’s a waste:

Batch processing. Doesn’t matter how fast the model is if you’re not waiting on it. Just use the Batch API at half price.
Background agents running unattended overnight.
Routes where a smaller, faster model would meet quality requirements anyway. Sonnet 4.6 is already much cheaper and faster. If the task doesn’t need Opus-tier capability, Fast Mode is the wrong axis to optimize on.

In a routing table, my mental model: Fast Mode is the upgrade you apply to Opus routes you’ve already justified. It’s not a substitute for routing decisions that should have happened upstream.

Where Fast is unavailable (Batch, AWS)

Two hard constraints from the docs:

Fast Mode is not available with the Batch API. Batch is asynchronous, so the latency premium has no value. Anthropic doesn’t sell it. If you want the cost optimization on non-latency-sensitive work, the Batch API is the other direction — 50% discount on input and output, but you give up real-time response.
Fast Mode is not available on Claude Platform on AWS. If your production deployment is on AWS Bedrock specifically and you’ve routed around Anthropic’s direct API for compliance or contracting reasons, Fast Mode isn’t on the menu. The standard endpoint is. Worth double-checking your deployment path before architecting around Fast.

It’s available on the direct Claude API and through other supported channels, but verify per the official Fast Mode docs before committing.

Limits & Trade-offs

A short list of the things that bit me or would have if I hadn’t caught them upfront:

Cache invalidation on model switch. Prompt caches are partitioned per model. Moving from 4.7 to 4.8, or from 4.8 standard to 4.8 Fast, invalidates the cached prefix. First few sessions on the new endpoint pay full cache-write cost. Plan the migration window accordingly.
Token count drift. Same content, run through count_tokens on 4.6 versus 4.8, yields different numbers. If you have billing dashboards or rate-limit predictions baked on historical 4.6 token counts, opus 4.8 token usage will read as a step-change the day you flip the model ID, even before any workflow changes.
Effort levels affect output cost. Opus 4.8 has effort tiers (high default, extra, max in Claude Code). Higher effort means more reasoning tokens, billed at output rates whether displayed or not. The same prompt can produce very different bills depending on effort.
Fast Mode capacity. Research preview means available capacity isn’t guaranteed. For production routes, have a fallback to the standard endpoint built in.

FAQ

Does turning on 1M context + Fast Mode on Opus 4.8 actually double my costs?

Roughly, yes — but the base unit matters. The 1M context is at standard pricing, so context size alone doesn’t increase the rate. Fast Mode is a flat 2x on input and output. Stack caching multipliers and data residency on top, and the real cost depends on configuration. Per-request bills will be approximately 2x your previous standard 1M cost, before factoring in cache hit ratios.

Is Fast Mode generally available or still in research preview?

Research preview on the Claude API, gated by access [as of publication date]. Available immediately in Claude Code via the /fast command. Capacity and pricing structure may change before GA. Check the Fast Mode docs for current status.

Can I use Fast Mode with Batch API or on AWS?

No to both. Fast Mode is incompatible with the Batch API (batch is async, so latency has no value), and it’s not available on Claude Platform on AWS. Direct Claude API only [as of publication date].

How much does the new tokenizer increase my token usage on Opus 4.8?

Anthropic’s documented range is 1.0x to 1.35x more tokens than pre-4.7 models, with code and structured data hitting the top of that range. Plain English prose is barely affected. Independent measurement on real workloads has reported slightly above the documented top end for technical content. Run count_tokens on representative samples of your actual workload before relying on a single multiplier.

When is it actually worth enabling Fast Mode versus staying on standard?

When wall-clock latency directly affects the experience or the workflow downstream, and the task genuinely needs Opus-tier quality. Interactive copilots, real-time agents, customer-facing chat. Not worth it for batch jobs, background processing, or routes where a cheaper faster model would meet the quality bar anyway.

Conclusion

The opus 4.8 1m fast decision is two switches, not one. The 1M context switch is essentially free at the rate level — you pay through the tokenizer, not the rate card. The Fast Mode switch is a real 2x, but at a multiplier low enough to be defensible on routes where latency actually matters.

If you’re already running a multi-model setup: standard 1M slots into the long-context Opus role at the same cost shape as 4.7; Fast Mode is a new lever worth selectively enabling on latency-critical routes, not a default. Model the cache and residency stacking before committing. Re-run count_tokens on real samples before trusting any cost projection.

That’s where my data ends. Research-preview status means treat the numbers as current, not committed.

Previous posts: