← 博客

本文暂未提供您所选语言的版本,正在显示英文版本。

MiniMax M3 API: Pricing, 1M Context & Production Use

MiniMax M3 API explained for builders: 1M context, native multimodal input, coding & agent workloads, and production cost notes.

By Dora 9 min read
MiniMax M3 API: Pricing, 1M Context & Production Use

The MiniMax M3 ​API went live June 1. I started testing it that week. Two weeks in is the earliest I’d write anything down — before that, you’re still being impressed by demos.

This is a work note, not a model review. Benchmarks are everywhere. What I cared about was narrower: where does the minimax m3 model actually fit in production, what does the 1M context cost in practice, and direct API or aggregator — which one and when.

A few things up front.

Most of the headline numbers (59.0% SWE-Bench Pro, the >9× / 15× speedups, 83.5 on BrowseComp) are vendor-reported. I’d treat them as a ceiling under favorable conditions, not a floor on your codebase.

The 1M context is real. The pricing on it is in two tiers. That matters more than people think.

Open weights landed on Hugging Face around day ten. If you read a launch piece saying “weights forthcoming,” it’s already stale.

What MiniMax M3 Is (for builders)

API availability & access paths

Three ways in. Direct through MiniMax’s open platform. Through an aggregator — OpenRouter, Fireworks, others. Or self-host from Hugging Face.

I tried the first two. Self-hosting I’ll leave to people with the GPUs for it — the minimax m3 parameters count is ~428B total with ~23B activated per token (MoE), so it’s not a single-consumer-GPU exercise.

The two paths I did test felt different in ways that aren’t obvious from the docs. Direct is cheaper per token. Aggregators give you one billing surface across many models. Which one matters depends on a question most teams haven’t answered yet — and I’ll come back to that later because it’s where I see people stall.

1M context with a 512K guaranteed minimum

This is the line worth reading carefully. The MiniMax M3 API supports up to 1M tokens of context. The number to plan against is ​512K — the guaranteed minimum​. The 1M ceiling is conditional.

I ran a test at ~480K (a stitched-together repo + design docs + a long thread). It came back consistently across three runs, latency in the range I’d expect.

Pushed to ~700K (added the project’s full issue history). Latency spread widened noticeably. And the billing tier changed.

So in practice: 512K is your reliable working number. The upper half of the window exists, but it’s a budget item, not free capacity.

The architecture under it is MSA — MiniMax Sparse Attention — documented in the launch post. The relevant detail for cost planning is that per-token compute at 1M context drops to roughly 1/20 of M2. Without that ratio, the long-context tier wouldn’t be economically possible at all.

What M3 Is Built For

Coding & agentic workloads

The minimax m3 model is positioned for long-horizon coding and agent work. From two weeks of poking at it, that framing is honest.

Single-turn Q&A is fine. Unexciting, but fine. Where it changes is in long sessions — read a repo, plan, execute, iterate, recover from something breaking halfway through. MiniMax’s own demo runs M3 for 12 hours, 18 commits, reproducing an ICLR paper. That’s the workload the architecture seems to be shaped around.

The relevant minimax m3 benchmark numbers — all vendor-reported — are 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 74.2% MCP Atlas. VentureBeat’s coverage lines them up against GPT-5.5 and Gemini 3.1 Pro. The framing (“eclipsing”) I’d cool down by 30%. The numbers are real. The conditions were MiniMax’s lab and MiniMax’s scaffolding.

What it means for builders: if you’re building a coding assistant, a desktop agent, or anything holding multi-step plans, M3 is in the shortlist. If your workload is short prompts at high concurrency, you’re paying for context you’re not using.

Native multimodal (image, video) input

Multimodal is native, not bolted on. Text, image, and video go into the same context. Output is text.

I gave it a UI screenshot + a 30-second screen recording + a chunk of related backend code and asked it to figure out what the user was actually trying to do. It got there. Not in one shot — I had to nudge it on the second turn. (The first guess was reasonable but wrong.) That counts as working, by my bar.

The detail I want to flag, because it bit me in a different test: image and video tokens share the same pool as text. A short clip can eat a serious chunk of your 512K window before any prompt text. I checked the token count on a 15-second 720p clip — it was higher than my mental model. Worth measuring before extrapolating.

Production Cost & Limits

I’m not putting specific per-million-token figures here. Provider rates shift, and there’s a separate piece on minimax m3 price mechanics that handles the math properly. What you need at planning stage is the shape.

Standard vs long-context (>512K) rate

Two tiers on the MiniMax M3 API:

  • ≤512K input tokens — standard rate. Covers most chat, coding, and agent loops.
  • >512K input tokens — higher long-context rate. Aimed at full-repo reasoning, ultra-long documents, hours-long agent sessions.

This split is the single thing I’d design around. A system that lives in the 100K–300K band has different unit economics than one that routinely brushes 700K. Discovered the same way I’d guess most teams discover it: by looking at a bill.

What worked for me: cap default routing at 512K, require an explicit flag to push past it. Then the cost shows up at the call site, not at the end of the month.

Shared token pool across modalities

Already mentioned this but it deserves its own line. There is no separate multimodal allowance. Images are tokens. Frames are tokens. They eat the same window text does, and they cross 512K the same way text does.

For agent loops that pull in screenshots every turn, this builds up faster than the napkin math suggests. Audit a real session’s token count. Don’t trust the synthetic benchmark.

Direct API vs Aggregation Layer

The decision I see teams stuck on. Most of them stall longer than they should.

When each makes sense

Go direct if:

  • M3 is committed to as your primary model and you don’t expect to swap.
  • Per-token cost matters more than integration surface.
  • You need the full 1M (some aggregators cap lower — Fireworks shipped with a 500K cap and is lifting it in stages).
  • Maintaining a model-specific integration doesn’t bother you.

Go through an aggregator if:

  • You already run more than one model in production, or you’ll need to.
  • You want to A/B M3 against, say, Claude Opus or DeepSeek V4 without rebuilding the request path.
  • Unified billing, retries, fallback routing, and observability matter.
  • You don’t yet know which model your workload settles on.

The honest case for aggregation isn’t lower per-token cost. Usually it’s marginally higher. The case is that ​model-switching freedom has economic value​, and that value compounds the more your product touches generation across providers. WaveSpeedAI sits in that layer, so do OpenRouter and Fireworks — each makes different trade-offs on routing, latency, and coverage.

My rough rule, for what it’s worth: single-model coding agent → direct. Mixed text + image + video across providers → aggregator. Not a hard rule. A starting position.

Limits & Trade-offs

Open weights, technical report, and what’s still missing

Weights are on Hugging Face. Community GGUF quantizations are up. Two things worth knowing:

License terms haven’t fully settled. Don’t assume Apache 2.0 or MIT — check before building commercial product on the local-deployment path.

And not every inference engine supports MSA yet. The ones that don’t fall back to dense attention, which forfeits a chunk of the speed advantage. If you self-host, verify engine support before you benchmark — otherwise your numbers will look worse than they should.

The ARC-AGI gap

One thing the launch coverage tends to skip. M3’s coding and agentic numbers do not translate cleanly to general abstract-reasoning benchmarks like ARC-AGI. The model is shaped for what it advertises — coding, tool use, long-horizon agents, multimodal grounding — not abstract puzzles.

That’s not a criticism. It’s the shape of the model. Knowing it saves a wrong bet.

FAQ

Is the MiniMax M3 API already live and stable for production use?

Yes. Live since June 1, 2026. Stable enough that multiple aggregators are routing real traffic. As with any model under three months old, expect occasional behavior shifts as the provider tunes things — pin your prompts, keep an eval suite.

What’s the real guaranteed context length on MiniMax M3 (512K or 1M)?

512K guaranteed, 1M ceiling. Under 512K behaves consistently. Past that, you cross into both higher pricing and wider latency variance. Some aggregators cap lower than 1M at launch. Plan around 512K.

Is MiniMax M3 natively multimodal for image and video input?

Yes. Native, not adapter. Text, image, and video share the same token pool and context window. Output is text only.

Are the open weights for MiniMax M3 available yet?

Yes. They landed on Hugging Face around day ten after launch. ~428B total parameters, ~23B activated per token (MoE). Single consumer GPU won’t run it — expect multi-GPU or quantized inference. License terms — verify before commercial use.

Should I access MiniMax M3 directly or through an aggregator?

Direct if you’re committed to a single model and want the lowest per-token cost. Aggregator if you’re running more than one model or expect to switch. The answer depends on your workload, not on which one is “better.”

Conclusion

The MiniMax M3 API ​is interesting less because it tops any single axis and more because of the combination — frontier-level coding, native multimodal, a real (if tiered) 1M context, open weights, all in one model. That combination collapses integration surface that used to require stitching providers together.

What I’d do before committing to production:

Run your actual workload, not a benchmark. Measure where prompts land relative to 512K. Decide routing policy before launch. Pick direct or aggregator based on how many models you’ll run, not on sticker price alone. If you self-host, check MSA support in your engine.

Two weeks isn’t long. The model will keep evolving and so will the pricing. That’s the expiration date on this piece.

To be verified, as always, against whatever the docs say the day you actually build.

Previous posts: