← Blog

Dieser Artikel ist noch nicht in Ihrer Sprache verfügbar. Die englische Version wird angezeigt.

Gemini 3.5 Flash vs 3.1 Pro: Speed, Agents, and Cost

Gemini 3.5 Flash now outperforms 3.1 Pro on coding and agent benchmarks at lower cost. Here's the production routing decision builders need to make.

By Dora 9 min read
Gemini 3.5 Flash vs 3.1 Pro: Speed, Agents, and Cost

Dora here. I’ve been sitting with the Gemini​ 3.5 ​Flash​ vs 3.1 pro numbers since Google’s I/O 2026 launch on May 19 and the short version is this: the hierarchy inversion is real, it’s not marketing, and it affects routing decisions you might already have locked into a config file.

Flash models are supposed to trade quality for speed. 3.5 Flash breaks that contract — at least on the workloads most production agents actually run.

Why This Comparison Is Unusual: Flash Beating the Previous Pro

What Google Showed at I/O 2026

Gemini 3.5 Flash shipped GA on May 19, stable model ID Gemini-3.5 Flash with no preview suffix. The headline claim from Google: it outperforms Gemini 3.1 Pro on coding and agentic benchmarks while running roughly 4x faster than comparable frontier models, often at less than half the cost.

The Tier Inversion Explained in One Paragraph

Flash beats 3.1 Pro on the benchmarks that look like real work: Terminal-Bench 2.1 (76.2% vs 70.3%), MCP Atlas (83.6% vs 78.2%), Finance Agent v2 (57.9% vs 43.0%), and GDPval-AA Elo (1656 vs 1314). It trails Pro on Humanity’s Last Exam (40.2% vs 44.4%) and ARC-AGI-2 (72.1% vs 77.1%) — benchmarks dominated by raw parametric knowledge and pure abstract reasoning. When evaluating Gemini​ 3.5 ​​​Flash​ vs 3.1 pro​, the split is clean: agent work goes to Flash, hard reasoning stays with Pro.

Head-to-Head: Benchmarks and What They Actually Measure

The Gemini​ 3.5 ​Flash​ benchmark case against 3.1 Pro is specific, not universal. Here’s what the numbers actually show.

Terminal-Bench 2.1 measures the ability to execute multi-step terminal tasks — reading file system state, writing and running scripts, handling error output, retrying. Flash scores 76.2% versus 3.1 Pro’s 70.3%. That nearly 6-point gap is meaningful for automated pipelines where the model is operating the terminal rather than advising a human operator.

MCP Atlas is the one I keep coming back to. It tests scaled tool-use reliability — how well a model maintains correct tool invocations across extended multi-call sequences (8–15 calls per task, 4k–12k token context per call). Flash’s 83.6% beats 3.1 Pro’s 78.2% and also leads every competitor including Claude Opus 4.7 (79.1%) and GPT-5.5 (75.3%). For developers building autonomous agents that integrate web search, vector databases, and code execution sandboxes, this is the benchmark to weigh most heavily.

GDPval-AA Elo: Flash at 1656 versus Pro at 1314. A 342-point swing on a real-task agentic evaluation. Not a rounding error.

Where 3.1 Pro Still Wins (ARC-AGI-2, Long-Context Retrieval)

ARC-AGI-2 scores favor Pro by 5 points (77.1% vs 72.1%). For tasks requiring novel pattern recognition, complex logical deduction, or problems that don’t map to training data patterns, 3.1 Pro has an edge.

The longer-context gap is the one to actually test against your data. MRCR v2 at 128K context shows 3.1 Pro at 84.9% vs Flash at 77.3% — a 7.6-point gap. If your use case involves retrieving specific information from very long documents, legal document analysis, or needle-in-a-haystack retrieval, 3.1 Pro remains the stronger option.

One honest caveat: all the headline numbers above are self-reported by Google. Validate against your own prompts and domain constraints before drawing conclusions.

Multimodal Understanding Scores

CharXiv Reasoning: Flash at 84.2%, edging GPT-5.5’s 84.1%. OSWorld: 78.4%, on par with GPT-5.5 (78.7%). On multimodal pipelines, Flash has the clearest upgrade case.

Pricing and Latency

Gemini 3.5 Flash Pricing

Gemini​ 3.5 ​​​Flash​ pricing​: $1.50 per million input tokens, $9 per million output tokens. Cached input drops to $0.15 per 1M — the relevant number if you’re running repeated system prompts across agent loops. Context window: 1,048,576 input tokens, 65,536 output tokens. Dynamic thinking is on by default with levels (minimal, low, medium, high) for cost/performance tradeoffs.

Gemini 3.1 Pro Preview Unit Cost

Gemini 3.1 Pro: $2.00 per million cache-miss input tokens, $12.00 per million output tokens. Context window: 2.0M tokens. Maximum output: 16K tokens per request. Above 200K context, pricing steps to $4.00 input / $18.00 output. Flash has a 4x output limit advantage (65K vs 16K per response), which matters for generating complete code files without truncation.

Throughput Comparison

Flash delivers around 284 tokens per second against Pro’s 109. A workflow that takes three minutes with Pro might finish in under ninety seconds with Flash, at 25% lower cost per token.

Speed isn’t the goal. Not breaking flow is. At 3+ tool calls per agent step, that gap compounds fast.

Production Routing Decision

When Flash Is the Right Default

Use Flash as your routing default if:

  • Your agent makes multiple sequential tool calls per task (MCP, function calling, code execution sandbox)
  • You’re on CI/CD pipelines or terminal-automation workloads
  • Context stays below 100K tokens per request
  • Response time is user-visible — at 284 tokens/sec versus 109, this matters for interactive products

For MCP-based agents, it’s not close. Flash leads MCP Atlas by 5.4 points, Toolathlon by 7.1, Finance Agent v2 by 14.9. The speed advantage compounds in multi-step loops. Cached input at $0.15/1M makes high-frequency tool use 10x cheaper than running Pro.

When 3.1 Pro Is Still Worth the Cost

Two cases. One is reasoning purity: algorithm design, proof construction, complex debugging where you can’t run the output to validate it. ARC-AGI-2 at 77.1% vs 72.1% is the signal. In tasks where errors are expensive and you get one shot, that gap matters.

The second case is long context. If your retrieval operates at 128K tokens or beyond — full codebase analysis, long-document RAG, contracts — test the MRCR v2 gap against your actual retrieval lengths before switching. 3.1 Pro’s 2.0M context window also gives you headroom Flash can’t match.

When to Wait for 3.5 Pro Instead of Choosing Either

Gemini 3.5 Pro was announced at I/O on May 19 but is still in limited Vertex preview, with GA expected in June 2026. It targets a 2M-token context window, Deep Think reasoning, and frontier multimodal — the use cases Gemini Ultra used to cover.

Wait for 3.5 Pro if your core requirement is hard reasoning at scale and you need the 2M context window. The current Pro is 3.1 and it wins those benchmarks. 3.5 Pro is likely to widen that lead further.

The practical question is calendar. If you need to route production traffic now, you’re choosing between Flash and 3.1 Pro. Run your own evals on your specific task distribution. That’ll tell you more than anything I say.

Fallback Patterns for High-Availability Stacks

The clean pattern is a request classifier, not a global model ID replacement. Don’t run the migration as “replace every Gemini-3.1-pro-preview string with Gemini-3.5-Flash.” That’s how good launch news turns into production regressions.

Practical fallback logic:

  • Primary: Gemini-3.5-Flash for agent and coding workloads
  • Escalation on reasoning tasks: Gemini-3.1-pro-preview — triggered by task classifier (long context, novel deduction, no-retry constraint)
  • On 429 / quota exhaustion: retry Flash with exponential backoff first; escalate to Pro only after two failed retries
  • On 5xx: fall back to Pro immediately, log model ID and failure reason

Log model ID, prompt size, token count, tool call count, latency, fallback reason, and user-visible outcome. Without those fields, you’ll argue about model preference instead of measuring route performance.

What This Means for Model Aggregation

Why Phased Rollouts Make Single-Vendor Commitments Riskier

The Gemini​ agent benchmark situation this month illustrates a pattern that’s accelerated through 2025–2026: a Flash-tier model beats the previous Pro on agentic work, while Pro holds on reasoning. Next month 3.5 Pro ships. The ranking resets again.

Hardcoding your infrastructure to a single model ID means each release forces a migration under time pressure. The teams that handled this cycle smoothly were already routing by task class, not by model name.

Routing Across Tiers Within One Vendor + Across Vendors

Having many tools isn’t the problem. Having to manage your tools is.

This conclusion has an expiration date. The Gemini​ 3.1 pro vs ​Gemini​ 3.5 ​Flash decision looks like Flash for most production agent work, today. Check the 3.5 Pro benchmarks when the model card drops in June. The routing logic you build now should make that re-evaluation a config change, not a code change.

FAQ

Is ​Gemini​ 3.5 ​Flash​ strictly better than ​Gemini​ 3.1 Pro?

No. Flash outperforms 3.1 Pro on agentic tasks, tool use, coding, and multimodal benchmarks. However, 3.1 Pro still leads on pure abstract reasoning (ARC-AGI-2) and long-context retrieval above 128K tokens. The better model depends entirely on your workload distribution.

Should I migrate from 3.1 Pro to 3.5 ​Flash​ right now?

It depends. If your workloads are dominated by agents, multi-step tool calling, terminal automation, or coding tasks, the migration is usually worth it — you’ll get better benchmark performance, roughly 3x higher throughput, and lower cost. For long-context RAG or high-stakes reasoning where errors are expensive, test your own prompts first before switching.

When will ​Gemini​ 3.5 Pro be released?

Gemini 3.5 Pro was announced at I/O 2026 but is not yet generally available. It is currently in limited preview. Google indicated a June 2026 target for full release. The current production Pro model remains Gemini 3.1 Pro Preview.

Does ​Gemini​ 3.5 ​Flash​ have a free tier?

Yes, there is a free tier with daily quotas. However, for any serious production agent workloads, the free tier limits are likely to be hit quickly. Most production use cases should plan on the paid tier.

Bottom Line

The Gemini​ 3.5 ​Flash​ vs 3.1 ​Pro split is cleaner than most Flash vs. Pro comparisons. Flash wins the work that looks like production: agents, tool calls, terminal tasks, multimodal grounding. Pro wins the work that looks like research: hard reasoning, long-context retrieval, novel deduction.

Default to Flash for agent workloads. Keep Pro available as an escalation target for reasoning-heavy requests and long-context retrieval above 128K. Build your fallback logic now so the 3.5 Pro release in June is a config update, not a migration sprint.

This is where my data ends. Run it on your own task distribution before you commit a routing change to production.

Previous posts: