← 博客

本文暂未提供您所选语言的版本,正在显示英文版本。

Gemini 3.5 Flash Shipped — A Flash-Tier Model Now Leads the Pro Tier on Agent Benchmarks

Gemini 3.5 Flash went GA at I/O 2026 with thinking-on-by-default, $1.50/$9 per 1M tokens, and a benchmark profile that beats Claude Opus 4.7 and GPT-5.5 on MCP Atlas and most agent suites. Here's where Flash leads, where it loses, and how to deploy.

8 min read

Google shipped Gemini 3.5 Flash to general availability on May 19, 2026, the same day they announced it at I/O — across the Gemini API, AI Studio, Antigravity, Vertex AI, the Gemini app, and AI Mode in Search. The model ID is gemini-3.5-flash (no preview suffix), the May 2026 snapshot is 3.5-flash-05-2026, and the pricing is $1.50 input / $9.00 output per 1M tokens with $0.15/1M for cached input.

The headline number is on the benchmark side: a Flash-tier model now beats Pro-tier frontier models on most agent suites. Claude Opus 4.7 and GPT-5.5 — both Pro-class, both meaningfully more expensive — trail Flash on MCP Atlas, Toolathlon, and Finance Agent v2. Coding is more mixed, and there’s a clear category where Flash still loses. Below is the full picture, an honest read of the trade-offs, and where to deploy.

What shipped, in one table

DetailValue
Model IDgemini-3.5-flash
Snapshot3.5-flash-05-2026
Input pricing$1.50/1M tokens
Output pricing$9.00/1M tokens
Cached input$0.15/1M tokens
Input modalitiesText + image + audio + video
Output modalitiesText
Context window1,048,576 input / 65,536 output
ThinkingDynamic thinking on by default
Tool useFunction calling, structured output, search-as-tool, code execution
AvailabilityGemini API, AI Studio, Antigravity, Vertex AI, Gemini app, AI Mode in Search
Speed claim~4× output tokens/sec vs frontier peers

The “thinking on by default” detail matters more than the spec sheet makes it look. This isn’t a thinking_budget parameter you set per request — Flash has dynamic reasoning baked in. The model decides how much to think based on the prompt. For production code that prices in latency budgets, this is a different deployment shape than Sonnet 4.6’s extended-thinking toggle or GPT-5.5’s reasoning parameter.

Agent benchmarks: Flash vs Pro-tier

The cross-vendor data is where Flash’s positioning becomes legible. Pulling from the launch comparisons in Digital Applied’s agentic coding breakdown and LLM Stats’ launch analysis:

BenchmarkGemini 3.5 FlashClaude Opus 4.7GPT-5.5Winner
MCP Atlas83.6%79.1%75.3%Flash (+4.5 / +8.3)
Toolathlon56.5%Flash
Finance Agent v257.9%Flash
CharXiv Reasoning84.2%Flash
MMMU-Pro83.6%Flash
SWE-Bench Pro64.3%Opus 4.7
Terminal-Bench 2.176.2%78.2%GPT-5.5 (+2.0)
OSWorld-Verified78.7%GPT-5.5
Blueprint-Bench 236.2%GPT-5.5
GDPval-AA1656 Elo1769 EloGPT-5.5 (+113)
ARC-AGI-272.1%84.6%GPT-5.5 (+12.5)

Three reads on this:

On agent orchestration, Flash is now the default to reach for first. MCP Atlas measures multi-step tool-driven workflows — the use case that most enterprise agent stacks actually deploy. Beating Opus by 4.5 points on this benchmark at Flash pricing is a meaningful capability-per-dollar shift. Toolathlon and Finance Agent v2 reinforce the pattern: anywhere the work is agentic (plan, call tools, integrate results, iterate), Flash is leading.

On terminal-style coding, GPT-5.5 still wins by a hair. A 2-point gap on Terminal-Bench 2.1 isn’t decisive — but combined with GPT-5.5’s lead on GDPval-AA (113 Elo) and OSWorld-Verified, the read is that if your workflow is “give the model a terminal and a task,” GPT-5.5 is still the right choice. Flash closes the gap; it doesn’t close the lead.

On hard abstract reasoning, Flash has a real weakness. ARC-AGI-2 is the cleanest signal here — Flash drops 12.5 points behind GPT-5.5. This is consistent with what we noted yesterday about Flash regressing on Humanity’s Last Exam and long-context retrieval versus the previous Gemini 3.1 Pro. The Flash architecture clearly traded reasoning depth for speed and cost. Gemini 3.5 Pro arriving in June is presumably the answer to that trade.

Pricing in context

ModelInput ($/1M)Output ($/1M)Output ratioNotes
Gemini 3.5 Flash$1.50$9.006.0×Cached input $0.15
Claude Sonnet 4.6$3.00$15.005.0×1M context flat
Claude Opus 4.7$5.00$25.005.0×Pro-tier reasoning
GPT-5.5$1.25$10.008.0×Cheapest input
Gemini 3.1 Pro (previous)$2.50$15.006.0×40% more than Flash

Flash sits below Sonnet 4.6 on both axes while leading Opus 4.7 on agent benchmarks. That’s the pricing story builders need to absorb: the agent-orchestration default just got 50% cheaper on input and 40% cheaper on output, with a meaningfully better benchmark profile than the previous default at the same tier.

The $0.15/1M cached input pricing is what tips the math hard for any RAG- or memory-heavy workflow. If you’re feeding 500K tokens of cached context per request, Flash’s cached-tier pricing is roughly 10% of Sonnet 4.6’s standard input rate. That’s not a percentage point of margin; that’s a different cost class.

Where Flash fits in production today

Concrete deployment reads, based on the benchmark data:

Use Flash for:

  • MCP / tool-orchestrated agents. This is where Flash genuinely leads, and the price advantage is largest.
  • High-volume API workflows where unit cost matters more than peak intelligence: data transformation, classification, structured extraction, batch processing.
  • Multi-modal pipelines that take image/audio/video input and emit text — Flash supports all four input modalities natively.
  • Cache-heavy workflows (long-context RAG, conversation memory, document search) — the $0.15/1M cached input is the cheapest in the frontier tier.

Don’t use Flash for (yet):

  • Hard abstract reasoning — ARC-AGI-2 style problems. GPT-5.5 is the choice.
  • Long-context retrieval at 128K+ — Flash regressed vs the previous Gemini 3.1 Pro here. Wait for 3.5 Pro in June.
  • Pure terminal coding agents — GPT-5.5 still has a 2-point edge on Terminal-Bench, which compounds across multi-step coding workflows.
  • Workloads where you need to control thinking budget per-request — Flash has thinking baked in, not exposed as a parameter.

What changed today that wasn’t true yesterday

Three things genuinely shifted with Flash’s release:

  1. The default agent model is no longer Pro-tier. “Use the best model you can afford” stops being good advice for agent workflows. For MCP-orchestrated tasks, Flash beats Pro models from competitors and costs less.
  2. The Gemini text family caught up on agentic capability. Pre-launch, the dominant framing was “Gemini is behind on coding/agents.” Post-launch, Flash leads most of the agent suites and is competitive on coding. The narrative needs to be updated.
  3. The reasoning gap got bigger, not smaller. Flash’s regression on ARC-AGI-2 and Humanity’s Last Exam is real. June’s Pro release is now the load-bearing event for whether Gemini closes that specific gap.

Deploy paths

The cleanest deployment shape today depends on what surface you’re on:

  • Production API directly via Google: gemini-3.5-flash via Vertex AI or AI Studio. Both surface the same model.
  • In Antigravity (Google’s IDE-style coding surface): default model swap from gemini-3.1-pro to gemini-3.5-flash is the right move for most workflows.
  • In a multi-vendor router: add gemini-3.5-flash to your agent-orchestration policy. For MCP / tool-heavy paths, route to Flash first; fall back to GPT-5.5 for terminal coding and ARC-style reasoning.
  • On WaveSpeedAI: the WaveSpeedAI LLM endpoint gives you OpenAI-compatible access to current frontier text models behind one API key. As Gemini 3.5 Flash gets integrated, you’ll be able to A/B-test it against the rest of your model lineup under the same surface.

What to watch for in June

Two things that resolve in the next four weeks:

  1. Gemini 3.5 Pro launches. This is the answer to whether the Flash regression on reasoning and long-context gets fixed. If Pro lands above 3.1 Pro on Humanity’s Last Exam and matches Flash on Terminal-Bench, the whole Gemini 3.5 family is the new default. If Pro just patches the regression at higher cost, the lineup stays bifurcated.
  2. Independent agent benchmark replications. Google’s MCP Atlas / Toolathlon / Finance Agent numbers are first-party. The interesting question is whether third-party agent benchmark suites (LangChain Bench, MetaGPT eval, etc.) reproduce the lead. Watch for replication studies in the next two to three weeks.

Until then: Flash is shipping, the agent-orchestration cost just dropped, and the question on most builders’ plates this week is whether to migrate the agent path off Opus 4.7 and onto gemini-3.5-flash today, or wait for 3.5 Pro.

Sources: LLM Stats on Gemini 3.5 Flash, Digital Applied agentic coding comparison, Seeking Alpha on agentic benchmark leadership, DataCamp Gemini 3.5 Flash review, Vertex AI release notes.