← 博客

本文暂未提供您所选语言的版本,正在显示英文版本。

Gemini 3.5 Omni and Omni Flash: A Multimodal Tracker

Google introduced Gemini Omni as a multimodal world model. Here's what builders can verify and what to watch in the multimodal generation space.

By Dora 10 min read
Gemini 3.5 Omni and Omni Flash: A Multimodal Tracker

A working note on what Google actually shipped, what’s still a demo, and what a builder should do about it this week.

I spent an afternoon trying to figure out whether I could call Gemini 3.5 Omni from code. I couldn’t. Not yet. The Gemini app showed me Omni Flash generating video from a photo and a one-line prompt, and it was good — but the thing I actually needed, an API endpoint I could wire into a pipeline, wasn’t there. That gap is the whole story. This piece documents where Omni really stands as of early June 2026, and what changes (and doesn’t) for anyone running video generation in production.

Quick version if you’re in a hurry: Omni is a real video ​generation​​​ model​, it’s live for consumers, and developer API access is still “coming weeks.” Don’t build on it yet. Here’s the detail.

What Google introduced as “Gemini Omni”

Positioning: multimodal world model for video, image, and simulation generation

Google’s own framing is that Gemini Omni can create anything from any input, starting with video. The keyword there is ​starting​. Right now Omni does one thing publicly: it takes text, images, audio, and existing video clips, and it produces video out the other end.

What makes it more than a text-to-video box is that it reasons across those inputs instead of stapling them together. Rather than simply stitching those inputs together, Omni reasons across all of them to produce a consistent output. Google also leaned hard on physics — the TechCrunch writeup of the I/O reveal notes the outputs are meant to reflect an understanding of motion, gravity, and fluid behavior. In the demos that meant fewer warped limbs and melting backgrounds. I’ll believe the consistency claim when I’ve run forty clips through it, not nine.

How Omni differs from Gemini 3.5 Flash / Pro (output modalities)

This is the part most coverage gets muddy, so I’ll be blunt about it. Omni and the 3.5 family are not the same kind of model.

ModelInputOutputWhat it’s for
Gemini Omni / Omni Flashtext, image, audio, videovideogenerating and editing footage
Gemini 3.5 Flashtext, image, audio, videotext, codeagentic tasks, reasoning, UI generation
Gemini 3.5 Protext, image, audio, videotext, codesame, higher ceiling (still rolling out)

3.5 Flash is input multimodal — it understands your video and writes you an answer. Omni is output multimodal — it understands your inputs and hands you back a video. Google’s blog says it plainly: Gemini Omni is our new model that can create anything from any input, starting with video, while 3.5 is the family combining frontier intelligence with action for agentic work. If you’ve been treating “Gemini 3.5 can do video” as one capability, split it in two. They live in different parts of your stack.

Omni Flash positioning vs. full Omni

Omni Flash is the variant that actually shipped. It’s the faster, cheaper tier, and reporting puts its clips at around a 10-second cap — described as a deployment choice rather than a hard model limit, which suggests it’ll stretch later. A higher-fidelity, longer-clip Omni tier sits above it, but that’s the one still mostly living in demo reels. So when someone says “Omni is out,” they mean Omni Flash, in the consumer app.

What’s confirmed at I/O 2026

Public-facing announcements vs. developer access timing

I/O ran May 19–20, 2026. Google confirmed Omni, the 3.5 lineup, and a pile of agent features. The split that matters: consumer access shipped, developer access didn’t. Gemini Omni Flash is rolling out to all Google AI Plus, Pro and Ultra subscribers globally through the Gemini app and Google Flow, plus free access on YouTube Shorts. Then the line every builder should highlight: In the coming weeks, we’ll also be rolling it out to developers and enterprise customers via APIs.

“Coming weeks.” Not a date. Not a model ID. Not a pricing table.

Where Omni was demonstrated and what was shown

The demos were conversational video editing — change one element, change the whole scene, keep the character consistent across turns. One example: feed it a violinist clip, then issue edits like “transport the violinist to a new environment” and “make the violin invisible,” each building on the last. It’s an editing-first pitch, which is a genuinely different bet than pure text-to-video. Google also confirmed every Omni output carries SynthID watermarking, which matters if your downstream use has provenance requirements.

The gap between demo and developer availability

Here’s where I paused. A consumer rollout and an API are two different products with two different reliability stories. That’s not pessimism. It’s just the difference between “I saw it work in an app” and “I can depend on it at 3am when a batch job fires.”

Why this matters for multimodal builders

Video generation is a separate problem from multimodal understanding

If your product already calls Gemini 3.5 to understand uploaded media, Omni doesn’t replace that. It’s an additional capability — generation — that you’d bolt on, not a drop-in upgrade. Treating them as one line item in your architecture diagram is how you end up surprised.

Where Omni would sit alongside existing video models (Veo, etc.)

This is the practical question. Google already ships Veo 3.1 through documented Gemini API and Vertex AI paths, so today Veo is the Google video model you can actually integrate. Omni isn’t a Veo replacement at the API layer yet — there’s no migration doc saying every Veo route swaps to Omni, and the Veo pages are still live. The difference in intent is real, though: Veo is built around text-to-video, while Omni’s whole thing is multi-input reasoning and conversational editing. For builders, that means Omni is positioned to collapse a chain — transcription, image understanding, generation — into one call. Whether that’s worth rebuilding around depends entirely on whether it ever ships an API.

What changes if Omni reaches developer access

If — when — the API lands, the appeal isn’t raw dollar savings. It’s losing interfaces. One model that takes mixed media and returns edited video is fewer SDKs to maintain, fewer error taxonomies to handle, fewer SLAs to track. That’s the kind of friction reduction I actually care about. But “if it ships” is doing a lot of work in that sentence.

What is not yet known

I’m flagging these because writing around them as if they’re settled is how teams get burned.

  • API availability and timingneeds verification. “Coming weeks” is the only official signal. No endpoint, no model ID in Google’s developer docs as of early June 2026.
  • Pricing for video workloads — unknown. Consumer access is bundled into the Google AI subscription tiers (Plus, Pro, Ultra), but per-call API pricing isn’t published. Video frames push token counts fast, so don’t assume text-model economics.
  • Output specs — Omni Flash clips appear capped near 10 seconds; full resolution, max length, and exact control inputs for the API tier aren’t documented.
  • Commercial use rights — not spelled out for developer use. SynthID watermarking is confirmed; licensing terms for generated output via API are not.

If a vendor or blog tells you the Omni API is “live for everyone,” they’re ahead of the evidence. I checked.

How to plan around Omni today

Don’t migrate production workloads on demos

Obvious, but worth saying because the demos are genuinely impressive and that’s exactly when teams make the mistake. A demo is a best case under controlled conditions. Production is the worst case at 2x volume. Until there’s an endpoint with documented latency and limits, Omni stays in your “watching” column, not your “depends-on” column.

Test fallback paths across multiple video generation models

I run the same generation task against more than one model precisely so that a launch like this doesn’t strand me. If Omni’s API lands next month, great — I want to be able to slot it in and compare against what I’m already using, not rebuild my pipeline to find out it’s worse for my specific scenario. The lesson from every model launch: “really good” in someone else’s demo means nothing until it’s good in your workflow at your frequency.

Model aggregation as a hedge against vendor-specific rollouts

This is the part that’s relevant to how I work. When a model ships to consumers first and developers “in the coming weeks,” the builders who aren’t scrambling are the ones who never tied themselves to a single provider’s release calendar. Running video generation through a unified multi-model access layer like WaveSpeedAI means a new model is something you test and adopt when it’s ready, not a migration you’re forced into. I’m not saying that’s right for everyone — if you only ever use one model, the abstraction is overhead. But at my switching frequency, one access layer over many models beats maintaining one integration per vendor. The value isn’t “more models.” It’s not having to rebuild every time the landscape shifts. And this month, it shifted.

FAQ

Is Gemini Omni available through an API today? No. As of early June 2026, Omni Flash is live in the Gemini app, Google Flow, and YouTube Shorts for AI Plus/Pro/Ultra subscribers. Google says developer and enterprise API access is “coming weeks,” but there’s no published model ID, endpoint, or pricing yet.

What’s the difference between Gemini Omni and Gemini 3.5 Pro? Output modality. Omni generates video (output multimodal). Gemini 3.5 Pro understands text, image, audio, and video but outputs text and code (input multimodal). Different jobs, different parts of your stack.

Does Gemini Omni generate video? Yes — that’s its core function. It takes text, images, audio, and video clips as input and produces video, with conversational editing across multiple turns. All outputs carry SynthID watermarks.

When can developers expect access to Omni? Unconfirmed. Google’s only public wording is “in the coming weeks.” Treat any specific date you see elsewhere as speculation until Google publishes API documentation.

How does Omni compare to other video generation models like Veo? Veo 3.1 is the model you can actually call via API today through Gemini API and Vertex AI. Omni’s distinction is multi-input reasoning and editing-first design rather than pure text-to-video. But until Omni’s API ships, Veo remains the practical Google baseline for production video generation.

Conclusion

Where this lands for me: Omni is real, the editing approach is a different and interesting bet, and Gemini 3.5 Omni is worth tracking closely. It is not worth building on this week. Consumer-live, developer-pending, specs-unknown — that’s a watch item, not a dependency.

So the move is boring and correct: keep your video generation flexible, test Omni in the app to form your own opinion, and don’t rearrange your production pipeline until there’s an endpoint you can actually point at. When the API shows up, I’ll run it against what I’m using now and report what held up.

To be verified. More when the docs land.

Previous posts: