Genie 3 Prompts: Writing Effective World Descriptions

Hi, Dora is here. In late January 2026, I kept getting floaty, consequence-free worlds from a Genie 3 build I was testing, gorgeous at first frame, then physics that felt like a dream. My prompts sounded right in my head, but the outputs drifted. Doors didn’t quite open. Gravity forgot itself.

So I slowed down. I treated prompts less like poetry and more like a short, plain spec. Once I did that, the worlds started holding together. Not perfect, but steadier. This is how I now approach Genie 3 prompts, framed by what actually helped on real tasks.

Prompt structure for world models

I stopped writing flowery prompts and started writing small, boring ones, the kind a teammate could skim and build from. World models respond well to that. My baseline looks like four parts:

Setting: where and when. Keep it concrete. “Narrow alley at dusk,” not “mysterious urban vibe.”
Dynamics: what moves and how. Name forces, constraints, and triggers.
Agent: who or what is acting. First-person camera or side-view? Human or object? Any capabilities?
Goals/affordances: what can be done here. Doors open, levers pull, ladders climb.

I write these as one to three sentences, then one line of constraints. That’s it. When I go longer, I usually get contradictions (and the model picks the wrong one).

A structure I reused a lot:

Sentence 1: a concrete place + time of day + lighting.
Sentence 2: the controllable agent + camera + motion verbs.
Sentence 3: the key interaction and outcome.
Constraint line: 1–3 short constraints (physics, camera, pacing).

Why this matters: world models don’t just draw: they simulate patterns. If you say “fast” and “steady,” you’re asking for two different rhythms. If you don’t say where gravity points, it guesses. Reducing ambiguity helps the model pick stable defaults.

For a deeper understanding of how Google Genie 3 can be used to simulate these patterns and more, check out our detailed article: What Is Google Genie 3?.

Signals that told me the structure was working:

Fewer camera jitters across 3–5 generations of the same seed
Objects retaining mass from frame to frame (no floaty cups)
Interactions completing in under 6 seconds instead of meandering for 15

If a scene kept wobbling, I removed adjectives first, not added more. Simpler usually won.

Environment description techniques

Describing environments for a world model is different from styling a single image. I had better luck when I:

Anchored space with two or three hard surfaces. “Wet cobblestone ground, brick walls left/right, metal door at end.” Hard surfaces cue contact, reflections, and friction.
Named affordances explicitly. If a lever should pull, say “pullable lever at chest height.” If a door should open inward, state the hinge side.
Set scale in human terms. “Knee-high curb,” “waist-high railing,” “truck-width alley.” The model snaps movement to these anchors.
Gave one light source with direction. “Neon sign above door, purple spill light left to right.” This reduced shadow flicker and helped keep the camera from hunting for interest.
Defined clutter as zones, not lists. “Stacked crates along right wall” worked better than naming every object. Too many nouns made the scene noisy without adding useful behavior.

Friction I hit:

Vague materials led to slippery physics. “Floor” made characters skate: “rubberized gym mat” gave traction.
Overstuffed layouts confused pathing. When I jammed six props into a small room, agents hesitated near corners.
Time of day without light direction didn’t do much. “Morning” alone rarely stabilized shadows.

When a scene still felt flimsy, I added one more physical cue (like “wind pushing left to right” or “light rain with visible splashes”). Small physical cues improved coherence more than extra style words.

Style and aesthetic control

Style is tempting to chase first. I tried to keep it last. Once the world behaved, I nudged the look:

Use one style anchor, not three. “1990s DV cam” or “soft film grain.” Stacking “cinematic, vintage, gritty” muddied motion.
Tie style to physics, not just color. “Handheld cam with slight shoulder bob” is a style that also sets camera behavior.
Mention lens equivalents only if you must. “28mm wide” sometimes helped with close quarters, but lens talk can overpower motion cues.
Texture with verbs, not adjectives. “Dust motes drift in a sun beam” beats “dreamy, hazy, ethereal.” Verbs give the model something to animate.

Compared with video-only models like Runway’s Gen-3, I noticed world-model prompts react more strongly to action and affordances than to pure look. If you come from Gen-3, you might need to dial down your style stack and dial up the space-and-action lines.

When style fought behavior, I removed style first. A plain, believable scene beats a beautiful but slippery one.

10 example prompts analyzed

Below are the exact Genie 3 prompts I used or close variants. I ran each 3–5 times in late January 2026, tweaking one variable at a time. I’m showing the prompt and what changed in practice.

Photorealistic scenes

“Narrow alley at dusk with wet cobblestone ground and brick walls left and right. First-person walking pace toward a metal door under a flickering neon sign. Reach for the handle and push the door inward to open.” Constraints: steady handheld, light rain, gravity down.

Result: Door opened in ~4–6s reliably. Light rain helped sell friction: footsteps stopped sliding. Without “push inward,” the door sometimes swung the wrong way.

“Small kitchen at night, overhead fluorescent hum. Third-person, waist-high camera following a person carrying a steaming mug to a wooden table. Set the mug down: small splash: steam curls.” Constraints: no camera dolly, soft clatter, stable shadows.

Result: Steam and small splash appeared in 4/5 runs. If I forgot “wooden table,” the mug glided a touch on glossy surfaces. Naming material mattered.

“Subway platform, off-peak, cool white lighting. Side-view as a commuter steps over a yellow safety line, stops, and steps back.” Constraints: constant speed, no jump cuts.

Result: Clear step-and-correct motion. When I removed “stops and steps back,” the model improvised with a wave or phone check, plausible, but not the point.

“Office corridor with carpet floor, glass walls on the right. First-person jog to a keypad door: hand enters PIN: door clicks open.” Constraints: slight breath noise, wrist-level keypad, gravity down.

Result: Best with “wrist-level keypad.” Without that, hands floated upward. Breath noise (even as a word) nudged pacing and helped avoid robotic motion.

“Parking garage, low ceiling, glossy concrete. Third-person as a rolling suitcase bumps over a speed bump, wobbles, then stabilizes.” Constraints: fixed camera, subtle echo, consistent reflections.

Result: The wobble showed up only when I said “bumps over a speed bump.” If I wrote “crosses a bump,” the wheel wobble often vanished. Verbs with contact cues helped.

Stylized environments

“Side-scrolling paper diorama city at noon. Cardboard buildings, painted clouds on pulleys. A cutout character runs and pulls a red lever: a drawbridge lowers.” Constraints: parallax layers, crisp edges, gravity down.

Result: Lever-and-bridge sequence held up cleanly. When I asked for “vintage watercolor + cardboard + ink,” edges bled and the bridge stuttered. One style anchor kept mechanics intact.

“Low-poly desert canyon in warm sunset light. Third-person as a sphere avatar rolls down a sand slope and banks left onto a plank bridge.” Constraints: constant roll speed, soft skid on sand, no camera roll.

Result: Banked turn worked 3/5 runs. Adding “no camera roll” stopped an annoying tilt that made the slope feel steeper than it was.

“Isometric cozy tavern, pixel art, 32-color palette. A bartender sprite wipes the bar: a patron sprite waves: a hanging sign swings when the door opens.” Constraints: fixed isometric camera, 1 swing period.

Result: The swing synced best when I specified “1 swing period.” Without it, the sign swung too long and pulled attention away from the sprites.

“Ink-and-wash forest path in light fog. First-person steps over a mossy log, camera dips with the step, then recovers.” Constraints: soft footfall, slow head bob, fog stays thin.

Result: Camera dip sold the step. Adding “fog stays thin” prevented the model from hiding the log with dramatic mist.

“Retro DV-cam skatepark, late afternoon. Third-person follow as a skateboarder ollies a small curb, lands, slight wheel chatter.” Constraints: handheld jitter small, curb ankle-high, shadows long.

Result: “Curb ankle-high” fixed scale and improved the ollie height. Without that, the trick sometimes became a hop with no curb contact.

Notes on iteration:

I tried each prompt with and without one constraint. Removing “gravity down” made scenes feel floaty again, obvious in the alley and skatepark.
Shorter prompts outperformed longer ones. Most of mine landed at ~30–45 words plus constraints.
Seeds (when available) helped me compare changes. I kept a small grid: 3 seeds × 2 variations, ~6 runs per idea. This sounds fussy, but it saved time.

A few limits I couldn’t smooth over:

Precise text like keypad digits stayed fuzzy, I focused on the action, not legibility.
Long, multi-step puzzles (three or more interactions) tended to drift by step two. Splitting into smaller beats worked better.
Highly reflective floors sometimes melted shadows across cuts. Calling out “consistent reflections” helped, but didn’t fix it every time.