Introducing xAI Grok Imagine Video Reference To Video on WaveSpeedAI

Grok Imagine Video Reference-to-Video: Generate Consistent AI Videos from Multiple Reference Images

What if you could hand an AI model seven different reference images — a character, a location, a set of props — and get back a single, coherent video that preserves every visual detail? That’s exactly what Grok Imagine Video Reference-to-Video delivers. Built by xAI, this multi-image reference-to-video model generates dynamic video clips that maintain identity, style, and scene composition across every frame, and it’s now available on WaveSpeedAI with no cold starts and pay-per-use pricing.

In a landscape where AI video generation is rapidly evolving — with Grok Imagine recently claiming the #1 spot on the Artificial Analysis Video Arena for both text-to-video and image-to-video — the reference-to-video variant takes things further by letting you control exactly what appears in your generated video using up to seven source images.

How Grok Imagine Video Reference-to-Video Works

Most AI video generators accept a single image or text prompt. Grok Imagine Video Reference-to-Video breaks that limitation by accepting 1 to 7 reference images alongside a text prompt describing the desired motion, camera movement, and scene.

Here’s the workflow:

Provide reference images — Upload up to 7 images via URL. These can include characters, objects, environments, or style references.
Write a motion prompt — Describe how the scene should move. Use @image1, @image2, etc. to reference specific uploaded images in your prompt.
Choose duration and resolution — Select 6 or 10 seconds of output at 720p or 480p resolution.
Generate — The model synthesizes all references into a single cohesive video with smooth, natural movement.

Under the hood, Grok Imagine Video is powered by xAI’s Aurora engine, an autoregressive mixture-of-experts architecture trained on billions of examples. The model predicts image tokens sequentially, which gives it tight control over generation and helps maintain visual consistency across frames — critical for multi-reference scenarios where identity preservation matters most.

Try Grok Imagine Video Reference-to-Video on WaveSpeedAI →

Key Features of Grok Imagine Video Reference-to-Video

Multi-image reference input (up to 7 images) — Feed the model a character from one photo, a background from another, and props from several more. The model composites them into a unified scene.
Identity and style preservation — Characters, objects, and environments maintain consistent appearance throughout the generated video. Facial features, clothing details, and proportions stay locked across frames.
Addressable image references — Use @image1, @image2, etc. in your prompt to direct exactly how each reference image influences the output.
Flexible duration options — Generate 6-second clips for quick tests and social content, or 10-second videos for more complete scenes.
720p and 480p resolution — Choose higher quality for final output or faster 480p processing for rapid iteration.
REST API access on WaveSpeedAI — No cold starts, instant inference, and simple pay-per-use billing at $0.05 per second.

Best Use Cases for Grok Imagine Video Reference-to-Video

Consistent Character Videos Across Multiple Shots

Film and animation projects demand character consistency across scenes. Feed the model reference images of a character from multiple angles — front, profile, three-quarter — and generate video clips where that character moves naturally while maintaining their exact appearance. This is invaluable for creators building episodic content or multi-scene narratives without a full production pipeline.

Product Showcase Videos from Product Photos

E-commerce teams can transform a set of static product photos into dynamic showcase videos. Upload images of a product from different angles, in different settings, or alongside complementary items, then describe the motion — a slow rotation, an unboxing sequence, or a lifestyle demonstration. The model preserves product details faithfully across the generated video.

Content creators for TikTok, Instagram Reels, and YouTube Shorts can generate engaging video clips from image collections in seconds. Combine a creator’s photo with a branded background and product imagery to produce on-brand video content without hiring a videographer or editing footage manually.

Multi-Angle Scene Composition

Architectural visualization, interior design, and real estate professionals can provide reference images from different angles of a space, then generate walkthrough-style videos that maintain spatial accuracy and design consistency. Describe camera movement through the space, and the model synthesizes a cohesive scene.

Brand-Consistent Marketing Videos

Marketing teams working with strict brand guidelines can provide brand assets — logos, color palettes, product imagery, spokesperson photos — as reference images. The model generates video content that stays on-brand without requiring manual post-production alignment.

Storyboard-to-Video Prototyping

Creative directors and storyboard artists can upload individual storyboard frames as reference images and generate rough video prototypes that show how a sequence might flow. This dramatically speeds up the pre-production review process for commercial and narrative projects.

Grok Imagine Video Reference-to-Video Pricing and API Access

Grok Imagine Video Reference-to-Video is available on WaveSpeedAI with straightforward per-second billing:

Duration	Cost
6 seconds	$0.30
10 seconds	$0.50

Billing rate: $0.05 per second, based on selected duration.

This is significantly more affordable than many competing platforms. Combined with WaveSpeedAI’s no cold starts and instant inference, you get fast results without paying for idle compute time.

API Code Example

import wavespeed

output = wavespeed.run(
    "x-ai/grok-imagine-video/reference-to-video",
    {
        "images": [
            "https://example.com/character-front.jpg",
            "https://example.com/character-side.jpg",
            "https://example.com/background-scene.jpg"
        ],
        "prompt": "@image1 and @image2 show a character who walks through the scene in @image3, looking around with natural movement",
        "duration": 10,
        "resolution": "720p"
    },
)

print(output["outputs"][0])

API Parameters

Parameter	Required	Description
`images`	Yes	Array of 1–7 reference image URLs
`prompt`	Yes	Motion description with optional @image references
`duration`	No	6 or 10 seconds (default varies)
`resolution`	No	`720p` (default) or `480p`

Get started with Grok Imagine Video Reference-to-Video →

Tips for Best Results with Grok Imagine Video

Use high-quality, well-lit reference images. The model’s identity preservation is only as good as the input. Sharp, evenly lit photos produce cleaner, more consistent video output.
Reference images explicitly in your prompt. Use @image1, @image2, etc. to tell the model which reference corresponds to which element in your scene. This gives you precise compositional control.
Keep references and prompt aligned. If your reference images show a specific character, describe that character’s actions in the prompt. Misaligned references and prompts produce confused output.
Start with fewer references, then add more. Begin with 2–3 images to establish the core scene, then add references for additional detail. This helps you isolate which images contribute what to the final output.
Test with 6-second clips first. Use the shorter duration to iterate on your prompt and reference combination before committing to 10-second generations. At $0.30 per test, rapid iteration is affordable.
Try 480p for drafts, 720p for finals. Use lower resolution during the creative exploration phase, then switch to 720p for the final output.

Grok Imagine Video Reference-to-Video is part of a broader family of xAI video and image models available on WaveSpeedAI:

Grok Imagine Video Image-to-Video — Generate video from a single image input
Grok Imagine Video Text-to-Video — Create video from text prompts alone
Grok Imagine Video Extend — Extend existing videos with smooth continuation
Grok Imagine Video Edit — Edit existing videos with text instructions
Grok Imagine Image Text-to-Image — Generate images from text prompts

Frequently Asked Questions About Grok Imagine Video Reference-to-Video

What is Grok Imagine Video Reference-to-Video?

Grok Imagine Video Reference-to-Video is xAI’s multi-image reference model that generates videos from up to 7 reference images, preserving identity, style, and scene composition with smooth natural movement.

How much does Grok Imagine Video Reference-to-Video cost?

Pricing is $0.05 per second — $0.30 for a 6-second video and $0.50 for a 10-second video. Billing is based on selected duration, and there are no subscription fees on WaveSpeedAI. You pay only for what you generate.

Can I use Grok Imagine Video Reference-to-Video via API?

Yes. Grok Imagine Video Reference-to-Video is available as a REST API on WaveSpeedAI with no cold starts, instant inference, and simple pay-per-use billing. You can integrate it into any application using the WaveSpeed Python SDK or direct HTTP requests.

How many reference images can I use with Grok Imagine Video?

You can provide between 1 and 7 reference images. Each image can represent a different element — characters, objects, backgrounds, or style references — and you can address them individually in your prompt using @image1 through @image7.

How does Grok Imagine Video compare to other AI video models?

Grok Imagine recently ranked #1 on the Artificial Analysis Video Arena for both text-to-video and image-to-video generation, outperforming Runway Gen-4.5, Sora 2 Pro, and Google Veo 3.1. The reference-to-video variant adds multi-image control that most competitors limit to 4 or fewer reference inputs.

Ready to generate consistent, identity-preserving videos from multiple reference images? Try Grok Imagine Video Reference-to-Video on WaveSpeedAI — no cold starts, affordable per-second pricing, and instant API access.

Grok Imagine Video Reference-to-Video: Generate Consistent AI Videos from Multiple Reference Images

How Grok Imagine Video Reference-to-Video Works

Key Features of Grok Imagine Video Reference-to-Video

Best Use Cases for Grok Imagine Video Reference-to-Video

Consistent Character Videos Across Multiple Shots

Product Showcase Videos from Product Photos

Social Media Content Creation at Scale

Multi-Angle Scene Composition

Brand-Consistent Marketing Videos

Storyboard-to-Video Prototyping

Grok Imagine Video Reference-to-Video Pricing and API Access

API Code Example

API Parameters

Tips for Best Results with Grok Imagine Video

Explore Related Grok Imagine Models on WaveSpeedAI

Frequently Asked Questions About Grok Imagine Video Reference-to-Video

What is Grok Imagine Video Reference-to-Video?

How much does Grok Imagine Video Reference-to-Video cost?

Can I use Grok Imagine Video Reference-to-Video via API?

How many reference images can I use with Grok Imagine Video?

How does Grok Imagine Video compare to other AI video models?

Related Articles

MCP in Production: What Developers Need to Know

What Is Claude Managed Agents?

Introducing Alibaba WAN 2.5 Image-to-Video Fast on WaveSpeedAI

Introducing Alibaba WAN 2.7 Image-to-Video on WaveSpeedAI

Introducing Alibaba WAN 2.7 Reference To Video on WaveSpeedAI

Introducing Alibaba WAN 2.7 Text-to-Video on WaveSpeedAI