Introducing ByteDance Seedance 2.0 Text-to-Video on WaveSpeedAI
Seedance 2.0 Text-to-Video generates Hollywood-grade cinematic videos from text prompts with native audio-visual synchronization, director-level camera control, and exceptional motion stability.
Introducing ByteDance Seedance 2.0 Text-to-Video on WaveSpeedAI: A New Era of Cinematic AI Video
Generative video has spent the last two years catching up to professional production. Most models still ship without sound, lose subjects mid-shot, or collapse the moment a prompt asks for a real camera move. Today we are happy to announce that ByteDance Seedance 2.0 Text-to-Video is now available on WaveSpeedAI — a flagship video model that generates Hollywood-grade cinematic clips from text alone, with native audio baked in and director-level control over the camera.
If you have been waiting for a text-to-video model you can drop into a real production pipeline, this is the one to try.
What is Seedance 2.0 Text-to-Video?
Seedance 2.0 is the latest generation of ByteDance’s Seed video family, built on a unified multimodal architecture that natively accepts text, image, audio, and video inputs in a single model. The Text-to-Video mode turns a written scene description into a finished cinematic clip.
Three things set Seedance 2.0 apart:
- Audio is generated together with the video in a single pass, with synchronized dialogue, foley, and ambience — no separate audio stack required.
- Camera, lighting, and performance are controllable through plain English — ask for a slow dolly in, dramatic rim light, or a specific facial expression and the model follows.
- Motion is stable across long shots, with consistent subjects, plausible physics, and clean transitions out to 15 seconds.
The model is exposed through a single endpoint, bytedance/seedance-2.0/text-to-video, with outputs from 480p up to 1080p across six aspect ratios.
Key Features
Unified Multimodal Architecture
Seedance 2.0 is not a stack of bolt-on adapters. The same underlying model handles text, image, audio, and video conditioning, which means you can stay on a single endpoint as your prompts grow more sophisticated — adding reference images for character consistency, reference videos for motion style, or reference audio for tone, all without switching models.
Native Audio-Visual Synchronization
Most text-to-video models hand you a silent clip and leave audio as a separate problem. Seedance 2.0 generates synchronized audio inline with the video, so dialogue lip-syncs, footsteps land on the right frames, and atmosphere matches the on-screen mood. The result is a clip that feels finished the moment it lands, not a rough draft waiting for post.
Director-Level Control
Seedance 2.0 reads prompts the way a director reads a shot list. Camera moves (push in, crane up, whip pan), lighting setups (golden hour, rim light, low-key), shadow direction, lens feel, and even character performance can be specified in natural language and the model honors them. This is the difference between “AI video” and a usable take.
Production-Grade Cinematic Quality
Visually, the model targets the look of professional cinema rather than generic stock footage: dramatic lighting, considered color grading, smooth natural motion, and strong subject coherence. It holds up well on a 1080p timeline, not just as a thumbnail.
Exceptional Motion Stability
Long shots are where most video models fall apart. Seedance 2.0 maintains stable subjects, consistent physics, and fluid transitions across the full duration range, which lets you actually use 10- and 15-second outputs as finished shots instead of as raw material to cut down.
Strong Instruction Adherence
Detailed scene descriptions, shot compositions, and creative direction are followed closely. You can layer specifics — wardrobe, props, blocking, mood — and expect them to land in the output rather than being averaged away.
Use Cases
- Film and TV pre-visualization — Block out shots and sequences before committing crew and budget. Generate animatics that already include sound design.
- Commercials and brand ads — Produce premium 5- to 15-second spots with cinematic lighting and synchronized voiceover or music beds.
- Music videos — Create stylized performance and narrative cuts with native audio sync, then drop in a final track.
- Premium social content — Stand out in a 9:16 feed with film-grade short-form clips that look authored, not generated.
- Education and explainers — Visualize abstract concepts, historical scenes, or scientific phenomena with clear motion and built-in narration cues.
- Concept and pitch decks — Sell film, TV, and game concepts to producers and publishers with production-quality moving previews instead of static boards.
- Game cinematics and trailers — Prototype trailer beats and key cinematic moments early in development.
Parameters
| Parameter | Required | Description |
|---|---|---|
prompt | Yes | Detailed description of the cinematic scene |
aspect_ratio | No | Output format: 16:9 (default), 9:16, 4:3, 3:4, 1:1, 21:9 |
duration | No | Video length in seconds: 4–15 (default: 5) |
resolution | No | Output resolution: 480p, 720p (default), or 1080p |
reference_images | No | Reference image URLs to guide style, characters, or composition |
reference_videos | No | Reference video URLs (total length must not exceed 15 seconds) |
reference_audios | No | Reference audio URLs (total length must not exceed 15 seconds) |
Pricing
| Resolution | Duration | Without Reference Videos | With Reference Videos |
|---|---|---|---|
| 480p | 5 s | $0.60 | $1.20 |
| 480p | 10 s | $1.20 | $2.40 |
| 480p | 15 s | $1.80 | $3.60 |
| 720p | 5 s | $1.20 | $2.40 |
| 720p | 10 s | $2.40 | $4.80 |
| 720p | 15 s | $3.60 | $7.20 |
| 1080p | 5 s | $3.00 | $6.00 |
| 1080p | 10 s | $6.00 | $12.00 |
| 1080p | 15 s | $9.00 | $18.00 |
Pricing scales linearly with duration across the full 4–15 second range. The base rate is $0.60 per 5 seconds at 480p; 720p is 2x base, 1080p is 5x base, and adding reference videos doubles the price.
Code Example
Call the model with the WaveSpeed Python SDK:
import wavespeed
output = wavespeed.run(
"bytedance/seedance-2.0/text-to-video",
{
"prompt": "A lone astronaut walks across a windswept red desert at golden hour, dramatic rim light, slow dolly in, cinematic 35mm look, distant mountains, swirling dust",
"aspect_ratio": "16:9",
"duration": "10",
"resolution": "1080p",
},
)
print(output["outputs"][0])
You can layer in reference_images, reference_videos, or reference_audios to lock down style, motion, or audio tone when you need stronger guidance.
Pro Tips
- Write like a director. Specify lighting (e.g. “soft window light, long shadows”), lens feel, camera move, and subject action. Vague prompts get vague shots.
- Pick the aspect ratio first. 16:9 for cinematic widescreen, 9:16 for premium vertical, 21:9 for anamorphic-style frames.
- Iterate at 480p or 720p. Lock the composition and motion at a cheap resolution, then re-render the winner at 1080p.
- Start short, then extend. Begin at 4–5 seconds to dial in look and tone, then push out to 10–15 seconds once the prompt is right.
- Lean into audio cues. Mention dialogue intent, music mood, or ambient sound — native audio responds to these as part of the prompt.
FAQ
Does Seedance 2.0 Text-to-Video really generate audio? Yes. Native audio-visual synchronization is built in, so videos come back with synchronized sound generated in the same pass. You do not need to run a separate text-to-audio or voice model.
What is the maximum clip length? Duration is continuous from 4 to 15 seconds. You can request any integer duration in that range; pricing scales linearly with duration.
Which resolutions and aspect ratios are supported? Output resolutions are 480p, 720p (default), and 1080p. Aspect ratios are 16:9 (default), 9:16, 4:3, 3:4, 1:1, and 21:9.
When should I use reference inputs? Reference images help anchor characters, style, or composition. Reference videos guide motion or shot style (note: this doubles the price). Reference audios shape tone, music, or voice. Combined reference video and audio total length must not exceed 15 seconds.
How does Seedance 2.0 Text-to-Video compare to the Image-to-Video and Fast variants? Text-to-Video starts from a prompt alone and is the right pick when you have no source frame. Image-to-Video animates an existing image. Fast Text-to-Video trades some quality for cheaper, quicker generations — great for iteration and high-volume use cases.
Related Models
- Seedance 2.0 Image-to-Video — Animate a still image with the same Seedance 2.0 architecture.
- Seedance 2.0 Fast Text-to-Video — Faster, lower-cost text-to-video for iteration and scale.
- Seedance 2.0 Fast Image-to-Video — Fast image-conditioned video generation.
- Seedance V1.5 Pro Text-to-Video — Previous-generation Seedance model.
Get Started
Seedance 2.0 Text-to-Video runs on WaveSpeedAI’s optimized inference stack with no cold starts, predictable pricing, and a single REST API. Whether you are pre-vizing a feature, cutting a brand spot, or building the next AI-native video product, this model gives you cinematic output and native audio in one call.
Try Seedance 2.0 Text-to-Video on WaveSpeedAI and start shooting with prompts.

