← Blog

Introducing Alibaba WAN 2.7 Text-to-Video on WaveSpeedAI

WAN 2.7 Text-to-Video turns plain prompts into coherent, cinematic clips with crisp detail, stable motion, and strong instruction-following—great for ads, exp

8 min read
Alibaba Wan.2.7 Text To Video WAN 2.7 Text-to-Video turns plain prompts into coherent, cin...
Try it

WAN 2.7 Text-to-Video: Cinematic AI Video Generation with Audio-Synced Motion

WAN 2.7 Text-to-Video is Alibaba’s latest cinematic AI video generation model, turning plain text prompts into coherent, high-quality clips with stable motion, crisp detail, and strong instruction-following. Now available on WaveSpeedAI, WAN 2.7 brings audio input support, negative prompt control, and flexible resolution options to creators building ads, explainers, music videos, and social content at scale.

For teams that need broadcast-ready output without a production crew, WAN 2.7 closes the gap between text prompt and finished clip — generating up to 1080p video that respects camera direction, lighting cues, and subject behavior described in natural language.

Try WAN 2.7 Text-to-Video on WaveSpeedAI →

How WAN 2.7 Text-to-Video Works

WAN 2.7 is a diffusion-based text-to-video model that interprets natural language prompts and synthesizes them into temporally coherent video. Unlike earlier text-to-video systems that struggled with object consistency across frames, WAN 2.7 maintains stable identity, plausible physics, and smooth camera motion across the full clip duration.

The model accepts a primary prompt and a range of optional controls:

  • Resolution: 720p (default) or 1080p output
  • Aspect ratio: 16:9 default, with flexible options for 9:16 vertical, 1:1 square, and cinematic widescreen formats
  • Duration: 5, 10, or 15 seconds per clip
  • Negative prompt: Exclude unwanted artifacts, styles, or elements
  • Audio input: Upload a track to synchronize visual rhythm and pacing
  • Prompt expansion: An optional mode that automatically enriches sparse prompts with cinematic detail before generation
  • Seed: Fix outputs for reproducible iteration

The audio-conditioned generation is what sets WAN 2.7 apart from most text-to-video APIs. Where competing models render visuals in isolation, WAN 2.7 can align cuts, motion intensity, and pacing to a music track or voiceover — making it directly useful for music videos, ad spots, and narrated explainers.

Key Features of WAN 2.7 Text-to-Video

  • Cinematic visual quality — produces detailed scenes with accurate lighting, depth, and composition that hold up at 1080p delivery resolution.
  • Audio-synchronized output — supply an audio track and the model paces motion to match, eliminating the manual cut-and-trim step in post.
  • Strong instruction-following — camera moves, color palettes, and subject behavior described in the prompt land in the generated video reliably.
  • Negative prompt control — explicitly exclude common artifacts (blurry faces, distorted limbs, unwanted text) for cleaner output.
  • Prompt expansion mode — short prompts get auto-enriched with scene detail, ideal for batch workflows where you don’t want to write paragraph-length descriptions.
  • Reproducible generations — fix the seed once you find a result you like and iterate on resolution or duration without losing the look.
  • Production-ready resolutions — 720p for fast turnaround, 1080p for client-grade deliverables.

Best Use Cases for WAN 2.7 Text-to-Video

Cinematic Storytelling and Narrative Shorts

Filmmakers and storytellers can render atmospheric, narrative-driven scenes from detailed prompts — describing camera angle, lighting style, mood, and subject action in one paragraph and getting back a usable cinematic shot. WAN 2.7’s stable motion makes it strong for establishing shots, dream sequences, and stylized narrative inserts.

Social Media Content at Scale

Vertical 9:16 output, 5-second clip lengths, and quick generation make WAN 2.7 ideal for TikTok, Instagram Reels, and YouTube Shorts. Brands can spin up dozens of platform-native variations from a single concept brief — testing hooks and visual styles without booking a single shoot day.

Marketing and Advertising Production

Agencies producing pre-roll ads, product teasers, and explainer videos can replace stock footage with custom-generated scenes that match exact brand requirements. The 15-second duration option fits standard ad placements, and 1080p output meets most digital-ad delivery specs out of the box.

Music Videos and Audio-Visual Sync

The audio input feature is purpose-built for music creators. Upload a track, describe the visual world, and WAN 2.7 generates video that pulses with the music — drum hits aligned to camera cuts, mood shifts mirrored in lighting changes. Independent musicians can produce full visualizers without hiring a director.

Concept Visualization for Pitching

Creative directors, product designers, and game studios can use WAN 2.7 to bring early-stage ideas to life before committing to production. A 5-second clip is enough to communicate tone, palette, and motion language to stakeholders — turning slide-deck concepts into moving previews in minutes.

Explainer and Educational Content

Course creators and SaaS marketing teams can illustrate abstract concepts — data flows, biological processes, historical scenes — with cinematic clips that hold attention better than animated diagrams. Pair the generated video with voiceover by uploading the narration as the audio input.

Branded Content for E-Commerce

Direct-to-consumer brands can generate lifestyle B-roll featuring their product category — cooking shots for kitchenware, outdoor scenes for apparel, ambient settings for home goods — at a fraction of the cost of contracting a video team.

Generate your first WAN 2.7 video →

WAN 2.7 Pricing and API Access

WAN 2.7 Text-to-Video is billed per second of generated video, with a clear flat rate at each resolution tier:

Duration720p1080p
5s$0.50$0.75
10s$1.00$1.50
15s$1.50$2.25
  • 720p: $0.10 per second
  • 1080p: $0.15 per second (1.5× base rate)

There are no subscription fees, no minimum commitments, and no cold starts — pay only for what you generate. WaveSpeedAI’s inference infrastructure means your first request runs at the same latency as your thousandth.

API Example

Generating a video is a single REST call using the WaveSpeed Python SDK:

import wavespeed

output = wavespeed.run(
    "alibaba/wan-2.7/text-to-video",
    {
        "prompt": "A neon-lit Tokyo street at night, slow dolly forward, rain-soaked pavement reflecting signs, cinematic 35mm look",
        "resolution": "1080p",
        "aspect_ratio": "16:9",
        "duration": 5,
    },
)

print(output["outputs"][0])

For audio-synchronized generation, pass a publicly accessible audio URL via the audio parameter. To exclude artifacts, add a negative_prompt. To let WAN 2.7 enrich a short prompt automatically, set enable_prompt_expansion to true.

If you’re comparing options across the WaveSpeedAI catalog, you may also want to evaluate other text-to-video models for different style, latency, or cost trade-offs.

Tips for Best Results with WAN 2.7

  • Be specific about cinematography. Include camera angle (low-angle, overhead, dolly-in), lens style (anamorphic, 35mm, wide), and lighting (golden hour, neon, hard shadows). Generic prompts produce generic output.
  • Use negative prompts to clean up output. Common entries: “blurry, distorted faces, low contrast, watermark, text overlay, jittery motion.” This removes a class of common artifacts in one parameter.
  • Enable prompt expansion for short prompts. If you’re batching generations from a list of brief concepts, prompt expansion adds the scene detail that produces cinematic results — without you writing paragraphs.
  • Lock the seed once you find a winner. When you nail the look at 720p, fix the seed and re-run at 1080p for a final-quality version of the same clip.
  • Match aspect ratio to platform. Use 9:16 for vertical social, 16:9 for YouTube and web players, 1:1 for feed posts, and cinematic widescreen for narrative work — generating at the target ratio beats cropping in post.
  • Sync to audio for music and ad work. When pacing matters, providing the audio track up front is faster and produces tighter results than trying to time motion through prompt language alone.

Frequently Asked Questions

What is WAN 2.7 Text-to-Video?

WAN 2.7 Text-to-Video is Alibaba’s advanced AI text-to-video model that generates cinematic-quality video clips from natural language prompts, with optional audio synchronization, negative prompt control, and 1080p output.

How much does WAN 2.7 cost?

WAN 2.7 is billed per second of generated video: $0.10/second at 720p and $0.15/second at 1080p. A 5-second 720p clip costs $0.50; a 15-second 1080p clip costs $2.25. There are no subscription fees or minimum commitments.

Can I use WAN 2.7 via API?

Yes. WAN 2.7 is available through WaveSpeedAI’s REST inference API and Python SDK with no cold starts. A single wavespeed.run() call returns the generated video URL.

Does WAN 2.7 support audio input?

Yes — WAN 2.7 accepts an optional audio track to synchronize the rhythm, pacing, and mood of the generated video. This makes it well-suited for music videos, narrated explainers, and ads with a defined soundbed.

What resolutions and aspect ratios does WAN 2.7 support?

WAN 2.7 generates video at 720p or 1080p, with flexible aspect ratios including 16:9, 9:16, 1:1, and cinematic widescreen — covering social, web, and broadcast delivery formats from a single API.

Start Generating with WAN 2.7 Today

WAN 2.7 Text-to-Video brings cinematic quality, audio-synchronized motion, and production-ready resolutions to a simple REST API — without subscription lock-in or cold starts. Whether you’re producing social content at scale, prototyping ad concepts, or building a music video from scratch, WAN 2.7 puts a full creative pipeline behind a single prompt.

Try WAN 2.7 Text-to-Video on WaveSpeedAI →