Nano Banana 2 & Pro Sale — 15% OFF | Apr 1–15 Only
Inicio/Explorar/wavespeed-ai/music-video-generator

AI Music Video Generator

wavespeed-ai/music-video-generator

AI Music Video Generator transforms audio + a single photo into a full music video with cinematic camera angles, smooth transitions, and perfect lip sync. Up to 10 minutes, 480p or 720p. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

digital-human
Input

Drag & drop or click to upload

Idle

Tu solicitud costará $0.15 por ejecución.

Con $10 puedes ejecutar este modelo aproximadamente 66 veces.

EjemplosVer todo

README

AI Music Video (MV) Generator

The world's best AI music video (MV) generator. Turn any song + a single photo into a professional-quality music video in minutes.

Why It's the Best

  • Blazing fast: Generate a full 1-minute music video in just a few minutes. No waiting hours.
  • Perfect lip sync: Vocal-aware segmentation ensures the singer's lips match the audio precisely throughout the entire video.
  • Cinematic quality: AI director plans each scene with different camera angles, compositions, and natural lighting — like a real music video shoot.
  • One photo is all you need: Upload a single portrait and the AI handles the rest — scene creation, angle variations, and smooth transitions.
  • Up to 10 minutes: Create full-length music videos, not just short clips.
  • Smart scene planning: Automatically detects vocal phrases and silence in the audio to create natural scene transitions at musically meaningful moments.

How It Works

  1. Upload your audio — any song, any genre, up to 10 minutes.
  2. Upload 1-3 reference images (optional) — the person who will appear in the video.
  3. Describe the scene (optional) — e.g. "A woman sings in a forest while playing a guitar".
  4. Choose aspect ratio — 16:9 (landscape) or 9:16 (portrait/vertical).
  5. Select resolution — 480p or 720p.
  6. Get your music video — fully rendered with transitions, multiple angles, and synced audio.

What Happens Behind the Scenes

  1. Vocal isolation — Separates vocals from instruments to analyze singing patterns.
  2. Smart segmentation — Splits the audio at natural phrase boundaries (not arbitrary fixed intervals).
  3. AI directing — A vision-language model plans each scene: camera angles, compositions, expressions, and camera movements.
  4. Scene generation — Creates unique starting frames for each segment from different angles.
  5. Video synthesis — Generates lip-synced digital human video for each segment.
  6. Cinematic assembly — Smooth crossfade transitions between scenes, with the original audio layered on top for perfect sync.

Pricing

Output ResolutionCost per 5 secondsMax Length
480p$0.1510 minutes
720p$0.3010 minutes

Billing Rules

  • Standard Rate: $0.03 per second
  • HD (720p) Rate: $0.06 per second
  • Minimum Charge: 5 seconds ($0.15 minimum)
  • Billing Cap: 600 seconds (10 minutes)

Parameters

ParameterRequiredDescription
audioYesURL of the audio/music file
imagesNoArray of 1-3 reference image URLs
promptNoScene/style description
aspect_ratioNo"16:9" or "9:16" (auto if omitted)
resolutionNo"480p" (default) or "720p"

Tips

  • Best results with vocals: The AI uses vocal patterns for scene timing. Songs with clear vocals produce the best-timed transitions.
  • Portrait photos work best: Clear, front-facing photos with visible face give the best identity preservation.
  • Be descriptive: A good prompt like "A rock singer performing on a neon-lit stage" gives much better results than just "singer".
  • No photo? No problem: If you don't provide images, the AI will generate a performer based on the detected voice (male/female).

Note

  • Max audio length: 10 minutes (600 seconds)
  • Processing speed: A 1-minute music video typically completes in 3-6 minutes
  • Supported audio formats: MP3, WAV, AAC, and most common formats
  • The AI automatically handles scene planning, you don't need to specify individual scenes