Music Video Generator

Playground

AI Music Video Generator transforms audio + a single photo into a full music video with cinematic camera angles, smooth transitions, and perfect lip sync. Up to 10 minutes, 480p or 720p. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Features

AI Music Video (MV) Generator

The world’s best AI music video (MV) generator. Turn any song + a single photo into a professional-quality music video in minutes.

Why It’s the Best

Blazing fast: Generate a full 1-minute music video in just a few minutes. No waiting hours.
Perfect lip sync: Vocal-aware segmentation ensures the singer’s lips match the audio precisely throughout the entire video.
Cinematic quality: AI director plans each scene with different camera angles, compositions, and natural lighting — like a real music video shoot.
One photo is all you need: Upload a single portrait and the AI handles the rest — scene creation, angle variations, and smooth transitions.
Up to 10 minutes: Create full-length music videos, not just short clips.
Smart scene planning: Automatically detects vocal phrases and silence in the audio to create natural scene transitions at musically meaningful moments.

How It Works

Upload your audio — any song, any genre, up to 10 minutes.
Upload 1-3 reference images (optional) — the person who will appear in the video.
Describe the scene (optional) — e.g. “A woman sings in a forest while playing a guitar”.
Choose aspect ratio — 16:9 (landscape) or 9:16 (portrait/vertical).
Select resolution — 480p or 720p.
Get your music video — fully rendered with transitions, multiple angles, and synced audio.

What Happens Behind the Scenes

Vocal isolation — Separates vocals from instruments to analyze singing patterns.
Smart segmentation — Splits the audio at natural phrase boundaries (not arbitrary fixed intervals).
AI directing — A vision-language model plans each scene: camera angles, compositions, expressions, and camera movements.
Scene generation — Creates unique starting frames for each segment from different angles.
Video synthesis — Generates lip-synced digital human video for each segment.
Cinematic assembly — Smooth crossfade transitions between scenes, with the original audio layered on top for perfect sync.

Pricing

Output Resolution	Cost per 5 seconds	Max Length
480p	$0.15	10 minutes
720p	$0.30	10 minutes

Billing Rules

Standard Rate: $0.03 per second
HD (720p) Rate: $0.06 per second
Minimum Charge: 5 seconds ($0.15 minimum)
Billing Cap: 600 seconds (10 minutes)

Parameters

Parameter	Required	Description
`audio`	Yes	URL of the audio/music file
`images`	No	Array of 1-3 reference image URLs
`prompt`	No	Scene/style description
`aspect_ratio`	No	”16:9” or “9:16” (auto if omitted)
`resolution`	No	”480p” (default) or “720p”

Tips

Best results with vocals: The AI uses vocal patterns for scene timing. Songs with clear vocals produce the best-timed transitions.
Portrait photos work best: Clear, front-facing photos with visible face give the best identity preservation.
Be descriptive: A good prompt like “A rock singer performing on a neon-lit stage” gives much better results than just “singer”.
No photo? No problem: If you don’t provide images, the AI will generate a performer based on the detected voice (male/female).

Note

Max audio length: 10 minutes (600 seconds)
Processing speed: A 1-minute music video typically completes in 3-6 minutes
Supported audio formats: MP3, WAV, AAC, and most common formats
The AI automatically handles scene planning, you don’t need to specify individual scenes

Authentication

For authentication details, please refer to the Authentication Guide.

API Endpoints

Submit Task & Query Result


# Submit the task
curl --location --request POST "https://api.wavespeed.ai/api/v3/wavespeed-ai/music-video-generator" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}" \
--data-raw '{
    "resolution": "480p"
}'

# Get the result
curl --location --request GET "https://api.wavespeed.ai/api/v3/predictions/${requestId}/result" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}"

Parameters

Task Submission Parameters

Request Parameters

Parameter	Type	Required	Default	Range	Description
audio	string	Yes	-	-	The audio/music file URL for generating the music video.
images	array	No	[]	-	List of reference image URLs (1-3 images). The person in the images will appear throughout the video.
prompt	string	No		-	Style and scene description for the music video (e.g. "A woman sings in a forest while playing a guitar").
aspect_ratio	string	No	-	16:9, 9:16	Aspect ratio of the output video. If not specified, auto-detected from input images.
resolution	string	No	480p	480p, 720p	The resolution of the output video.

Response Parameters

Parameter	Type	Description
code	integer	HTTP status code (e.g., 200 for success)
message	string	Status message (e.g., “success”)
data.id	string	Unique identifier for the prediction, Task Id
data.model	string	Model ID used for the prediction
data.outputs	array	Array of URLs to the generated content (empty when status is not `completed`)
data.urls	object	Object containing related API endpoints
data.urls.get	string	URL to retrieve the prediction result
data.status	string	Status of the task: `created`, `processing`, `completed`, or `failed`
data.created_at	string	ISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”)
data.error	string	Error message (empty if no error occurred)
data.timings	object	Object containing timing details
data.timings.inference	integer	Inference time in milliseconds

Result Request Parameters

Parameter	Type	Required	Default	Description
id	string	Yes	-	Task ID

Result Response Parameters

Parameter	Type	Description
code	integer	HTTP status code (e.g., 200 for success)
message	string	Status message (e.g., “success”)
data	object	The prediction data object containing all details
data.id	string	Unique identifier for the prediction, the ID of the prediction to get
data.model	string	Model ID used for the prediction
data.outputs	string	Array of URLs to the generated content (empty when status is not completed).
data.urls	object	Object containing related API endpoints
data.urls.get	string	URL to retrieve the prediction result
data.status	string	Status of the task: `created`, `processing`, `completed`, or `failed`
data.created_at	string	ISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”)
data.error	string	Error message (empty if no error occurred)
data.timings	object	Object containing timing details
data.timings.inference	integer	Inference time in milliseconds

Multitalk Neta Lumina