MMAudio v2 — wavespeed-ai/mmaudio-v2
MMAudio v2 generates high-quality sound effects and ambience for a video using the visual content plus a text prompt. Upload a clip, describe the audio you want (environment, materials, impacts, whooshes, texture), and the model synthesizes a synced audio track that matches motion and timing. It’s ideal for adding cinematic SFX, atmospheric layers, and “sound design” style audio to silent footage.
Key capabilities
- Video-to-audio generation (adds sound to an existing video)
- Prompt-driven sound design: ambience, impacts, textures, mechanical sounds, nature
- Timing-aware audio that follows visual motion beats
- Optional negative_prompt to avoid unwanted audio characteristics
- Duration control for generating audio for different clip lengths
- mask_away_clip option for generating audio without directly using the original clip audio
Use cases
- Add cinematic ambience to silent clips (city night, wind, rain, room tone)
- Create synced sound effects (footsteps, fabric rustle, metal clanks, sparks)
- Product and food sound design (sizzles, pours, crackles, knife cuts)
- Trailer-style audio layers for short edits and social videos
- Rapid sound prototyping before final mix and mastering
Pricing
| Unit | Price |
|---|
| Per second of audio | $0.001 |
Examples:
| Duration | Price |
|---|
| 5s | $0.005 |
| 8s | $0.008 |
| 10s | $0.010 |
Inputs
- video (required): the source video to generate audio for
- prompt (required): describe the desired sound
Parameters
- duration: audio length in seconds
- num_inference_steps: sampling steps
- guidance_scale: prompt adherence strength
- negative_prompt: what to avoid (e.g., “muffled, noisy, distorted, music”)
- mask_away_clip: whether to mask away the clip (useful when you want fully generated audio)
Prompting guide (video → audio)
Write prompts like a sound designer:
- Environment: location + ambience (rainy alley, factory hall, forest dawn)
- Materials: metal, glass, lava, fabric, wood, water
- Actions: slice, pour, crackle, hiss, whoosh, impact
- Texture: crisp, gritty, low rumble, sparkling high-end, subtle room tone
- Timing beats: “as the blade presses in…”, “when the cube hits the ground…”
Example prompts
- A glowing lava cube crackles and pops with ember flickers. A tungsten blade presses into the semi-liquid core with a soft sizzling hiss, tiny molten droplets splatter, low rumble underneath, cinematic close-mic detail.
- Rainy city night ambience with distant traffic, soft wind, occasional footsteps, subtle neon buzz, realistic stereo space.