Introducing Mirelo AI Sfx V1 Video To Audio on WaveSpeedAI

Mirelo SFX V1 Video-to-Audio: AI-Powered Synchronized Sound Effects for Any Video

Mirelo SFX V1 Video-to-Audio is a new AI sound generation model on WaveSpeedAI that produces synchronized sound effects directly from video input, transforming silent footage into immersive, scene-matched audio. Whether you’re a filmmaker filling in missing foley, a content creator polishing short-form videos, or a developer automating audio production at scale, this model delivers realistic audio that matches what’s happening on screen — without the cost or turnaround of traditional sound design.

Sound design has long been one of the most time-consuming parts of video production. Recording foley, sourcing stock effects, and hand-aligning each sound to picture can eat up hours per minute of finished content. Mirelo SFX V1 collapses that workflow into a single API call, letting you go from raw video to mixed audio in seconds.

Try Mirelo SFX V1 Video-to-Audio on WaveSpeedAI →

How Mirelo SFX V1 Video-to-Audio Works

Mirelo SFX V1 Video-to-Audio analyzes the visual content of an uploaded clip — the on-screen action, environment, motion, and pacing — and generates audio that synchronizes with what it sees. The model accepts a video file or URL as the only required input, and optionally takes a text prompt to steer the type of sound you want.

The technical specs developers care about:

Input: Video URL or direct upload
Output: Audio synchronized to video timing
Duration: 2 to 10 seconds per run
Multi-sample generation: 2 audio variations by default, configurable up to multiple samples per request
Reproducibility: Seed parameter for deterministic outputs

What sets Mirelo SFX V1 apart from generic text-to-audio models is the video conditioning. Instead of generating sound from a description alone, the model grounds its output in the actual frames of your clip — meaning footsteps land on the right beat, splashes hit when something enters the water, and ambient textures match the visible environment.

Key Features of Mirelo SFX V1 Video-to-Audio

Video-synchronized sound generation — The model parses on-screen action and produces audio that aligns with the visual timing, eliminating the manual frame-by-frame sync work traditional foley requires.
Optional text prompt guidance — Steer the audio with natural language (e.g., “rain on window glass” or “crowded café ambience”) when the scene is ambiguous or when you want a specific creative direction.
Multiple samples per run — Generate several audio variations in a single API call, then A/B select the best take without re-submitting and paying for another job.
Adjustable duration up to 10 seconds — Configure exactly how long the generated audio should be, billed per second per sample.
Reproducible outputs via seed — Lock in a specific result with the seed parameter, useful for iterative editing or maintaining consistency across a series.
REST API with no cold starts — Hosted on WaveSpeedAI’s inference infrastructure, so first-call latency stays low and batch jobs run predictably.

Best Use Cases for Mirelo SFX V1 Video-to-Audio

Film and Video Post-Production Foley

Independent filmmakers and post-production studios can use Mirelo SFX V1 to generate realistic foley for silent footage or poorly recorded scenes. Footstep sounds, door closes, fabric rustles, and ambient room tone — all of which traditionally require a foley artist and a recording session — can now be drafted in seconds and refined in your edit. This is especially valuable for indie productions working without a dedicated sound team.

Short-form video creators on TikTok, Reels, and Shorts know that audio drives engagement. Silent clips get scrolled past. With Mirelo SFX V1, creators can batch-process dozens of clips, generating tailored sound effects that match each scene rather than relying on the same overused stock library. The multi-sample feature is particularly useful here — pick the variation that hits hardest for the algorithm.

Game Development and Interactive Media

Game developers can feed in-game capture footage to Mirelo SFX V1 to prototype sound effects for new mechanics, environments, or cutscenes. Instead of waiting on a sound designer for early-stage builds, developers can generate placeholder audio that already feels production-quality, then iterate from there.

Advertising and Product Marketing Videos

Marketing teams producing high volumes of product videos, demo reels, and social ads can use Mirelo SFX V1 to add polished audio without booking studio time. A silent unboxing video becomes a tactile experience with package crinkle, button clicks, and product handling sounds — all generated to match the on-screen action.

Content Automation Pipelines

For teams running automated video pipelines — news clip generation, AI-produced explainers, archival footage restoration — Mirelo SFX V1 integrates as a REST API call. Combine it with WaveSpeedAI’s text-to-video and image-to-video models to build fully automated video-with-audio production workflows.

Archival Footage and Silent Film Enhancement

Restoring or repurposing silent archival footage? Mirelo SFX V1 can add atmospheric audio that brings old clips to life — historical street ambience, machinery, weather — without invasive editing.

Educational and Training Videos

Instructional content often has weak or missing audio in demonstration segments. Mirelo SFX V1 can fill those gaps with appropriate environmental and action sounds, making training videos more engaging without re-shooting.

Mirelo SFX V1 Video-to-Audio Pricing and API Access

Mirelo SFX V1 is billed at $0.007 per second per sample, with a minimum billable duration of 2 seconds and a maximum of 10 seconds per run.

Duration	1 Sample	2 Samples	4 Samples
2s	$0.014	$0.028	$0.056
5s	$0.035	$0.070	$0.140
10s	$0.070	$0.140	$0.280

Total cost = billed duration × num_samples × $0.007

A typical 5-second, 2-sample run costs $0.07 — affordable enough for high-volume production workflows.

API Example

Calling Mirelo SFX V1 via the WaveSpeedAI Python SDK:

import json
import os
import time
from urllib.request import Request, urlopen

api_key = os.environ["WAVESPEED_API_KEY"]
headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
payload = {
    "video": "https://interactive-examples.mdn.mozilla.net/media/cc0-videos/flower.mp4",
    "num_samples": 2,
    "duration": 5,
    "seed": -1
}

def request_json(url, data=None):
    request = Request(url, data=data, headers=headers, method="POST" if data else "GET")
    with urlopen(request) as response:
        return json.load(response)

# 1. Submit the prediction.
submit_body = request_json("https://api.wavespeed.ai/api/v3/mirelo-ai/sfx-v1/video-to-audio", json.dumps(payload).encode())
task = submit_body.get("data", submit_body)
prediction_id = task.get("id")
if not prediction_id:
    raise RuntimeError("Submission response did not contain a prediction id")
result_url = task.get("urls", {}).get("get") or f"https://api.wavespeed.ai/api/v3/predictions/{prediction_id}/result"

# 2. Poll until the prediction finishes.
while True:
    body = request_json(result_url)
    result = body.get("data", body)
    status = result.get("status")
    if status == "completed":
        print(result.get("outputs", []))
        break
    if status in {"failed", "cancelled", "timeout"}:
        raise RuntimeError(result)
    if status not in {"created", "processing"}:
        raise RuntimeError(f"Unexpected status: {status}")
    time.sleep(2)

WaveSpeedAI’s hosted infrastructure means no cold starts, no GPU provisioning, and pay-per-use billing — you only pay for what you generate.

Get your API key and start building →

Tips for Best Results with Mirelo SFX V1 Video-to-Audio

Leave the prompt empty when the video is self-explanatory. The model infers strong audio from clear visuals — extra text can sometimes over-steer the result.
Use the prompt to disambiguate. For scenes that could imply multiple soundscapes (e.g., an indoor shot that could be a library or a café), explicit prompts produce more accurate results.
Generate 3–4 samples on creative work. Variation increases the chance of finding a perfect match, and the cost per additional sample is minimal.
Lock the seed once you find a winner. Reproducibility matters when iterating on a longer project or matching audio across multiple cuts.
Match duration to the key action window. If the most important sound event is 3 seconds long, generate 3 seconds rather than the full 10 — you’ll get more focused output and pay less.
Ensure video URLs are publicly accessible if you’re passing links rather than uploading directly.

Frequently Asked Questions

What is Mirelo SFX V1 Video-to-Audio?

Mirelo SFX V1 Video-to-Audio is an AI model on WaveSpeedAI that generates synchronized sound effects from video input, with optional text prompt guidance for creative control.

How much does Mirelo SFX V1 Video-to-Audio cost?

Mirelo SFX V1 is billed at $0.007 per second per sample. A 5-second, 2-sample generation costs $0.07. Billable duration ranges from 2 to 10 seconds.

Can I use Mirelo SFX V1 Video-to-Audio via API?

Yes. Mirelo SFX V1 is available through WaveSpeedAI’s REST API with no cold starts. Use the Python SDK or any HTTP client to call mirelo-ai/sfx-v1/video-to-audio with your video and optional parameters.

How long can the generated audio be?

Audio duration is configurable from 2 to 10 seconds per run. For longer audio, segment your video and run multiple generations.

Does Mirelo SFX V1 require a text prompt?

No. The video is the only required input — the model can infer audio purely from visual content. Prompts are optional and useful for steering the result toward a specific sound or style.

Start Generating Synchronized Audio with Mirelo SFX V1

Stop manually sourcing and syncing sound effects. Mirelo SFX V1 Video-to-Audio gives you scene-matched audio in seconds, with a simple REST API and pay-per-use pricing that scales from a single creator to a full production pipeline.