Seedance 2.0 20% OFF | Video Generator で作成 →

Stable Audio 3 Text to Audio API

stability-ai /

Stable Audio 3 Text-to-Audio is a fast AI audio generation model that creates sound effects from text prompts with controllable duration and output format. Ready-to-use REST inference API for sound effect generation, ambient audio, game audio, video production, cinematic sound design, creator content, and professional text-to-audio workflows with simple integration, no coldstarts, and affordable pricing.

text-to-audio
入力

待機中

$0.02061回あたり·~48 / $1

サンプルすべて表示

30-second cinematic sound-design scene: an isolated desert gas station at midnight with buzzing fluorescent lights, distant highway wind, a loose metal sign creaking, insects around the lamps, and a faraway truck passing slowly. Keep it detailed, spatial, realistic, and non-musical.

関連モデル

README

Stability AI Stable Audio 3 Text-to-Audio

Stability AI Stable Audio 3 Text-to-Audio generates audio directly from a natural-language prompt, with controls for duration, negative prompting, inference steps, guidance scale, and output format. It is suitable for sound design, ambient scenes, cinematic textures, audio prototyping, and other prompt-driven audio generation workflows.

Why Choose This?

  • Prompt-based audio generation
    Generate audio from a text description of mood, environment, texture, or sound event.

  • Duration control
    Choose how long the generated audio should be.

  • Negative prompt support
    Use negative_prompt to steer the model away from unwanted elements.

  • Generation controls
    Adjust num_inference_steps and guidance_scale to balance fidelity, control, and generation behavior.

  • Flexible output format
    Export the generated audio in a supported format such as mp3.

  • Production-ready API
    Suitable for sound effects, ambience, cinematic scenes, creative prototyping, and audio ideation workflows.

Parameters

ParameterRequiredDescription
promptYesText prompt describing the audio you want to generate.
durationNoOutput audio duration in seconds.
negative_promptNoText description of sounds or qualities you want to avoid.
num_inference_stepsNoNumber of inference steps used during generation.
guidance_scaleNoControls how strongly the model follows the prompt.
output_formatNoOutput audio format, such as mp3.

How to Use

  1. Write your prompt — describe the sound scene, mood, texture, and pacing you want.
  2. Set duration (optional) — choose the target audio length.
  3. Add a negative prompt (optional) — specify anything you want the model to avoid.
  4. Adjust generation settings (optional) — tune num_inference_steps and guidance_scale if needed.
  5. Choose output format (optional) — select the format that best fits your workflow.
  6. Submit — run the model and download the generated audio.

Example Prompt

30-second cinematic sound-design scene: an isolated desert gas station at midnight with buzzing fluorescent lights, distant highway wind, a loose metal sign creaking, insects around the lamps, and a faraway truck passing slowly. Keep it detailed, spatial, realistic, and non-musical.

Pricing

Just $0.0206 per generation.

Billing Rules

  • Each generation costs $0.0206
  • Pricing is fixed per request
  • duration, negative_prompt, num_inference_steps, guidance_scale, and output_format do not affect pricing

Best Use Cases

  • Sound design — Generate cinematic and environmental sound scenes from prompts.
  • Ambient audio — Create non-musical background textures and atmosphere beds.
  • Creative prototyping — Explore multiple sound directions quickly from text.
  • Game and film audio ideation — Produce draft effects and scene ambience for early-stage workflows.
  • Content production — Generate custom audio for videos, trailers, podcasts, or installations.

Pro Tips

  • Be specific in your prompt about environment, motion, texture, and realism.
  • Use negative_prompt when you want to suppress music, vocals, distortion, or unwanted artifacts.
  • Increase num_inference_steps if you want potentially more refined output and can tolerate more runtime.
  • Adjust guidance_scale when you want tighter prompt adherence.
  • Start with a clear, focused prompt before adding extra details.

Notes

  • prompt is required.
  • Pricing is fixed at $0.0206 per generation.
  • Better prompts usually improve both realism and controllability.
  • This workflow is especially useful for non-musical and cinematic audio generation.

Related Models

  • Other Stability AI audio generation workflows — Useful when you need different quality, speed, or control trade-offs.
  • Prompt-based sound generation models — Useful when you want alternate audio styles or generation behavior.
アクセシビリティ:本サイトは第三者が提供するAIモデルを使用しています。

Stable Audio 3 Text To Audio API — Quick start

Grab a WaveSpeedAI API key, then call POST https://api.wavespeed.ai/api/v3/stability-ai/stable-audio-3/text-to-audio with your input as JSON. The endpoint returns a prediction id; poll the prediction endpoint until status flips to completed, then read the output URL from data.outputs[0]. Examples for Stable Audio 3 Text To Audio below.

HTTP example
# Submit the prediction
curl -X POST "https://api.wavespeed.ai/api/v3/stability-ai/stable-audio-3/text-to-audio" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $WAVESPEED_API_KEY" \
  -d '{
    "prompt": "A cinematic shot of a city at sunset, soft golden light",
    "duration": 30,
    "negative_prompt": "blurry, low quality, distorted",
    "num_inference_steps": 8,
    "guidance_scale": 1,
    "output_format": "mp3"
}'

# Response includes a prediction id. Poll for the result:
curl -X GET "https://api.wavespeed.ai/api/v3/predictions/{request_id}/result" \
  -H "Authorization: Bearer $WAVESPEED_API_KEY"

# When status is "completed", read the output from data.outputs[0].
Node.js example
// npm install wavespeed
const WaveSpeed = require('wavespeed');

const client = new WaveSpeed(); // reads WAVESPEED_API_KEY from env

const result = await client.run("stability-ai/stable-audio-3/text-to-audio", {
        "prompt": "A cinematic shot of a city at sunset, soft golden light",
        "duration": 30,
        "negative_prompt": "blurry, low quality, distorted",
        "num_inference_steps": 8,
        "guidance_scale": 1,
        "output_format": "mp3"
});

console.log(result.outputs[0]); // → URL of the generated output
Python example
# pip install wavespeed
import wavespeed

output = wavespeed.run(
    "stability-ai/stable-audio-3/text-to-audio",
    {
    "prompt": "A cinematic shot of a city at sunset, soft golden light",
    "duration": 30,
    "negative_prompt": "blurry, low quality, distorted",
    "num_inference_steps": 8,
    "guidance_scale": 1,
    "output_format": "mp3"
}
)

print(output["outputs"][0])  # → URL of the generated output

Stable Audio 3 Text To Audio API — Frequently asked questions

What is the Stable Audio 3 Text To Audio API?

Stable Audio 3 Text To Audio is a Stability AI model for audio generation, exposed as a REST API on WaveSpeedAI. Stable Audio 3 Text-to-Audio is a fast AI audio generation model that creates sound effects from text prompts with controllable duration and output format. Ready-to-use REST inference API for sound effect generation, ambient audio, game audio, video production, cinematic sound design, creator content, and professional text-to-audio workflows with simple integration, no coldstarts, and affordable pricing. You can call it programmatically or try it from the playground above.

How do I call the Stable Audio 3 Text To Audio API?

POST your input parameters to the model's REST endpoint (shown in the API tab of this playground) with your WaveSpeedAI API key in the Authorization header. Submission returns a prediction ID; poll the prediction endpoint until status flips to "completed", then read the output URL from the result. The playground generates a ready-to-paste code sample in Python, JavaScript, or cURL for whatever inputs you've set. Full request/response shape is documented at https://wavespeed.ai/docs/docs-api/stability-ai/stability-ai-stable-audio-3-text-to-audio.

How much does Stable Audio 3 Text To Audio cost per run?

Stable Audio 3 Text To Audio starts at $0.021 per run. That figure is the base price — the final charge scales with the parameters you set in the form (output size, length, count, references, or whatever knobs this model exposes), so a higher-quality or larger output costs more than a minimal one. The exact cost for your current input is shown live next to the Generate button before you submit, and the actual per-call charge is recorded on the prediction afterwards.

What inputs does Stable Audio 3 Text To Audio accept?

Key inputs: `prompt`, `duration`, `guidance_scale`, `num_inference_steps`, `negative_prompt`, `output_format`. The full JSON schema (types, defaults, allowed values) is rendered above the Generate button and mirrored in the API reference at https://wavespeed.ai/docs/docs-api/stability-ai/stability-ai-stable-audio-3-text-to-audio.

How do I get started with the Stable Audio 3 Text To Audio API?

Sign up for a free WaveSpeedAI account to claim starter credits, copy your API key from /accesskey, then call the endpoint shown in the API tab of the playground. The playground also auto-generates a code sample in Python, JavaScript, or cURL for the parameters you've set.

Can I use Stable Audio 3 Text To Audio outputs commercially?

Commercial usage rights depend on the model's license, set by its provider (Stability AI). The license summary appears on the model card above; see WaveSpeedAI's Terms of Service for platform-level conditions.