Chatterbox Speech to Speech API

Chatterbox Speech-to-Speech

Chatterbox Speech-to-Speech transforms a source audio clip into a target voice style using optional reference audio. It is suitable for voice conversion, style transfer, creator dubbing, character voice prototyping, and other speech-to-speech workflows where you want to preserve spoken content while changing vocal identity or delivery style.

Why Choose This?

Speech-to-speech conversion
Transform an existing speech recording into a different voice style.
Optional reference voice guidance
Add reference_audio when you want the output to follow a particular vocal tone or character.
Simple workflow
Upload source audio, optionally upload a reference voice sample, and generate the converted result.
Useful for creator and dubbing workflows
Suitable for voice restyling, character voice tests, demo production, and spoken-content transformation.
Production-ready API
Useful for narration replacement, voice experiments, content localization, and creative audio workflows.

Parameters

Parameter	Required	Description
audio	Yes	Source audio to convert.
reference_audio	No	Optional reference audio used to guide the target voice style.

How to Use

Upload your source audio — provide the speech recording you want to transform.
Upload reference audio (optional) — add a target voice sample if you want stronger style guidance.
Submit — run the model and download the converted speech audio.

Example Use Case

Convert a spoken voice clip into a different vocal style for creator content, dubbing, or character voice testing.

Pricing

Just $0.02 per started minute.

Billing Rules

Pricing is $0.02 per started minute
Audio duration is billed in started 60-second units
Audio shorter than 60 seconds is billed as 1 minute
reference_audio does not affect pricing

Example Costs

Audio Duration	Cost
1s–60s	$0.02
61s–120s	$0.04
121s–180s	$0.06

Best Use Cases

Voice style transfer — Convert speech into a different vocal tone or identity.
Character voice prototyping — Test alternative voice styles for characters or avatars.
Creator dubbing — Rework spoken audio for short-form content or promos.
Narration restyling — Preserve content while changing delivery feel.
Speech workflow experiments — Compare different voice directions from the same recording.

Pro Tips

Use clean source audio for better intelligibility.
Add reference_audio only when you want stronger target voice guidance.
Use a clear reference sample with stable tone for more consistent conversion.
Short clips are useful for testing before processing longer audio.

Notes

audio is required.
reference_audio is optional.
Pricing is based on source audio duration and billed per started minute.
Better source audio and cleaner reference audio generally improve output quality.

Related Models

Chatterbox Text-to-Speech — Generate speech directly from text.
Voice cloning workflows — Useful when you need a reusable custom voice identity instead of per-request voice guidance.
Audio generation workflows — Useful when you need music or sound generation instead of speech conversion.

Speech To Speech API — Quick start

Grab a WaveSpeedAI API key, then call POST https://api.wavespeed.ai/api/v3/chatterbox/speech-to-speech with your input as JSON. The endpoint returns a prediction id. Start polling the result endpoint around every 2 seconds, increase the interval for long-running tasks, and stop on any terminal status. On completed, read output values from data.outputs. Examples for Speech To Speech below.

HTTP example

set -euo pipefail

: "${WAVESPEED_API_KEY:?Set WAVESPEED_API_KEY}"

REQUEST_BODY=$(cat <<'JSON'
{
    "audio": "https://interactive-examples.mdn.mozilla.net/media/cc0-audio/t-rex-roar.mp3"
}
JSON
)

# 1. Submit the prediction.
SUBMIT_RESPONSE=$(curl --silent --show-error --fail-with-body \
  -X POST "https://api.wavespeed.ai/api/v3/chatterbox/speech-to-speech" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $WAVESPEED_API_KEY" \
  -d "$REQUEST_BODY")

TASK=$(printf '%s' "$SUBMIT_RESPONSE" | jq 'if has("data") then .data else . end')
PREDICTION_ID=$(printf '%s' "$TASK" | jq -r '.id')
if [ -z "$PREDICTION_ID" ] || [ "$PREDICTION_ID" = "null" ]; then
  printf 'Submission response did not contain a prediction id
' >&2
  exit 1
fi
RESULT_URL=$(printf '%s' "$TASK" | jq -r '.urls.get // empty')
if [ -z "$RESULT_URL" ]; then
  RESULT_URL="https://api.wavespeed.ai/api/v3/predictions/$PREDICTION_ID/result"
fi

# 2. Poll until the prediction finishes.
while true; do
  RESPONSE=$(curl --silent --show-error --fail-with-body "$RESULT_URL" \
    -H "Authorization: Bearer $WAVESPEED_API_KEY")
  RESULT=$(printf '%s' "$RESPONSE" | jq 'if has("data") then .data else . end')
  STATUS=$(printf '%s' "$RESULT" | jq -r '.status')
  case "$STATUS" in
    completed) printf '%s\n' "$RESULT" | jq '.outputs'; break ;;
    failed|cancelled|timeout) printf '%s\n' "$RESULT" | jq . >&2; exit 1 ;;
    created|processing) sleep 2 ;;
    *) printf 'Unexpected status: %s
' "$STATUS" >&2; exit 1 ;;
  esac
done

Node.js example

const submitUrl = "https://api.wavespeed.ai/api/v3/chatterbox/speech-to-speech";
const apiKey = process.env.WAVESPEED_API_KEY;
if (!apiKey) throw new Error('Set WAVESPEED_API_KEY');

async function requestJson(url, options = {}) {
  const response = await fetch(url, options);
  if (!response.ok) throw new Error(await response.text());
  return response.json();
}

// 1. Submit the prediction.
const body = await requestJson(submitUrl, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${apiKey}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
        "audio": "https://interactive-examples.mdn.mozilla.net/media/cc0-audio/t-rex-roar.mp3"
}),
});
const task = body.data ?? body;
if (!task.id) throw new Error("Submission response did not contain a prediction id");
const resultUrl = task.urls?.get ||
  `https://api.wavespeed.ai/api/v3/predictions/${task.id}/result`;

// 2. Poll until the prediction finishes.
while (true) {
  const resultBody = await requestJson(resultUrl, {
    headers: { "Authorization": `Bearer ${apiKey}` },
  });
  const result = resultBody.data ?? resultBody;
  if (result.status === "completed") {
    console.log(result.outputs);
    break;
  }
  if (["failed", "cancelled", "timeout"].includes(result.status)) throw new Error(JSON.stringify(result));
  if (!["created", "processing"].includes(result.status)) throw new Error("Unexpected status: " + result.status);
  await new Promise(resolve => setTimeout(resolve, 2000));
}

Python example

import json
import os
import time
from urllib.request import Request, urlopen

api_key = os.environ["WAVESPEED_API_KEY"]
headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
payload = {
    "audio": "https://interactive-examples.mdn.mozilla.net/media/cc0-audio/t-rex-roar.mp3"
}

def request_json(url, data=None):
    request = Request(url, data=data, headers=headers, method="POST" if data else "GET")
    with urlopen(request) as response:
        return json.load(response)

# 1. Submit the prediction.
body = request_json("https://api.wavespeed.ai/api/v3/chatterbox/speech-to-speech", json.dumps(payload).encode())
task = body.get("data", body)
if not task.get("id"):
    raise RuntimeError("Submission response did not contain a prediction id")
result_url = task.get("urls", {}).get("get") or f"https://api.wavespeed.ai/api/v3/predictions/{task['id']}/result"

# 2. Poll until the prediction finishes.
while True:
    result_body = request_json(result_url)
    result = result_body.get("data", result_body)
    status = result.get("status")
    if status == "completed":
        print(result.get("outputs", []))
        break
    if status in {"failed", "cancelled", "timeout"}:
        raise RuntimeError(result)
    if status not in {"created", "processing"}:
        raise RuntimeError(f"Unexpected status: {status}")
    time.sleep(2)

Speech To Speech API — Frequently asked questions

What is the Speech To Speech API?

Speech To Speech is a Chatterbox model for AI inference, exposed as a REST API on WaveSpeedAI. Chatterbox Speech to Speech is a fast AI voice conversion model that converts source audio into a target voice style with optional reference audio guidance. Ready-to-use REST inference API for voice conversion, speech style transfer, dubbing, character voices, creator content, audio localization, and professional speech-to-speech workflows with simple integration, no coldstarts, and affordable pricing. You can call it programmatically or try it from the playground above.

How do I call the Speech To Speech API?

POST your input parameters to the model's REST endpoint (shown in the API tab of this playground) with your WaveSpeedAI API key in the Authorization header. Submission returns a prediction ID. Poll the result endpoint starting around every 2 seconds, increase the interval for long-running tasks, and stop on any terminal status. The playground generates production-oriented Python, JavaScript, and cURL examples with timeouts, transient-error handling, and safe GET retries. Full request/response shape is documented at https://wavespeed.ai/docs/docs-api/chatterbox/chatterbox-speech-to-speech.

How much does Speech To Speech cost per run?

Speech To Speech starts at $0.020 per run. That figure is the base price — the final charge scales with the parameters you set in the form (output size, length, count, references, or whatever knobs this model exposes), so a higher-quality or larger output costs more than a minimal one. The exact cost for your current input is shown live next to the Generate button before you submit, and the actual per-call charge is recorded on the prediction afterwards.

What inputs does Speech To Speech accept?

Key inputs: `audio`, `reference_audio`. The full JSON schema (types, defaults, allowed values) is rendered above the Generate button and mirrored in the API reference at https://wavespeed.ai/docs/docs-api/chatterbox/chatterbox-speech-to-speech.

How long does Speech To Speech take to generate?

Median end-to-end generation time on WaveSpeedAI is around 9 seconds per request, based on recent successful runs. Queue time varies with global demand; live status is visible in the prediction record.

Can I use Speech To Speech outputs commercially?

Commercial usage rights depend on the model's license, set by its provider (Chatterbox). The license summary appears on the model card above; see WaveSpeedAI's Terms of Service for platform-level conditions.

サンプルすべて表示

関連モデル

README