Openai Whisper With Video

Playground

OpenAI Whisper Large v3 (Video-to-Text) delivers high-accuracy multilingual transcription directly from video files, with automatic language detection and optional timestamped, subtitle-ready segments. Built for stable production use with a ready-to-use REST API, fast response, no cold starts, and predictable pricing.

Features

OpenAI Whisper (Large-v3) — Video-to-Text

OpenAI Whisper — Video-to-Text is a production-ready speech recognition endpoint powered by Whisper large-v3. It transcribes or translates speech directly from video files by extracting audio and returning clean, readable text, with optional word-level timestamps for subtitle and alignment workflows.

Built for stable production use with a ready-to-use REST API, no cold starts, and predictable pay-per-second pricing.

Key capabilities

Video input support (audio is extracted automatically)
Two tasks: transcribe and translate
Language selection: auto detection or manual language code
Optional word-level timestamps via enable_timestamps
Optional sync response via enable_sync_mode (API only)

Parameters

Parameter	Required	Description
video	Yes	Input video (upload or public URL).
language	No	Language code or auto (default).
task	No	transcribe or translate.
enable_timestamps	No	Generate word-level timestamps (may increase processing time).
prompt	No	Short guidance text to steer transcription/translation style.
enable_sync_mode	No	API only: wait for result and return it directly in the response.

How to use

Upload video (or paste a public video URL).
Set language:
- Use auto for most cases.
- Choose a specific language code if detection is unstable.
Choose task:
- transcribe for same-language transcription
- translate for translated output
(Optional) Enable enable_timestamps if you need subtitle timing/alignment.
(Optional) Add a prompt to guide formatting or terminology (names, jargon, punctuation).
Run and read the transcript output.

API note: enable_sync_mode is not shown as a normal UI option; it’s only available through the API.

Pricing

Mode	enable_timestamps	Price per second
Standard	false	$0.001 / s
Timestamped	true	$0.002 / s

Examples

Video length	Standard	Timestamped
60s	$0.06	$0.12
600s (10 min)	$0.60	$1.20

Notes

If you use a URL, it must be publicly accessible; the UI showing a preview thumbnail is a good sanity check.
Timestamps are best for subtitles and editing, but may take longer to process.
For best accuracy, use clear speech and minimize background music/noise.

More Models to Try

OpenAI Whisper Turbo on WaveSpeedAI — Faster, cost-efficient speech-to-text for real-time or high-volume transcription pipelines while keeping strong multilingual recognition quality.

Authentication

For authentication details, please refer to the Authentication Guide.

API Endpoints

Submit Task & Query Result


# Submit the task
curl --location --request POST "https://api.wavespeed.ai/api/v3/wavespeed-ai/openai-whisper-with-video" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}" \
--data-raw '{
    "language": "auto",
    "task": "transcribe",
    "enable_timestamps": false,
    "prompt": "",
    "enable_sync_mode": true
}'

# Get the result
curl --location --request GET "https://api.wavespeed.ai/api/v3/predictions/${requestId}/result" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}"

Parameters

Task Submission Parameters

Request Parameters

Parameter	Type	Required	Default	Range	Description
video	string	Yes		-	Video file or URL to transcribe. Provide an HTTPS URL or upload a video file. Audio will be extracted for transcription.
language	string	No	auto	auto, af, am, ar, as, az, ba, be, bg, bn, bo, br, bs, ca, cs, cy, da, de, el, en, es, et, eu, fa, fi, fo, fr, gl, gu, ha, haw, he, hi, hr, ht, hu, hy, id, is, it, ja, jw, ka, kk, km, kn, ko, la, lb, ln, lo, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, my, ne, nl, nn, no, oc, pa, pl, ps, pt, ro, ru, sa, sd, si, sk, sl, sn, so, sq, sr, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, uk, ur, uz, vi, yi, yo, zh, yue	Language spoken in the audio. Set to 'auto' for automatic language detection (default).
task	string	No	transcribe	transcribe, translate	The task to perform. 'transcribe' to the source language or 'translate' to English.
enable_timestamps	boolean	No	false	-	Enable to generate word-level timestamps for the transcription. Note: This may increase processing time.
prompt	string	No		-	An optional text to provide as a prompt to guide the model's style or continue a previous audio segment. The prompt should be in the same language as the audio.
enable_sync_mode	boolean	No	true	-	If set to true, the function will wait for the result to be generated and uploaded before returning the response. It allows you to get the result directly in the response. This property is only available through the API.

Response Parameters

Parameter	Type	Description
code	integer	HTTP status code (e.g., 200 for success)
message	string	Status message (e.g., “success”)
data.id	string	Unique identifier for the prediction, Task Id
data.model	string	Model ID used for the prediction
data.outputs	array	Array of URLs to the generated content (empty when status is not `completed`)
data.urls	object	Object containing related API endpoints
data.urls.get	string	URL to retrieve the prediction result
data.has_nsfw_contents	array	Array of boolean values indicating NSFW detection for each output
data.status	string	Status of the task: `created`, `processing`, `completed`, or `failed`
data.created_at	string	ISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”)
data.error	string	Error message (empty if no error occurred)
data.timings	object	Object containing timing details
data.timings.inference	integer	Inference time in milliseconds

Result Request Parameters

Parameter	Type	Required	Default	Description
id	string	Yes	-	Task ID

Result Response Parameters

Parameter	Type	Description
code	integer	HTTP status code (e.g., 200 for success)
message	string	Status message (e.g., “success”)
data	object	The prediction data object containing all details
data.id	string	Unique identifier for the prediction, the ID of the prediction to get
data.model	string	Model ID used for the prediction
data.outputs	string	Object containing URLs to the generated content (empty when status is not completed).
data.urls	object	Object containing related API endpoints
data.urls.get	string	URL to retrieve the prediction result
data.status	string	Status of the task: `created`, `processing`, `completed`, or `failed`
data.created_at	string	ISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”)
data.error	string	Error message (empty if no error occurred)
data.timings	object	Object containing timing details
data.timings.inference	integer	Inference time in milliseconds

Openai Whisper Turbo Paddle Ocr