LTX 2 19B Text to Video | Powerful Text-to-Video API

Home/Explore/WaveSpeed/Ltx 2 19b/Text To Video

wavespeed-ai /

LTX-2 19b is the first DiT-based audio-video foundation model with synchronized audio and video, high fidelity, multiple performance modes, and production-ready outputs in one model. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

text-to-video

Input

Enable Safety Checker

Idle

$0.08per run·~12 / $1

ExamplesView all

Stop-motion clay astronaut plants a flag on a tiny moon set. Soft clay squish, miniature wind, tiny crunching steps. Fixed camera, charming imperfections.

A sleek wireless earbud rotates on a glossy pedestal in a white studio. Subtle whoosh transitions, faint electronic hum, clicky case open/close sounds. Macro close-ups, crisp lighting.

Ultra-realistic luxury interior photography, cold bronze used as refined accent material, softly brushed cold bronze surfaces integrated into walls, fireplace surround and custom furniture, warm metallic reflections, elegant contemporary interior design, modern cubist house, sophisticated yet welcoming atmosphere, curated high-end decor objects, organic shapes, premium materials (stone, natural wood, textured fabrics), balanced composition, normal ceiling height, depth in the space with subtle background details, distant window more than 4 meters away with discreet linen curtains barely visible and not emphasized, natural daylight mixed with soft architectural lighting, realistic shadows, rich textures, luxury interior design editorial style, photographed with a professional full-frame camera, 35mm lens, shallow depth of field, extreme photorealism, no people, no text

Hands warming near Lohri bonfire at a luxury hotel, flying sparks, woolen sleeves, night winter ambience, soft firelight on skin, cinematic close-up, cozy festive mood, ultra realistic photography, Indian winter festival

high fashion studio shoot, minimalism, photorealism, tall woman walking forward, full body, long shot, neutral gray concrete background, soft diffused daylight, clean shadows, outfit: beige oversized sweater dress with large cable knit, asymmetrical off-the-shoulder, long textured sleeves, high black suede over-the-knee sock boots, form-fitting, just above the knees, large suede tote bag in matching tone, hair slickly pulled back, minimal makeup, confident stride, elegant pose, high detail on knitwear and leather, realistic textures, sharp focus, monochrome beige-sand palette

Related Models

lipsync-3/avatar

digital-human

kling-v3-turbo-std/image-to-video

image-to-video

kling-v3-turbo-std/text-to-video

text-to-video

kling-v3-turbo-pro/image-to-video

image-to-video

kling-v3-turbo-pro/text-to-video

text-to-video

ltx-2.3-spicy/image-to-video-lora

lora-support

README

LTX-2 19B Text-to-Video with Audio

LTX-2 is the first DiT-based (Diffusion Transformer) audio-video foundation model, capable of generating synchronized audio and video from a text prompt. With 19 billion parameters, it produces high-fidelity, production-ready clips with natural sound that matches the visuals — no post-production audio layering required.

Why Choose This?

Synchronized audio-video generation Outputs video with matching audio in a single pass — footsteps, ambient sounds, speech-like tones, and environmental audio are generated to fit the visual content.
High-fidelity visuals Leverages a 19B-parameter DiT architecture for detailed, temporally consistent video with minimal flickering.
Flexible resolution and aspect ratio Supports 480p, 720p, and 1080p outputs in both 16:9 (landscape) and 9:16 (vertical) formats.
Variable duration Generate clips from 5 to 20 seconds, suitable for quick loops or longer narrative beats.

Parameters

Parameter	Required	Description
prompt	Yes	Text description of the scene, action, and audio cues
resolution	No	Output resolution: 480p, 720p (default), or 1080p
aspect_ratio	No	Output format: 16:9 (default) or 9:16
duration	No	Video length in seconds (5-20)
seed	No	Random seed for reproducibility (-1 for random)

Resolution Options

Resolution	Best For
480p	Fast previews, iteration, lowest cost
720p	Balanced quality and cost (default)
1080p	Final delivery, maximum detail

Aspect Ratio Options

Aspect Ratio	Use Case
16:9	Landscape, YouTube, desktop
9:16	Vertical, TikTok, Stories, Reels

How to Use

Write your prompt — describe the scene, action, and desired audio cues.
Select resolution — 480p for iteration, 720p for balance, 1080p for final output.
Choose aspect ratio — 16:9 for landscape, 9:16 for vertical platforms.
Set duration — 5-20 seconds based on your content needs.
Run — submit and download the generated video with synchronized audio.

Pricing

Resolution	5s	10s	15s	20s
480p	$0.06	$0.12	$0.18	$0.24
720p	$0.08	$0.16	$0.24	$0.32
1080p	$0.12	$0.24	$0.36	$0.48

Billing Rules

Base price: $0.08 (720p, 5 seconds)
Resolution multiplier: 480p = 0.75×, 720p = 1×, 1080p = 1.5×
Duration: Scales linearly (per 5 seconds)
Total cost = duration × $0.08 × resolution_multiplier / 5

Best Use Cases

Short-form Content — Create TikTok, Reels, and Stories with built-in audio.
Product Demos — Generate promotional videos with ambient sound.
Social Media — Produce engaging clips without separate audio editing.
Prototyping — Quickly visualize concepts with synchronized audiovisuals.
Marketing — Create ad content with cohesive sound design.

Pro Tips

Audio is automatic — you don't need to explicitly request it.
Describe sounds when it matters (e.g., "jazz music," "thunderstorm").
Match aspect ratio to platform: 9:16 for vertical-first, 16:9 for YouTube/desktop.
Iterate at 480p to dial in the prompt, then render at higher resolution for final output.
Use fixed seed when testing prompt variations to isolate the effect of your changes.

Notes

Maximum video duration is 20 seconds.
Audio is generated based on the visual content and prompt context.
For longer content, generate multiple clips and edit together.

Related Models

LTX-2 19B Image-to-Video — Animate a reference image into video with synchronized audio.
Wan 2.5 T2V — Alternative text-to-video with the Wan ecosystem.
Kling 2.6 T2V — Kuaishou's latest text-to-video generation model.
Sora 2 T2V — OpenAI's text-to-video model with cinematic quality.

Accessibility:This website uses AI models provided by third parties.

ExamplesView all

Related Models

README

LTX-2 19B Text-to-Video with Audio

Why Choose This?

Parameters

Resolution Options

Aspect Ratio Options

How to Use

Pricing

Billing Rules

Best Use Cases

Pro Tips

Notes

Related Models

Ltx 2 19b Text To Video API — Quick start

Ltx 2 19b Text To Video API — Frequently asked questions