video-to-video

video-to-video

Bytedance LatentSync

bytedance/latentsync

Bytedance LatentSync combines Stable Diffusion and TREPA for high-res end-to-end lip-sync, delivering precise, realistic mouth motions in generated videos. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Hint: You can drag and drop a file or click to upload

Hint: You can drag and drop a file or click to upload

Idle

Your request will cost $0.15 per run.

For $10 you can run this model approximately 66 times.

One more thing::

ExamplesView all

README

Bytedance Latent Sync harnesses the power of stable diffusion and TREPA to deliver precise, high-resolution lip synchronization for dynamic and realistic video generation. Our framework directly models complex audio-visual correlations using Stable Diffusion. Additionally, we found that diffusion-based lip sync methods exhibit inferior temporal consistency. We propose Temporal REPresentation Alignment (TREPA) to enhance temporal consistency while preserving lip-sync accuracy. TREPA uses temporal representations extracted by large-scale self-supervised video models to align the generated frames with the ground truth frames. Our endpoint supports mp4 for video input and mp3/aac/wav/m4a audio files for the audio input.