Home/Explore/Avatar Lipsync/bytedance/latentsync

video-to-video

bytedance/latentsync

Latent Sync harnesses the power of stable diffusion and TREPA to deliver precise, high-resolution lip synchronization for dynamic and realistic video generation. $0.15 per 5 seconds video generation.

Doc

Hint: You can drag and drop a file or click to upload

Hint: You can drag and drop a file or click to upload

Idle

Your request will cost $0.15 per run.

For $10 you can run this model approximately 66 times.

One more thing:

ExamplesView all

README

Bytedance Latent Sync harnesses the power of stable diffusion and TREPA to deliver precise, high-resolution lip synchronization for dynamic and realistic video generation. Our framework directly models complex audio-visual correlations using Stable Diffusion. Additionally, we found that diffusion-based lip sync methods exhibit inferior temporal consistency. We propose Temporal REPresentation Alignment (TREPA) to enhance temporal consistency while preserving lip-sync accuracy. TREPA uses temporal representations extracted by large-scale self-supervised video models to align the generated frames with the ground truth frames. Our endpoint supports mp4 for video input and mp3/aac/wav/m4a audio files for the audio input.