
video-to-video
Idle
Your request will cost $0.15 per run.
For $10 you can run this model approximately 66 times.
One more thing:
 View all
View allBytedance Latent Sync harnesses the power of stable diffusion and TREPA to deliver precise, high-resolution lip synchronization for dynamic and realistic video generation. Our framework directly models complex audio-visual correlations using Stable Diffusion. Additionally, we found that diffusion-based lip sync methods exhibit inferior temporal consistency. We propose Temporal REPresentation Alignment (TREPA) to enhance temporal consistency while preserving lip-sync accuracy. TREPA uses temporal representations extracted by large-scale self-supervised video models to align the generated frames with the ground truth frames. Our endpoint supports mp4 for video input and mp3/aac/wav/m4a audio files for the audio input.