text-to-audio
Idle
Your request will cost $0.04 per object,
For $1 you can run this model approximately 25 times.
Dia directly generates highly realistic dialogue from a transcript. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.
To accelerate research, we are providing access to pretrained model checkpoints and inference code. The model weights are hosted on Hugging Face. The model only supports English generation at the moment.
We also provide a demo page comparing our model to ElevenLabs Studio and Sesame CSM-1B.
(Update) We have a ZeroGPU Space running! Try it now here. Thanks to the HF team for the support :) Join our discord server for community support and access to new features. Play with a larger version of Dia: generate fun conversations, remix content, and share with friends. 🔮 Join the waitlist for early access.
Keep input text length moderate Short input (corresponding to under 5s of audio) will sound unnatural Very long input (corresponding to over 20s of audio) will make the speech unnaturally fast. Use non-verbal tags sparingly, from the list in the README. Overusing or using unlisted non-verbals may cause weird artifacts. Always begin input text with [S1], and always alternate between [S1] and [S2] (i.e. [S1]... [S1]... is not good) When using audio prompts (voice cloning), follow these instructions carefully: Provide the transcript of the to-be cloned audio before the generation text. Transcript must use [S1], [S2] speaker tags correctly (i.e. single speaker: [S1]..., two speakers: [S1]... [S2]...) Duration of the to-be cloned audio should be 5~10 seconds for the best results. (Keep in mind: 1 second ≈ 86 tokens) Put [S1] or [S2] (the second-to-last speaker's tag) at the end of the audio to improve audio quality at the end
Generate dialogue via [S1] and [S2] tag Generate non-verbal like (laughs), (coughs), etc. Below verbal tags will be recognized, but might result in unexpected output. (laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles) Voice cloning. See example/voice_clone.py for more information. In the Hugging Face space, you can upload the audio you want to clone and place its transcript before your script. Make sure the transcript follows the required format. The model will then output only the content of your script.