Song Generation

Playground

SongGeneration (LeVo) is an open-source text-to-song model that turns lyrics and optional audio or text prompts into high-quality songs. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.

Features

SongGeneration

SongGeneration (LeVo) is an open-source text-to-song model developed by Tencent AI Lab that generates high-quality songs with lyrics. It aligns with cutting-edge commercial music generation models like Suno 4.5. Provide lyrics, and optionally an audio or text prompt, to generate a custom song.

Usage

Provide a lyrics, and optionally an audio or text prompt, to generate a custom song.

Lyrics format

Lyrics need to be in the following format:

[structure tag]
lyrics
[structure tag]
lyrics

One paragraph represents one segments, starting with a structure tag and ending with a blank line
One line represents one sentence, punctuation is not recommended inside the sentence
The following segments should not contain lyrics: [intro-short], [intro-medium], [inst-short], [inst-medium], [outro-short], [outro-medium]
The following segments require lyrics: [verse], [chorus], [bridge]

An example of Lyrics is as follows:

[intro-short]

[verse]
Streetlights flicker in the night
I wander through familiar corners
Memories rush in like a tide
Your smile so vivid and bright
Etched in my heart, it won’t fade
All those moments once so sweet
Now I’m left with only memories

[verse]
My phone screen lights up
A message from you appears
Just a few simple words
Yet they bring me to tears
The warmth of your embrace
Now feels so far away
How I wish to turn back time
And have you by my side again

[chorus]
The warmth of memories still remains
But you are gone
My heart was filled with love
Now pierced by longing
The rhythm of music plays
But my heart is drifting
In days without you
How can I keep moving on

[outro-short]

Description

Description could be used to describe the genre of the music, as well as the timbre.

female, dark, pop, sad, piano and drums, the bpm is 125

Prompt Audio

Prompt Audio could be used to guide the model to learn the genre in the music audio.

Priority

Priority: prompt_audio > description > genre

Input Guide

🎵 Lyrics Input Format

The lyric field defines the lyrics and structure of the song. It consists of multiple musical section, each starting with a structure label. The model uses these labels to guide the musical and lyrical progression of the generated song.

📌 Structure Labels

The following segments should not contain lyrics (they are purely instrumental):
- [intro-short], [intro-medium], [inst-short], [inst-medium], [outro-short], [outro-medium]
- short indicates a segment of approximately 0–10 seconds
- medium indicates a segment of approximately 10–20 seconds
- We find that [inst] label is less stable, so we recommend that you do not use it.
The following segments require lyrics:
- [verse], [chorus], [bridge]

Current supported segments are:

[verse]
[chorus]
[bridge]
[intro-short]
[intro-medium]
[intro-long]
[outro-short]
[outro-medium]
[outro-long]
[inst-short]
[inst-medium]
[inst-long]
[silence]

🧾 Lyrics Formatting Rules

Each section is separated by an empty line
Within lyrical segments ([verse], [chorus], [bridge]), lyrics must be written in complete sentences, and each sentence is one line.

📝 Description Input Format

The description field allows you to control various musical attributes of the generated song. It can describe up to six musical dimensions: Gender (e.g., male, female), Timbre (e.g., dark, bright, soft), Genre (e.g., pop, jazz, rock), Emotion (e.g., sad, energetic, romantic), Instrument (e.g., piano, drums, guitar), BPM (e.g., the bpm is 120).

All six dimensions are optional — you can specify any subset of them.
The order of dimensions is flexible.
Use commas (,) to separate different attributes.
Although the model supports open vocabulary, we recommend using predefined tags for more stable and reliable performance. A list of commonly supported tags for each dimension is available in sample descriptions.

Here are a few valid descriptions inputs:

- female, dark, pop, sad, piano and drums, the bpm is 125.
- male, piano, jazz.
- male, dark, the bpm is 110.

🎧Prompt Audio Usage Notes

The input audio file can be longer than 10 seconds, but only the first 10 seconds will be used.
For best musicality and structure, it is recommended to use the chorus section of a song as the prompt audio.
You can use this field to influence genre, instrumentation, rhythm, and voice.

Authentication

For authentication details, please refer to the Authentication Guide.

API Endpoints

Submit Task & Query Result


# Submit the task
curl --location --request POST "https://api.wavespeed.ai/api/v3/wavespeed-ai/song-generation" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}" \
--data-raw '{
    "genre": "Auto",
    "guidance_scale": 1.5,
    "temperature": 0.9,
    "top_k": 50,
    "seed": -1
}'

# Get the result
curl --location --request GET "https://api.wavespeed.ai/api/v3/predictions/${requestId}/result" \
--header "Authorization: Bearer ${WAVESPEED_API_KEY}"

Parameters

Task Submission Parameters

Request Parameters

Parameter	Type	Required	Default	Range	Description
lyric	string	Yes	-	-	Each paragraph represents a segment starting with a structure tag and ending with a blank line, each line is a sentence without punctuation, segments [intro], [inst], [outro] should not contain lyrics, while [verse], [chorus], and [bridge] require lyrics.
description	string	No	-	-	Song Description (Optional). Describe the gender, timbre, genre, emotion, instrument and bpm of the song. Only English is supported currently.
prompt_audio	string	No	-	-	Prompt Audio (Optional). Provide a URL to an audio file that serves as a prompt for the genre of the song generation.
genre	string	No	Auto	Pop, R&B, Dance, Jazz, Folk, Rock, Chinese Style, Chinese Tradition, Metal, Reggae, Chinese Opera, Auto	Genre Select (Optional). Choose a genre for the song.
guidance_scale	number	No	1.5	0.1 ~ 3.0	The guidance scale to use for the generation.
temperature	number	No	0.9	0.1 ~ 2.0	The temperature to use for the generation. A higher value means more randomness in the output.
top_k	integer	No	50	1 ~ 100	The top-k value to use for the generation. This controls the number of highest probability vocabulary tokens to keep for top-k-filtering.
seed	integer	No	-1	-1 ~ 2147483647	The random seed to use for the generation. -1 means a random seed will be used.

Response Parameters

Parameter	Type	Description
code	integer	HTTP status code (e.g., 200 for success)
message	string	Status message (e.g., “success”)
data.id	string	Unique identifier for the prediction, Task Id
data.model	string	Model ID used for the prediction
data.outputs	array	Array of URLs to the generated content (empty when status is not `completed`)
data.urls	object	Object containing related API endpoints
data.urls.get	string	URL to retrieve the prediction result
data.has_nsfw_contents	array	Array of boolean values indicating NSFW detection for each output
data.status	string	Status of the task: `created`, `processing`, `completed`, or `failed`
data.created_at	string	ISO timestamp of when the request was created (e.g., “2023-04-01T12:34:56.789Z”)
data.error	string	Error message (empty if no error occurred)
data.timings	object	Object containing timing details
data.timings.inference	integer	Inference time in milliseconds

Result Request Parameters

SkyReels V1 Steady Dancer