jhansss commited on
Commit
e39f54d
·
1 Parent(s): d91c8af

Merge with main

Browse files
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2025 SingingSDS Authors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -10,13 +10,28 @@ pinned: false
10
  ---
11
  # SingingSDS: Role-Playing Singing Spoken Dialogue System
12
 
13
- A role-playing singing dialogue system that converts speech input into character-based singing output.
14
 
15
- ## Installation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ### Requirements
18
 
19
- - Python 3.11+
20
  - CUDA (optional, for GPU acceleration)
21
 
22
  ### Install Dependencies
@@ -31,13 +46,36 @@ conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvi
31
  pip install -r requirements.txt
32
  ```
33
 
34
- #### Option 2: Using pip only
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ```bash
37
  pip install -r requirements.txt
38
  ```
39
 
40
- #### Option 3: Using pip with virtual environment
41
 
42
  ```bash
43
  python -m venv singingsds_env
@@ -50,7 +88,7 @@ source singingsds_env/bin/activate
50
  pip install -r requirements.txt
51
  ```
52
 
53
- ## Usage
54
 
55
  ### Command Line Interface (CLI)
56
 
@@ -81,8 +119,7 @@ python cli.py \
81
  - `--config_path`: Configuration file path (default: config/cli/yaoyin_default.yaml)
82
  - `--output_audio`: Output audio file path (required)
83
 
84
-
85
- ### Web Interface (Gradio)
86
 
87
  Start the web interface:
88
 
@@ -92,7 +129,9 @@ python app.py
92
 
93
  Then visit the displayed address in your browser to use the graphical interface.
94
 
95
- ## Configuration
 
 
96
 
97
  ### Character Configuration
98
 
@@ -104,26 +143,32 @@ The system supports multiple preset characters:
104
  ### Model Configuration
105
 
106
  #### ASR Models
107
- - `openai/whisper-large-v3-turbo`
108
- - `openai/whisper-large-v3`
109
- - `openai/whisper-medium`
110
- - `openai/whisper-small`
111
- - `funasr/paraformer-zh`
 
 
112
 
113
  #### LLM Models
114
- - `gemini-2.5-flash`
115
- - `google/gemma-2-2b`
116
- - `meta-llama/Llama-3.2-3B-Instruct`
117
- - `meta-llama/Llama-3.1-8B-Instruct`
118
- - `Qwen/Qwen3-8B`
119
- - `Qwen/Qwen3-30B-A3B`
120
- - `MiniMaxAI/MiniMax-Text-01`
 
 
121
 
122
  #### SVS Models
123
- - `espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg` (Bilingual)
124
- - `espnet/aceopencpop_svs_visinger2_40singer_pretrain` (Chinese)
 
 
125
 
126
- ## Project Structure
127
 
128
  ```
129
  SingingSDS/
@@ -146,10 +191,30 @@ SingingSDS/
146
  └── README.md, requirements.txt
147
  ```
148
 
149
- ## Contributing
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
 
151
- Issues and Pull Requests are welcome!
152
 
153
- ## License
154
 
155
 
 
10
  ---
11
  # SingingSDS: Role-Playing Singing Spoken Dialogue System
12
 
13
+ <div align="center">
14
 
15
+ **A role-playing singing dialogue system that converts speech input into character-based singing output.**
16
+
17
+ ![Paper](https://img.shields.io/badge/Paper-Coming%20Soon-orange) [![Code](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/SingingSDS/SingingSDS) [![HuggingFace Demo](https://img.shields.io/badge/🤗%20HuggingFace-Demo-yellow)](https://huggingface.co/spaces/espnet/SingingSDS) [![YouTube](https://img.shields.io/badge/YouTube-Playlist-red)](https://www.youtube.com/playlist?list=PLZpUJJbwp2WvtPBenG5D3h09qKIrt24ui)
18
+
19
+ </div>
20
+
21
+ ## 📖 Overview
22
+
23
+ SingingSDS is an innovative role-playing singing dialogue system that seamlessly converts natural speech input into character-based singing output. The system integrates automatic speech recognition (ASR), large language models (LLM), and singing voice synthesis (SVS) to create an immersive conversational singing experience.
24
+
25
+ <div align="center">
26
+ <img src="assets/demo.png" alt="SingingSDS Interface" style="max-width: 100%; height: auto;"/>
27
+ <p><em>SingingSDS Web Interface: Interactive singing dialogue system with character visualization, audio I/O, evaluation metrics, and flexible configuration options.</em></p>
28
+ </div>
29
+
30
+ ## 🚀 Installation
31
 
32
  ### Requirements
33
 
34
+ - Python 3.10 or 3.11
35
  - CUDA (optional, for GPU acceleration)
36
 
37
  ### Install Dependencies
 
46
  pip install -r requirements.txt
47
  ```
48
 
49
+ #### Option 2: Using uv (Fast & Modern)
50
+
51
+ First install uv:
52
+
53
+ ```bash
54
+ # On macOS/Linux:
55
+ curl -LsSf https://astral.sh/uv/install.sh | sh
56
+
57
+ # On Windows:
58
+ powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
59
+
60
+ # Or via pip:
61
+ pip install uv
62
+ ```
63
+
64
+ Then install dependencies:
65
+
66
+ ```bash
67
+ uv venv
68
+ source .venv/bin/activate # On Windows: .venv\Scripts\activate
69
+ uv pip install -r requirements.txt
70
+ ```
71
+
72
+ #### Option 3: Using pip only
73
 
74
  ```bash
75
  pip install -r requirements.txt
76
  ```
77
 
78
+ #### Option 4: Using pip with virtual environment
79
 
80
  ```bash
81
  python -m venv singingsds_env
 
88
  pip install -r requirements.txt
89
  ```
90
 
91
+ ## 💻 Usage
92
 
93
  ### Command Line Interface (CLI)
94
 
 
119
  - `--config_path`: Configuration file path (default: config/cli/yaoyin_default.yaml)
120
  - `--output_audio`: Output audio file path (required)
121
 
122
+ ### 🌐 Web Interface (Gradio)
 
123
 
124
  Start the web interface:
125
 
 
129
 
130
  Then visit the displayed address in your browser to use the graphical interface.
131
 
132
+ > 💡 **Tip**: You can also try our [HuggingFace demo](https://huggingface.co/spaces/espnet/SingingSDS) for a quick test without local installation!
133
+
134
+ ## ⚙️ Configuration
135
 
136
  ### Character Configuration
137
 
 
143
  ### Model Configuration
144
 
145
  #### ASR Models
146
+ | Model | Description |
147
+ |-------|-------------|
148
+ | `openai/whisper-large-v3-turbo` | Latest Whisper model with turbo optimization |
149
+ | `openai/whisper-large-v3` | Large Whisper v3 model |
150
+ | `openai/whisper-medium` | Medium-sized Whisper model |
151
+ | `openai/whisper-small` | Small Whisper model |
152
+ | `funasr/paraformer-zh` | Paraformer for Chinese ASR |
153
 
154
  #### LLM Models
155
+ | Model | Description |
156
+ |-------|-------------|
157
+ | `gemini-2.5-flash` | Google Gemini 2.5 Flash |
158
+ | `google/gemma-2-2b` | Google Gemma 2B model |
159
+ | `meta-llama/Llama-3.2-3B-Instruct` | Meta Llama 3.2 3B Instruct |
160
+ | `meta-llama/Llama-3.1-8B-Instruct` | Meta Llama 3.1 8B Instruct |
161
+ | `Qwen/Qwen3-8B` | Qwen3 8B model |
162
+ | `Qwen/Qwen3-30B-A3B` | Qwen3 30B A3B model |
163
+ | `MiniMaxAI/MiniMax-Text-01` | MiniMax Text model |
164
 
165
  #### SVS Models
166
+ | Model | Language Support |
167
+ |------|------------------|
168
+ | `espnet/visinger2-zh-jp-multisinger-svs` | Bilingual (Chinese & Japanese) |
169
+ | `espnet/aceopencpop_svs_visinger2_40singer_pretrain` | Chinese |
170
 
171
+ ## 📁 Project Structure
172
 
173
  ```
174
  SingingSDS/
 
191
  └── README.md, requirements.txt
192
  ```
193
 
194
+ ## 🤝 Contributing
195
+
196
+ We welcome contributions! Please feel free to submit issues and pull requests.
197
+
198
+ ## 📄 License
199
+
200
+ ### Character Assets
201
+
202
+ The Yaoyin (遥音) character assets, including [`character_yaoyin.png`](./assets/character_yaoyin.png) created by illustrator Zihe Zhou, are commissioned exclusively for the SingingSDS project. Screenshots of the system that include these assets, such as [`demo.png`](./assets/demo.png), are also covered under this license. The assets may be used only for direct derivatives of SingingSDS, such as project-related posts, usage videos, or other content directly depicting the project. Any other use requires express permission from the illustrator, and these assets may not be used for training, fine-tuning, or improving any artificial intelligence or machine learning models. For full license details, see [`assets/character_yaoyin.LICENSE`](./assets/character_yaoyin.LICENSE).
203
+
204
+ ### Code License
205
+
206
+ All source code in this repository is licensed under the [MIT License](./LICENSE). This license applies **only to the code**. Character assets remain under their separate license and restrictions, as described in the **Character Assets** section.
207
+
208
+ ### Model License
209
+
210
+ The models used in SingingSDS are subject to their respective licenses and terms of use. Users must comply with each model’s official license, which can be found at the respective model’s official repository or website.
211
+
212
+ ---
213
+
214
+ <div align="center">
215
 
216
+ Paper (Coming soon) [Code](https://github.com/SingingSDS/SingingSDS) • [Demo](https://huggingface.co/spaces/espnet/SingingSDS) • [Video](https://www.youtube.com/playlist?list=PLZpUJJbwp2WvtPBenG5D3h09qKIrt24ui)
217
 
218
+ </div>
219
 
220
 
assets/character_yaoyin.LICENSE ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ YAOIN CHARACTER ASSETS LICENSE
2
+
3
+ Copyright (c) 2025 Zihe Zhou
4
+ All rights reserved.
5
+
6
+ These Yaoyin (遥音) character assets, including character_yaoyin.png and any derivative works
7
+ such as demo.png (system screenshots containing the character), were created by the illustrator
8
+ Zihe Zhou and commissioned exclusively for the SingingSDS project (the "Project").
9
+
10
+ 1. DEFINITIONS
11
+ 1.1 "Licensed Material" means all Yaoyin character assets provided in this repository, including
12
+ images, screenshots, and any derivative works.
13
+ 1.2 "Licensee" means any individual or entity using the Licensed Material under these terms.
14
+ 1.3 "Permitted Use" means use of the Licensed Material solely for direct derivatives of the Project
15
+ (SingingSDS), including but not limited to:
16
+ a. Project-related posts or social media content
17
+ b. Demonstration or usage videos of SingingSDS
18
+ c. Other content directly depicting or showcasing SingingSDS
19
+ 1.4 "Prohibited Use" means any use outside the scope of Permitted Use, including commercial
20
+ exploitation, redistribution, or incorporation into unrelated projects.
21
+ 1.5 "AI/ML Training" means use of the Licensed Material for training, fine-tuning, or improving
22
+ artificial intelligence or machine learning models, including but not limited to neural networks,
23
+ diffusion models, or large language models, whether for commercial or non-commercial purposes.
24
+
25
+ 2. GRANT OF LICENSE
26
+ The Licensor grants the Licensee a non-exclusive, non-transferable, revocable license to use
27
+ the Licensed Material only for Permitted Use. No other rights are granted. This license does
28
+ not transfer copyright, which remains with the Licensor.
29
+
30
+ 3. RESTRICTIONS
31
+ Licensee shall not:
32
+ 3.1 Use the Licensed Material beyond Permitted Use without express written permission from the Licensor.
33
+ 3.2 Use the Licensed Material for AI/ML Training as defined in Section 1.5.
34
+ 3.3 Redistribute, sublicense, or otherwise transfer the Licensed Material to any third party except
35
+ as part of a direct derivative of the Project.
36
+
37
+ 4. ATTRIBUTION
38
+ All permitted uses of the Licensed Material must include the name of the Licensor (Zihe Zhou) as the creator.
39
+
40
+ 5. INTELLECTUAL PROPERTY
41
+ 5.1 All rights, titles, and interests in the Licensed Material, including copyrights, trademarks,
42
+ and other intellectual property rights, remain with the Licensor.
43
+ 5.2 This license does not grant any rights to the Licensor’s trademarks, service marks, trade names,
44
+ or branding beyond the explicit rights described herein.
45
+
46
+ 6. WARRANTY DISCLAIMER
47
+ The Licensed Material is provided "as-is" without any warranty, express or implied, including but
48
+ not limited to merchantability, fitness for a particular purpose, or non-infringement. The Licensor
49
+ shall not be liable for any damages arising from the use or inability to use the Licensed Material.
50
+
51
+ 7. ENFORCEMENT
52
+ Violation of these terms may result in immediate revocation of the license and potential legal action.
53
+ Licensee acknowledges that monetary damages may be insufficient to remedy a breach and that the Licensor
54
+ may seek injunctive relief as necessary.
55
+
56
+ 8. GOVERNING LAW
57
+ These License Terms shall be governed by and construed in accordance with the laws of the jurisdiction
58
+ of the Licensor’s primary residence or institutional affiliation.
config/cli/yaoyin_test.yaml CHANGED
@@ -1,6 +1,6 @@
1
  asr_model: openai/whisper-medium
2
  llm_model: meta-llama/Llama-3.1-8B-Instruct
3
- svs_model: espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg
4
  melody_source: sample-lyric-kising
5
  language: mandarin
6
  prompt_template_character: Yaoyin
 
1
  asr_model: openai/whisper-medium
2
  llm_model: meta-llama/Llama-3.1-8B-Instruct
3
+ svs_model: espnet/visinger2-zh-jp-multisinger-svs
4
  melody_source: sample-lyric-kising
5
  language: mandarin
6
  prompt_template_character: Yaoyin
config/interface/options.yaml CHANGED
@@ -25,24 +25,24 @@ llm_models:
25
  name: Qwen3 30B A3B
26
 
27
  svs_models:
28
- - id: mandarin-espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg
29
  name: Visinger2 (Bilingual)-zh
30
- model_path: espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg
31
  lang: mandarin
32
  voices:
33
  voice1: resources/singer/singer_embedding_ace-2.npy
34
  voice2: resources/singer/singer_embedding_ace-8.npy
35
- voice3: resources/singer/singer_embedding_itako.npy
36
  voice4: resources/singer/singer_embedding_kising_orange.npy
37
  voice5: resources/singer/singer_embedding_m4singer_Alto-4.npy
38
- - id: japanese-espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg
39
  name: Visinger2 (Bilingual)-jp
40
- model_path: espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg
41
  lang: japanese
42
  voices:
43
  voice1: resources/singer/singer_embedding_ace-2.npy
44
  voice2: resources/singer/singer_embedding_ace-8.npy
45
- voice3: resources/singer/singer_embedding_itako.npy
46
  voice4: resources/singer/singer_embedding_kising_orange.npy
47
  voice5: resources/singer/singer_embedding_m4singer_Alto-4.npy
48
  - id: mandarin-espnet/aceopencpop_svs_visinger2_40singer_pretrain
 
25
  name: Qwen3 30B A3B
26
 
27
  svs_models:
28
+ - id: mandarin-espnet/visinger2-zh-jp-multisinger-svs
29
  name: Visinger2 (Bilingual)-zh
30
+ model_path: espnet/visinger2-zh-jp-multisinger-svs
31
  lang: mandarin
32
  voices:
33
  voice1: resources/singer/singer_embedding_ace-2.npy
34
  voice2: resources/singer/singer_embedding_ace-8.npy
35
+ voice3: resources/singer/singer_embedding_ace-27.npy
36
  voice4: resources/singer/singer_embedding_kising_orange.npy
37
  voice5: resources/singer/singer_embedding_m4singer_Alto-4.npy
38
+ - id: japanese-espnet/visinger2-zh-jp-multisinger-svs
39
  name: Visinger2 (Bilingual)-jp
40
+ model_path: espnet/visinger2-zh-jp-multisinger-svs
41
  lang: japanese
42
  voices:
43
  voice1: resources/singer/singer_embedding_ace-2.npy
44
  voice2: resources/singer/singer_embedding_ace-8.npy
45
+ voice3: resources/singer/singer_embedding_ace-27.npy
46
  voice4: resources/singer/singer_embedding_kising_orange.npy
47
  voice5: resources/singer/singer_embedding_m4singer_Alto-4.npy
48
  - id: mandarin-espnet/aceopencpop_svs_visinger2_40singer_pretrain
modules/svs/espnet.py CHANGED
@@ -39,6 +39,7 @@ class ESPNetSVS(AbstractSVSModel):
39
  elif self.model_id in [
40
  "espnet/mixdata_svs_visinger2_spkemb_lang_pretrained",
41
  "espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg",
 
42
  ]:
43
 
44
  def mandarin_mapper(pinyin: str) -> list[str]:
@@ -123,6 +124,7 @@ class ESPNetSVS(AbstractSVSModel):
123
  elif self.model_id in [
124
  "espnet/mixdata_svs_visinger2_spkemb_lang_pretrained",
125
  "espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg",
 
126
  ]:
127
  langs = {
128
  "mandarin": 2,
 
39
  elif self.model_id in [
40
  "espnet/mixdata_svs_visinger2_spkemb_lang_pretrained",
41
  "espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg",
42
+ "espnet/visinger2-zh-jp-multisinger-svs",
43
  ]:
44
 
45
  def mandarin_mapper(pinyin: str) -> list[str]:
 
124
  elif self.model_id in [
125
  "espnet/mixdata_svs_visinger2_spkemb_lang_pretrained",
126
  "espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg",
127
+ "espnet/visinger2-zh-jp-multisinger-svs",
128
  ]:
129
  langs = {
130
  "mandarin": 2,
resources/singer/singer_embedding_ameboshi.npy DELETED
Binary file (896 Bytes)
 
resources/singer/singer_embedding_itako.npy DELETED
Binary file (896 Bytes)
 
resources/singer/singer_embedding_ofuton.npy DELETED
Binary file (896 Bytes)