Spaces:

espnet
/

SingingSDS

Running

App Files Files Community

jhansss commited on 16 days ago

Commit

e39f54d

1 Parent(s): d91c8af

Merge with main

Browse files

Files changed (9) hide show

LICENSE +21 -0
README.md +92 -27
assets/character_yaoyin.LICENSE +58 -0
config/cli/yaoyin_test.yaml +1 -1
config/interface/options.yaml +6 -6
modules/svs/espnet.py +2 -0
resources/singer/singer_embedding_ameboshi.npy +0 -0
resources/singer/singer_embedding_itako.npy +0 -0
resources/singer/singer_embedding_ofuton.npy +0 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 SingingSDS Authors
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -10,13 +10,28 @@ pinned: false
 ---
 # SingingSDS: Role-Playing Singing Spoken Dialogue System
-A role-playing singing dialogue system that converts speech input into character-based singing output.
-## Installation
 ### Requirements
-- Python 3.11+
 - CUDA (optional, for GPU acceleration)
 ### Install Dependencies
@@ -31,13 +46,36 @@ conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvi
 pip install -r requirements.txt
 ```
-#### Option 2: Using pip only
 ```bash
 pip install -r requirements.txt
 ```
-#### Option 3: Using pip with virtual environment
 ```bash
 python -m venv singingsds_env
@@ -50,7 +88,7 @@ source singingsds_env/bin/activate
 pip install -r requirements.txt
 ```
-## Usage
 ### Command Line Interface (CLI)
@@ -81,8 +119,7 @@ python cli.py \
 - `--config_path`: Configuration file path (default: config/cli/yaoyin_default.yaml)
 - `--output_audio`: Output audio file path (required)
-### Web Interface (Gradio)
 Start the web interface:
@@ -92,7 +129,9 @@ python app.py
 Then visit the displayed address in your browser to use the graphical interface.
-## Configuration
 ### Character Configuration
@@ -104,26 +143,32 @@ The system supports multiple preset characters:
 ### Model Configuration
 #### ASR Models
-- `openai/whisper-large-v3-turbo`
-- `openai/whisper-large-v3`
-- `openai/whisper-medium`
-- `openai/whisper-small`
-- `funasr/paraformer-zh`
 #### LLM Models
-- `gemini-2.5-flash`
-- `google/gemma-2-2b`
-- `meta-llama/Llama-3.2-3B-Instruct`
-- `meta-llama/Llama-3.1-8B-Instruct`
-- `Qwen/Qwen3-8B`
-- `Qwen/Qwen3-30B-A3B`
-- `MiniMaxAI/MiniMax-Text-01`
 #### SVS Models
-- `espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg` (Bilingual)
-- `espnet/aceopencpop_svs_visinger2_40singer_pretrain` (Chinese)
-## Project Structure
 ```
 SingingSDS/
@@ -146,10 +191,30 @@ SingingSDS/
 └── README.md, requirements.txt
 ```
-## Contributing
-Issues and Pull Requests are welcome!
-## License

 ---
 # SingingSDS: Role-Playing Singing Spoken Dialogue System
+<div align="center">
+**A role-playing singing dialogue system that converts speech input into character-based singing output.**
+![Paper](https://img.shields.io/badge/Paper-Coming%20Soon-orange) [![Code](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/SingingSDS/SingingSDS) [![HuggingFace Demo](https://img.shields.io/badge/🤗%20HuggingFace-Demo-yellow)](https://huggingface.co/spaces/espnet/SingingSDS) [![YouTube](https://img.shields.io/badge/YouTube-Playlist-red)](https://www.youtube.com/playlist?list=PLZpUJJbwp2WvtPBenG5D3h09qKIrt24ui)
+</div>
+## 📖 Overview
+SingingSDS is an innovative role-playing singing dialogue system that seamlessly converts natural speech input into character-based singing output. The system integrates automatic speech recognition (ASR), large language models (LLM), and singing voice synthesis (SVS) to create an immersive conversational singing experience.
+<div align="center">
+  <img src="assets/demo.png" alt="SingingSDS Interface" style="max-width: 100%; height: auto;"/>
+  <p><em>SingingSDS Web Interface: Interactive singing dialogue system with character visualization, audio I/O, evaluation metrics, and flexible configuration options.</em></p>
+</div>
+## 🚀 Installation
 ### Requirements
+- Python 3.10 or 3.11
 - CUDA (optional, for GPU acceleration)
 ### Install Dependencies
 pip install -r requirements.txt
 ```
+#### Option 2: Using uv (Fast & Modern)
+First install uv:
+```bash
+# On macOS/Linux:
+curl -LsSf https://astral.sh/uv/install.sh | sh
+# On Windows:
+powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
+# Or via pip:
+pip install uv
+```
+Then install dependencies:
+```bash
+uv venv
+source .venv/bin/activate  # On Windows: .venv\Scripts\activate
+uv pip install -r requirements.txt
+```
+#### Option 3: Using pip only
 ```bash
 pip install -r requirements.txt
 ```
+#### Option 4: Using pip with virtual environment
 ```bash
 python -m venv singingsds_env
 pip install -r requirements.txt
 ```
+## 💻 Usage
 ### Command Line Interface (CLI)
 - `--config_path`: Configuration file path (default: config/cli/yaoyin_default.yaml)
 - `--output_audio`: Output audio file path (required)
+### 🌐 Web Interface (Gradio)
 Start the web interface:
 Then visit the displayed address in your browser to use the graphical interface.
+> 💡 **Tip**: You can also try our [HuggingFace demo](https://huggingface.co/spaces/espnet/SingingSDS) for a quick test without local installation!
+## ⚙️ Configuration
 ### Character Configuration
 ### Model Configuration
 #### ASR Models
+| Model | Description |
+|-------|-------------|
+| `openai/whisper-large-v3-turbo` | Latest Whisper model with turbo optimization |
+| `openai/whisper-large-v3` | Large Whisper v3 model |
+| `openai/whisper-medium` | Medium-sized Whisper model |
+| `openai/whisper-small` | Small Whisper model |
+| `funasr/paraformer-zh` | Paraformer for Chinese ASR |
 #### LLM Models
+| Model | Description |
+|-------|-------------|
+| `gemini-2.5-flash` | Google Gemini 2.5 Flash |
+| `google/gemma-2-2b` | Google Gemma 2B model |
+| `meta-llama/Llama-3.2-3B-Instruct` | Meta Llama 3.2 3B Instruct |
+| `meta-llama/Llama-3.1-8B-Instruct` | Meta Llama 3.1 8B Instruct |
+| `Qwen/Qwen3-8B` | Qwen3 8B model |
+| `Qwen/Qwen3-30B-A3B` | Qwen3 30B A3B model |
+| `MiniMaxAI/MiniMax-Text-01` | MiniMax Text model |
 #### SVS Models
+| Model | Language Support |
+|------|------------------|
+| `espnet/visinger2-zh-jp-multisinger-svs` | Bilingual (Chinese & Japanese) |
+| `espnet/aceopencpop_svs_visinger2_40singer_pretrain` | Chinese |
+## 📁 Project Structure
 ```
 SingingSDS/
 └── README.md, requirements.txt
 ```
+## 🤝 Contributing
+We welcome contributions! Please feel free to submit issues and pull requests.
+## 📄 License
+### Character Assets
+The Yaoyin (遥音) character assets, including [`character_yaoyin.png`](./assets/character_yaoyin.png) created by illustrator Zihe Zhou, are commissioned exclusively for the SingingSDS project. Screenshots of the system that include these assets, such as [`demo.png`](./assets/demo.png), are also covered under this license. The assets may be used only for direct derivatives of SingingSDS, such as project-related posts, usage videos, or other content directly depicting the project. Any other use requires express permission from the illustrator, and these assets may not be used for training, fine-tuning, or improving any artificial intelligence or machine learning models. For full license details, see [`assets/character_yaoyin.LICENSE`](./assets/character_yaoyin.LICENSE).
+### Code License
+All source code in this repository is licensed under the [MIT License](./LICENSE). This license applies **only to the code**. Character assets remain under their separate license and restrictions, as described in the **Character Assets** section.
+### Model License
+The models used in SingingSDS are subject to their respective licenses and terms of use. Users must comply with each model’s official license, which can be found at the respective model’s official repository or website.
+---
+<div align="center">
+Paper (Coming soon) • [Code](https://github.com/SingingSDS/SingingSDS) • [Demo](https://huggingface.co/spaces/espnet/SingingSDS) • [Video](https://www.youtube.com/playlist?list=PLZpUJJbwp2WvtPBenG5D3h09qKIrt24ui)
+</div>

assets/character_yaoyin.LICENSE ADDED Viewed

	@@ -0,0 +1,58 @@

+YAOIN CHARACTER ASSETS LICENSE
+Copyright (c) 2025 Zihe Zhou
+All rights reserved.
+These Yaoyin (遥音) character assets, including character_yaoyin.png and any derivative works
+such as demo.png (system screenshots containing the character), were created by the illustrator
+Zihe Zhou and commissioned exclusively for the SingingSDS project (the "Project").
+1. DEFINITIONS
+1.1 "Licensed Material" means all Yaoyin character assets provided in this repository, including
+    images, screenshots, and any derivative works.
+1.2 "Licensee" means any individual or entity using the Licensed Material under these terms.
+1.3 "Permitted Use" means use of the Licensed Material solely for direct derivatives of the Project
+    (SingingSDS), including but not limited to:
+    a. Project-related posts or social media content
+    b. Demonstration or usage videos of SingingSDS
+    c. Other content directly depicting or showcasing SingingSDS
+1.4 "Prohibited Use" means any use outside the scope of Permitted Use, including commercial
+    exploitation, redistribution, or incorporation into unrelated projects.
+1.5 "AI/ML Training" means use of the Licensed Material for training, fine-tuning, or improving
+    artificial intelligence or machine learning models, including but not limited to neural networks,
+    diffusion models, or large language models, whether for commercial or non-commercial purposes.
+2. GRANT OF LICENSE
+The Licensor grants the Licensee a non-exclusive, non-transferable, revocable license to use
+the Licensed Material only for Permitted Use. No other rights are granted. This license does
+not transfer copyright, which remains with the Licensor.
+3. RESTRICTIONS
+Licensee shall not:
+3.1 Use the Licensed Material beyond Permitted Use without express written permission from the Licensor.
+3.2 Use the Licensed Material for AI/ML Training as defined in Section 1.5.
+3.3 Redistribute, sublicense, or otherwise transfer the Licensed Material to any third party except
+    as part of a direct derivative of the Project.
+4. ATTRIBUTION
+All permitted uses of the Licensed Material must include the name of the Licensor (Zihe Zhou) as the creator.
+5. INTELLECTUAL PROPERTY
+5.1 All rights, titles, and interests in the Licensed Material, including copyrights, trademarks,
+    and other intellectual property rights, remain with the Licensor.
+5.2 This license does not grant any rights to the Licensor’s trademarks, service marks, trade names,
+    or branding beyond the explicit rights described herein.
+6. WARRANTY DISCLAIMER
+The Licensed Material is provided "as-is" without any warranty, express or implied, including but
+not limited to merchantability, fitness for a particular purpose, or non-infringement. The Licensor
+shall not be liable for any damages arising from the use or inability to use the Licensed Material.
+7. ENFORCEMENT
+Violation of these terms may result in immediate revocation of the license and potential legal action.
+Licensee acknowledges that monetary damages may be insufficient to remedy a breach and that the Licensor
+may seek injunctive relief as necessary.
+8. GOVERNING LAW
+These License Terms shall be governed by and construed in accordance with the laws of the jurisdiction
+of the Licensor’s primary residence or institutional affiliation.

config/cli/yaoyin_test.yaml CHANGED Viewed

@@ -1,6 +1,6 @@
 asr_model: openai/whisper-medium
 llm_model: meta-llama/Llama-3.1-8B-Instruct
-svs_model: espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg
 melody_source: sample-lyric-kising
 language: mandarin
 prompt_template_character: Yaoyin

 asr_model: openai/whisper-medium
 llm_model: meta-llama/Llama-3.1-8B-Instruct
+svs_model: espnet/visinger2-zh-jp-multisinger-svs
 melody_source: sample-lyric-kising
 language: mandarin
 prompt_template_character: Yaoyin

config/interface/options.yaml CHANGED Viewed

@@ -25,24 +25,24 @@ llm_models:
     name: Qwen3 30B A3B
 svs_models:
-  - id: mandarin-espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg
     name: Visinger2 (Bilingual)-zh
-    model_path: espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg
     lang: mandarin
     voices:
       voice1: resources/singer/singer_embedding_ace-2.npy
       voice2: resources/singer/singer_embedding_ace-8.npy
-      voice3: resources/singer/singer_embedding_itako.npy
       voice4: resources/singer/singer_embedding_kising_orange.npy
       voice5: resources/singer/singer_embedding_m4singer_Alto-4.npy
-  - id: japanese-espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg
     name: Visinger2 (Bilingual)-jp
-    model_path: espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg
     lang: japanese
     voices:
       voice1: resources/singer/singer_embedding_ace-2.npy
       voice2: resources/singer/singer_embedding_ace-8.npy
-      voice3: resources/singer/singer_embedding_itako.npy
       voice4: resources/singer/singer_embedding_kising_orange.npy
       voice5: resources/singer/singer_embedding_m4singer_Alto-4.npy
   - id: mandarin-espnet/aceopencpop_svs_visinger2_40singer_pretrain

     name: Qwen3 30B A3B
 svs_models:
+  - id: mandarin-espnet/visinger2-zh-jp-multisinger-svs
     name: Visinger2 (Bilingual)-zh
+    model_path: espnet/visinger2-zh-jp-multisinger-svs
     lang: mandarin
     voices:
       voice1: resources/singer/singer_embedding_ace-2.npy
       voice2: resources/singer/singer_embedding_ace-8.npy
+      voice3: resources/singer/singer_embedding_ace-27.npy
       voice4: resources/singer/singer_embedding_kising_orange.npy
       voice5: resources/singer/singer_embedding_m4singer_Alto-4.npy
+  - id: japanese-espnet/visinger2-zh-jp-multisinger-svs
     name: Visinger2 (Bilingual)-jp
+    model_path: espnet/visinger2-zh-jp-multisinger-svs
     lang: japanese
     voices:
       voice1: resources/singer/singer_embedding_ace-2.npy
       voice2: resources/singer/singer_embedding_ace-8.npy
+      voice3: resources/singer/singer_embedding_ace-27.npy
       voice4: resources/singer/singer_embedding_kising_orange.npy
       voice5: resources/singer/singer_embedding_m4singer_Alto-4.npy
   - id: mandarin-espnet/aceopencpop_svs_visinger2_40singer_pretrain

modules/svs/espnet.py CHANGED Viewed

@@ -39,6 +39,7 @@ class ESPNetSVS(AbstractSVSModel):
         elif self.model_id in [
             "espnet/mixdata_svs_visinger2_spkemb_lang_pretrained",
             "espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg",
         ]:
             def mandarin_mapper(pinyin: str) -> list[str]:
@@ -123,6 +124,7 @@ class ESPNetSVS(AbstractSVSModel):
         elif self.model_id in [
             "espnet/mixdata_svs_visinger2_spkemb_lang_pretrained",
             "espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg",
         ]:
             langs = {
                 "mandarin": 2,

         elif self.model_id in [
             "espnet/mixdata_svs_visinger2_spkemb_lang_pretrained",
             "espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg",
+            "espnet/visinger2-zh-jp-multisinger-svs",
         ]:
             def mandarin_mapper(pinyin: str) -> list[str]:
         elif self.model_id in [
             "espnet/mixdata_svs_visinger2_spkemb_lang_pretrained",
             "espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg",
+            "espnet/visinger2-zh-jp-multisinger-svs",
         ]:
             langs = {
                 "mandarin": 2,

resources/singer/singer_embedding_ameboshi.npy DELETED Viewed

Binary file (896 Bytes)

resources/singer/singer_embedding_itako.npy DELETED Viewed

Binary file (896 Bytes)

resources/singer/singer_embedding_ofuton.npy DELETED Viewed

Binary file (896 Bytes)