--- library_name: transformers license: apache-2.0 pipeline_tag: text-to-speech --- # Soprano: Instant, Ultra‑Realistic Text‑to‑Speech

--- ## Overview **Soprano** is an ultra‑lightweight, open‑source text‑to‑speech (TTS) model designed for real‑time, high‑fidelity speech synthesis at unprecedented speed, all while remaining compact and easy to deploy. With only **80M parameters**, Soprano achieves a real‑time factor (RTF) of **~2000×**, capable of generating **10 hours of audio in under 20 seconds**. Soprano uses a **seamless streaming** technique that enables true real‑time synthesis in **<15 ms**, multiple orders of magnitude faster than existing TTS pipelines. This space contains the **model weights** for Soprano. The LLM uses a standard Qwen3 architecture, and the decoder is a Vocos model fine-tuned on the output hidden states of the LLM. Github: https://github.com/ekwek1/soprano Model Demo: https://huggingface.co/spaces/ekwek/Soprano-TTS --- ## Installation **Requirements**: Linux or Windows, CUDA‑enabled GPU required (CPU support coming soon). ### One‑line install ```bash pip install soprano-tts ``` ### Install from source ```bash git clone https://github.com/ekwek1/soprano.git cd soprano pip install -e . ``` > **Note**: Soprano uses **LMDeploy** to accelerate inference by default. If LMDeploy cannot be installed in your environment, Soprano can fall back to the HuggingFace **transformers** backend (with slower performance). To enable this, pass `backend='transformers'` when creating the TTS model. --- ## Usage ```python from soprano import SopranoTTS model = SopranoTTS() ``` ### Basic inference ```python out = model.infer("Hello world!") ``` ### Save output to a file ```python out = model.infer("Hello world!", "out.wav") ``` ### Custom sampling parameters ```python out = model.infer( "Hello world!", temperature=0.3, top_p=0.95, repetition_penalty=1.2, ) ``` ### Batched inference ```python out = model.infer_batch(["Hello world!"] * 10) ``` #### Save batch outputs to a directory ```python out = model.infer_batch(["Hello world!"] * 10, "/dir") ``` ### Streaming inference ```python import torch stream = model.infer_stream("Hello world!", chunk_size=1) # Audio chunks can be accessed via an iterator chunks = [] for chunk in stream: chunks.append(chunk) out = torch.cat(chunks) ``` --- ## Key Features ### 1. High‑fidelity 32 kHz audio Soprano synthesizes speech at **32 kHz**, delivering clarity that is perceptually indistinguishable from 44.1 kHz audio and significantly higher quality than the 24 kHz output used by many existing TTS models. ### 2. Vocos‑based neural decoder Instead of slow diffusion decoders, Soprano uses a **Vocos‑based decoder**, enabling **orders‑of‑magnitude faster** waveform generation while maintaining comparable perceptual quality. ### 3. Seamless real‑time streaming Soprano leverages the decoder’s finite receptive field to losslessly stream audio with **ultra‑low latency**. The streamed output is acoustically identical to offline synthesis, enabling interactive applications with sub‑frame delays. ### 4. State‑of‑the‑art neural audio codec Speech is represented using a **neural codec** that compresses audio to **~15 tokens/sec** at just **0.2 kbps**, allowing extremely fast generation and efficient memory usage without sacrificing quality. ### 5. Sentence‑level streaming for infinite context Each sentence is generated independently, enabling **effectively infinite generation length** while maintaining stability and real‑time performance for long‑form generation. --- ## License This project is licensed under the **Apache-2.0** license.