UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models
Paper
•
2510.04593
•
Published
This repository contains UniVoice, a unified Large Language Model (LLM) framework that seamlessly integrates Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) within a single model through continuous representations. It combines autoregressive modeling for speech recognition with flow matching for high-quality generation, and enables high-fidelity zero-shot voice cloning.
On the basis of Python >= 3.10 environment, install the necessary dependencies by running the following command:
git clone https://github.com/gwh22/UniVoice
cd UniVoice
# We recommend using conda to create a new environment.
conda create -n UniVoice python=3.10
conda activate UniVoice
# install cuda >= 11.8
conda install cudatoolkit=11.8 -c nvidia
pip install -r requirements.txt
cd UniVoice
# for ASR task
sh scripts/infer_asr.sh
# for TTS task
sh scripts/infer_tts.sh
cd UniVoice
sh scripts/train_all.sh
Our code is released under MIT License. If our work and codebase is useful for you, please cite as:
@article{guan2025univoice,
title={UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models},
author={Guan, Wenhao and Niu, Zhikang and Jiang, Ziyue and Wang, Kaidi and Chen, Peijie and Hong, Qingyang and Li, Lin and Chen, Xie},
journal={arXiv preprint arXiv:2510.04593},
year={2025}
}
This codebase borrows from DiT, SmolLM2-360M, F5-TTS, Monoformer, LLaVA, and Transformers. Thanks for their great works.