Orpheus Speech Adapter v1
Speech-to-Speech adapter for Orpheus 3B model. Converts Whisper audio embeddings to Orpheus-compatible representations for end-to-end speech dialogue.
Model Description
This is a Stage 1 trained adapter that bridges Whisper audio encoder to Orpheus TTS model, enabling speech-to-speech generation without intermediate text.
Architecture
Audio Input β Whisper Large-v3 (frozen) β SpeechAdapter β ExpansionBlocks β Orpheus 3B (frozen) β SNAC Tokens β Audio Output
(1280 dim) (19M params) (100M params) (3B frozen)
Components
| Component | Parameters | Description |
|---|---|---|
| Whisper | 1.5B (frozen) | Audio encoder (whisper-large-v3) |
| SpeechAdapter | ~19M | Linear projection with downsampling |
| ExpansionBlocks | ~100M | 4x Transformer blocks with cross-attention |
| Orpheus | 3B (frozen) | LLaMA-based TTS model |
Training Details
- Dataset:
200K samples (500 hours) of Spanish/Italian speech - Training: 3 epochs on RTX 4090
- Final Loss: 18.68
- Approach: LLaMA-Omni 2 style (adapter-based, LLM frozen)
Hyperparameters
learning_rate = 5e-5
epochs = 3
batch_size = 2 (RTX 4090)
gradient_accumulation = 16
warmup_ratio = 0.05
label_smoothing = 0.1
max_grad_norm = 1.0
Usage
import torch
from transformers import AutoModelForCausalLM
# Load Orpheus base model
orpheus = AutoModelForCausalLM.from_pretrained(
"canopylabs/3b-es_it-ft-research_release",
torch_dtype=torch.bfloat16
)
# Load adapter checkpoint
checkpoint = torch.load("finetune_stage1_epoch3.pt")
adapter.load_state_dict(checkpoint["adapter"])
expansion.load_state_dict(checkpoint["expansion"])
# Forward pass
audio_embeds = adapter(whisper_features) # [B, T/5, 3072]
expanded = expansion(combined, audio_embeds)
outputs = orpheus(inputs_embeds=expanded)
Files
finetune_stage1_epoch3.pt- Best checkpoint (loss 18.68)finetune_stage1.py- Training script
Architecture Details
SpeechAdapter
class SpeechAdapter(nn.Module):
def __init__(self, whisper_dim=1280, llm_dim=3072, downsample=5):
# Concatenates 5 frames β projects to LLM dimension
# Input: [B, T, 1280] β Output: [B, T/5, 3072]
self.proj = nn.Sequential(
nn.Linear(6400, 2048),
nn.GELU(),
nn.Linear(2048, 3072),
nn.LayerNorm(3072)
)
ExpansionBlocks
class ExpansionBlock(nn.Module):
# 4 blocks of:
# - Self-attention (causal)
# - Cross-attention (to audio embeddings)
# - SwiGLU FFN
Training Comparison
| Approach | Dataset Needed | Our Loss |
|---|---|---|
| Interleaved (IST-LM) | 100K+ hours | 23.94 (stagnated) |
| Adapter-based | ~500 hours | 18.68 β |
The adapter-based approach works better with limited data because:
- Orpheus stays frozen (no catastrophic forgetting)
- Only ~119M parameters trained (vs full model)
- Adapter learns projection, not generation
Next Steps
- Stage 2: Train GateFusion for adaptive blending
- Inference pipeline for real-time speech-to-speech
Citation
@misc{orpheus-speech-adapter-2025,
title={Orpheus Speech Adapter: Efficient Speech-to-Speech with Frozen LLM},
author={Parle Audio Team},
year={2025},
url={https://huggingface.co/marcosremar2/orpheus-speech-adapter-v1}
}
License
Apache 2.0
Model tree for marcosremar2/orpheus-speech-adapter-v1
Base model
meta-llama/Llama-3.2-3B-Instruct
Finetuned
canopylabs/orpheus-3b-0.1-pretrained
Finetuned
canopylabs/3b-es_it-ft-research_release