Orpheus Speech Adapter v1

Speech-to-Speech adapter for Orpheus 3B model. Converts Whisper audio embeddings to Orpheus-compatible representations for end-to-end speech dialogue.

Model Description

This is a Stage 1 trained adapter that bridges Whisper audio encoder to Orpheus TTS model, enabling speech-to-speech generation without intermediate text.

Architecture

Audio Input β†’ Whisper Large-v3 (frozen) β†’ SpeechAdapter β†’ ExpansionBlocks β†’ Orpheus 3B (frozen) β†’ SNAC Tokens β†’ Audio Output
              (1280 dim)                  (19M params)    (100M params)      (3B frozen)

Components

Component Parameters Description
Whisper 1.5B (frozen) Audio encoder (whisper-large-v3)
SpeechAdapter ~19M Linear projection with downsampling
ExpansionBlocks ~100M 4x Transformer blocks with cross-attention
Orpheus 3B (frozen) LLaMA-based TTS model

Training Details

  • Dataset: 200K samples (500 hours) of Spanish/Italian speech
  • Training: 3 epochs on RTX 4090
  • Final Loss: 18.68
  • Approach: LLaMA-Omni 2 style (adapter-based, LLM frozen)

Hyperparameters

learning_rate = 5e-5
epochs = 3
batch_size = 2 (RTX 4090)
gradient_accumulation = 16
warmup_ratio = 0.05
label_smoothing = 0.1
max_grad_norm = 1.0

Usage

import torch
from transformers import AutoModelForCausalLM

# Load Orpheus base model
orpheus = AutoModelForCausalLM.from_pretrained(
    "canopylabs/3b-es_it-ft-research_release",
    torch_dtype=torch.bfloat16
)

# Load adapter checkpoint
checkpoint = torch.load("finetune_stage1_epoch3.pt")
adapter.load_state_dict(checkpoint["adapter"])
expansion.load_state_dict(checkpoint["expansion"])

# Forward pass
audio_embeds = adapter(whisper_features)  # [B, T/5, 3072]
expanded = expansion(combined, audio_embeds)
outputs = orpheus(inputs_embeds=expanded)

Files

  • finetune_stage1_epoch3.pt - Best checkpoint (loss 18.68)
  • finetune_stage1.py - Training script

Architecture Details

SpeechAdapter

class SpeechAdapter(nn.Module):
    def __init__(self, whisper_dim=1280, llm_dim=3072, downsample=5):
        # Concatenates 5 frames β†’ projects to LLM dimension
        # Input: [B, T, 1280] β†’ Output: [B, T/5, 3072]
        self.proj = nn.Sequential(
            nn.Linear(6400, 2048),
            nn.GELU(),
            nn.Linear(2048, 3072),
            nn.LayerNorm(3072)
        )

ExpansionBlocks

class ExpansionBlock(nn.Module):
    # 4 blocks of:
    # - Self-attention (causal)
    # - Cross-attention (to audio embeddings)
    # - SwiGLU FFN

Training Comparison

Approach Dataset Needed Our Loss
Interleaved (IST-LM) 100K+ hours 23.94 (stagnated)
Adapter-based ~500 hours 18.68 βœ“

The adapter-based approach works better with limited data because:

  1. Orpheus stays frozen (no catastrophic forgetting)
  2. Only ~119M parameters trained (vs full model)
  3. Adapter learns projection, not generation

Next Steps

  • Stage 2: Train GateFusion for adaptive blending
  • Inference pipeline for real-time speech-to-speech

Citation

@misc{orpheus-speech-adapter-2025,
  title={Orpheus Speech Adapter: Efficient Speech-to-Speech with Frozen LLM},
  author={Parle Audio Team},
  year={2025},
  url={https://huggingface.co/marcosremar2/orpheus-speech-adapter-v1}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for marcosremar2/orpheus-speech-adapter-v1