---
language:
- en
tags:
- audio
- automatic-speech-recognition
- gqa
- rope
- pytorch
- safetensors
pipeline_tag: automatic-speech-recognition
license: other
license_name: gradient-ai-license-v1.0
license_link: https://huggingface.co/gradient-research/license
gated: auto
extra_gated_heading: License Agreement Required
extra_gated_prompt: >-
  By registering for access to this model, you agree to the strict terms and
  conditions of the Gradient-AI License. This model is strictly prohibited from
  being used for deception, weaponization, or illegal acts.
extra_gated_button_content: Acknowledge License and Request Access
extra_gated_fields:
  I have read and agree to be bound by the Gradient-AI License: checkbox
  Name / Organization: text
  Intended Use Case:
    type: select
    options:
    - Research
    - Education
    - label: Commercial (Requires Permission)
      value: commercial
    - label: Other
      value: other
library_name: transformers
---

# Gradient-Transcribe1 (125M)

Gradient-Transcribe1 is a high-efficiency transformer-based model for automatic speech recognition (ASR). It incorporates modern architectural advancements such as **Grouped Query Attention (GQA)** and **Rotary Positional Embeddings (RoPE)** to deliver superior inference performance and long-context stability.

**Access to this model is gated.** Users must agree to the Gradient-AI License and provide their intended use case before downloading the weights.

## Model Details

Gradient-Transcribe1 is a sequence-to-sequence encoder-decoder model optimized for 16kHz audio. Key architectural features include:

*   **Grouped Query Attention (GQA):** Optimized for faster decoding and reduced KV cache memory footprint.
*   **Rotary Positional Embeddings (RoPE):** Enhanced relative position encoding for better sequence length generalization.
*   **Modern Activation & Norm:** Utilizing RMSNorm and SwiGLU for improved training stability.

### Specifications

| Component            | Configuration |
|----------------------|---------------|
| **Parameters**       | 138,044,928   |
| **Hidden Size**      | 768           |
| **Encoder Layers**   | 8             |
| **Decoder Layers**   | 10            |
| **Attention Heads**  | 8 (Q), 4 (KV) |
| **Vocab Size**       | 1024          |
| **Mel Bins**         | 80            |

## Usage

Due to the custom nature of this architecture, you must set `trust_remote_code=True` when loading the model.

### Loading the Model
```python
from transformers import AutoModel, AutoTokenizer

# Load the model (requires approved access)
model = AutoModel.from_pretrained(
    "your-username/gradient-transcribe1-125m", 
    trust_remote_code=True,
    use_auth_token=True
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("your-username/gradient-transcribe1-125m")
Transcription Example
Python
import torch
import librosa

# Load 16kHz audio
audio, _ = librosa.load("sample_audio.wav", sr=16000)

# Note: Pre-processing to Mel-spectrogram must match the model's 80-bin configuration.
# transcription = model.generate(input_features)
```

Training Data
Gradient-Transcribe1 was trained on a combination of curated speech datasets and synthetic data to validate the performance of GQA in ASR tasks. It is currently optimized for English speech.

Limitations and Biases
Intended Use: This model is designed for research and educational purposes. Usage for deceptive, weaponized, or illegal acts is strictly prohibited.

Hallucinations: As a sequence-to-sequence model, it may generate text that does not exist in the audio, particularly in high-noise environments.

Domain Specificity: Performance may vary across different accents, dialects, and technical terminologies.

License
This model is licensed under the Gradient-AI License v1.0. By requesting access, you agree to abide by the terms specified at gradient-research/license.