--- language: - en tags: - audio - automatic-speech-recognition - gqa - rope - pytorch - safetensors pipeline_tag: automatic-speech-recognition license: other license_name: gradient-ai-license-v1.0 license_link: https://huggingface.co/gradient-research/license gated: auto extra_gated_heading: License Agreement Required extra_gated_prompt: >- By registering for access to this model, you agree to the strict terms and conditions of the Gradient-AI License. This model is strictly prohibited from being used for deception, weaponization, or illegal acts. extra_gated_button_content: Acknowledge License and Request Access extra_gated_fields: I have read and agree to be bound by the Gradient-AI License: checkbox Name / Organization: text Intended Use Case: type: select options: - Research - Education - label: Commercial (Requires Permission) value: commercial - label: Other value: other library_name: transformers --- # Gradient-Transcribe1 (125M) Gradient-Transcribe1 is a high-efficiency transformer-based model for automatic speech recognition (ASR). It incorporates modern architectural advancements such as **Grouped Query Attention (GQA)** and **Rotary Positional Embeddings (RoPE)** to deliver superior inference performance and long-context stability. **Access to this model is gated.** Users must agree to the Gradient-AI License and provide their intended use case before downloading the weights. ## Model Details Gradient-Transcribe1 is a sequence-to-sequence encoder-decoder model optimized for 16kHz audio. Key architectural features include: * **Grouped Query Attention (GQA):** Optimized for faster decoding and reduced KV cache memory footprint. * **Rotary Positional Embeddings (RoPE):** Enhanced relative position encoding for better sequence length generalization. * **Modern Activation & Norm:** Utilizing RMSNorm and SwiGLU for improved training stability. ### Specifications | Component | Configuration | |----------------------|---------------| | **Parameters** | 138,044,928 | | **Hidden Size** | 768 | | **Encoder Layers** | 8 | | **Decoder Layers** | 10 | | **Attention Heads** | 8 (Q), 4 (KV) | | **Vocab Size** | 1024 | | **Mel Bins** | 80 | ## Usage Due to the custom nature of this architecture, you must set `trust_remote_code=True` when loading the model. ### Loading the Model ```python from transformers import AutoModel, AutoTokenizer # Load the model (requires approved access) model = AutoModel.from_pretrained( "your-username/gradient-transcribe1-125m", trust_remote_code=True, use_auth_token=True ) # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained("your-username/gradient-transcribe1-125m") Transcription Example Python import torch import librosa # Load 16kHz audio audio, _ = librosa.load("sample_audio.wav", sr=16000) # Note: Pre-processing to Mel-spectrogram must match the model's 80-bin configuration. # transcription = model.generate(input_features) ``` Training Data Gradient-Transcribe1 was trained on a combination of curated speech datasets and synthetic data to validate the performance of GQA in ASR tasks. It is currently optimized for English speech. Limitations and Biases Intended Use: This model is designed for research and educational purposes. Usage for deceptive, weaponized, or illegal acts is strictly prohibited. Hallucinations: As a sequence-to-sequence model, it may generate text that does not exist in the audio, particularly in high-noise environments. Domain Specificity: Performance may vary across different accents, dialects, and technical terminologies. License This model is licensed under the Gradient-AI License v1.0. By requesting access, you agree to abide by the terms specified at gradient-research/license.