SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct

This is an EAGLE3 draft model for speculative decoding with Qwen/Qwen3-Coder-30B-A3B-Instruct.

Model Description

EAGLE3 (Efficient Auto-regressive Language model Generation with Learned Embeddings) is a speculative decoding technique that uses a lightweight draft model to predict future tokens, which are then verified by the target model in parallel. This can significantly accelerate inference speed (2-3x) without any loss in output quality.

Key Features

  • Target Model: Qwen3-Coder-30B-A3B-Instruct (30B parameters, 3B active)
  • Draft Model Size: ~350MB (single transformer layer)
  • Training Data: OpenPromptContainer (OPC) regenerated dataset
  • Training Steps: 295,000 (Epoch 1)
  • Framework: Trained with SpecForge

Training Metrics

Metric Value
First Token Accuracy (acc_0) 88.19%
Average Accuracy (7 positions) 85.19%
Training Epochs 1+ (295k steps)

Usage

With SGLang

import sglang as sgl

# Launch with EAGLE3 speculative decoding
llm = sgl.Engine(
    model_path="Qwen/Qwen3-Coder-30B-A3B-Instruct",
    speculative_algorithm="EAGLE",
    speculative_draft_model_path="sgl-project/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct",
    speculative_num_steps=5,
    speculative_eagle_topk=8,
    speculative_num_draft_tokens=64,
)

# Generate text
output = llm.generate("Write a Python function to sort a list:")
print(output)

With SGLang Server

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-Coder-30B-A3B-Instruct \
    --speculative-algorithm EAGLE \
    --speculative-draft-model-path sgl-project/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 8 \
    --speculative-num-draft-tokens 64 \
    --tp 8

Model Architecture

The EAGLE3 draft model is a lightweight transformer that:

  • Shares embeddings with the target model
  • Uses a single transformer layer (hidden_size=2048, intermediate_size=12288)
  • Predicts multiple future tokens autoregressively
  • Uses the target model's hidden states as input
{
  "architectures": ["LlamaForCausalLMEagle3"],
  "hidden_size": 2048,
  "intermediate_size": 12288,
  "num_attention_heads": 32,
  "num_key_value_heads": 4,
  "num_hidden_layers": 1,
  "vocab_size": 151936
}

Training Details

  • Framework: SpecForge with SGLang backend
  • Hardware: 4x NVIDIA H200 GPUs (TP=4)
  • Batch Size: 1 per GPU
  • Learning Rate: 1e-4 with cosine annealing
  • Max Sequence Length: 4096
  • Attention Backend: FlexAttention

Citation

If you use this model, please cite:

@article{li2024eagle,
  title={EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
  journal={arXiv preprint arXiv:2401.15077},
  year={2024}
}

@misc{sglang2024,
  title={SGLang: Efficient Execution of Structured Language Model Programs},
  author={Zheng, Lianmin and others},
  year={2024},
  url={https://github.com/sgl-project/sglang}
}

License

This model is released under the Apache 2.0 License, following the base model's license.

Downloads last month
18
Safetensors
Model size
0.2B params
Tensor type
I64
BF16
BOOL
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct

Finetuned
(28)
this model