Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.0.2
AWQ Quantization Guide for Router Models
This guide explains how to quantize the CourseGPT-Pro router models to AWQ (Activation-aware Weight Quantization) format for efficient inference.
Models to Quantize
- router-gemma3-merged (27B, BF16)
- router-qwen3-32b-merged (33B, BF16)
Quick Start: Google Colab
- Open the Colab notebook:
quantize_to_awq_colab.ipynb - Set runtime to GPU: Runtime → Change runtime type → GPU (A100 recommended)
- Add your HF token: Replace
your_hf_token_herein cell 2 - Run all cells: The notebook will quantize both models
Requirements
- GPU: A100 (40GB+) or H100 recommended for 27B-33B models
- Time: ~30-60 minutes per model
- Disk Space: ~20-30GB per quantized model
- HF Token: With write access to
Alovestocodenamespace
AWQ Configuration
The notebook uses optimized AWQ settings:
AWQ_CONFIG = {
"w_bit": 4, # 4-bit quantization
"q_group_size": 128, # Group size for quantization
"zero_point": True, # Use zero-point quantization
"version": "GEMM", # GEMM kernel (better for longer contexts)
}
Output Repositories
By default, quantized models are saved to:
Alovestocode/router-gemma3-merged-awqAlovestocode/router-qwen3-32b-merged-awq
You can modify the output_repo in the configuration to upload to existing repos or different names.
Verification
After quantization, the notebook includes a verification step that:
- Loads the AWQ model
- Tests generation
- Checks parameter count
- Verifies model integrity
Usage After Quantization
Update your app.py to use the AWQ models:
MODELS = {
"Router-Gemma3-27B-AWQ": {
"repo_id": "Alovestocode/router-gemma3-merged-awq",
"quantization": "awq"
},
"Router-Qwen3-32B-AWQ": {
"repo_id": "Alovestocode/router-qwen3-32b-merged-awq",
"quantization": "awq"
}
}
Alternative: Using llm-compressor (vLLM)
If you prefer using llm-compressor (vLLM's quantization tool):
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
# Quantize model
oneshot(
model="Alovestocode/router-gemma3-merged",
output_dir="./router-gemma3-awq",
modifiers=[AWQModifier(w_bit=4, q_group_size=128)]
)
However, AutoAWQ (used in the notebook) is more mature and widely tested.
Troubleshooting
Out of Memory
- Use a larger GPU (A100 80GB or H100)
- Reduce
calibration_dataset_sizeto 64 or 32 - Quantize models one at a time
Slow Quantization
- Ensure you're using a high-end GPU (A100/H100)
- Check that CUDA is properly configured
- Consider using multiple GPUs with
tensor_parallel_size
Upload Failures
- Verify HF token has write access
- Check repository exists or can be created
- Ensure sufficient disk space