Spaces:
Sleeping
Sleeping
File size: 3,211 Bytes
a79bc8f f0033ab a79bc8f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
# AWQ Quantization Guide for Router Models
This guide explains how to quantize the CourseGPT-Pro router models to AWQ (Activation-aware Weight Quantization) format for efficient inference.
## Models to Quantize
- [router-gemma3-merged](https://huggingface.co/Alovestocode/router-gemma3-merged) (27B, BF16)
- [router-qwen3-32b-merged](https://huggingface.co/Alovestocode/router-qwen3-32b-merged) (33B, BF16)
## Quick Start: Google Colab
1. **Open the Colab notebook**: `quantize_to_awq_colab.ipynb`
2. **Set runtime to GPU**: Runtime → Change runtime type → GPU (A100 recommended)
3. **Add your HF token**: Replace `your_hf_token_here` in cell 2
4. **Run all cells**: The notebook will quantize both models
## Requirements
- **GPU**: A100 (40GB+) or H100 recommended for 27B-33B models
- **Time**: ~30-60 minutes per model
- **Disk Space**: ~20-30GB per quantized model
- **HF Token**: With write access to `Alovestocode` namespace
## AWQ Configuration
The notebook uses optimized AWQ settings:
```python
AWQ_CONFIG = {
"w_bit": 4, # 4-bit quantization
"q_group_size": 128, # Group size for quantization
"zero_point": True, # Use zero-point quantization
"version": "GEMM", # GEMM kernel (better for longer contexts)
}
```
## Output Repositories
By default, quantized models are saved to:
- `Alovestocode/router-gemma3-merged-awq`
- `Alovestocode/router-qwen3-32b-merged-awq`
You can modify the `output_repo` in the configuration to upload to existing repos or different names.
## Verification
After quantization, the notebook includes a verification step that:
1. Loads the AWQ model
2. Tests generation
3. Checks parameter count
4. Verifies model integrity
## Usage After Quantization
Update your `app.py` to use the AWQ models:
```python
MODELS = {
"Router-Gemma3-27B-AWQ": {
"repo_id": "Alovestocode/router-gemma3-merged-awq",
"quantization": "awq"
},
"Router-Qwen3-32B-AWQ": {
"repo_id": "Alovestocode/router-qwen3-32b-merged-awq",
"quantization": "awq"
}
}
```
## Alternative: Using llm-compressor (vLLM)
If you prefer using llm-compressor (vLLM's quantization tool):
```python
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
# Quantize model
oneshot(
model="Alovestocode/router-gemma3-merged",
output_dir="./router-gemma3-awq",
modifiers=[AWQModifier(w_bit=4, q_group_size=128)]
)
```
However, AutoAWQ (used in the notebook) is more mature and widely tested.
## Troubleshooting
### Out of Memory
- Use a larger GPU (A100 80GB or H100)
- Reduce `calibration_dataset_size` to 64 or 32
- Quantize models one at a time
### Slow Quantization
- Ensure you're using a high-end GPU (A100/H100)
- Check that CUDA is properly configured
- Consider using multiple GPUs with `tensor_parallel_size`
### Upload Failures
- Verify HF token has write access
- Check repository exists or can be created
- Ensure sufficient disk space
## References
- [AutoAWQ Documentation](https://github.com/casper-hansen/AutoAWQ)
- [AWQ Paper](https://arxiv.org/abs/2306.00978)
- [vLLM AWQ Support](https://docs.vllm.ai/en/latest/features/quantization/awq.html)
|