Spaces:

Alovestocode
/

ZeroGPU-LLM-Inference

Sleeping

File size: 3,211 Bytes

# AWQ Quantization Guide for Router Models

This guide explains how to quantize the CourseGPT-Pro router models to AWQ (Activation-aware Weight Quantization) format for efficient inference.

## Models to Quantize

- [router-gemma3-merged](https://huggingface.co/Alovestocode/router-gemma3-merged) (27B, BF16)
- [router-qwen3-32b-merged](https://huggingface.co/Alovestocode/router-qwen3-32b-merged) (33B, BF16)

## Quick Start: Google Colab

1. **Open the Colab notebook**: `quantize_to_awq_colab.ipynb`
2. **Set runtime to GPU**: Runtime → Change runtime type → GPU (A100 recommended)
3. **Add your HF token**: Replace `your_hf_token_here` in cell 2
4. **Run all cells**: The notebook will quantize both models

## Requirements

- **GPU**: A100 (40GB+) or H100 recommended for 27B-33B models
- **Time**: ~30-60 minutes per model
- **Disk Space**: ~20-30GB per quantized model
- **HF Token**: With write access to `Alovestocode` namespace

## AWQ Configuration

The notebook uses optimized AWQ settings:

```python
AWQ_CONFIG = {
    "w_bit": 4,              # 4-bit quantization
    "q_group_size": 128,     # Group size for quantization
    "zero_point": True,      # Use zero-point quantization
    "version": "GEMM",       # GEMM kernel (better for longer contexts)
}
```

## Output Repositories

By default, quantized models are saved to:
- `Alovestocode/router-gemma3-merged-awq`
- `Alovestocode/router-qwen3-32b-merged-awq`

You can modify the `output_repo` in the configuration to upload to existing repos or different names.

## Verification

After quantization, the notebook includes a verification step that:
1. Loads the AWQ model
2. Tests generation
3. Checks parameter count
4. Verifies model integrity

## Usage After Quantization

Update your `app.py` to use the AWQ models:

```python
MODELS = {
    "Router-Gemma3-27B-AWQ": {
        "repo_id": "Alovestocode/router-gemma3-merged-awq",
        "quantization": "awq"
    },
    "Router-Qwen3-32B-AWQ": {
        "repo_id": "Alovestocode/router-qwen3-32b-merged-awq",
        "quantization": "awq"
    }
}
```

## Alternative: Using llm-compressor (vLLM)

If you prefer using llm-compressor (vLLM's quantization tool):

```python
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier

# Quantize model
oneshot(
    model="Alovestocode/router-gemma3-merged",
    output_dir="./router-gemma3-awq",
    modifiers=[AWQModifier(w_bit=4, q_group_size=128)]
)
```

However, AutoAWQ (used in the notebook) is more mature and widely tested.

## Troubleshooting

### Out of Memory
- Use a larger GPU (A100 80GB or H100)
- Reduce `calibration_dataset_size` to 64 or 32
- Quantize models one at a time

### Slow Quantization
- Ensure you're using a high-end GPU (A100/H100)
- Check that CUDA is properly configured
- Consider using multiple GPUs with `tensor_parallel_size`

### Upload Failures
- Verify HF token has write access
- Check repository exists or can be created
- Ensure sufficient disk space

## References

- [AutoAWQ Documentation](https://github.com/casper-hansen/AutoAWQ)
- [AWQ Paper](https://arxiv.org/abs/2306.00978)
- [vLLM AWQ Support](https://docs.vllm.ai/en/latest/features/quantization/awq.html)