Spaces:

Alovestocode
/

ZeroGPU-LLM-Inference

Sleeping

App Files Files Community

ZeroGPU-LLM-Inference / QUANTIZE_AWQ.md

Alikestocode

Fix AWQModifier import path: use modifiers.awq instead of modifiers.quantization

f0033ab about 1 month ago

preview code

raw

history blame contribute delete

3.21 kB

	# AWQ Quantization Guide for Router Models

	This guide explains how to quantize the CourseGPT-Pro router models to AWQ (Activation-aware Weight Quantization) format for efficient inference.

	## Models to Quantize

	- [router-gemma3-merged](https://huggingface.co/Alovestocode/router-gemma3-merged) (27B, BF16)
	- [router-qwen3-32b-merged](https://huggingface.co/Alovestocode/router-qwen3-32b-merged) (33B, BF16)

	## Quick Start: Google Colab

	1. Open the Colab notebook: `quantize_to_awq_colab.ipynb`
	2. Set runtime to GPU: Runtime → Change runtime type → GPU (A100 recommended)
	3. Add your HF token: Replace `your_hf_token_here` in cell 2
	4. Run all cells: The notebook will quantize both models

	## Requirements

	- GPU: A100 (40GB+) or H100 recommended for 27B-33B models
	- Time: ~30-60 minutes per model
	- Disk Space: ~20-30GB per quantized model
	- HF Token: With write access to `Alovestocode` namespace

	## AWQ Configuration

	The notebook uses optimized AWQ settings:

	```python
	AWQ_CONFIG = {
	"w_bit": 4, # 4-bit quantization
	"q_group_size": 128, # Group size for quantization
	"zero_point": True, # Use zero-point quantization
	"version": "GEMM", # GEMM kernel (better for longer contexts)
	}
	```

	## Output Repositories

	By default, quantized models are saved to:
	- `Alovestocode/router-gemma3-merged-awq`
	- `Alovestocode/router-qwen3-32b-merged-awq`

	You can modify the `output_repo` in the configuration to upload to existing repos or different names.

	## Verification

	After quantization, the notebook includes a verification step that:
	1. Loads the AWQ model
	2. Tests generation
	3. Checks parameter count
	4. Verifies model integrity

	## Usage After Quantization

	Update your `app.py` to use the AWQ models:

	```python
	MODELS = {
	"Router-Gemma3-27B-AWQ": {
	"repo_id": "Alovestocode/router-gemma3-merged-awq",
	"quantization": "awq"
	},
	"Router-Qwen3-32B-AWQ": {
	"repo_id": "Alovestocode/router-qwen3-32b-merged-awq",
	"quantization": "awq"
	}
	}
	```

	## Alternative: Using llm-compressor (vLLM)

	If you prefer using llm-compressor (vLLM's quantization tool):

	```python
	from llmcompressor import oneshot
	from llmcompressor.modifiers.awq import AWQModifier

	# Quantize model
	oneshot(
	model="Alovestocode/router-gemma3-merged",
	output_dir="./router-gemma3-awq",
	modifiers=[AWQModifier(w_bit=4, q_group_size=128)]
	)
	```

	However, AutoAWQ (used in the notebook) is more mature and widely tested.

	## Troubleshooting

	### Out of Memory
	- Use a larger GPU (A100 80GB or H100)
	- Reduce `calibration_dataset_size` to 64 or 32
	- Quantize models one at a time

	### Slow Quantization
	- Ensure you're using a high-end GPU (A100/H100)
	- Check that CUDA is properly configured
	- Consider using multiple GPUs with `tensor_parallel_size`

	### Upload Failures
	- Verify HF token has write access
	- Check repository exists or can be created
	- Ensure sufficient disk space

	## References

	- [AutoAWQ Documentation](https://github.com/casper-hansen/AutoAWQ)
	- [AWQ Paper](https://arxiv.org/abs/2306.00978)
	- [vLLM AWQ Support](https://docs.vllm.ai/en/latest/features/quantization/awq.html)