ZeroGPU-LLM-Inference / QUANTIZE_AWQ.md
Alikestocode's picture
Fix AWQModifier import path: use modifiers.awq instead of modifiers.quantization
f0033ab

A newer version of the Gradio SDK is available: 6.0.2

Upgrade

AWQ Quantization Guide for Router Models

This guide explains how to quantize the CourseGPT-Pro router models to AWQ (Activation-aware Weight Quantization) format for efficient inference.

Models to Quantize

Quick Start: Google Colab

  1. Open the Colab notebook: quantize_to_awq_colab.ipynb
  2. Set runtime to GPU: Runtime → Change runtime type → GPU (A100 recommended)
  3. Add your HF token: Replace your_hf_token_here in cell 2
  4. Run all cells: The notebook will quantize both models

Requirements

  • GPU: A100 (40GB+) or H100 recommended for 27B-33B models
  • Time: ~30-60 minutes per model
  • Disk Space: ~20-30GB per quantized model
  • HF Token: With write access to Alovestocode namespace

AWQ Configuration

The notebook uses optimized AWQ settings:

AWQ_CONFIG = {
    "w_bit": 4,              # 4-bit quantization
    "q_group_size": 128,     # Group size for quantization
    "zero_point": True,      # Use zero-point quantization
    "version": "GEMM",       # GEMM kernel (better for longer contexts)
}

Output Repositories

By default, quantized models are saved to:

  • Alovestocode/router-gemma3-merged-awq
  • Alovestocode/router-qwen3-32b-merged-awq

You can modify the output_repo in the configuration to upload to existing repos or different names.

Verification

After quantization, the notebook includes a verification step that:

  1. Loads the AWQ model
  2. Tests generation
  3. Checks parameter count
  4. Verifies model integrity

Usage After Quantization

Update your app.py to use the AWQ models:

MODELS = {
    "Router-Gemma3-27B-AWQ": {
        "repo_id": "Alovestocode/router-gemma3-merged-awq",
        "quantization": "awq"
    },
    "Router-Qwen3-32B-AWQ": {
        "repo_id": "Alovestocode/router-qwen3-32b-merged-awq",
        "quantization": "awq"
    }
}

Alternative: Using llm-compressor (vLLM)

If you prefer using llm-compressor (vLLM's quantization tool):

from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier

# Quantize model
oneshot(
    model="Alovestocode/router-gemma3-merged",
    output_dir="./router-gemma3-awq",
    modifiers=[AWQModifier(w_bit=4, q_group_size=128)]
)

However, AutoAWQ (used in the notebook) is more mature and widely tested.

Troubleshooting

Out of Memory

  • Use a larger GPU (A100 80GB or H100)
  • Reduce calibration_dataset_size to 64 or 32
  • Quantize models one at a time

Slow Quantization

  • Ensure you're using a high-end GPU (A100/H100)
  • Check that CUDA is properly configured
  • Consider using multiple GPUs with tensor_parallel_size

Upload Failures

  • Verify HF token has write access
  • Check repository exists or can be created
  • Ensure sufficient disk space

References