Spaces:

Alovestocode
/

ZeroGPU-LLM-Inference

Sleeping

AWQ_CONFIG = {
    "w_bit": 4,              # 4-bit quantization
    "q_group_size": 128,     # Group size for quantization
    "zero_point": True,      # Use zero-point quantization
    "version": "GEMM",       # GEMM kernel (better for longer contexts)
}

Output Repositories

By default, quantized models are saved to:

Alovestocode/router-gemma3-merged-awq
Alovestocode/router-qwen3-32b-merged-awq

You can modify the output_repo in the configuration to upload to existing repos or different names.

Verification

After quantization, the notebook includes a verification step that:

Loads the AWQ model
Tests generation
Checks parameter count
Verifies model integrity

Usage After Quantization

Update your app.py to use the AWQ models:

MODELS = {
    "Router-Gemma3-27B-AWQ": {
        "repo_id": "Alovestocode/router-gemma3-merged-awq",
        "quantization": "awq"
    },
    "Router-Qwen3-32B-AWQ": {
        "repo_id": "Alovestocode/router-qwen3-32b-merged-awq",
        "quantization": "awq"
    }
}

Alternative: Using llm-compressor (vLLM)

If you prefer using llm-compressor (vLLM's quantization tool):

from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier

# Quantize model
oneshot(
    model="Alovestocode/router-gemma3-merged",
    output_dir="./router-gemma3-awq",
    modifiers=[AWQModifier(w_bit=4, q_group_size=128)]
)

However, AutoAWQ (used in the notebook) is more mature and widely tested.

Troubleshooting

Out of Memory

Use a larger GPU (A100 80GB or H100)
Reduce calibration_dataset_size to 64 or 32
Quantize models one at a time

Slow Quantization

Ensure you're using a high-end GPU (A100/H100)
Check that CUDA is properly configured
Consider using multiple GPUs with tensor_parallel_size

Upload Failures

Verify HF token has write access
Check repository exists or can be created
Ensure sufficient disk space

AWQ Quantization Guide for Router Models

Models to Quantize

Quick Start: Google Colab

Requirements

AWQ Configuration