# Router Models AWQ Quantization with LLM Compressor (vLLM Native)

This notebook quantizes the CourseGPT-Pro router models to AWQ (Activation-aware Weight Quantization) format using **LLM Compressor** - vLLM's native quantization tool.

**Models to quantize:**
- `Alovestocode/router-gemma3-merged` (27B)
- `Alovestocode/router-qwen3-32b-merged` (33B)

**Output:** AWQ-quantized models ready for vLLM inference with optimal performance.

**Why LLM Compressor?**
- Native vLLM integration (better compatibility)
- Supports advanced features (pruning, combined modifiers)
- Actively maintained by vLLM team
- Optimized for vLLM inference engine

**⚠️ IMPORTANT:** If you see errors about `AWQModifier` parameters, **restart the kernel** (Runtime → Restart runtime) and run all cells from the beginning. The notebook uses `AWQModifier()` without parameters (default 4-bit AWQ).


## 1. Install Dependencies


In [None]:
# Install required packages
# LLM Compressor is vLLM's native quantization tool
# Note: Package name is 'llmcompressor' (no hyphen), may need to install from GitHub
%pip install -q transformers accelerate huggingface_hub
%pip install -q torch --index-url https://download.pytorch.org/whl/cu118

# Try installing llmcompressor from PyPI first, fallback to GitHub if not available
try:
 import llmcompressor
 print("✅ llmcompressor already installed")
except ImportError:
 print("Installing llmcompressor...")
 # Try PyPI first
 import subprocess
 import sys
 result = subprocess.run([sys.executable, "-m", "pip", "install", "-q", "llmcompressor"], 
 capture_output=True, text=True)
 if result.returncode != 0:
 # Fallback to GitHub installation
 print("PyPI installation failed, trying GitHub...")
 subprocess.run([sys.executable, "-m", "pip", "install", "-q", 
 "git+https://github.com/vllm-project/llm-compressor.git"], 
 check=False)
 print("✅ llmcompressor installed")

# Utility function to check disk space
import shutil
def check_disk_space():
 """Check available disk space."""
 total, used, free = shutil.disk_usage("/")
 print(f"Disk Space: {free / (1024**3):.2f} GB free out of {total / (1024**3):.2f} GB total")
 return free / (1024**3) # Return free space in GB

print("Initial disk space:")
check_disk_space()


## 2. Authenticate with Hugging Face


In [None]:
from huggingface_hub import login
import os

# Login to Hugging Face (you'll need a token with write access)
# Get your token from: https://huggingface.co/settings/tokens
HF_TOKEN = "your_hf_token_here" # Replace with your token

login(token=HF_TOKEN)
os.environ["HF_TOKEN"] = HF_TOKEN


## 3. Configuration


In [None]:
# Model-specific AWQ overrides. Keys match MODELS_TO_QUANTIZE entries.
MODEL_AWQ_OVERRIDES = {
 "router-gemma3-merged": {"group_size": 16},
}

# Derived AWQ configs per model (defaults + overrides)
MODEL_AWQ_CONFIGS = {
 model_key: {**AWQ_CONFIG, **MODEL_AWQ_OVERRIDES.get(model_key, {})}
 for model_key in MODELS_TO_QUANTIZE.keys()
}



In [None]:
# Model configurations
MODELS_TO_QUANTIZE = {
 "router-gemma3-merged": {
 "repo_id": "Alovestocode/router-gemma3-merged",
 "output_repo": "Alovestocode/router-gemma3-merged-awq", # Or keep same repo
 "model_type": "gemma",
 },
 "router-qwen3-32b-merged": {
 "repo_id": "Alovestocode/router-qwen3-32b-merged",
 "output_repo": "Alovestocode/router-qwen3-32b-merged-awq", # Or keep same repo
 "model_type": "qwen",
 }
}

# AWQ quantization config
AWQ_CONFIG = {
 "num_bits": 4, # Weight bit-width
 "group_size": 128, # Group size for weight quantization
 "zero_point": True, # False would force symmetric quant (no zero-point)
 "strategy": "group", # Quantize per group for best AWQ accuracy
 "targets": ["Linear"], # Modules to quantize (QuantizationMixin default)
 "ignore": ["lm_head"], # Skip final LM head
 "format": "pack-quantized",
 "observer": "minmax",
 "dynamic": False,
 "version": "GEMM", # Kept for logging/back-compat
}


In [None]:
# Note: build_awq_modifier_config helper function is defined in the next cell
# It properly constructs QuantizationScheme and QuantizationArgs objects



## 4. Quantization Function


In [None]:
# LLM Compressor (vLLM native quantization tool)
# Import with error handling in case installation failed
try:
 from llmcompressor import oneshot
 # Correct import path: AWQModifier is in modifiers.awq, not modifiers.quantization
 from llmcompressor.modifiers.awq import AWQModifier
 from compressed_tensors.quantization import QuantizationScheme, QuantizationArgs
 from compressed_tensors.quantization.quant_args import (
 QuantizationStrategy,
 QuantizationType,
 )
 LLM_COMPRESSOR_AVAILABLE = True
 print("✅ LLM Compressor imported successfully")
except ImportError as e:
 print(f"❌ Failed to import llmcompressor/quantization deps: {e}")
 print("Please ensure llmcompressor is installed:")
 print(" %pip install llmcompressor")
 print(" OR")
 print(" %pip install git+https://github.com/vllm-project/llm-compressor.git")
 print("\nNote: If import still fails, try:")
 print(" %pip install --upgrade llmcompressor")
 LLM_COMPRESSOR_AVAILABLE = False
 raise

from transformers import AutoTokenizer
from huggingface_hub import HfApi, scan_cache_dir, upload_folder
from datasets import Dataset
from llmcompressor.recipe import Recipe
import torch
import shutil
import gc
import os

# Try to import delete_revisions (may not be available in all versions)
try:
 from huggingface_hub import delete_revisions
 DELETE_REVISIONS_AVAILABLE = True
except ImportError:
 # delete_revisions might not be available, we'll use alternative method
 DELETE_REVISIONS_AVAILABLE = False
 print("Note: delete_revisions not available, will use alternative cache cleanup method")

def build_awq_modifier_config(awq_config: dict):
 """Create config_groups/ignore settings for AWQModifier."""
 if not isinstance(awq_config, dict):
 raise ValueError("awq_config must be a dictionary of quantization settings")

 def _get(key, *aliases, default=None):
 for candidate in (key, *aliases):
 if candidate in awq_config:
 value = awq_config[candidate]
 if value is not None:
 return value
 return default

 num_bits = _get("num_bits", "w_bit", default=4)
 group_size = _get("group_size", "q_group_size", default=128)
 zero_point = awq_config.get("zero_point", True)
 symmetric = awq_config.get("symmetric")
 if symmetric is None:
 symmetric = not bool(zero_point)

 strategy = _get("strategy", default="group")
 if isinstance(strategy, QuantizationStrategy):
 quant_strategy = strategy
 else:
 quant_strategy = QuantizationStrategy(str(strategy).lower())

 qtype = awq_config.get("type", QuantizationType.INT)
 if isinstance(qtype, QuantizationType):
 quant_type = qtype
 else:
 quant_type = QuantizationType(str(qtype).lower())

 weights_args = QuantizationArgs(
 num_bits=num_bits,
 group_size=group_size,
 symmetric=symmetric,
 strategy=quant_strategy,
 type=quant_type,
 dynamic=awq_config.get("dynamic", False),
 observer=awq_config.get("observer", "minmax"),
 )

 quant_scheme = QuantizationScheme(
 targets=awq_config.get("targets", ["Linear"]),
 weights=weights_args,
 input_activations=None,
 output_activations=None,
 format=awq_config.get("format", "pack-quantized"),
 )

 config_groups = {"group_0": quant_scheme}
 ignore = awq_config.get("ignore", ["lm_head"])
 return config_groups, ignore

def quantize_model_to_awq(
 model_name: str,
 repo_id: str,
 output_repo: str,
 model_type: str,
 awq_config: dict,
 calibration_dataset_size: int = 128
):
 """Quantize a model to AWQ format using LLM Compressor (vLLM native).
 
 Args:
 model_name: Display name for the model
 repo_id: Source Hugging Face repo ID
 output_repo: Destination Hugging Face repo ID
 model_type: Model type (gemma/qwen) for tokenizer selection
 awq_config: AWQ quantization configuration
 calibration_dataset_size: Number of calibration samples
 """
 print(f"\n{'='*60}")
 print(f"Quantizing {model_name} with LLM Compressor (vLLM native)")
 print(f"Source: {repo_id}")
 print(f"Destination: {output_repo}")
 print(f"{'='*60}\n")
 
 # Check disk space before starting
 free_space_before = check_disk_space()
 if free_space_before < 30:
 print(f"⚠️ WARNING: Low disk space ({free_space_before:.2f} GB). Quantization may fail.")
 
 # Step 1: Create temporary output directory
 import tempfile
 temp_output_dir = f"./temp_{model_name.replace('-', '_')}_awq"
 print(f"[1/4] Creating temporary output directory: {temp_output_dir}")
 os.makedirs(temp_output_dir, exist_ok=True)
 
 # Step 2: Prepare calibration dataset
 print(f"\n[2/4] Preparing calibration dataset ({calibration_dataset_size} samples)...")
 
 # Create calibration dataset for router agent
 calibration_texts = [
 "You are the Router Agent coordinating Math, Code, and General-Search specialists.",
 "Emit EXACTLY ONE strict JSON object with keys route_plan, route_rationale, expected_artifacts,",
 "Solve a quadratic equation using Python programming.",
 "Implement a binary search algorithm with proper error handling.",
 "Explain the concept of gradient descent in machine learning.",
 "Write a function to calculate the Fibonacci sequence recursively.",
 "Design a REST API endpoint for user authentication.",
 "Analyze the time complexity of merge sort algorithm.",
 ]
 
 # Repeat to reach desired size
 while len(calibration_texts) < calibration_dataset_size:
 calibration_texts.extend(calibration_texts[:calibration_dataset_size - len(calibration_texts)])
 
 calibration_texts = calibration_texts[:calibration_dataset_size]
 print(f"✅ Calibration dataset prepared: {len(calibration_texts)} samples")
 
 # Step 3: Quantize model using LLM Compressor
 print(f"\n[3/4] Quantizing model to AWQ with LLM Compressor (this may take 30-60 minutes)...")
 print(f"Config: {awq_config}")
 print("⚠️ LLM Compressor will load the model, quantize it, and save to local directory")
 
 if not LLM_COMPRESSOR_AVAILABLE:
 raise ImportError("LLM Compressor is not available. Please install it first.")
 
 try:
 # LLM Compressor's oneshot function handles everything:
 # - Loading the model
 # - Quantization with calibration data
 # - Saving quantized model
 print(f" → Starting quantization with LLM Compressor...")
 print(f" → This may take 30-60 minutes depending on model size...")
 
 print(f" → Creating QuantizationScheme for AWQModifier...")
 config_groups, ignore_modules = build_awq_modifier_config(awq_config)
 first_group = next(iter(config_groups.values()))
 bits = first_group.weights.num_bits if first_group.weights else "?"
 group_sz = first_group.weights.group_size if first_group.weights else "?"
 print(f" ✅ AWQ config ready ({bits}-bit, group size {group_sz})")
 print(f" → Creating AWQModifier with structured config...")
 modifiers = [
 AWQModifier(
 config_groups=config_groups,
 ignore=ignore_modules,
 )
 ]
 print(f" ✅ AWQModifier created successfully")
 
 # Call oneshot with the modifier
 # oneshot() uses HfArgumentParser which only understands ModelArguments, DatasetArguments, RecipeArguments
 # We need to convert modifiers to Recipe and calibration_texts to Dataset
 print(f" → Starting quantization process...")
 
 # Prepare calibration dataset (limit to reasonable size)
 calibration_texts_limited = calibration_texts[:min(calibration_dataset_size, 128)]
 
 # Convert calibration texts to Hugging Face Dataset
 # DatasetArguments expects a Dataset object, not a list
 print(f" → Creating Hugging Face Dataset from calibration texts...")
 calibration_dataset = Dataset.from_dict({"text": calibration_texts_limited})
 print(f" ✅ Created dataset with {len(calibration_dataset)} samples")
 
 # Convert modifiers list to Recipe object
 # RecipeArguments expects a Recipe object, not a list of modifiers
 print(f" → Converting modifiers to Recipe object...")
 recipe = Recipe.from_modifiers(modifiers)
 print(f" ✅ Recipe created from modifiers")
 
 # Load tokenizer for text-only models (required as processor)
 # For text-only LLMs, we need to pass tokenizer explicitly to avoid processor initialization errors
 print(f" → Loading tokenizer for text-only model...")
 tokenizer = AutoTokenizer.from_pretrained(
 repo_id,
 use_fast=True,
 trust_remote_code=True,
 token=os.environ.get("HF_TOKEN")
 )
 print(f" ✅ Tokenizer loaded")
 
 # oneshot() API - all kwargs must map to ModelArguments, DatasetArguments, or RecipeArguments
 # - model: ModelArguments.model
 # - output_dir: ModelArguments.output_dir
 # - recipe: RecipeArguments.recipe (Recipe object)
 # - dataset: DatasetArguments.dataset (Dataset object)
 # - num_calibration_samples: DatasetArguments.num_calibration_samples
 # - use_auth_token: ModelArguments.use_auth_token (reads from HF_TOKEN env var)
 # - trust_remote_code_model: ModelArguments.trust_remote_code_model
 # - stage: RecipeArguments.stage (default: "default")
 # - tokenizer: ModelArguments.tokenizer (required for text-only models to avoid processor errors)
 print(f" → Calling oneshot() with proper argument structure...")
 oneshot(
 model=repo_id,
 output_dir=temp_output_dir,
 recipe=recipe,
 stage="default", # Recipe stage
 dataset=calibration_dataset,
 num_calibration_samples=min(calibration_dataset_size, len(calibration_dataset)),
 tokenizer=tokenizer, # Pass tokenizer explicitly for text-only models (processor inferred)
 use_auth_token=True, # Reads from os.environ["HF_TOKEN"]
 trust_remote_code_model=True
 )
 
 print(f"✅ Model quantized to AWQ successfully")
 except Exception as e:
 print(f"❌ Quantization failed: {e}")
 print(f"\nTroubleshooting:")
 print(f"1. Ensure llmcompressor is installed: %pip install llmcompressor")
 print(f"2. Or install from GitHub: %pip install git+https://github.com/vllm-project/llm-compressor.git")
 print(f"3. Check that you have sufficient GPU memory (40GB+ recommended)")
 import traceback
 traceback.print_exc()
 raise
 
 # Step 4: Upload to Hugging Face
 print(f"\n[4/4] Uploading quantized model to {output_repo}...")
 
 # Create repo if it doesn't exist
 api = HfApi()
 try:
 api.create_repo(
 repo_id=output_repo,
 repo_type="model",
 exist_ok=True,
 token=os.environ.get("HF_TOKEN")
 )
 print(f"✅ Repository ready: {output_repo}")
 except Exception as e:
 print(f"Note: Repo may already exist: {e}")
 
 # Upload the quantized model directory
 try:
 upload_folder(
 folder_path=temp_output_dir,
 repo_id=output_repo,
 repo_type="model",
 token=os.environ.get("HF_TOKEN"),
 ignore_patterns=["*.pt", "*.bin"] # Only upload safetensors
 )
 print(f"✅ Quantized model uploaded to {output_repo}")
 except Exception as e:
 print(f"❌ Upload failed: {e}")
 import traceback
 traceback.print_exc()
 raise
 
 # Step 5: Clean up to free disk space (critical for Colab)
 print(f"\n[5/5] Cleaning up local files to free disk space...")
 
 # Delete temporary output directory
 try:
 import shutil
 shutil.rmtree(temp_output_dir)
 print(f" ✅ Deleted temporary directory: {temp_output_dir}")
 except Exception as e:
 print(f" ⚠️ Could not delete temp directory: {e}")
 
 # Free GPU memory
 torch.cuda.empty_cache()
 gc.collect()
 
 # Clear Hugging Face cache for the source model (frees ~50-70GB)
 print(f" → Clearing Hugging Face cache for {repo_id}...")
 try:
 cache_info = scan_cache_dir()
 # Find and delete revisions for the source model
 revisions_to_delete = []
 for repo in cache_info.revisions:
 if repo.repo_id == repo_id:
 revisions_to_delete.append(repo)
 
 if revisions_to_delete:
 if DELETE_REVISIONS_AVAILABLE:
 # Use delete_revisions if available
 delete_revisions(revisions_to_delete)
 print(f" ✅ Deleted {len(revisions_to_delete)} cached revision(s) for {repo_id}")
 else:
 # Alternative: Delete cache directories manually
 deleted_count = 0
 for revision in revisions_to_delete:
 try:
 # Get the cache directory path
 cache_path = revision.snapshot_path if hasattr(revision, 'snapshot_path') else None
 if cache_path and os.path.exists(cache_path):
 shutil.rmtree(cache_path)
 deleted_count += 1
 except Exception as e:
 print(f" ⚠️ Could not delete {revision.repo_id}: {e}")
 
 if deleted_count > 0:
 print(f" ✅ Deleted {deleted_count} cached revision(s) for {repo_id}")
 else:
 print(f" ℹ️ Found {len(revisions_to_delete)} cached revision(s) but couldn't delete them")
 print(f" Try manually: huggingface-cli scan-cache --dir ~/.cache/huggingface")
 else:
 print(f" ℹ️ No cached revisions found for {repo_id}")
 except Exception as e:
 print(f" ⚠️ Cache cleanup warning: {e} (continuing...)")
 print(f" You can manually clean cache with: huggingface-cli scan-cache")
 
 # Check disk space after cleanup
 free_space_after = check_disk_space()
 print(f"\n✅ Cleanup complete! Free space: {free_space_after:.2f} GB")
 
 print(f"\n✅ {model_name} quantization complete!")
 print(f"Model available at: https://huggingface.co/{output_repo}")
 print(f"💾 Local model files deleted to save disk space")
 print(f"🚀 Model is ready for vLLM inference with optimal performance!")


In [None]:
quantize_model_to_awq(
 model_name="Router-Gemma3-27B",
 repo_id=MODELS_TO_QUANTIZE["router-gemma3-merged"]["repo_id"],
 output_repo=MODELS_TO_QUANTIZE["router-gemma3-merged"]["output_repo"],
 model_type=MODELS_TO_QUANTIZE["router-gemma3-merged"]["model_type"],
 awq_config=MODEL_AWQ_CONFIGS["router-gemma3-merged"],
 calibration_dataset_size=128
)


## 6. Quantize Router-Qwen3-32B-Merged


In [None]:
quantize_model_to_awq(
 model_name="Router-Qwen3-32B",
 repo_id=MODELS_TO_QUANTIZE["router-qwen3-32b-merged"]["repo_id"],
 output_repo=MODELS_TO_QUANTIZE["router-qwen3-32b-merged"]["output_repo"],
 model_type=MODELS_TO_QUANTIZE["router-qwen3-32b-merged"]["model_type"],
 awq_config=MODEL_AWQ_CONFIGS["router-qwen3-32b-merged"],
 calibration_dataset_size=128
)


## 7. Verify Quantized Models


In [None]:
# Verify quantized models with vLLM (recommended) or Transformers
from transformers import AutoTokenizer

def verify_awq_model_vllm(repo_id: str):
 """Verify AWQ model can be loaded with vLLM (recommended)."""
 print(f"\nVerifying {repo_id} with vLLM...")
 
 try:
 # Try importing vLLM
 try:
 from vllm import LLM, SamplingParams
 except ImportError:
 print("⚠️ vLLM not available, skipping vLLM verification")
 return False
 
 # Load with vLLM (auto-detects AWQ)
 llm = LLM(
 model=repo_id,
 quantization="awq",
 trust_remote_code=True,
 token=os.environ.get("HF_TOKEN"),
 gpu_memory_utilization=0.5 # Lower for verification
 )
 
 # Test generation
 sampling_params = SamplingParams(
 temperature=0.0,
 max_tokens=10
 )
 
 test_prompt = "You are the Router Agent. Test prompt."
 outputs = llm.generate([test_prompt], sampling_params)
 
 generated_text = outputs[0].outputs[0].text
 print(f"✅ vLLM loads and generates correctly")
 print(f"Generated: {generated_text[:100]}...")
 
 del llm
 torch.cuda.empty_cache()
 
 return True
 except Exception as e:
 print(f"❌ vLLM verification failed: {e}")
 import traceback
 traceback.print_exc()
 return False

def verify_awq_model_transformers(repo_id: str):
 """Verify AWQ model can be loaded with Transformers (fallback)."""
 print(f"\nVerifying {repo_id} with Transformers...")
 
 try:
 # Load tokenizer
 tokenizer = AutoTokenizer.from_pretrained(
 repo_id,
 trust_remote_code=True,
 token=os.environ.get("HF_TOKEN")
 )
 
 # Try loading with AutoAWQ (if available)
 try:
 from awq import AutoAWQForCausalLM
 model = AutoAWQForCausalLM.from_quantized(
 repo_id,
 fuse_layers=True,
 trust_remote_code=True,
 device_map="auto",
 token=os.environ.get("HF_TOKEN")
 )
 
 # Test generation
 test_prompt = "You are the Router Agent. Test prompt."
 inputs = tokenizer(test_prompt, return_tensors="pt").to(model.device)
 
 with torch.inference_mode():
 outputs = model.generate(
 **inputs,
 max_new_tokens=10,
 do_sample=False
 )
 
 generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
 print(f"✅ Transformers loads and generates correctly")
 print(f"Generated: {generated_text[:100]}...")
 
 del model
 del tokenizer
 torch.cuda.empty_cache()
 
 return True
 except ImportError:
 print("⚠️ AutoAWQ not available, skipping Transformers verification")
 return False
 except Exception as e:
 print(f"❌ Transformers verification failed: {e}")
 import traceback
 traceback.print_exc()
 return False

# Verify both models (prefer vLLM)
for model_key, model_info in MODELS_TO_QUANTIZE.items():
 print(f"\n{'='*60}")
 print(f"Verifying {model_key}")
 print(f"{'='*60}")
 
 # Try vLLM first (recommended)
 vllm_ok = verify_awq_model_vllm(model_info["output_repo"])
 
 # Fallback to Transformers if vLLM not available
 if not vllm_ok:
 verify_awq_model_transformers(model_info["output_repo"])


## Notes

- **GPU Required**: This quantization requires a GPU with at least 40GB VRAM (A100/H100 recommended)
- **Time**: Each model takes approximately 30-60 minutes to quantize
- **Disk Space**: 
 - Colab has limited disk space (~80GB free)
 - Each source model is ~50-70GB (BF16)
 - Quantized models are ~15-20GB (AWQ 4-bit)
 - **The notebook automatically deletes source models after quantization to save space**
- **Cleanup**: After each model is quantized and uploaded:
 - GPU memory is freed
 - Hugging Face cache for source model is cleared
 - Disk space is checked before/after
- **Output Repos**: Models are saved to new repos with `-awq` suffix
- **Usage**: After quantization, update your `app.py` to use the AWQ repos:
 ```python
 MODELS = {
 "Router-Gemma3-27B-AWQ": {
 "repo_id": "Alovestocode/router-gemma3-merged-awq",
 "quantization": "awq"
 },
 "Router-Qwen3-32B-AWQ": {
 "repo_id": "Alovestocode/router-qwen3-32b-merged-awq",
 "quantization": "awq"
 }
 }
 ```
