Spaces:

Alovestocode
/

ZeroGPU-LLM-Inference

Sleeping

Alikestocode commited on Nov 10

Commit

a79bc8f

1 Parent(s): b2bf767

Add Colab notebook for AWQ quantization of router models

- Complete quantization pipeline for Gemma3 and Qwen3 models
- Uses AutoAWQ for reliable quantization
- Includes verification and upload steps
- Documentation with troubleshooting guide

Files changed (2) hide show

QUANTIZE_AWQ.md +110 -0
quantize_to_awq_colab.ipynb +366 -0

QUANTIZE_AWQ.md ADDED Viewed

	@@ -0,0 +1,110 @@

+# AWQ Quantization Guide for Router Models
+This guide explains how to quantize the CourseGPT-Pro router models to AWQ (Activation-aware Weight Quantization) format for efficient inference.
+## Models to Quantize
+- [router-gemma3-merged](https://huggingface.co/Alovestocode/router-gemma3-merged) (27B, BF16)
+- [router-qwen3-32b-merged](https://huggingface.co/Alovestocode/router-qwen3-32b-merged) (33B, BF16)
+## Quick Start: Google Colab
+1. **Open the Colab notebook**: `quantize_to_awq_colab.ipynb`
+2. **Set runtime to GPU**: Runtime → Change runtime type → GPU (A100 recommended)
+3. **Add your HF token**: Replace `your_hf_token_here` in cell 2
+4. **Run all cells**: The notebook will quantize both models
+## Requirements
+- **GPU**: A100 (40GB+) or H100 recommended for 27B-33B models
+- **Time**: ~30-60 minutes per model
+- **Disk Space**: ~20-30GB per quantized model
+- **HF Token**: With write access to `Alovestocode` namespace
+## AWQ Configuration
+The notebook uses optimized AWQ settings:
+```python
+AWQ_CONFIG = {
+    "w_bit": 4,              # 4-bit quantization
+    "q_group_size": 128,     # Group size for quantization
+    "zero_point": True,      # Use zero-point quantization
+    "version": "GEMM",       # GEMM kernel (better for longer contexts)
+}
+```
+## Output Repositories
+By default, quantized models are saved to:
+- `Alovestocode/router-gemma3-merged-awq`
+- `Alovestocode/router-qwen3-32b-merged-awq`
+You can modify the `output_repo` in the configuration to upload to existing repos or different names.
+## Verification
+After quantization, the notebook includes a verification step that:
+1. Loads the AWQ model
+2. Tests generation
+3. Checks parameter count
+4. Verifies model integrity
+## Usage After Quantization
+Update your `app.py` to use the AWQ models:
+```python
+MODELS = {
+    "Router-Gemma3-27B-AWQ": {
+        "repo_id": "Alovestocode/router-gemma3-merged-awq",
+        "quantization": "awq"
+    },
+    "Router-Qwen3-32B-AWQ": {
+        "repo_id": "Alovestocode/router-qwen3-32b-merged-awq",
+        "quantization": "awq"
+    }
+}
+```
+## Alternative: Using llm-compressor (vLLM)
+If you prefer using llm-compressor (vLLM's quantization tool):
+```python
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import AWQModifier
+# Quantize model
+oneshot(
+    model="Alovestocode/router-gemma3-merged",
+    output_dir="./router-gemma3-awq",
+    modifiers=[AWQModifier(w_bit=4, q_group_size=128)]
+)
+```
+However, AutoAWQ (used in the notebook) is more mature and widely tested.
+## Troubleshooting
+### Out of Memory
+- Use a larger GPU (A100 80GB or H100)
+- Reduce `calibration_dataset_size` to 64 or 32
+- Quantize models one at a time
+### Slow Quantization
+- Ensure you're using a high-end GPU (A100/H100)
+- Check that CUDA is properly configured
+- Consider using multiple GPUs with `tensor_parallel_size`
+### Upload Failures
+- Verify HF token has write access
+- Check repository exists or can be created
+- Ensure sufficient disk space
+## References
+- [AutoAWQ Documentation](https://github.com/casper-hansen/AutoAWQ)
+- [AWQ Paper](https://arxiv.org/abs/2306.00978)
+- [vLLM AWQ Support](https://docs.vllm.ai/en/latest/features/quantization/awq.html)

quantize_to_awq_colab.ipynb ADDED Viewed

	@@ -0,0 +1,366 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# Router Models AWQ Quantization\n",
+        "\n",
+        "This notebook quantizes the CourseGPT-Pro router models to AWQ (Activation-aware Weight Quantization) format for efficient inference.\n",
+        "\n",
+        "**Models to quantize:**\n",
+        "- `Alovestocode/router-gemma3-merged` (27B)\n",
+        "- `Alovestocode/router-qwen3-32b-merged` (33B)\n",
+        "\n",
+        "**Output:** AWQ-quantized models ready for vLLM or Transformers inference.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 1. Install Dependencies\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Install required packages\n",
+        "!pip install -q autoawq transformers accelerate huggingface_hub\n",
+        "!pip install -q torch --index-url https://download.pytorch.org/whl/cu118\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 2. Authenticate with Hugging Face\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from huggingface_hub import login\n",
+        "import os\n",
+        "\n",
+        "# Login to Hugging Face (you'll need a token with write access)\n",
+        "# Get your token from: https://huggingface.co/settings/tokens\n",
+        "HF_TOKEN = \"your_hf_token_here\"  # Replace with your token\n",
+        "\n",
+        "login(token=HF_TOKEN)\n",
+        "os.environ[\"HF_TOKEN\"] = HF_TOKEN\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 3. Configuration\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Model configurations\n",
+        "MODELS_TO_QUANTIZE = {\n",
+        "    \"router-gemma3-merged\": {\n",
+        "        \"repo_id\": \"Alovestocode/router-gemma3-merged\",\n",
+        "        \"output_repo\": \"Alovestocode/router-gemma3-merged-awq\",  # Or keep same repo\n",
+        "        \"model_type\": \"gemma\",\n",
+        "    },\n",
+        "    \"router-qwen3-32b-merged\": {\n",
+        "        \"repo_id\": \"Alovestocode/router-qwen3-32b-merged\",\n",
+        "        \"output_repo\": \"Alovestocode/router-qwen3-32b-merged-awq\",  # Or keep same repo\n",
+        "        \"model_type\": \"qwen\",\n",
+        "    }\n",
+        "}\n",
+        "\n",
+        "# AWQ quantization config\n",
+        "AWQ_CONFIG = {\n",
+        "    \"w_bit\": 4,  # 4-bit quantization\n",
+        "    \"q_group_size\": 128,  # Group size for quantization\n",
+        "    \"zero_point\": True,  # Use zero-point quantization\n",
+        "    \"version\": \"GEMM\",  # GEMM kernel (better for longer contexts)\n",
+        "}\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 4. Quantization Function\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from awq import AutoAWQForCausalLM\n",
+        "from transformers import AutoTokenizer\n",
+        "from huggingface_hub import HfApi\n",
+        "import torch\n",
+        "\n",
+        "def quantize_model_to_awq(\n",
+        "    model_name: str,\n",
+        "    repo_id: str,\n",
+        "    output_repo: str,\n",
+        "    model_type: str,\n",
+        "    awq_config: dict,\n",
+        "    calibration_dataset_size: int = 128\n",
+        "):\n",
+        "    \"\"\"Quantize a model to AWQ format.\n",
+        "    \n",
+        "    Args:\n",
+        "        model_name: Display name for the model\n",
+        "        repo_id: Source Hugging Face repo ID\n",
+        "        output_repo: Destination Hugging Face repo ID\n",
+        "        model_type: Model type (gemma/qwen) for tokenizer selection\n",
+        "        awq_config: AWQ quantization configuration\n",
+        "        calibration_dataset_size: Number of calibration samples\n",
+        "    \"\"\"\n",
+        "    print(f\"\\n{'='*60}\")\n",
+        "    print(f\"Quantizing {model_name}\")\n",
+        "    print(f\"Source: {repo_id}\")\n",
+        "    print(f\"Destination: {output_repo}\")\n",
+        "    print(f\"{'='*60}\\n\")\n",
+        "    \n",
+        "    # Step 1: Load tokenizer\n",
+        "    print(f\"[1/5] Loading tokenizer from {repo_id}...\")\n",
+        "    tokenizer = AutoTokenizer.from_pretrained(\n",
+        "        repo_id,\n",
+        "        trust_remote_code=True,\n",
+        "        token=os.environ.get(\"HF_TOKEN\")\n",
+        "    )\n",
+        "    print(f\"✅ Tokenizer loaded\")\n",
+        "    \n",
+        "    # Step 2: Load model\n",
+        "    print(f\"\\n[2/5] Loading model from {repo_id}...\")\n",
+        "    print(\"⚠️ This may take several minutes and requires significant GPU memory...\")\n",
+        "    \n",
+        "    model = AutoAWQForCausalLM.from_pretrained(\n",
+        "        repo_id,\n",
+        "        device_map=\"auto\",\n",
+        "        trust_remote_code=True,\n",
+        "        token=os.environ.get(\"HF_TOKEN\")\n",
+        "    )\n",
+        "    print(f\"✅ Model loaded\")\n",
+        "    \n",
+        "    # Step 3: Prepare calibration dataset\n",
+        "    print(f\"\\n[3/5] Preparing calibration dataset ({calibration_dataset_size} samples)...\")\n",
+        "    \n",
+        "    # Create a simple calibration dataset\n",
+        "    # You can customize this based on your use case\n",
+        "    calibration_texts = [\n",
+        "        \"You are the Router Agent coordinating Math, Code, and General-Search specialists.\",\n",
+        "        \"Emit EXACTLY ONE strict JSON object with keys route_plan, route_rationale, expected_artifacts,\",\n",
+        "        \"Solve a quadratic equation using Python programming.\",\n",
+        "        \"Implement a binary search algorithm with proper error handling.\",\n",
+        "        \"Explain the concept of gradient descent in machine learning.\",\n",
+        "        \"Write a function to calculate the Fibonacci sequence recursively.\",\n",
+        "        \"Design a REST API endpoint for user authentication.\",\n",
+        "        \"Analyze the time complexity of merge sort algorithm.\",\n",
+        "    ]\n",
+        "    \n",
+        "    # Repeat to reach desired size\n",
+        "    while len(calibration_texts) < calibration_dataset_size:\n",
+        "        calibration_texts.extend(calibration_texts[:calibration_dataset_size - len(calibration_texts)])\n",
+        "    \n",
+        "    calibration_texts = calibration_texts[:calibration_dataset_size]\n",
+        "    \n",
+        "    # Tokenize calibration data\n",
+        "    def tokenize_function(texts):\n",
+        "        return tokenizer(\n",
+        "            texts,\n",
+        "            return_tensors=\"pt\",\n",
+        "            padding=True,\n",
+        "            truncation=True,\n",
+        "            max_length=512\n",
+        "        )\n",
+        "    \n",
+        "    calibration_data = tokenize_function(calibration_texts)\n",
+        "    print(f\"✅ Calibration dataset prepared: {len(calibration_texts)} samples\")\n",
+        "    \n",
+        "    # Step 4: Quantize model\n",
+        "    print(f\"\\n[4/5] Quantizing model to AWQ (this may take 30-60 minutes)...\")\n",
+        "    print(f\"Config: {awq_config}\")\n",
+        "    \n",
+        "    model.quantize(\n",
+        "        tokenizer,\n",
+        "        quant_config=awq_config,\n",
+        "        calib_data=calibration_data\n",
+        "    )\n",
+        "    \n",
+        "    print(f\"✅ Model quantized to AWQ\")\n",
+        "    \n",
+        "    # Step 5: Save quantized model\n",
+        "    print(f\"\\n[5/5] Saving quantized model to {output_repo}...\")\n",
+        "    \n",
+        "    # Create repo if it doesn't exist\n",
+        "    api = HfApi()\n",
+        "    try:\n",
+        "        api.create_repo(\n",
+        "            repo_id=output_repo,\n",
+        "            repo_type=\"model\",\n",
+        "            exist_ok=True,\n",
+        "            token=os.environ.get(\"HF_TOKEN\")\n",
+        "        )\n",
+        "    except Exception as e:\n",
+        "        print(f\"Note: Repo may already exist: {e}\")\n",
+        "    \n",
+        "    # Save model\n",
+        "    model.save_quantized(\n",
+        "        output_repo,\n",
+        "        safetensors=True,\n",
+        "        shard_size=\"10GB\"  # Shard large models\n",
+        "    )\n",
+        "    \n",
+        "    # Upload tokenizer\n",
+        "    tokenizer.save_pretrained(output_repo)\n",
+        "    \n",
+        "    print(f\"✅ Quantized model saved to {output_repo}\")\n",
+        "    \n",
+        "    # Clean up memory\n",
+        "    del model\n",
+        "    del tokenizer\n",
+        "    torch.cuda.empty_cache()\n",
+        "    \n",
+        "    print(f\"\\n✅ {model_name} quantization complete!\")\n",
+        "    print(f\"Model available at: https://huggingface.co/{output_repo}\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": []
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "quantize_model_to_awq(\n",
+        "    model_name=\"Router-Gemma3-27B\",\n",
+        "    repo_id=MODELS_TO_QUANTIZE[\"router-gemma3-merged\"][\"repo_id\"],\n",
+        "    output_repo=MODELS_TO_QUANTIZE[\"router-gemma3-merged\"][\"output_repo\"],\n",
+        "    model_type=MODELS_TO_QUANTIZE[\"router-gemma3-merged\"][\"model_type\"],\n",
+        "    awq_config=AWQ_CONFIG,\n",
+        "    calibration_dataset_size=128\n",
+        ")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 6. Quantize Router-Qwen3-32B-Merged\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "quantize_model_to_awq(\n",
+        "    model_name=\"Router-Qwen3-32B\",\n",
+        "    repo_id=MODELS_TO_QUANTIZE[\"router-qwen3-32b-merged\"][\"repo_id\"],\n",
+        "    output_repo=MODELS_TO_QUANTIZE[\"router-qwen3-32b-merged\"][\"output_repo\"],\n",
+        "    model_type=MODELS_TO_QUANTIZE[\"router-qwen3-32b-merged\"][\"model_type\"],\n",
+        "    awq_config=AWQ_CONFIG,\n",
+        "    calibration_dataset_size=128\n",
+        ")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 7. Verify Quantized Models\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from transformers import AutoTokenizer\n",
+        "from awq import AutoAWQForCausalLM\n",
+        "\n",
+        "def verify_awq_model(repo_id: str):\n",
+        "    \"\"\"Verify that an AWQ model can be loaded correctly.\"\"\"\n",
+        "    print(f\"\\nVerifying {repo_id}...\")\n",
+        "    \n",
+        "    try:\n",
+        "        # Load tokenizer\n",
+        "        tokenizer = AutoTokenizer.from_pretrained(\n",
+        "            repo_id,\n",
+        "            trust_remote_code=True,\n",
+        "            token=os.environ.get(\"HF_TOKEN\")\n",
+        "        )\n",
+        "        \n",
+        "        # Load AWQ model\n",
+        "        model = AutoAWQForCausalLM.from_quantized(\n",
+        "            repo_id,\n",
+        "            fuse_layers=True,\n",
+        "            trust_remote_code=True,\n",
+        "            device_map=\"auto\",\n",
+        "            token=os.environ.get(\"HF_TOKEN\")\n",
+        "        )\n",
+        "        \n",
+        "        # Test generation\n",
+        "        test_prompt = \"You are the Router Agent. Test prompt.\"\n",
+        "        inputs = tokenizer(test_prompt, return_tensors=\"pt\").to(model.device)\n",
+        "        \n",
+        "        with torch.inference_mode():\n",
+        "            outputs = model.generate(\n",
+        "                **inputs,\n",
+        "                max_new_tokens=10,\n",
+        "                do_sample=False\n",
+        "            )\n",
+        "        \n",
+        "        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
+        "        print(f\"✅ Model loads and generates correctly\")\n",
+        "        print(f\"Generated: {generated_text[:100]}...\")\n",
+        "        \n",
+        "        # Check model size\n",
+        "        total_params = sum(p.numel() for p in model.parameters())\n",
+        "        print(f\"Total parameters: {total_params / 1e9:.2f}B\")\n",
+        "        \n",
+        "        del model\n",
+        "        del tokenizer\n",
+        "        torch.cuda.empty_cache()\n",
+        "        \n",
+        "        return True\n",
+        "    except Exception as e:\n",
+        "        print(f\"❌ Verification failed: {e}\")\n",
+        "        import traceback\n",
+        "        traceback.print_exc()\n",
+        "        return False\n",
+        "\n",
+        "# Verify both models\n",
+        "for model_key, model_info in MODELS_TO_QUANTIZE.items():\n",
+        "    verify_awq_model(model_info[\"output_repo\"])\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 2
+}