Alikestocode commited on
Commit
a79bc8f
·
1 Parent(s): b2bf767

Add Colab notebook for AWQ quantization of router models

Browse files

- Complete quantization pipeline for Gemma3 and Qwen3 models
- Uses AutoAWQ for reliable quantization
- Includes verification and upload steps
- Documentation with troubleshooting guide

Files changed (2) hide show
  1. QUANTIZE_AWQ.md +110 -0
  2. quantize_to_awq_colab.ipynb +366 -0
QUANTIZE_AWQ.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AWQ Quantization Guide for Router Models
2
+
3
+ This guide explains how to quantize the CourseGPT-Pro router models to AWQ (Activation-aware Weight Quantization) format for efficient inference.
4
+
5
+ ## Models to Quantize
6
+
7
+ - [router-gemma3-merged](https://huggingface.co/Alovestocode/router-gemma3-merged) (27B, BF16)
8
+ - [router-qwen3-32b-merged](https://huggingface.co/Alovestocode/router-qwen3-32b-merged) (33B, BF16)
9
+
10
+ ## Quick Start: Google Colab
11
+
12
+ 1. **Open the Colab notebook**: `quantize_to_awq_colab.ipynb`
13
+ 2. **Set runtime to GPU**: Runtime → Change runtime type → GPU (A100 recommended)
14
+ 3. **Add your HF token**: Replace `your_hf_token_here` in cell 2
15
+ 4. **Run all cells**: The notebook will quantize both models
16
+
17
+ ## Requirements
18
+
19
+ - **GPU**: A100 (40GB+) or H100 recommended for 27B-33B models
20
+ - **Time**: ~30-60 minutes per model
21
+ - **Disk Space**: ~20-30GB per quantized model
22
+ - **HF Token**: With write access to `Alovestocode` namespace
23
+
24
+ ## AWQ Configuration
25
+
26
+ The notebook uses optimized AWQ settings:
27
+
28
+ ```python
29
+ AWQ_CONFIG = {
30
+ "w_bit": 4, # 4-bit quantization
31
+ "q_group_size": 128, # Group size for quantization
32
+ "zero_point": True, # Use zero-point quantization
33
+ "version": "GEMM", # GEMM kernel (better for longer contexts)
34
+ }
35
+ ```
36
+
37
+ ## Output Repositories
38
+
39
+ By default, quantized models are saved to:
40
+ - `Alovestocode/router-gemma3-merged-awq`
41
+ - `Alovestocode/router-qwen3-32b-merged-awq`
42
+
43
+ You can modify the `output_repo` in the configuration to upload to existing repos or different names.
44
+
45
+ ## Verification
46
+
47
+ After quantization, the notebook includes a verification step that:
48
+ 1. Loads the AWQ model
49
+ 2. Tests generation
50
+ 3. Checks parameter count
51
+ 4. Verifies model integrity
52
+
53
+ ## Usage After Quantization
54
+
55
+ Update your `app.py` to use the AWQ models:
56
+
57
+ ```python
58
+ MODELS = {
59
+ "Router-Gemma3-27B-AWQ": {
60
+ "repo_id": "Alovestocode/router-gemma3-merged-awq",
61
+ "quantization": "awq"
62
+ },
63
+ "Router-Qwen3-32B-AWQ": {
64
+ "repo_id": "Alovestocode/router-qwen3-32b-merged-awq",
65
+ "quantization": "awq"
66
+ }
67
+ }
68
+ ```
69
+
70
+ ## Alternative: Using llm-compressor (vLLM)
71
+
72
+ If you prefer using llm-compressor (vLLM's quantization tool):
73
+
74
+ ```python
75
+ from llmcompressor import oneshot
76
+ from llmcompressor.modifiers.quantization import AWQModifier
77
+
78
+ # Quantize model
79
+ oneshot(
80
+ model="Alovestocode/router-gemma3-merged",
81
+ output_dir="./router-gemma3-awq",
82
+ modifiers=[AWQModifier(w_bit=4, q_group_size=128)]
83
+ )
84
+ ```
85
+
86
+ However, AutoAWQ (used in the notebook) is more mature and widely tested.
87
+
88
+ ## Troubleshooting
89
+
90
+ ### Out of Memory
91
+ - Use a larger GPU (A100 80GB or H100)
92
+ - Reduce `calibration_dataset_size` to 64 or 32
93
+ - Quantize models one at a time
94
+
95
+ ### Slow Quantization
96
+ - Ensure you're using a high-end GPU (A100/H100)
97
+ - Check that CUDA is properly configured
98
+ - Consider using multiple GPUs with `tensor_parallel_size`
99
+
100
+ ### Upload Failures
101
+ - Verify HF token has write access
102
+ - Check repository exists or can be created
103
+ - Ensure sufficient disk space
104
+
105
+ ## References
106
+
107
+ - [AutoAWQ Documentation](https://github.com/casper-hansen/AutoAWQ)
108
+ - [AWQ Paper](https://arxiv.org/abs/2306.00978)
109
+ - [vLLM AWQ Support](https://docs.vllm.ai/en/latest/features/quantization/awq.html)
110
+
quantize_to_awq_colab.ipynb ADDED
@@ -0,0 +1,366 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Router Models AWQ Quantization\n",
8
+ "\n",
9
+ "This notebook quantizes the CourseGPT-Pro router models to AWQ (Activation-aware Weight Quantization) format for efficient inference.\n",
10
+ "\n",
11
+ "**Models to quantize:**\n",
12
+ "- `Alovestocode/router-gemma3-merged` (27B)\n",
13
+ "- `Alovestocode/router-qwen3-32b-merged` (33B)\n",
14
+ "\n",
15
+ "**Output:** AWQ-quantized models ready for vLLM or Transformers inference.\n"
16
+ ]
17
+ },
18
+ {
19
+ "cell_type": "markdown",
20
+ "metadata": {},
21
+ "source": [
22
+ "## 1. Install Dependencies\n"
23
+ ]
24
+ },
25
+ {
26
+ "cell_type": "code",
27
+ "execution_count": null,
28
+ "metadata": {},
29
+ "outputs": [],
30
+ "source": [
31
+ "# Install required packages\n",
32
+ "!pip install -q autoawq transformers accelerate huggingface_hub\n",
33
+ "!pip install -q torch --index-url https://download.pytorch.org/whl/cu118\n"
34
+ ]
35
+ },
36
+ {
37
+ "cell_type": "markdown",
38
+ "metadata": {},
39
+ "source": [
40
+ "## 2. Authenticate with Hugging Face\n"
41
+ ]
42
+ },
43
+ {
44
+ "cell_type": "code",
45
+ "execution_count": null,
46
+ "metadata": {},
47
+ "outputs": [],
48
+ "source": [
49
+ "from huggingface_hub import login\n",
50
+ "import os\n",
51
+ "\n",
52
+ "# Login to Hugging Face (you'll need a token with write access)\n",
53
+ "# Get your token from: https://huggingface.co/settings/tokens\n",
54
+ "HF_TOKEN = \"your_hf_token_here\" # Replace with your token\n",
55
+ "\n",
56
+ "login(token=HF_TOKEN)\n",
57
+ "os.environ[\"HF_TOKEN\"] = HF_TOKEN\n"
58
+ ]
59
+ },
60
+ {
61
+ "cell_type": "markdown",
62
+ "metadata": {},
63
+ "source": [
64
+ "## 3. Configuration\n"
65
+ ]
66
+ },
67
+ {
68
+ "cell_type": "code",
69
+ "execution_count": null,
70
+ "metadata": {},
71
+ "outputs": [],
72
+ "source": [
73
+ "# Model configurations\n",
74
+ "MODELS_TO_QUANTIZE = {\n",
75
+ " \"router-gemma3-merged\": {\n",
76
+ " \"repo_id\": \"Alovestocode/router-gemma3-merged\",\n",
77
+ " \"output_repo\": \"Alovestocode/router-gemma3-merged-awq\", # Or keep same repo\n",
78
+ " \"model_type\": \"gemma\",\n",
79
+ " },\n",
80
+ " \"router-qwen3-32b-merged\": {\n",
81
+ " \"repo_id\": \"Alovestocode/router-qwen3-32b-merged\",\n",
82
+ " \"output_repo\": \"Alovestocode/router-qwen3-32b-merged-awq\", # Or keep same repo\n",
83
+ " \"model_type\": \"qwen\",\n",
84
+ " }\n",
85
+ "}\n",
86
+ "\n",
87
+ "# AWQ quantization config\n",
88
+ "AWQ_CONFIG = {\n",
89
+ " \"w_bit\": 4, # 4-bit quantization\n",
90
+ " \"q_group_size\": 128, # Group size for quantization\n",
91
+ " \"zero_point\": True, # Use zero-point quantization\n",
92
+ " \"version\": \"GEMM\", # GEMM kernel (better for longer contexts)\n",
93
+ "}\n"
94
+ ]
95
+ },
96
+ {
97
+ "cell_type": "markdown",
98
+ "metadata": {},
99
+ "source": [
100
+ "## 4. Quantization Function\n"
101
+ ]
102
+ },
103
+ {
104
+ "cell_type": "code",
105
+ "execution_count": null,
106
+ "metadata": {},
107
+ "outputs": [],
108
+ "source": [
109
+ "from awq import AutoAWQForCausalLM\n",
110
+ "from transformers import AutoTokenizer\n",
111
+ "from huggingface_hub import HfApi\n",
112
+ "import torch\n",
113
+ "\n",
114
+ "def quantize_model_to_awq(\n",
115
+ " model_name: str,\n",
116
+ " repo_id: str,\n",
117
+ " output_repo: str,\n",
118
+ " model_type: str,\n",
119
+ " awq_config: dict,\n",
120
+ " calibration_dataset_size: int = 128\n",
121
+ "):\n",
122
+ " \"\"\"Quantize a model to AWQ format.\n",
123
+ " \n",
124
+ " Args:\n",
125
+ " model_name: Display name for the model\n",
126
+ " repo_id: Source Hugging Face repo ID\n",
127
+ " output_repo: Destination Hugging Face repo ID\n",
128
+ " model_type: Model type (gemma/qwen) for tokenizer selection\n",
129
+ " awq_config: AWQ quantization configuration\n",
130
+ " calibration_dataset_size: Number of calibration samples\n",
131
+ " \"\"\"\n",
132
+ " print(f\"\\n{'='*60}\")\n",
133
+ " print(f\"Quantizing {model_name}\")\n",
134
+ " print(f\"Source: {repo_id}\")\n",
135
+ " print(f\"Destination: {output_repo}\")\n",
136
+ " print(f\"{'='*60}\\n\")\n",
137
+ " \n",
138
+ " # Step 1: Load tokenizer\n",
139
+ " print(f\"[1/5] Loading tokenizer from {repo_id}...\")\n",
140
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
141
+ " repo_id,\n",
142
+ " trust_remote_code=True,\n",
143
+ " token=os.environ.get(\"HF_TOKEN\")\n",
144
+ " )\n",
145
+ " print(f\"✅ Tokenizer loaded\")\n",
146
+ " \n",
147
+ " # Step 2: Load model\n",
148
+ " print(f\"\\n[2/5] Loading model from {repo_id}...\")\n",
149
+ " print(\"⚠️ This may take several minutes and requires significant GPU memory...\")\n",
150
+ " \n",
151
+ " model = AutoAWQForCausalLM.from_pretrained(\n",
152
+ " repo_id,\n",
153
+ " device_map=\"auto\",\n",
154
+ " trust_remote_code=True,\n",
155
+ " token=os.environ.get(\"HF_TOKEN\")\n",
156
+ " )\n",
157
+ " print(f\"✅ Model loaded\")\n",
158
+ " \n",
159
+ " # Step 3: Prepare calibration dataset\n",
160
+ " print(f\"\\n[3/5] Preparing calibration dataset ({calibration_dataset_size} samples)...\")\n",
161
+ " \n",
162
+ " # Create a simple calibration dataset\n",
163
+ " # You can customize this based on your use case\n",
164
+ " calibration_texts = [\n",
165
+ " \"You are the Router Agent coordinating Math, Code, and General-Search specialists.\",\n",
166
+ " \"Emit EXACTLY ONE strict JSON object with keys route_plan, route_rationale, expected_artifacts,\",\n",
167
+ " \"Solve a quadratic equation using Python programming.\",\n",
168
+ " \"Implement a binary search algorithm with proper error handling.\",\n",
169
+ " \"Explain the concept of gradient descent in machine learning.\",\n",
170
+ " \"Write a function to calculate the Fibonacci sequence recursively.\",\n",
171
+ " \"Design a REST API endpoint for user authentication.\",\n",
172
+ " \"Analyze the time complexity of merge sort algorithm.\",\n",
173
+ " ]\n",
174
+ " \n",
175
+ " # Repeat to reach desired size\n",
176
+ " while len(calibration_texts) < calibration_dataset_size:\n",
177
+ " calibration_texts.extend(calibration_texts[:calibration_dataset_size - len(calibration_texts)])\n",
178
+ " \n",
179
+ " calibration_texts = calibration_texts[:calibration_dataset_size]\n",
180
+ " \n",
181
+ " # Tokenize calibration data\n",
182
+ " def tokenize_function(texts):\n",
183
+ " return tokenizer(\n",
184
+ " texts,\n",
185
+ " return_tensors=\"pt\",\n",
186
+ " padding=True,\n",
187
+ " truncation=True,\n",
188
+ " max_length=512\n",
189
+ " )\n",
190
+ " \n",
191
+ " calibration_data = tokenize_function(calibration_texts)\n",
192
+ " print(f\"✅ Calibration dataset prepared: {len(calibration_texts)} samples\")\n",
193
+ " \n",
194
+ " # Step 4: Quantize model\n",
195
+ " print(f\"\\n[4/5] Quantizing model to AWQ (this may take 30-60 minutes)...\")\n",
196
+ " print(f\"Config: {awq_config}\")\n",
197
+ " \n",
198
+ " model.quantize(\n",
199
+ " tokenizer,\n",
200
+ " quant_config=awq_config,\n",
201
+ " calib_data=calibration_data\n",
202
+ " )\n",
203
+ " \n",
204
+ " print(f\"✅ Model quantized to AWQ\")\n",
205
+ " \n",
206
+ " # Step 5: Save quantized model\n",
207
+ " print(f\"\\n[5/5] Saving quantized model to {output_repo}...\")\n",
208
+ " \n",
209
+ " # Create repo if it doesn't exist\n",
210
+ " api = HfApi()\n",
211
+ " try:\n",
212
+ " api.create_repo(\n",
213
+ " repo_id=output_repo,\n",
214
+ " repo_type=\"model\",\n",
215
+ " exist_ok=True,\n",
216
+ " token=os.environ.get(\"HF_TOKEN\")\n",
217
+ " )\n",
218
+ " except Exception as e:\n",
219
+ " print(f\"Note: Repo may already exist: {e}\")\n",
220
+ " \n",
221
+ " # Save model\n",
222
+ " model.save_quantized(\n",
223
+ " output_repo,\n",
224
+ " safetensors=True,\n",
225
+ " shard_size=\"10GB\" # Shard large models\n",
226
+ " )\n",
227
+ " \n",
228
+ " # Upload tokenizer\n",
229
+ " tokenizer.save_pretrained(output_repo)\n",
230
+ " \n",
231
+ " print(f\"✅ Quantized model saved to {output_repo}\")\n",
232
+ " \n",
233
+ " # Clean up memory\n",
234
+ " del model\n",
235
+ " del tokenizer\n",
236
+ " torch.cuda.empty_cache()\n",
237
+ " \n",
238
+ " print(f\"\\n✅ {model_name} quantization complete!\")\n",
239
+ " print(f\"Model available at: https://huggingface.co/{output_repo}\")\n"
240
+ ]
241
+ },
242
+ {
243
+ "cell_type": "markdown",
244
+ "metadata": {},
245
+ "source": []
246
+ },
247
+ {
248
+ "cell_type": "code",
249
+ "execution_count": null,
250
+ "metadata": {},
251
+ "outputs": [],
252
+ "source": [
253
+ "quantize_model_to_awq(\n",
254
+ " model_name=\"Router-Gemma3-27B\",\n",
255
+ " repo_id=MODELS_TO_QUANTIZE[\"router-gemma3-merged\"][\"repo_id\"],\n",
256
+ " output_repo=MODELS_TO_QUANTIZE[\"router-gemma3-merged\"][\"output_repo\"],\n",
257
+ " model_type=MODELS_TO_QUANTIZE[\"router-gemma3-merged\"][\"model_type\"],\n",
258
+ " awq_config=AWQ_CONFIG,\n",
259
+ " calibration_dataset_size=128\n",
260
+ ")\n"
261
+ ]
262
+ },
263
+ {
264
+ "cell_type": "markdown",
265
+ "metadata": {},
266
+ "source": [
267
+ "## 6. Quantize Router-Qwen3-32B-Merged\n"
268
+ ]
269
+ },
270
+ {
271
+ "cell_type": "code",
272
+ "execution_count": null,
273
+ "metadata": {},
274
+ "outputs": [],
275
+ "source": [
276
+ "quantize_model_to_awq(\n",
277
+ " model_name=\"Router-Qwen3-32B\",\n",
278
+ " repo_id=MODELS_TO_QUANTIZE[\"router-qwen3-32b-merged\"][\"repo_id\"],\n",
279
+ " output_repo=MODELS_TO_QUANTIZE[\"router-qwen3-32b-merged\"][\"output_repo\"],\n",
280
+ " model_type=MODELS_TO_QUANTIZE[\"router-qwen3-32b-merged\"][\"model_type\"],\n",
281
+ " awq_config=AWQ_CONFIG,\n",
282
+ " calibration_dataset_size=128\n",
283
+ ")\n"
284
+ ]
285
+ },
286
+ {
287
+ "cell_type": "markdown",
288
+ "metadata": {},
289
+ "source": [
290
+ "## 7. Verify Quantized Models\n"
291
+ ]
292
+ },
293
+ {
294
+ "cell_type": "code",
295
+ "execution_count": null,
296
+ "metadata": {},
297
+ "outputs": [],
298
+ "source": [
299
+ "from transformers import AutoTokenizer\n",
300
+ "from awq import AutoAWQForCausalLM\n",
301
+ "\n",
302
+ "def verify_awq_model(repo_id: str):\n",
303
+ " \"\"\"Verify that an AWQ model can be loaded correctly.\"\"\"\n",
304
+ " print(f\"\\nVerifying {repo_id}...\")\n",
305
+ " \n",
306
+ " try:\n",
307
+ " # Load tokenizer\n",
308
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
309
+ " repo_id,\n",
310
+ " trust_remote_code=True,\n",
311
+ " token=os.environ.get(\"HF_TOKEN\")\n",
312
+ " )\n",
313
+ " \n",
314
+ " # Load AWQ model\n",
315
+ " model = AutoAWQForCausalLM.from_quantized(\n",
316
+ " repo_id,\n",
317
+ " fuse_layers=True,\n",
318
+ " trust_remote_code=True,\n",
319
+ " device_map=\"auto\",\n",
320
+ " token=os.environ.get(\"HF_TOKEN\")\n",
321
+ " )\n",
322
+ " \n",
323
+ " # Test generation\n",
324
+ " test_prompt = \"You are the Router Agent. Test prompt.\"\n",
325
+ " inputs = tokenizer(test_prompt, return_tensors=\"pt\").to(model.device)\n",
326
+ " \n",
327
+ " with torch.inference_mode():\n",
328
+ " outputs = model.generate(\n",
329
+ " **inputs,\n",
330
+ " max_new_tokens=10,\n",
331
+ " do_sample=False\n",
332
+ " )\n",
333
+ " \n",
334
+ " generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
335
+ " print(f\"✅ Model loads and generates correctly\")\n",
336
+ " print(f\"Generated: {generated_text[:100]}...\")\n",
337
+ " \n",
338
+ " # Check model size\n",
339
+ " total_params = sum(p.numel() for p in model.parameters())\n",
340
+ " print(f\"Total parameters: {total_params / 1e9:.2f}B\")\n",
341
+ " \n",
342
+ " del model\n",
343
+ " del tokenizer\n",
344
+ " torch.cuda.empty_cache()\n",
345
+ " \n",
346
+ " return True\n",
347
+ " except Exception as e:\n",
348
+ " print(f\"❌ Verification failed: {e}\")\n",
349
+ " import traceback\n",
350
+ " traceback.print_exc()\n",
351
+ " return False\n",
352
+ "\n",
353
+ "# Verify both models\n",
354
+ "for model_key, model_info in MODELS_TO_QUANTIZE.items():\n",
355
+ " verify_awq_model(model_info[\"output_repo\"])\n"
356
+ ]
357
+ }
358
+ ],
359
+ "metadata": {
360
+ "language_info": {
361
+ "name": "python"
362
+ }
363
+ },
364
+ "nbformat": 4,
365
+ "nbformat_minor": 2
366
+ }