Quantization was performed using exllama3 v0.0.15.
| Quant | Size (GB) | KL-div (quant, orig) | KL-div (orig, quant) | Perplexity | Top-K K=1 | Top-K K=2 | Top-K K=3 | Top-K K=4 | Top-K K=5 |
|---|---|---|---|---|---|---|---|---|---|
| 2.0bpw | 57 | 0.41617405 | 0.63545585 | 5.76802263 | 0.7706 | 0.4538 | 0.2223 | 0.0960 | 0.0410 |
| 3.0bpw | 84 | 0.14266876 | 0.19399434 | 4.36054284 | 0.8777 | 0.6373 | 0.4059 | 0.2369 | 0.1316 |
| 4.0bpw | 111 | 0.02823074 | 0.03066939 | 3.85603126 | 0.9442 | 0.8062 | 0.6315 | 0.4630 | 0.3231 |
| 5.0bpw | 138 | 0.00948993 | 0.00957515 | 3.75920156 | 0.9666 | 0.8749 | 0.7485 | 0.6102 | 0.4783 |
| 5.5bpw | 151 | 0.00514412 | 0.00524848 | 3.75774266 | 0.9753 | 0.9075 | 0.8081 | 0.6913 | 0.5751 |
| 6.0bpw | 165 | 0.00399840 | 0.00401824 | 3.75147423 | 0.9774 | 0.9144 | 0.8214 | 0.7118 | 0.6006 |
| original | 437 | - | - | 3.73952143 | - | - | - | - | - |
Metrics explanation
- KL-divergence: Measures the difference between probability distributions of quantized and original models. Lower is better (closer to original).
- Perplexity: Indicates how well the model predicts the next token. Lower values mean better prediction quality.
- Top-K agreement: Shows how often the quantized model's top-K predictions exactly match the original model's predictions. Higher values indicate better preservation of the original model's behavior (1.0 = perfect match):
- K=1: The fraction of times both models predict the exact same token as their top-1 choice
- K=5: The fraction of times both models have the same 5 tokens in their top-5 predictions in the exact same ranking order
Tool Calls Support for Qwen/GLM Models
The official tabbyAPI doesn't support tool calls for Qwen and GLM models yet.
If you're using Qwen-Code, OpenClaw, or similar software that need tool call support, you can use my fork with the tools-support branch:
Clone directly:
git clone -b tools-support https://github.com/NeuroSenko/tabbyAPI
Or add to existing tabbyAPI installation:
git remote add neurosenko https://github.com/NeuroSenko/tabbyAPI
git fetch neurosenko
git checkout -b tools-support neurosenko/tools-support
This branch includes native tool calling support for Qwen and GLM model families.