ASHQ1 β Autonomous Selective Hybrid Quantization
β οΈ Experimental. ASHQ1 is a personal research project that I will be refining over time. Use at your own risk. Results may vary between architectures and fine-tunes. Feedback and contributions welcome.
Latest update (v6): The classifier has been overhauled β the empirical depth-weighting heuristic was removed after A/B testing confirmed it added zero value. Quality improved as a result. The same budget now goes further.
ASHQ1 is a post-training quantization method for GGUF models that uses an imatrix-driven priority queue to maximise theoretical quality per megabyte. Instead of uniform bit-depth or heuristic layer-blocking, it treats tied tensor groups as monolithic entities and greedily upgrades them by strict mathematical utility β the product of summed importance and theoretical MSE reduction, divided by size cost.
Results
| Method | Model | Size | PPL (ctx 1024) | Ξ vs Uniform |
|---|---|---|---|---|
| ASHQ1 (v6) | Ornith-1.0-9B-MTP | 6012 MiB | 7.4697 Β± 0.04862 | β0.1551 |
| Uniform Q6_K | Ornith-1.0-9B-MTP | 7198 MiB | 7.6248 Β± 0.05039 | baseline |
ASHQ1 beats uniform Q6_K by 0.155 PPL while being 16.5% smaller (β1186 MiB). The current classifier (v6) dropped empirical depth-weighting heuristics β the theoretical priority queue now works even better.
ASHQ1 is often on par with hand-tuned SHQ quants in quality, and sometimes surpasses them. At the same time, it saves significant time and effort β just set your target size and go.
Real-World Validation
ASHQ1's theoretical quality advantage transfers to real agentic coding. We tested Ornith-1.0-9B ASHQ1 6500 (6.4 GB, 33% smaller than Q8_0) as the backend for Pi, an autonomous coding agent that uses llama.cpp as its LLM backend.
At temperature 0.6, the model was tasked with building a complete personal finance dashboard as a single HTML file β Canvas charts, budget tracker, dark mode, transaction filtering, upcoming bills, responsive layout. The agent worked autonomously: planned the architecture, wrote the entire ~1100-line file, caught its own bugs (date.now β date.getTime), fixed dark mode logic, ran Node.js validation, and iterated until all checks passed. The final finance-dashboard.html was a polished, production-quality single-page app β no external dependencies, no hallucinations, no broken features.
This is not cherry-picked. It's the first test we ran. The benchmarks didn't lie β ASHQ1 preserves enough quality that a 6.4 GB quant can drive an autonomous coding agent to build complete, working applications from scratch.
How It Works
1. Floor Assignment
Every tensor starts at a minimum tier by class. SSM params and norms lock at F16. Embeddings start at Q5_K. Weight matrices start at Q4_K (or IQ4_XS for QAT models). MTP heads deploy at Q8_0.
With --allow-q3-or-lower, low-importance tensors (ffn_down, attn_output, ssm_out) start as low as IQ2_XXS, giving the priority queue more room to upgrade important tensors to Q8_0. Tensors missing imatrix data are kept at Q4_K to avoid garbage at low bitrates.
2. Importance
Imatrix in_sum2 measures how much each weight contributes to the output variance. Layer position weighting was tested but showed no PPL benefit and has been removed.
3. Tied Group Detection
Tensors with numerically identical in_sum2 arrays are tied (shared weights). They form a single upgrade group β all members upgrade together as one unit. Group importance is the sum of its members' importance, preventing large groups from being starved of budget.
4. Priority Queue Drain
All possible single-tier upgrades are pushed into a max-heap:
utility/MiB = sum(timp[group]) Γ (MSE(cur) β MSE(next)) / (size(next) β size(cur))
MSE per tier is theoretical: MSE = 2^(-2 Γ bpw). K-quants get +0.1 effective bpw vs IQ-quants at the same real bpw, so IQ4_NLβQ4_K is a free quality gain.
The queue pops the highest-utility upgrade, applies it, pushes the next upgrade for that group, and drains until the budget is exhausted. A final pass catches any remaining zero-cost upgrades.
Why It Works
| Problem | ASHQ1 Solution |
|---|---|
| Uniform quant wastes bits on low-importance tensors | Priority queue allocates budget where it matters |
| Heuristic hand-tuning doesn't scale | Single knob: --size in MiB |
| Hand-tuned SHQ hybrids need days of PPL sweeps | Queue converges in ~1 sec for any budget |
| Large tied groups starved by per-tensor logic | sum(timp) prevents 32Γ group penalty |
| IQ4_NLβQ4_K at same bpw is a no-op | Free-upgrade pass catches zero-cost quality gains |
| No PPL-per-budget curve needed | Queue optimises for MSE directly |
| Tensors without imatrix crash at low bitrates | has_imatrix check falls back to Q4_K floor |
Supported Architectures
| Arch | Detection | Features |
|---|---|---|
qwen35 |
SSM + QKV | Hybrid attention, SSM layers, GQA, MTP support |
mellum2 |
MoE (exps tensors) |
Mixture of Experts, GQA, router F16 |
gemma4 |
Layer-scale norms | QAT support, Q4_K attention floor |
MTP (Multi-Token Prediction) heads are handled explicitly: MTP tensors deploy at Q8_0 and are excluded from the classifier's budget (their cost is subtracted from the target upfront). Tensor names with nextn.* or layers beyond n_layers are detected as MTP at runtime.
Looking for: Qwen3.6 support
Qwen3.6 is one of the most capable local LLMs right now, but I can't handle it on my hardware. The BF16 source is ~55 GB β I don't have enough RAM to even load it, let alone quantize. If you have access to a Qwen3.6 GGUF (any quantization) and can run llama-imatrix on it β or if you'd like to collaborate on adding architecture detection β please reach out. I can handle the integration, I just need the raw tensor names and imatrix data to map out the class system.
New architectures can be added via ARCH_FEATURES in constants.py.
Code Structure
| File | Role |
|---|---|
main.py |
CLI entry point, orchestration, --show-floors, multiple --imatrix support |
model_reader.py |
Reads GGUF, detects architecture/prefix/n_layers/MTP at runtime |
imatrix_reader.py |
Parses imatrix GGUF, detects tied groups via np.allclose(in_sum2), combines multiple imatrix |
classifier.py |
Floor assignment β tied group building β priority queue drain β free upgrade pass |
config_generator.py |
Generates --tensor-type regex rules from classified tensors (valid ECMAScript regex with pipe-alternated ranges) |
quantizer.py |
Subprocess wrapper around llama-quantize |
constants.py |
TENSOR_CLASS mapping, CLASS_HARD_FLOORS, CLASS_MAX_TIER, MSE_BPW, TIER_BPW, ARCH_FEATURES |
Usage
Quantization
pip install -r requirements.txt
# Dry run (βΌ1 sec)
python main.py --model model.gguf --imatrix imatrix.gguf --size 6800
# Actual quant (βΌ10 min)
python main.py --model model.gguf --imatrix imatrix.gguf --size 6800 --run
# Show hard floors
python main.py --show-floors
# Multiple imatrix (combined with max/mean)
python main.py --model model.gguf --imatrix i1.gguf --imatrix i2.gguf \
--imatrix-method max --size 6800 --run
# Allow low-bit tensors (IQ2_XXS through Q8_0 spread)
python main.py --model model.gguf --imatrix imatrix.gguf --size 6000 \
--allow-q3-or-lower --run
The llama-quantize binary path is set in quantizer.py:6.
Inference (llama-server)
Recommended server flags for serving ASHQ1 quants:
./build/bin/llama-server \
-m model-ASHQ1.gguf \
-c 50000 \
--jinja \
-fit off \
-ngl 99 \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--port 8080 \
--mmap \
--temp 1.0 \
--top-p 0.95 \
--min-p 0 \
--top-k 20 \
--seed -1 \
--parallel 1
Tier Reference
| Tier | BPW | MSE_BPW |
|---|---|---|
| F16 | 16.0 | 16.0 |
| Q8_0 | 8.50 | 8.50 |
| Q6_K | 6.5625 | 6.5625 |
| Q5_K | 5.50 | 5.50 |
| Q4_K | 4.50 | 4.50 |
| IQ4_NL | 4.50 | (2) |
| IQ4_XS | 4.25 | 4.25 |
| Q3_K | 3.4375 | 3.4375 |
| IQ3_M | 3.66 | β |
| IQ3_S | 3.44 | 3.44 |
| IQ3_XXS | 3.0625 | 3.0625 |
| IQ2_S | 2.50 | 2.50 |
| IQ2_XS | 2.3125 | 2.3125 |
| IQ2_XXS | 2.0625 | 2.0625 |
| IQ1_S | 1.5625 | 1.5625 |
(2) IQ4_NL uses IQ4_XS MSE_BPW for the free-upgrade pass (same real bpw as Q4_K).
Quantization Configs
Generated configs are valid llama-quantize arguments with ECMAScript-compatible regex patterns. Each --tensor-type rule matches a group of tensors that share the same target tier, with layers grouped into contiguous ranges:
(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_k=Q8_0β specific attention layers at Q8_0(blk|BLK)\.((?:22|23|24|25|26))\.ffn_gate=Q6_Kβ range of FFN layers at Q6_K.*output_norm.*=F16β global catch-all
Rules are sorted by specificity (specific layers, high tiers first) because llama-quantize uses first-match-wins.