ASHQ1 — Autonomous Selective Hybrid Quantization

⚠️ Experimental. ASHQ1 is a personal research project that I will be refining over time. Use at your own risk. Results may vary between architectures and fine-tunes. Feedback and contributions welcome.

Latest update (v6): The classifier has been overhauled — the empirical depth-weighting heuristic was removed after A/B testing confirmed it added zero value. Quality improved as a result. The same budget now goes further.

ASHQ1 is a post-training quantization method for GGUF models that uses an imatrix-driven priority queue to maximise theoretical quality per megabyte. Instead of uniform bit-depth or heuristic layer-blocking, it treats tied tensor groups as monolithic entities and greedily upgrades them by strict mathematical utility — the product of summed importance and theoretical MSE reduction, divided by size cost.

Results

Method	Model	Size	PPL (ctx 1024)	Δ vs Uniform
ASHQ1 (v6)	Ornith-1.0-9B-MTP	6012 MiB	7.4697 ± 0.04862	−0.1551
Uniform Q6_K	Ornith-1.0-9B-MTP	7198 MiB	7.6248 ± 0.05039	baseline

ASHQ1 beats uniform Q6_K by 0.155 PPL while being 16.5% smaller (−1186 MiB). The current classifier (v6) dropped empirical depth-weighting heuristics — the theoretical priority queue now works even better.

ASHQ1 is often on par with hand-tuned SHQ quants in quality, and sometimes surpasses them. At the same time, it saves significant time and effort — just set your target size and go.

Real-World Validation

ASHQ1's theoretical quality advantage transfers to real agentic coding. We tested Ornith-1.0-9B ASHQ1 6500 (6.4 GB, 33% smaller than Q8_0) as the backend for Pi, an autonomous coding agent that uses llama.cpp as its LLM backend.

At temperature 0.6, the model was tasked with building a complete personal finance dashboard as a single HTML file — Canvas charts, budget tracker, dark mode, transaction filtering, upcoming bills, responsive layout. The agent worked autonomously: planned the architecture, wrote the entire ~1100-line file, caught its own bugs (date.now → date.getTime), fixed dark mode logic, ran Node.js validation, and iterated until all checks passed. The final finance-dashboard.html was a polished, production-quality single-page app — no external dependencies, no hallucinations, no broken features.

This is not cherry-picked. It's the first test we ran. The benchmarks didn't lie — ASHQ1 preserves enough quality that a 6.4 GB quant can drive an autonomous coding agent to build complete, working applications from scratch.

How It Works

1. Floor Assignment

Every tensor starts at a minimum tier by class. SSM params and norms lock at F16. Embeddings start at Q5_K. Weight matrices start at Q4_K (or IQ4_XS for QAT models). MTP heads deploy at Q8_0.

With --allow-q3-or-lower, low-importance tensors (ffn_down, attn_output, ssm_out) start as low as IQ2_XXS, giving the priority queue more room to upgrade important tensors to Q8_0. Tensors missing imatrix data are kept at Q4_K to avoid garbage at low bitrates.

2. Importance

Imatrix in_sum2 measures how much each weight contributes to the output variance. Layer position weighting was tested but showed no PPL benefit and has been removed.

3. Tied Group Detection

Tensors with numerically identical in_sum2 arrays are tied (shared weights). They form a single upgrade group — all members upgrade together as one unit. Group importance is the sum of its members' importance, preventing large groups from being starved of budget.

4. Priority Queue Drain

All possible single-tier upgrades are pushed into a max-heap:

utility/MiB = sum(timp[group]) × (MSE(cur) − MSE(next)) / (size(next) − size(cur))

MSE per tier is theoretical: MSE = 2^(-2 × bpw). K-quants get +0.1 effective bpw vs IQ-quants at the same real bpw, so IQ4_NL→Q4_K is a free quality gain.

The queue pops the highest-utility upgrade, applies it, pushes the next upgrade for that group, and drains until the budget is exhausted. A final pass catches any remaining zero-cost upgrades.

Why It Works

Problem	ASHQ1 Solution
Uniform quant wastes bits on low-importance tensors	Priority queue allocates budget where it matters
Heuristic hand-tuning doesn't scale	Single knob: `--size` in MiB
Hand-tuned SHQ hybrids need days of PPL sweeps	Queue converges in ~1 sec for any budget
Large tied groups starved by per-tensor logic	`sum(timp)` prevents 32× group penalty
IQ4_NL→Q4_K at same bpw is a no-op	Free-upgrade pass catches zero-cost quality gains
No PPL-per-budget curve needed	Queue optimises for MSE directly
Tensors without imatrix crash at low bitrates	`has_imatrix` check falls back to Q4_K floor

Supported Architectures

Arch	Detection	Features
`qwen35`	SSM + QKV	Hybrid attention, SSM layers, GQA, MTP support
`mellum2`	MoE (`exps` tensors)	Mixture of Experts, GQA, router F16
`gemma4`	Layer-scale norms	QAT support, Q4_K attention floor

MTP (Multi-Token Prediction) heads are handled explicitly: MTP tensors deploy at Q8_0 and are excluded from the classifier's budget (their cost is subtracted from the target upfront). Tensor names with nextn.* or layers beyond n_layers are detected as MTP at runtime.

Looking for: Qwen3.6 support

Qwen3.6 is one of the most capable local LLMs right now, but I can't handle it on my hardware. The BF16 source is ~55 GB — I don't have enough RAM to even load it, let alone quantize. If you have access to a Qwen3.6 GGUF (any quantization) and can run llama-imatrix on it — or if you'd like to collaborate on adding architecture detection — please reach out. I can handle the integration, I just need the raw tensor names and imatrix data to map out the class system.

New architectures can be added via ARCH_FEATURES in constants.py.

Code Structure

File	Role
`main.py`	CLI entry point, orchestration, `--show-floors`, multiple `--imatrix` support
`model_reader.py`	Reads GGUF, detects architecture/prefix/n_layers/MTP at runtime
`imatrix_reader.py`	Parses imatrix GGUF, detects tied groups via `np.allclose(in_sum2)`, combines multiple imatrix
`classifier.py`	Floor assignment → tied group building → priority queue drain → free upgrade pass
`config_generator.py`	Generates `--tensor-type` regex rules from classified tensors (valid ECMAScript regex with pipe-alternated ranges)
`quantizer.py`	Subprocess wrapper around `llama-quantize`
`constants.py`	TENSOR_CLASS mapping, CLASS_HARD_FLOORS, CLASS_MAX_TIER, MSE_BPW, TIER_BPW, ARCH_FEATURES

Usage

Quantization

pip install -r requirements.txt

# Dry run (∼1 sec)
python main.py --model model.gguf --imatrix imatrix.gguf --size 6800

# Actual quant (∼10 min)
python main.py --model model.gguf --imatrix imatrix.gguf --size 6800 --run

# Show hard floors
python main.py --show-floors

# Multiple imatrix (combined with max/mean)
python main.py --model model.gguf --imatrix i1.gguf --imatrix i2.gguf \
  --imatrix-method max --size 6800 --run

# Allow low-bit tensors (IQ2_XXS through Q8_0 spread)
python main.py --model model.gguf --imatrix imatrix.gguf --size 6000 \
  --allow-q3-or-lower --run

The llama-quantize binary path is set in quantizer.py:6.

Inference (llama-server)

Recommended server flags for serving ASHQ1 quants:

./build/bin/llama-server \
  -m model-ASHQ1.gguf \
  -c 50000 \
  --jinja \
  -fit off \
  -ngl 99 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --port 8080 \
  --mmap \
  --temp 1.0 \
  --top-p 0.95 \
  --min-p 0 \
  --top-k 20 \
  --seed -1 \
  --parallel 1

Tier Reference

Tier	BPW	MSE_BPW
F16	16.0	16.0
Q8_0	8.50	8.50
Q6_K	6.5625	6.5625
Q5_K	5.50	5.50
Q4_K	4.50	4.50
IQ4_NL	4.50	(2)
IQ4_XS	4.25	4.25
Q3_K	3.4375	3.4375
IQ3_M	3.66	—
IQ3_S	3.44	3.44
IQ3_XXS	3.0625	3.0625
IQ2_S	2.50	2.50
IQ2_XS	2.3125	2.3125
IQ2_XXS	2.0625	2.0625
IQ1_S	1.5625	1.5625

(2) IQ4_NL uses IQ4_XS MSE_BPW for the free-upgrade pass (same real bpw as Q4_K).

Quantization Configs

Generated configs are valid llama-quantize arguments with ECMAScript-compatible regex patterns. Each --tensor-type rule matches a group of tensors that share the same target tier, with layers grouped into contiguous ranges:

(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_k=Q8_0 — specific attention layers at Q8_0
(blk|BLK)\.((?:22|23|24|25|26))\.ffn_gate=Q6_K — range of FFN layers at Q6_K
.*output_norm.*=F16 — global catch-all

Rules are sorted by specificity (specific layers, high tiers first) because llama-quantize uses first-match-wins.

References

Downloads last month: -; Downloads are not tracked for this model. How to track