This model is an INT4-quantized CUDA/GPU variant of Phi-3.5-Mini-Instruct, fine-tuned using a combination of Quantization-Aware Training (QAT) and LoRA to preserve accuracy while enabling highly efficient inference. After fine-tuning, the model was exported to ONNX and quantized using INT4 RTN via the default ONNX Runtime Model Builder settings to ensure compatibility across platforms.

The QAT process was performed on three small, non-overlapping datasets:

HotpotQA - to improve EOS behavior and reduce excessively long generations
Alpaca - to maintain general reasoning and instruction-following quality
A small synthetic set (≈50 samples) - to reinforce broad knowledge recall

Training used approximately 10k samples over 2 epochs, completing in about 1.5–2 hours on an H100. A key insight is that fine-tuning before quantization, using fake quantization during training, consistently produces better accuracy retention than fine-tuning after quantization. However, the effectiveness of this approach is sensitive to the correct choice of hyperparameters.

My QAT+LoRA fine-tuned Phi-3.5-Mini shows consistent accuracy improvements across nearly all reasoning-focused benchmarks while maintaining overall MMLU performance with some drop in certain metrics accuracy scenarios, but with the rise in normalized accuracy which is usually a sign of an improved model. A particularly notable improvement is the significant reduction and near-elimination of the EOS-generation issue, where the model previously delayed or failed to produce the EOS token. The quantized model produces prompts that are, on average, shorter than the original PyTorch model, saving in inference as well.

The preliminary benchmark results (evaluated on the first 10k samples due to time constraints):

Task	Metric	Published INT4 GPU Model	QAT Fine-Tuned Model	Δ Change
MMLU (overall)	Accuracy	67.84% ± 0.38	67.32% ± 0.38	−0.52
TriviaQA	Exact Match	23.0% ± 2.43	33.4% ± 2.11	+10.4
ARC-Easy	Accuracy	83.16% ± 0.77	85.56% ± 0.77	+2.40
ARC-Easy	Norm Acc	82.37% ± 0.78	86.53% ± 0.70	+4.16
ARC-Challenge	Accuracy	58.79% ± 1.44	59.39% ± 1.44	+0.60
ARC-Challenge	Norm Acc	60.58% ± 1.43	62.26% ± 1.42	+1.68
Winogrande	Accuracy	74.51% ± 1.22	75.53% ± 1.21	+1.02
PIQA	Accuracy	79.92% ± 0.93	80.47% ± 0.92	+0.55
CommonSenseQA	Accuracy	70.27% ± 1.32	73.30% ± 1.27	+3.03
OpenBookQA	Accuracy	38.00% ± 2.17	37.20% ± 2.16	−0.80
OpenBookQA	Norm Acc	47.20% ± 2.23	48.00% ± 2.24	+0.80

Even more surprising is that the quantized model is outperforming the baseline PyTorch model accross some benchmarks

Task	Baseline (FP16/BF16)	QAT+LoRA	Δ Gap
ARC-Challenge	59.22	59.39	+0.17 ↑
ARC-Challenge (Norm)	61.18	62.26	+1.08 ↑
ARC-Easy	84.18	85.56	+1.38 ↑
ARC-Easy (Norm)	83.38	86.53	+3.15 ↑
Winogrande	74.90	75.53	+0.63 ↑
PIQA	79.65	80.47	+0.82 ↑
MMLU (Overall)	68.62	67.32	−1.30 ↓
CommonSenseQA	75.02	73.30	−1.72 ↓
OpenBookQA	38.20	37.20	−1.00 ↓
OpenBookQA (Norm)	47.80	48.00	+0.20 ↑

More extensive runs are needed, especially for MMLU where evaluation is slow and we used a 10k-sample cutoff for speed. We also need evaluations on generative benchmarks, but this is currently blocked by the lack of a reliable ONNX-compatible data generator. My custom implementation of generate_until for 0-shot HumanEval shows no observable degradation, but the benchmarking logic is not fully validated, so I am not including those results here.

🚀 Major Wins

EOS issue fixed
+10.4 EM on TriviaQA — significant factual recall improvement
+4.1 normalized accuracy on ARC-Easy — stronger multi-step reasoning
Consistent improvements across ARC-Challenge, Winogrande, and PIQA
- Indicates higher robustness under INT4 quantization
- Reduces reasoning instability often seen in small INT4 models

🔒 Minimal Risk

MMLU decrease is only −0.5, well within statistical noise
No measurable degradation in broad knowledge or general-domain performance

🎯 What These Results Demonstrate

QAT + LoRA meaningfully improves model quality under INT4 constraints
Improvements transfer to real tasks: QA, reasoning, and common sense
Accuracy gains are achieved without increasing model size or inference cost
Confirms that QAT is a viable path for boosting INT4 model performance in production-like scenarios

See the images below for sample of improvement over the already published INT4 model:

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nenad1002/Phi-3.5-mini-instruct-onnx-qat-finetuned

Base model

microsoft/Phi-3.5-mini-instruct

Quantized

(159)

this model