This model is an INT4-quantized CUDA/GPU variant of Phi-3.5-Mini-Instruct, fine-tuned using a combination of Quantization-Aware Training (QAT) and LoRA to preserve accuracy while enabling highly efficient inference. After fine-tuning, the model was exported to ONNX and quantized using INT4 RTN via the default ONNX Runtime Model Builder settings to ensure compatibility across platforms.
The QAT process was performed on three small, non-overlapping datasets:
- HotpotQA - to improve EOS behavior and reduce excessively long generations
- Alpaca - to maintain general reasoning and instruction-following quality
- A small synthetic set (β50 samples) - to reinforce broad knowledge recall
Training used approximately 10k samples over 2 epochs, completing in about 1.5β2 hours on an H100. A key insight is that fine-tuning before quantization, using fake quantization during training, consistently produces better accuracy retention than fine-tuning after quantization. However, the effectiveness of this approach is sensitive to the correct choice of hyperparameters.
My QAT+LoRA fine-tuned Phi-3.5-Mini shows consistent accuracy improvements across nearly all reasoning-focused benchmarks while maintaining overall MMLU performance with some drop in certain metrics accuracy scenarios, but with the rise in normalized accuracy which is usually a sign of an improved model. A particularly notable improvement is the significant reduction and near-elimination of the EOS-generation issue, where the model previously delayed or failed to produce the EOS token. The quantized model produces prompts that are, on average, shorter than the original PyTorch model, saving in inference as well.
The preliminary benchmark results (evaluated on the first 10k samples due to time constraints):
| Task | Metric | Published INT4 GPU Model | QAT Fine-Tuned Model | Ξ Change |
|---|---|---|---|---|
| MMLU (overall) | Accuracy | 67.84% Β± 0.38 | 67.32% Β± 0.38 | β0.52 |
| TriviaQA | Exact Match | 23.0% Β± 2.43 | 33.4% Β± 2.11 | +10.4 |
| ARC-Easy | Accuracy | 83.16% Β± 0.77 | 85.56% Β± 0.77 | +2.40 |
| ARC-Easy | Norm Acc | 82.37% Β± 0.78 | 86.53% Β± 0.70 | +4.16 |
| ARC-Challenge | Accuracy | 58.79% Β± 1.44 | 59.39% Β± 1.44 | +0.60 |
| ARC-Challenge | Norm Acc | 60.58% Β± 1.43 | 62.26% Β± 1.42 | +1.68 |
| Winogrande | Accuracy | 74.51% Β± 1.22 | 75.53% Β± 1.21 | +1.02 |
| PIQA | Accuracy | 79.92% Β± 0.93 | 80.47% Β± 0.92 | +0.55 |
| CommonSenseQA | Accuracy | 70.27% Β± 1.32 | 73.30% Β± 1.27 | +3.03 |
| OpenBookQA | Accuracy | 38.00% Β± 2.17 | 37.20% Β± 2.16 | β0.80 |
| OpenBookQA | Norm Acc | 47.20% Β± 2.23 | 48.00% Β± 2.24 | +0.80 |
Even more surprising is that the quantized model is outperforming the baseline PyTorch model accross some benchmarks
| Task | Baseline (FP16/BF16) | QAT+LoRA | Ξ Gap |
|---|---|---|---|
| ARC-Challenge | 59.22 | 59.39 | +0.17 β |
| ARC-Challenge (Norm) | 61.18 | 62.26 | +1.08 β |
| ARC-Easy | 84.18 | 85.56 | +1.38 β |
| ARC-Easy (Norm) | 83.38 | 86.53 | +3.15 β |
| Winogrande | 74.90 | 75.53 | +0.63 β |
| PIQA | 79.65 | 80.47 | +0.82 β |
| MMLU (Overall) | 68.62 | 67.32 | β1.30 β |
| CommonSenseQA | 75.02 | 73.30 | β1.72 β |
| OpenBookQA | 38.20 | 37.20 | β1.00 β |
| OpenBookQA (Norm) | 47.80 | 48.00 | +0.20 β |
More extensive runs are needed, especially for MMLU where evaluation is slow and we used a 10k-sample cutoff for speed. We also need evaluations on generative benchmarks, but this is currently blocked by the lack of a reliable ONNX-compatible data generator. My custom implementation of generate_until for 0-shot HumanEval shows no observable degradation, but the benchmarking logic is not fully validated, so I am not including those results here.
π Major Wins
- EOS issue fixed
- +10.4 EM on TriviaQA β significant factual recall improvement
- +4.1 normalized accuracy on ARC-Easy β stronger multi-step reasoning
- Consistent improvements across ARC-Challenge, Winogrande, and PIQA
- Indicates higher robustness under INT4 quantization
- Reduces reasoning instability often seen in small INT4 models
π Minimal Risk
- MMLU decrease is only β0.5, well within statistical noise
- No measurable degradation in broad knowledge or general-domain performance
π― What These Results Demonstrate
- QAT + LoRA meaningfully improves model quality under INT4 constraints
- Improvements transfer to real tasks: QA, reasoning, and common sense
- Accuracy gains are achieved without increasing model size or inference cost
- Confirms that QAT is a viable path for boosting INT4 model performance in production-like scenarios
See the images below for sample of improvement over the already published INT4 model:
Model tree for nenad1002/Phi-3.5-mini-instruct-onnx-qat-finetuned
Base model
microsoft/Phi-3.5-mini-instruct





