This model is an INT4-quantized CUDA/GPU variant of Phi-3.5-Mini-Instruct, fine-tuned using a combination of Quantization-Aware Training (QAT) and LoRA to preserve accuracy while enabling highly efficient inference. After fine-tuning, the model was exported to ONNX and quantized using INT4 RTN via the default ONNX Runtime Model Builder settings to ensure compatibility across platforms.

The QAT process was performed on three small, non-overlapping datasets:

  • HotpotQA - to improve EOS behavior and reduce excessively long generations
  • Alpaca - to maintain general reasoning and instruction-following quality
  • A small synthetic set (β‰ˆ50 samples) - to reinforce broad knowledge recall

Training used approximately 10k samples over 2 epochs, completing in about 1.5–2 hours on an H100. A key insight is that fine-tuning before quantization, using fake quantization during training, consistently produces better accuracy retention than fine-tuning after quantization. However, the effectiveness of this approach is sensitive to the correct choice of hyperparameters.

My QAT+LoRA fine-tuned Phi-3.5-Mini shows consistent accuracy improvements across nearly all reasoning-focused benchmarks while maintaining overall MMLU performance with some drop in certain metrics accuracy scenarios, but with the rise in normalized accuracy which is usually a sign of an improved model. A particularly notable improvement is the significant reduction and near-elimination of the EOS-generation issue, where the model previously delayed or failed to produce the EOS token. The quantized model produces prompts that are, on average, shorter than the original PyTorch model, saving in inference as well.

The preliminary benchmark results (evaluated on the first 10k samples due to time constraints):

Task Metric Published INT4 GPU Model QAT Fine-Tuned Model Ξ” Change
MMLU (overall) Accuracy 67.84% Β± 0.38 67.32% Β± 0.38 βˆ’0.52
TriviaQA Exact Match 23.0% Β± 2.43 33.4% Β± 2.11 +10.4
ARC-Easy Accuracy 83.16% Β± 0.77 85.56% Β± 0.77 +2.40
ARC-Easy Norm Acc 82.37% Β± 0.78 86.53% Β± 0.70 +4.16
ARC-Challenge Accuracy 58.79% Β± 1.44 59.39% Β± 1.44 +0.60
ARC-Challenge Norm Acc 60.58% Β± 1.43 62.26% Β± 1.42 +1.68
Winogrande Accuracy 74.51% Β± 1.22 75.53% Β± 1.21 +1.02
PIQA Accuracy 79.92% Β± 0.93 80.47% Β± 0.92 +0.55
CommonSenseQA Accuracy 70.27% Β± 1.32 73.30% Β± 1.27 +3.03
OpenBookQA Accuracy 38.00% Β± 2.17 37.20% Β± 2.16 βˆ’0.80
OpenBookQA Norm Acc 47.20% Β± 2.23 48.00% Β± 2.24 +0.80

Even more surprising is that the quantized model is outperforming the baseline PyTorch model accross some benchmarks

Task Baseline (FP16/BF16) QAT+LoRA Ξ” Gap
ARC-Challenge 59.22 59.39 +0.17 ↑
ARC-Challenge (Norm) 61.18 62.26 +1.08 ↑
ARC-Easy 84.18 85.56 +1.38 ↑
ARC-Easy (Norm) 83.38 86.53 +3.15 ↑
Winogrande 74.90 75.53 +0.63 ↑
PIQA 79.65 80.47 +0.82 ↑
MMLU (Overall) 68.62 67.32 βˆ’1.30 ↓
CommonSenseQA 75.02 73.30 βˆ’1.72 ↓
OpenBookQA 38.20 37.20 βˆ’1.00 ↓
OpenBookQA (Norm) 47.80 48.00 +0.20 ↑

More extensive runs are needed, especially for MMLU where evaluation is slow and we used a 10k-sample cutoff for speed. We also need evaluations on generative benchmarks, but this is currently blocked by the lack of a reliable ONNX-compatible data generator. My custom implementation of generate_until for 0-shot HumanEval shows no observable degradation, but the benchmarking logic is not fully validated, so I am not including those results here.

πŸš€ Major Wins

  • EOS issue fixed
  • +10.4 EM on TriviaQA β€” significant factual recall improvement
  • +4.1 normalized accuracy on ARC-Easy β€” stronger multi-step reasoning
  • Consistent improvements across ARC-Challenge, Winogrande, and PIQA
    • Indicates higher robustness under INT4 quantization
    • Reduces reasoning instability often seen in small INT4 models

πŸ”’ Minimal Risk

  • MMLU decrease is only βˆ’0.5, well within statistical noise
  • No measurable degradation in broad knowledge or general-domain performance

🎯 What These Results Demonstrate

  • QAT + LoRA meaningfully improves model quality under INT4 constraints
  • Improvements transfer to real tasks: QA, reasoning, and common sense
  • Accuracy gains are achieved without increasing model size or inference cost
  • Confirms that QAT is a viable path for boosting INT4 model performance in production-like scenarios

See the images below for sample of improvement over the already published INT4 model:

image

image

image

image

image

image

image

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nenad1002/Phi-3.5-mini-instruct-onnx-qat-finetuned

Quantized
(159)
this model