Qwen3-1.7B-NVFP4A16

Model Overview

Model Architecture: Qwen/Qwen3-1.7B
Input: Text
Output: Text
Model Optimizations:
- Weight quantization: FP4 with per-group 16
- Activation quantization: FP16
Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
Release Date: 8/4/2025
Version: 1.0
Model Developers: 2imi9

This model is a quantized version of Qwen/Qwen3-1.7B. It was evaluated on several tasks to assess its quality in comparison to the unquantized model.

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "path/to/Qwen3-1.7B-NVFP4A16"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Creation

This model was created by applying LLM Compressor with post-training quantization (PTQ) without calibration samples, as presented in the code snippet below.

from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation

# Load model.
MODEL_ID = "Qwen/Qwen3-1.7B"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp4 with per group 16 via ptq
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4A16", ignore=["lm_head"])

# Apply quantization.
oneshot(model=model, recipe=recipe)

print("\n\n========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
    model.device
)
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("==========================================\n\n")

# Save to disk in compressed-tensors format.
SAVE_DIR = f"../../../model/{MODEL_ID.split('/')[-1]}-NVFP4A16"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

Evaluation

This model was evaluated on the well-known MMLU-Redux, Math500, IFEval, and RULER-NIAH benchmarks. All evaluations were conducted using a custom evaluation framework with vLLM backend.

Accuracy

Category	Metric	Qwen/Qwen3-1.7B	Qwen3-1.7B-NVFP4A16 (this model)	Recovery (%)
General Knowledge	MMLU-Redux (T)	NA	65.73%	NA
General Knowledge	MMLU-Redux	64.4%	55.23%	85.8%
Mathematical Reasoning	Math500 (T)	93.4%	89.6%	95.9%
Mathematical Reasoning	Math500	73%	70%	95.9%
Instruction Following	IFEval(Strict Prompt Level Acc)	68.2%	66.17%	97.0%
Long Context	RULER-NIAH-32k	NA	76.21%	NA
Coding	LiveCodeBench (2410-2502)(T) Pass@1	33.2%	29.75%	89.6%
Coding	LiveCodeBench (2410-2502) Pass@1	11.6%	6.25%	53.8%

Reproduction

The results were obtained using the following commands: (on single gpu rtx5090)

MMLU-Redux

# MMLU-Redux without thinking
HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_1.7B_bf16 \
    model_name=/app/model/Qwen3-1.7B-NVFP4A16 \
    eval_dataset=mmlu-redux \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.9 \
    output_dir=/app/outputs/mmlu_redux_nvfp4_1.7vllm \
    debug=false

Math500

# Math500 with thinking
HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_1.7B_bf16 \
    model_name=/app/model/Qwen3-1.7B-NVFP4A16 \
    eval_dataset=math500 \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.9 \
    output_dir=/app/outputs/math500_nvfp4_1.7vllm_thinking \
    debug=false

IFEval

HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_1.7B_bf16 \
    model_name=/app/model/Qwen3-1.7B-NVFP4A16 \
    eval_dataset=ifeval \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.9 \
    output_dir=/app/outputs/ifeval_nvfp4_1.7vllm \
    debug=false

RULER-NIAH-32k

HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_1.7B_bf16 \
    model_name=/app/model/Qwen3-1.7B-NVFP4A16 \
    eval_dataset=ruler-niah-32k \
    eval_dataset_config=ruler-32k \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.5 \
    predictor_conf.vllm.max_num_seqs=1 \
    predictor_conf.vllm.max_num_batched_tokens=16384 \
    predictor_conf.vllm.max_seq_len=32768 \
    predictor_conf.vllm.enable_prefix_caching=false \
    +predictor_conf.vllm.cpu_offload_gb=8 \
    +predictor_conf.vllm.device=auto \
    output_dir=/app/outputs/ruler_niah_nvfp4_1.7vllm \
    debug=false

Technical Details

Quantization Scheme: NVFP4A16 (FP4 weights with FP16 activations, per-group 16)
Excluded Layers: Language model head (lm_head) is not quantized
Memory Reduction: Approximately 75% reduction in model size
Inference Backend: Optimized for vLLM with tensor parallelism support
Context Length: Supports up to 32k tokens (as tested with RULER-NIAH)

Configuration Notes

GPU memory utilization can be adjusted between 0.5-0.9 depending on available hardware
For long context evaluation (32k), reduced memory utilization (0.5) and CPU offload (8GB) are recommended
Prefix caching can be disabled for memory-constrained environments
Tensor parallel size of 1 is sufficient for the 1.7B parameter model

Downloads last month: 10

Safetensors

Model size

1B params

Tensor type

F32

BF16

F8_E4M3

Model tree for 2imi9/Qwen3-1.7B-NVFP4A16

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Quantized

(127)

this model