Qwen3-1.7B-NVFP4A16

Model Overview

  • Model Architecture: Qwen/Qwen3-1.7B
  • Input: Text
  • Output: Text
  • Model Optimizations:
    • Weight quantization: FP4 with per-group 16
    • Activation quantization: FP16
  • Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
  • Release Date: 8/4/2025
  • Version: 1.0
  • Model Developers: 2imi9

This model is a quantized version of Qwen/Qwen3-1.7B. It was evaluated on several tasks to assess its quality in comparison to the unquantized model.

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "path/to/Qwen3-1.7B-NVFP4A16"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Creation

This model was created by applying LLM Compressor with post-training quantization (PTQ) without calibration samples, as presented in the code snippet below.

from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation

# Load model.
MODEL_ID = "Qwen/Qwen3-1.7B"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp4 with per group 16 via ptq
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4A16", ignore=["lm_head"])

# Apply quantization.
oneshot(model=model, recipe=recipe)

print("\n\n========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
    model.device
)
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("==========================================\n\n")

# Save to disk in compressed-tensors format.
SAVE_DIR = f"../../../model/{MODEL_ID.split('/')[-1]}-NVFP4A16"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

Evaluation

This model was evaluated on the well-known MMLU-Redux, Math500, IFEval, and RULER-NIAH benchmarks. All evaluations were conducted using a custom evaluation framework with vLLM backend.

Accuracy

Category Metric Qwen/Qwen3-1.7B Qwen3-1.7B-NVFP4A16 (this model) Recovery (%)
General Knowledge MMLU-Redux (T) NA 65.73% NA
General Knowledge MMLU-Redux 64.4% 55.23% 85.8%
Mathematical Reasoning Math500 (T) 93.4% 89.6% 95.9%
Mathematical Reasoning Math500 73% 70% 95.9%
Instruction Following IFEval(Strict Prompt Level Acc) 68.2% 66.17% 97.0%
Long Context RULER-NIAH-32k NA 76.21% NA
Coding LiveCodeBench (2410-2502)(T) Pass@1 33.2% 29.75% 89.6%
Coding LiveCodeBench (2410-2502) Pass@1 11.6% 6.25% 53.8%

Reproduction

The results were obtained using the following commands: (on single gpu rtx5090)

MMLU-Redux

# MMLU-Redux without thinking
HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_1.7B_bf16 \
    model_name=/app/model/Qwen3-1.7B-NVFP4A16 \
    eval_dataset=mmlu-redux \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.9 \
    output_dir=/app/outputs/mmlu_redux_nvfp4_1.7vllm \
    debug=false

Math500

# Math500 with thinking
HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_1.7B_bf16 \
    model_name=/app/model/Qwen3-1.7B-NVFP4A16 \
    eval_dataset=math500 \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.9 \
    output_dir=/app/outputs/math500_nvfp4_1.7vllm_thinking \
    debug=false

IFEval

HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_1.7B_bf16 \
    model_name=/app/model/Qwen3-1.7B-NVFP4A16 \
    eval_dataset=ifeval \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.9 \
    output_dir=/app/outputs/ifeval_nvfp4_1.7vllm \
    debug=false

RULER-NIAH-32k

HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_1.7B_bf16 \
    model_name=/app/model/Qwen3-1.7B-NVFP4A16 \
    eval_dataset=ruler-niah-32k \
    eval_dataset_config=ruler-32k \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.5 \
    predictor_conf.vllm.max_num_seqs=1 \
    predictor_conf.vllm.max_num_batched_tokens=16384 \
    predictor_conf.vllm.max_seq_len=32768 \
    predictor_conf.vllm.enable_prefix_caching=false \
    +predictor_conf.vllm.cpu_offload_gb=8 \
    +predictor_conf.vllm.device=auto \
    output_dir=/app/outputs/ruler_niah_nvfp4_1.7vllm \
    debug=false

Technical Details

  • Quantization Scheme: NVFP4A16 (FP4 weights with FP16 activations, per-group 16)
  • Excluded Layers: Language model head (lm_head) is not quantized
  • Memory Reduction: Approximately 75% reduction in model size
  • Inference Backend: Optimized for vLLM with tensor parallelism support
  • Context Length: Supports up to 32k tokens (as tested with RULER-NIAH)

Configuration Notes

  • GPU memory utilization can be adjusted between 0.5-0.9 depending on available hardware
  • For long context evaluation (32k), reduced memory utilization (0.5) and CPU offload (8GB) are recommended
  • Prefix caching can be disabled for memory-constrained environments
  • Tensor parallel size of 1 is sufficient for the 1.7B parameter model
Downloads last month
10
Safetensors
Model size
1B params
Tensor type
F32
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for 2imi9/Qwen3-1.7B-NVFP4A16

Finetuned
Qwen/Qwen3-1.7B
Quantized
(127)
this model