Instructions to use vrfai/Qwen3.6-27B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use vrfai/Qwen3.6-27B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="vrfai/Qwen3.6-27B-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("vrfai/Qwen3.6-27B-NVFP4")
model = AutoModelForImageTextToText.from_pretrained("vrfai/Qwen3.6-27B-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use vrfai/Qwen3.6-27B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "vrfai/Qwen3.6-27B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vrfai/Qwen3.6-27B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/vrfai/Qwen3.6-27B-NVFP4

SGLang

How to use vrfai/Qwen3.6-27B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "vrfai/Qwen3.6-27B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vrfai/Qwen3.6-27B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "vrfai/Qwen3.6-27B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vrfai/Qwen3.6-27B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use vrfai/Qwen3.6-27B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/vrfai/Qwen3.6-27B-NVFP4
```

Qwen3.6-27B-NVFP4

NVFP4 quantized version of Qwen/Qwen3.6-27B by vrfai using llm-compressor.

Tested and deployed on 2× NVIDIA RTX 5090 with full tensor-parallel inference via vLLM.

NVFP4 Quantization Details


Base model	Qwen/Qwen3.6-27B
Quantization	NVFP4 — weights FP4, activations FP4 (dynamic local), scales FP8
Format	`compressed-tensors` (native vLLM support)
Tool	vllm-project/llm-compressor
Requires	NVIDIA Blackwell GPU (SM 120+), vLLM ≥ 0.19

What's Quantized / What's Not

The quantization strategy carefully preserves the most sensitive components in BF16 while aggressively compressing the compute-heavy stable layers:

Component	Precision	Reason
FFN / MLP — all 64 transformer layers	NVFP4	High parameter density, stable under quantization
Full-attention projections (q/k/v/o) — 16 GQA layers	NVFP4	Standard attention, tolerant to 4-bit
DeltaNet / Linear-attention projections — 48 layers	BF16	Gated linear recurrence is sensitive to numerical errors
Vision encoder — all 27 blocks + merger	BF16	Vision tower preserved to maintain multimodal quality
`lm_head`	BF16	Output logits preserved for generation stability

The architecture of Qwen3.6-27B interleaves 3 × DeltaNet (linear attention) layers with 1 × full GQA attention every 4 layers (16 such groups × 4 = 64 layers total). Only the full-attention group and all FFN layers are quantized; the DeltaNet recurrent cores are untouched.

Quantization Config (llm-compressor)

# recipe.yaml
QuantizationModifier:
  targets: [Linear]
  scheme: NVFP4
  ignore:
    - lm_head
    # Vision encoder — all 27 blocks (attn + mlp) + merger
    - re:model\.visual\.blocks\.\d+\..*
    - model.visual.merger.linear_fc1
    - model.visual.merger.linear_fc2
    # DeltaNet / Linear-attention layers (layers 0–2, 4–6, 8–10, ..., 60–62)
    - re:model\.language_model\.layers\.\d+\.linear_attn\..*

Quick Start (vLLM)

vllm serve vrfai/Qwen3.6-27B-NVFP4 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9 \
  --dtype auto \
  --trust-remote-code \
  --tensor-parallel-size 2

For single-GPU Blackwell (e.g., RTX 5090 with 32 GB):

vllm serve vrfai/Qwen3.6-27B-NVFP4 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --dtype auto \
  --trust-remote-code

Python (Transformers)

from transformers import Qwen3_5ForConditionalGeneration, AutoTokenizer

model_name = "vrfai/Qwen3.6-27B-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = Qwen3_5ForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Explain quantization in one paragraph."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

OpenAI-compatible API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="vrfai/Qwen3.6-27B-NVFP4",
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

Tested Environment

Component	Version
vLLM	0.19.1
Transformers	5.6.0
PyTorch	2.10.0+cu128
CUDA	12.8 (nvcc 12.8.61)
llm-compressor	compressed-tensors 0.14.0.1
GPU	2× NVIDIA RTX 5090 (tensor-parallel-size 2)
OS	Ubuntu 24

Best Practices

Sampling parameters:

Mode	temperature	top_p	top_k	presence_penalty
Thinking — general	1.0	0.95	20	0.0
Thinking — coding (WebDev)	0.6	0.95	20	0.0
Non-thinking / instruct	0.7	0.80	20	1.5

Output length: Recommend max_new_tokens=32768 for most tasks; up to 81920 for complex math/coding benchmarks.

Thinking mode (enable via chat template):

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    chat_template_kwargs={"enable_thinking": True},
)

Credits

Original model: Qwen Team (Alibaba Group)
NVFP4 quantization: vrfai
Quantization framework: vllm-project/llm-compressor

Below is the original model card from Qwen/Qwen3.6-27B:

This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format.

These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc.

Following the February release of the Qwen3.5 series, we're pleased to share the first open-weight variant of Qwen3.6. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience.

Qwen3.6 Highlights

This release delivers substantial upgrades, particularly in

Agentic Coding: the model now handles frontend workflows and repository-level reasoning with greater fluency and precision.
Thinking Preservation: we've introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead.

For more details, please refer to our blog post Qwen3.6-27B.

Model Overview

Type: Causal Language Model with Vision Encoder
Training Stage: Pre-training & Post-training
Language Model
- Number of Parameters: 27B
- Hidden Dimension: 5120
- Token Embedding: 248320 (Padded)
- Number of Layers: 64
- Hidden Layout: 16 × (3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN))
- Gated DeltaNet:
  - Number of Linear Attention Heads: 48 for V and 16 for QK
  - Head Dimension: 128
- Gated Attention:
  - Number of Attention Heads: 24 for Q and 4 for KV
  - Head Dimension: 256
  - Rotary Position Embedding Dimension: 64
- Feed Forward Network:
  - Intermediate Dimension: 17408
- LM Output: 248320 (Padded)
- MTP: trained with multi-steps
Context Length: 262,144 natively and extensible up to 1,010,000 tokens.

Citation

@misc{qwen3.6-27b,
    title  = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},
    author = {{Qwen Team}},
    month  = {April},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3.6-27b}
}

Downloads last month: 17,853

Safetensors

Model size

19B params

Tensor type

F32

BF16

F8_E4M3

Model tree for vrfai/Qwen3.6-27B-NVFP4

Base model

Qwen/Qwen3.6-27B

Quantized

(418)

this model

Collection including vrfai/Qwen3.6-27B-NVFP4

Qwen3.6-Optimized

Collection

3 items • Updated Apr 24 • 1