DeepSeek-V3.2-NVFP4

NVFP4 (4-bit floating point) quantized version of DeepSeek-V3.2 with reference CPU inference implementation.

Model Description

DeepSeek-V3.2 is a 685B parameter Mixture-of-Experts (MoE) model with 37B activated parameters per token. This quantized version converts the original FP8 weights to NVFP4 format for 16x compression compared to FP32.

Quantization Details

Property	Value
Source Format	FP8 E4M3 (128x128 block scales)
Target Format	NVFP4 E2M1 (16-element block scales)
Quantization Method	Custom FP8 to NVFP4 converter
Original Size	Approximately 642 GB (FP8)
Quantized Size	391 GB (NVFP4)
Compression	16x vs FP32
Conversion Errors	0
Weights Converted	30,769

Preserved Components (Not Quantized)

The following sensitive components are preserved in their original precision to maintain model quality:

Embeddings (model.embed_tokens)
Output head (lm_head)
MoE router gates (*.mlp.gate)
Layer norms and RMS norms
DSA indexer weights (indexer.weights_proj, indexer.k_norm)

Reference Implementation

The inference/ directory contains a functional reference implementation for CPU inference:

Quick Start

cd inference

# Run unit tests (under 30 seconds)
python test_nvfp4_kernel.py

# Run forward pass test (10-15 minutes)
python test_forward_pass.py

# Interactive inference (slow on CPU: 2-5 min/token)
python generate.py \
    --ckpt-path /mnt/models/deepseek-v3.2-nvfp4 \
    --config config_671B_nvfp4.json \
    --interactive \
    --max-new-tokens 10

Implementation Details

File	Description
model.py	DeepSeek V3.2 architecture with NVFP4 support
generate.py	Text generation and inference pipeline
nvfp4_kernel.py	NVFP4 CPU dequantization kernels
kernel.py	FP8 runtime kernels with CPU fallbacks
encoding_dsv32.py	DeepSeek V3.2 message encoding
test_*.py	Comprehensive test suite

See inference/README.md for complete documentation.

Hardware Requirements

CPU Inference (Reference Implementation)

RAM: Minimum 400GB
CPU: Multi-core recommended
Performance: Approximately 2-5 minutes per token

GPU Inference (Future)

Requires completion of Triton NVFP4 kernels
Target: NVIDIA Blackwell GPUs (SM100, SM120)
Expected speedup: 100-1000x vs CPU

NVFP4 Format Specification

E2M1 Floating Point

4 bits per value (16 representable values)
Values: {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}
Storage: 2 FP4 values packed per uint8 byte

Dual-Level Scaling

Per-block scale: FP8 E4M3, 16 elements per block
Global scale: FP32 scalar
Formula: value = packed * weight_scale * weight_scale_2

Architecture Notes

Multi-head Latent Attention (MLA)

Compressed KV cache using latent projection
FP8 KV cache for memory efficiency

Sparse Attention (DSA)

Indexer class computes attention pattern selection
Top-k sparse pattern for efficient long-context

Mixture of Experts (MoE)

256 routed experts per layer
1 shared expert per layer
Top-8 routing with load balancing

Conversion Process

This model was converted using a custom FP8 to NVFP4 streaming converter:

Dequantize: FP8 E4M3 weights to FP32 (using 128x128 block inverse scales)
Compute NVFP4 scales:
- Global scale: scale_2 = amax / (6.0 * 448.0)
- Per-block scale: scale = block_amax / (6.0 * scale_2)
Quantize: FP32 to NVFP4 E2M1 (16-element blocks)
Pack: Two FP4 values per uint8 byte

Note: For vLLM's fused MoE kernels, gate_proj (w1) and up_proj (w3) within each expert must share the same weight_scale_2. The converter handles this by computing a joint amax across both tensors to derive the shared global scale.

See tools/fp8_to_nvfp4_streaming.py for the complete conversion implementation.

Tensor Format

For each quantized weight:

*.weight: Packed uint8 [M, N/2]
*.weight_scale: FP8 E4M3 per-block scale [M, N/16]
*.weight_scale_2: FP32 global scale [1]

Validation

Comprehensive testing completed:

NVFP4 kernel unit tests: PASS
Model loading: PASS (73 shards, 391GB)
Forward pass: PASS (valid outputs, no NaN/Inf)
Output quality: Coherent, semantically correct responses

See conversion_report.json for detailed conversion statistics.

Acknowledgments

Original model by DeepSeek AI
NVFP4 format based on NVIDIA TensorRT Model Optimizer

License

This model inherits the MIT License from the original DeepSeek-V3.2 model.

Citation

@misc{deepseekai2025deepseekv32,
      title={DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models},
      author={DeepSeek-AI},
      year={2025},
}

Contact

For issues with the quantized version or reference implementation, please open an issue.

For questions about the original model, contact DeepSeek AI.

Downloads last month: 138

Safetensors

Model size

387B params

Tensor type

F32

BF16

F8_E4M3

Model tree for eousphoros/DeepSeek-V3.2-NVFP4

Base model

deepseek-ai/DeepSeek-V3.2-Exp-Base

Finetuned

deepseek-ai/DeepSeek-V3.2

Quantized

(3)

this model