DeepSeek-V3.2-NVFP4

NVFP4 (4-bit floating point) quantized version of DeepSeek-V3.2 with reference CPU inference implementation.


Model Description

DeepSeek-V3.2 is a 685B parameter Mixture-of-Experts (MoE) model with 37B activated parameters per token. This quantized version converts the original FP8 weights to NVFP4 format for 16x compression compared to FP32.

Quantization Details

Property Value
Source Format FP8 E4M3 (128x128 block scales)
Target Format NVFP4 E2M1 (16-element block scales)
Quantization Method Custom FP8 to NVFP4 converter
Original Size Approximately 642 GB (FP8)
Quantized Size 391 GB (NVFP4)
Compression 16x vs FP32
Conversion Errors 0
Weights Converted 30,769

Preserved Components (Not Quantized)

The following sensitive components are preserved in their original precision to maintain model quality:

  • Embeddings (model.embed_tokens)
  • Output head (lm_head)
  • MoE router gates (*.mlp.gate)
  • Layer norms and RMS norms
  • DSA indexer weights (indexer.weights_proj, indexer.k_norm)

Reference Implementation

The inference/ directory contains a functional reference implementation for CPU inference:

Quick Start

cd inference

# Run unit tests (under 30 seconds)
python test_nvfp4_kernel.py

# Run forward pass test (10-15 minutes)
python test_forward_pass.py

# Interactive inference (slow on CPU: 2-5 min/token)
python generate.py \
    --ckpt-path /mnt/models/deepseek-v3.2-nvfp4 \
    --config config_671B_nvfp4.json \
    --interactive \
    --max-new-tokens 10

Implementation Details

File Description
model.py DeepSeek V3.2 architecture with NVFP4 support
generate.py Text generation and inference pipeline
nvfp4_kernel.py NVFP4 CPU dequantization kernels
kernel.py FP8 runtime kernels with CPU fallbacks
encoding_dsv32.py DeepSeek V3.2 message encoding
test_*.py Comprehensive test suite

See inference/README.md for complete documentation.


Hardware Requirements

CPU Inference (Reference Implementation)

  • RAM: Minimum 400GB
  • CPU: Multi-core recommended
  • Performance: Approximately 2-5 minutes per token

GPU Inference (Future)

  • Requires completion of Triton NVFP4 kernels
  • Target: NVIDIA Blackwell GPUs (SM100, SM120)
  • Expected speedup: 100-1000x vs CPU

NVFP4 Format Specification

E2M1 Floating Point

  • 4 bits per value (16 representable values)
  • Values: {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}
  • Storage: 2 FP4 values packed per uint8 byte

Dual-Level Scaling

  • Per-block scale: FP8 E4M3, 16 elements per block
  • Global scale: FP32 scalar
  • Formula: value = packed * weight_scale * weight_scale_2

Architecture Notes

Multi-head Latent Attention (MLA)

  • Compressed KV cache using latent projection
  • FP8 KV cache for memory efficiency

Sparse Attention (DSA)

  • Indexer class computes attention pattern selection
  • Top-k sparse pattern for efficient long-context

Mixture of Experts (MoE)

  • 256 routed experts per layer
  • 1 shared expert per layer
  • Top-8 routing with load balancing

Conversion Process

This model was converted using a custom FP8 to NVFP4 streaming converter:

  1. Dequantize: FP8 E4M3 weights to FP32 (using 128x128 block inverse scales)
  2. Compute NVFP4 scales:
    • Global scale: scale_2 = amax / (6.0 * 448.0)
    • Per-block scale: scale = block_amax / (6.0 * scale_2)
  3. Quantize: FP32 to NVFP4 E2M1 (16-element blocks)
  4. Pack: Two FP4 values per uint8 byte

Note: For vLLM's fused MoE kernels, gate_proj (w1) and up_proj (w3) within each expert must share the same weight_scale_2. The converter handles this by computing a joint amax across both tensors to derive the shared global scale.

See tools/fp8_to_nvfp4_streaming.py for the complete conversion implementation.

Tensor Format

For each quantized weight:

  • *.weight: Packed uint8 [M, N/2]
  • *.weight_scale: FP8 E4M3 per-block scale [M, N/16]
  • *.weight_scale_2: FP32 global scale [1]

Validation

Comprehensive testing completed:

  • NVFP4 kernel unit tests: PASS
  • Model loading: PASS (73 shards, 391GB)
  • Forward pass: PASS (valid outputs, no NaN/Inf)
  • Output quality: Coherent, semantically correct responses

See conversion_report.json for detailed conversion statistics.


Acknowledgments

  • Original model by DeepSeek AI
  • NVFP4 format based on NVIDIA TensorRT Model Optimizer

License

This model inherits the MIT License from the original DeepSeek-V3.2 model.


Citation

@misc{deepseekai2025deepseekv32,
      title={DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models},
      author={DeepSeek-AI},
      year={2025},
}

Contact

For issues with the quantized version or reference implementation, please open an issue.

For questions about the original model, contact DeepSeek AI.

Downloads last month
138
Safetensors
Model size
387B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for eousphoros/DeepSeek-V3.2-NVFP4

Quantized
(3)
this model