MiniMax-M2.1-Quark-W8A8-INT8

W8A8 INT8 quantized version of MiniMaxAI/MiniMax-M2.1 (456B MoE) using AMD Quark.

Model Details


Base Model	MiniMaxAI/MiniMax-M2.1
Architecture	MoE (Mixture of Experts), 62 layers, 256 experts, top-8 routing
Parameters	456B total, ~45.9B active
Quantization	W8A8 INT8 (per-channel weight + per-token dynamic activation)
Quantizer	AMD Quark v0.11.1 (`ptpc_int8` scheme)
Model Size	215 GB (47 safetensors shards)
Original Size	427 GB (BF16)
Compression	~2x size reduction

Quantization Scheme

Component	Dtype	Granularity	Mode
Weight	INT8	per-channel (ch_axis=0)	symmetric, static
Activation	INT8	per-token (ch_axis=1)	symmetric, dynamic
`lm_head`	BF16	—	unquantized
MoE gates	BF16	—	unquantized

Evaluation Results

Accuracy compared with the official BF16 model:

Benchmark	BF16 (official)	Quark W8A8 INT8	Recovery
MMLU (5-shot)	86.2%	85.48%	99.2%
GSM8K (8-shot)	92.0%	93.7%	101.8%

Quantization preserves >99% of the original model's accuracy across both benchmarks.

How to Use

With vLLM (Recommended)

⚠️ Requires vLLM with Quark W8A8 INT8 MoE support. See PR #XXXX or use the branch feat/quark-w8a8-int8-moe.

# Start the server
VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
    --model nameistoken/MiniMax-M2.1-Quark-W8A8-INT8 \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --max-model-len 8192

# Chat completion
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "nameistoken/MiniMax-M2.1-Quark-W8A8-INT8",
  "messages": [{"role": "user", "content": "Hello! What is the capital of France?"}],
  "max_tokens": 256,
  "temperature": 0.7
}'

Hardware Requirements

Minimum: 8x GPUs with ≥48 GB VRAM each (e.g., AMD W7900D, NVIDIA A100-80G)
Tensor Parallelism: TP=8 required for 215 GB model

Quantization Details

This model was quantized using AMD Quark's ptpc_int8 (Per-Token Per-Channel INT8) scheme, which is equivalent to the W8A8 scheme in llmcompressor:

Weight quantization: INT8 per-channel (one scale per output channel), symmetric, static
Activation quantization: INT8 per-token (one scale per token), symmetric, dynamic (computed at inference time)
Excluded layers: lm_head (to preserve output quality) and all MoE gate layers (to preserve routing precision)
Calibration: 128 random samples, max sequence length 4096

Reproduce Quantization

# Using AMD Quark
pip install quark transformers datasets torch

# Run quantization
cd Quark/examples/torch/language_modeling/llm_ptq
python quantize_quark_ptpc_int8.py \
    --model_dir /path/to/MiniMax-M2.1-bf16 \
    --quant_scheme ptpc_int8 \
    --num_calib_data 128 \
    --exclude_layers lm_head "*block_sparse_moe.gate*" \
    --skip_evaluation \
    --multi_gpu --multi_device \
    --trust_remote_code \
    --model_export hf_format \
    --output_dir /path/to/output

Performance

Throughput benchmarks on 8x AMD W7900D (48 GB each), vLLM with TP=8:

Input	Output	Batch	TTFT (ms)	Sys tok/s
128	128	1	68.5	24.1
128	128	32	254.7	504.6
1024	256	32	170.8	508.6
1024	1024	32	169.5	509.8
2048	1024	32	276.5	537.5

Peak throughput: 537.5 tok/s (in=2048, out=1024, batch=32)
TTFT range: 51.5–547.0 ms
Scaling: 20.9x throughput increase from batch=1 to batch=32

Citation

If you use this model, please cite the original MiniMax-M2.1 model:

@misc{minimax2025minimax01,
    title={MiniMax-M2.1: Scaling Test-Time Compute Efficiently with Open-Source Models},
    author={MiniMax},
    year={2025},
    url={https://huggingface.co/MiniMaxAI/MiniMax-M2.1}
}

License

This model inherits the MiniMax Open Model License from the base model.

Downloads last month: 12

Safetensors

Model size

229B params

Tensor type

BF16

Model tree for nameistoken/MiniMax-M2.1-Quark-W8A8-INT8

Base model

MiniMaxAI/MiniMax-M2.1

Quantized

(44)

this model

Evaluation results

accuracy on MMLU (5-shot)
self-reported

85.480
accuracy on GSM8K (8-shot)
self-reported

93.700