MiniMax-M2.1-Quark-W8A8-INT8
W8A8 INT8 quantized version of MiniMaxAI/MiniMax-M2.1 (456B MoE) using AMD Quark.
Model Details
| Base Model | MiniMaxAI/MiniMax-M2.1 |
| Architecture | MoE (Mixture of Experts), 62 layers, 256 experts, top-8 routing |
| Parameters | 456B total, ~45.9B active |
| Quantization | W8A8 INT8 (per-channel weight + per-token dynamic activation) |
| Quantizer | AMD Quark v0.11.1 (ptpc_int8 scheme) |
| Model Size | 215 GB (47 safetensors shards) |
| Original Size | 427 GB (BF16) |
| Compression | ~2x size reduction |
Quantization Scheme
| Component | Dtype | Granularity | Mode |
|---|---|---|---|
| Weight | INT8 | per-channel (ch_axis=0) | symmetric, static |
| Activation | INT8 | per-token (ch_axis=1) | symmetric, dynamic |
lm_head |
BF16 | — | unquantized |
| MoE gates | BF16 | — | unquantized |
Evaluation Results
Accuracy compared with the official BF16 model:
| Benchmark | BF16 (official) | Quark W8A8 INT8 | Recovery |
|---|---|---|---|
| MMLU (5-shot) | 86.2% | 85.48% | 99.2% |
| GSM8K (8-shot) | 92.0% | 93.7% | 101.8% |
Quantization preserves >99% of the original model's accuracy across both benchmarks.
How to Use
With vLLM (Recommended)
⚠️ Requires vLLM with Quark W8A8 INT8 MoE support. See PR #XXXX or use the branch
feat/quark-w8a8-int8-moe.
# Start the server
VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
--model nameistoken/MiniMax-M2.1-Quark-W8A8-INT8 \
--tensor-parallel-size 8 \
--trust-remote-code \
--max-model-len 8192
# Chat completion
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "nameistoken/MiniMax-M2.1-Quark-W8A8-INT8",
"messages": [{"role": "user", "content": "Hello! What is the capital of France?"}],
"max_tokens": 256,
"temperature": 0.7
}'
Hardware Requirements
- Minimum: 8x GPUs with ≥48 GB VRAM each (e.g., AMD W7900D, NVIDIA A100-80G)
- Tensor Parallelism: TP=8 required for 215 GB model
Quantization Details
This model was quantized using AMD Quark's ptpc_int8 (Per-Token Per-Channel INT8) scheme, which is equivalent to the W8A8 scheme in llmcompressor:
- Weight quantization: INT8 per-channel (one scale per output channel), symmetric, static
- Activation quantization: INT8 per-token (one scale per token), symmetric, dynamic (computed at inference time)
- Excluded layers:
lm_head(to preserve output quality) and all MoE gate layers (to preserve routing precision) - Calibration: 128 random samples, max sequence length 4096
Reproduce Quantization
# Using AMD Quark
pip install quark transformers datasets torch
# Run quantization
cd Quark/examples/torch/language_modeling/llm_ptq
python quantize_quark_ptpc_int8.py \
--model_dir /path/to/MiniMax-M2.1-bf16 \
--quant_scheme ptpc_int8 \
--num_calib_data 128 \
--exclude_layers lm_head "*block_sparse_moe.gate*" \
--skip_evaluation \
--multi_gpu --multi_device \
--trust_remote_code \
--model_export hf_format \
--output_dir /path/to/output
Performance
Throughput benchmarks on 8x AMD W7900D (48 GB each), vLLM with TP=8:
| Input | Output | Batch | TTFT (ms) | Sys tok/s |
|---|---|---|---|---|
| 128 | 128 | 1 | 68.5 | 24.1 |
| 128 | 128 | 32 | 254.7 | 504.6 |
| 1024 | 256 | 32 | 170.8 | 508.6 |
| 1024 | 1024 | 32 | 169.5 | 509.8 |
| 2048 | 1024 | 32 | 276.5 | 537.5 |
- Peak throughput: 537.5 tok/s (in=2048, out=1024, batch=32)
- TTFT range: 51.5–547.0 ms
- Scaling: 20.9x throughput increase from batch=1 to batch=32
Citation
If you use this model, please cite the original MiniMax-M2.1 model:
@misc{minimax2025minimax01,
title={MiniMax-M2.1: Scaling Test-Time Compute Efficiently with Open-Source Models},
author={MiniMax},
year={2025},
url={https://huggingface.co/MiniMaxAI/MiniMax-M2.1}
}
License
This model inherits the MiniMax Open Model License from the base model.
- Downloads last month
- 12
Model tree for nameistoken/MiniMax-M2.1-Quark-W8A8-INT8
Base model
MiniMaxAI/MiniMax-M2.1Evaluation results
- accuracy on MMLU (5-shot)self-reported85.480
- accuracy on GSM8K (8-shot)self-reported93.700