WeDLM-8B-Instruct ⭐

WeDLM-8B-Instruct is our flagship instruction-tuned diffusion language model that performs parallel decoding under standard causal attention, fine-tuned from WeDLM-8B.

Highlights:

  • πŸš€ 3-6Γ— faster than vLLM-optimized Qwen3-8B on math reasoning tasks
  • πŸ“ˆ Outperforms base Qwen3-8B-Instruct on most benchmarks
  • βœ… Native KV cache compatible (FlashAttention, PagedAttention, CUDA Graphs)

For the base (pretrained) version, see WeDLM-8B, which is based on Qwen3-8B-Base.

πŸ“„ Paper | 🌐 Project Page | πŸ’» GitHub

Model Details

Attribute Value
Base Model WeDLM-8B
Parameters 8B
Context Length 32,768

Installation

git clone https://github.com/tencent/WeDLM.git
cd WeDLM && bash install.sh
Manual Installation
# Step 1: PyTorch
pip install torch==2.8.0+cu129 --index-url https://download.pytorch.org/whl/cu129

# Step 2: flash-attn build dependencies
pip install psutil ninja packaging

# Step 3: flash-attn (requires torch first)
pip install flash-attn==2.7.4.post1 --no-build-isolation

# Step 4: WeDLM
git clone https://github.com/tencent/WeDLM.git
cd WeDLM && pip install -e .
Docker Installation
# Pull the Docker image
docker pull aiweiliu/wedlm:v3

# Run the container with GPU support
docker run -it --gpus all -p 8080:8080 --name wedlm aiweiliu/wedlm:v3 /bin/bash

# Inside the container, run inference directly
python example.py --model tencent/WeDLM-8B-Instruct

Note: flash-attn requires compilation and must be installed after PyTorch. The install.sh script handles this automatically (default: CUDA 12.9). For other CUDA versions: CUDA_VERSION=cu124 bash install.sh

Quick Start (Recommended)

For fast inference, use the wedlm engine:

from transformers import AutoTokenizer
from wedlm import LLM, SamplingParams

llm = LLM(model="tencent/WeDLM-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)

prompt = "Solve step by step: A store sells apples for $2 each and oranges for $3 each. Tom bought 5 apples and 4 oranges. How much did he spend?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=512))
print(outputs[0]["text"])

Multi-turn Conversation

messages = [
    {"role": "user", "content": "What is the derivative of x^2?"},
    {"role": "assistant", "content": "The derivative of xΒ² is 2x."},
    {"role": "user", "content": "What about x^3?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=256))

Batch Inference

prompts = [
    "Explain quantum entanglement simply.",
    "Write a Python function to check if a number is prime.",
    "What are the main causes of climate change?"
]
messages_batch = [[{"role": "user", "content": p}] for p in prompts]
texts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages_batch]

outputs = llm.generate(texts, SamplingParams(temperature=0.2, max_tokens=512))
for i, output in enumerate(outputs):
    print(f"=== Response {i+1} ===\n{output['text']}\n")

HuggingFace Transformers

For training or simple forward passes:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "tencent/WeDLM-8B-Instruct", 
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)

messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model(**inputs)

⚠️ Note: The HuggingFace interface is for training/forward pass convenience. For optimized inference throughput, use the wedlm engine above.

Performance

Generation Quality

Benchmark Qwen3-8B-Instruct WeDLM-8B-Instruct
ARC-C (0-shot) 91.47 92.92
GSM8K (3-shot) 89.91 92.27
MATH (4-shot) 69.60 64.80
HumanEval (4-shot) 71.95 80.49
MMLU (5-shot) 71.52 75.14
GPQA-Diamond (5-shot) 41.41 44.95
Average 75.12 77.53

Inference Speed

Speedup varies by task characteristics (measured against vLLM-optimized Qwen3-8B-Instruct):

Scenario Speedup Notes
Math Reasoning (GSM8K) 3-6Γ— Structured, predictable output
Code Generation 2-3Γ— Deterministic syntax
Open-ended QA 1.5-2Γ— Higher entropy limits parallelism

Citation

@article{liu2025wedlm,
  title={WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference},
  author={Liu, Aiwei and He, Minghua and Zeng, Shaoxun and Zhang, Linhao and Wu, Chuhan and Jia, Wei and Liu, Yuan and Yu, Yang and Zhou, Xiao and Zhou, Jie},
  journal={arXiv preprint arXiv:2512.22737},
  year={2025}
}

License

Apache 2.0

Downloads last month
302
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 2 Ask for provider support

Collection including tencent/WeDLM-8B-Instruct