Instructions to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="diffuse-cpp/LLaDA-8B-Instruct-GGUF", filename="llada-8b-q4km.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0 # Run inference directly in the terminal: llama-cli -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0 # Run inference directly in the terminal: llama-cli -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0 # Run inference directly in the terminal: ./llama-cli -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
Use Docker
docker model run hf.co/diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
- LM Studio
- Jan
- vLLM
How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "diffuse-cpp/LLaDA-8B-Instruct-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "diffuse-cpp/LLaDA-8B-Instruct-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
- Ollama
How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with Ollama:
ollama run hf.co/diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
- Unsloth Studio new
How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for diffuse-cpp/LLaDA-8B-Instruct-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for diffuse-cpp/LLaDA-8B-Instruct-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for diffuse-cpp/LLaDA-8B-Instruct-GGUF to start chatting
- Docker Model Runner
How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with Docker Model Runner:
docker model run hf.co/diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
- Lemonade
How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
Run and chat with the model
lemonade run user.LLaDA-8B-Instruct-GGUF-Q8_0
List all available models
lemonade list
Performance notes: entropy_exit + inter-step cache benchmarks
LLaDA-8B GGUF: 1.6x Inference Speedup
We're excited to announce a major performance breakthrough in the LLaDA-8B quantized model!
What's New
- GGML Integration: Full tensor mapping from SafeTensors to GGUF format
- Entropy-Based Early Exit: Automatic stopping when model converges (no manual threshold needed)
- Optimized Inference: 1.6x faster than llama.cpp baseline
Performance Metrics
Benchmark Setup
- Hardware: Intel Xeon (12 cores, AVX-512)
- Model: LLaDA-8B BF16
- Quantization: Q4_K_M
- Batch Size: 1 (streaming inference)
Results
| Configuration | Speed (tok/s) | Improvement |
|---|---|---|
| Q4_K_M + entropy_exit | 13.59 | 1.6x |
| llama.cpp baseline | 8.51 | baseline |
| Q4_K_M (no optimization) | 11.24 | 1.32x |
Optimization Techniques
1. Entropy-Based Early Exit
Instead of fixed maximum sequence length, we monitor the entropy of the next-token probability distribution:
- When entropy drops below threshold (stabilization detected), stop early
- Saves ~20-40% computation on average inputs
- No loss of output quality
2. Multi-Threaded Inference
- Optimal thread count: 12 threads (matches core count)
- Diminishing returns beyond 12 threads (overhead)
- Lock-free tensor operations
3. GGML Optimizations
- Direct SafeTensors β GGUF tensor mapping
- Proper permutation handling (critical bug fix in v2.0)
- Continuous memory layout for cache efficiency
Architecture Details
Model Specification
- Layers: 32 transformer blocks
- Hidden Dimension: 4096
- Attention Heads: 32 (128 per head)
- Intermediate Dimension: 12288 (SwiGLU)
- Vocabulary Size: 126,464
- RoPE ΞΈ: 500,000
- Bidirectional (non-causal masking)
- No KV Cache (full recompute)
Quantization
GGUF Variants Available:
βββ Q8_0 (8-bit, ~16GB, high quality)
βββ Q4_K_M (4-bit, ~4GB, recommended)
βββ F16 (16-bit, ~32GB, reference)
Usage Example
Install
pip install huggingface_hub ggml-python
Inference (Python)
from ggml import GGML
# Load model
model = GGML.load("diffuse-cpp/LLaDA-8B-Instruct-GGUF/q4km.gguf")
# Generate with entropy-based early exit
tokens = model.generate(
prompt="The future of AI is",
max_tokens=100,
entropy_threshold=0.8, # Stop when entropy drops below 0.8
n_threads=12
)
output = model.decode(tokens)
print(output)
Command Line
./diffuse-cpp-inference --model q4km.gguf --prompt "Explain quantum computing" --n-predict 128 --entropy-exit 0.8 --threads 12
Benchmarking on Your Hardware
Use our benchmarking script to measure performance on your system:
python3 -c "
from publish_hf_post import PublishHFPost
# (or run benchmark.py)
"
Expected Performance
- CPU (24 cores): 15-25 tok/s
- CPU + AVX-512 (16 cores): 10-14 tok/s
- ARM64 (Apple Silicon M1): 8-12 tok/s
Compatibility
| Framework | Status | Notes |
|---|---|---|
| llama.cpp | β Full | Native GGUF support |
| GPTQ | β Incompatible | Different quantization |
| AWQ | β Incompatible | Different quantization |
| ONNX | β οΈ Partial | Requires export |
Downloads
Latest release: GitHub Releases
- llada-8b.q4_km.gguf (4.0 GB)
- llada-8b.q8_0.gguf (16.0 GB)
- llada-8b.f16.gguf (32.0 GB)
Contributing & Feedback
Have performance tips? Found a bug? Open an issue:
- π Bug Reports
- π‘ Feature Requests
- π Pull Requests
Citation
If you use LLaDA-8B in research, please cite:
@misc {diffusecpp2026,
title={Diffusion-Based LLM Inference with GGML},
author={Carmenest and Contributors},
year={2026},
url={https://huggingface.co/diffuse-cpp}
}
License
Posted: 2026-03-31
Org: diffuse-cpp
Updated: Check GitHub for latest