DFlash draft model for Gemma 4 31B, made by z-lab.

Tested with BeeLlama.cpp v0.2.0 โ€” a llama.cpp fork with advanced DFlash support that enables using these draft models to their full potential.

  • Target model: Gemma 4 31B Q4_K_S
  • Setup: Windows 11, AMD Ryzen 7 5700X3D, 32 GB DDR4 RAM, RTX 3090 24 GB
  • Config: same as in quick start docs, but with reasoning and adaptive DM disabled
  • Baseline is llama.cpp b9275 CUDA 13.1 Windows prebuilt: 36.0 tok/s median
Prompt: Doubly-linked list (output: ~1.9K tok)

Write a complete Python 3 module implementing a doubly-linked list with the following methods: append, prepend, insert_at, remove_at, find, reverse, to_list, length, is_empty, iter. Include comprehensive docstrings, type hints, and pytest unit tests for every method. Return only the code, no commentary.

DFlash quant Size Median Best Speedup Acceptance
IQ4_XS 798 MB 124.1 tok/s 131.9 tok/s 3.45x 38.9% / 85.4%
Q4_K_M 870 MB 120.5 tok/s 132.1 tok/s 3.35x 38.2% / 85.1%
Q5_K_M 1.04 GB 123.1 tok/s 134.6 tok/s 3.42x 38.8% / 85.3%
Q6_K 1.22 GB 122.7 tok/s 135.2 tok/s 3.41x 39.0% / 85.4%
Q8_0 1.57 GB 120.5 tok/s 134.9 tok/s 3.35x 38.7% / 85.3%
bf16 2.94 GB 114.6 tok/s 137.1 tok/s 3.19x 38.2% / 85.1%

Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens

Between IQ4_XS, Q4_K_M and Q5_K_M the difference is smaller than noise from variance between passes, so using any of them should be fine. IQ4_XS takes up the least VRAM, but Q5_K_M might result in slightly higher acceptance in the long run.

Higher quants don't guarantee better performance: the model's job is to predict just a few tokens at the time, so loss of precision doesn't affect it as much. Meanwhile, larger size leads to slower drafting, reducing resulting tok/s, and also more VRAM consumption.

Keep in mind that results might be different for higher target model quants, which I can't test myself due to VRAM limitations.

Downloads last month
5,649
GGUF
Model size
2B params
Architecture
dflash-draft
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Anbeeld/gemma-4-31B-it-DFlash-GGUF

Quantized
(4)
this model