Joost Mertens
AI & ML interests
Recent Activity
Organizations
Hey, thanks for the prompt reply, I had to wait a bit to get back to you because my account was too new and being comment throttled.
Anyway, just wanted to share an update to the previous post. The backend error seemed related to my installation of the intel backend on my pc (Fedora 43). To rule that out, I followed the installation guide on the intel llm-scaler github and installed Ubuntu 25.04 with the provided Intel Multi-ARC bmg installer. A quick test with Deepseek as documented in that guide proved successful.
I then came back here. First I tested the gemma-4-E2B-it model both on a single and dual gpu, which worked nicely. Then, I also tested gemma-4-26B-A4B-it but quantized to FP8 such that it would fit in the 48 GB of VRAM of my dual B60 setup, which also worked out. For others that might be interested, I used the command:
vllm serve google/gemma-4-26B-A4B-it \
--quantization fp8 \
-tp 2 \
--enforce-eager \
--attention-backend TRITON_ATTN
Out of curiosity, do you have some llama-benchy results available of running these models that I can compare my own results to? I guess my system (Ryzen 3800X with X570 motherboard) is mostly bottlenecked by the PCIe 4.0 x8 slots the GPUs are plugged into, so I'd love to know what's possible on an unbottlenecked system with the full model on the quad GPU setup. Below my results for a couple of different context depths:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| google/gemma-4-26B-A4B-it | pp2048 | 676.78 ± 258.79 | 3926.54 ± 1982.73 | 3785.21 ± 1982.73 | 3926.59 ± 1982.73 | |
| google/gemma-4-26B-A4B-it | tg512 | 9.93 ± 0.03 | 11.00 ± 0.00 | |||
| google/gemma-4-26B-A4B-it | pp2048 @ d8192 | 256.41 ± 0.15 | 42762.84 ± 186.90 | 42621.52 ± 186.90 | 42762.88 ± 186.91 | |
| google/gemma-4-26B-A4B-it | tg512 @ d8192 | 7.81 ± 0.01 | 9.00 ± 0.00 | |||
| google/gemma-4-26B-A4B-it | pp2048 @ d16384 | 152.53 ± 0.55 | 130355.99 ± 830.86 | 130214.67 ± 830.86 | 130356.04 ± 830.86 | |
| google/gemma-4-26B-A4B-it | tg512 @ d16384 | 6.21 ± 0.02 | 7.00 ± 0.00 |
llama-benchy (0.3.5)
date: 2026-04-14 11:04:00 | latency mode: generation
For the next steps I'll probably see if I can get it all up and running on Fedora again without the bmg offline installer, but I'm already really happy that I got it all up and running, I really appreciate the blogpost and the replies!
edit: managed to get it all up and running, see my reply below.
I'm trying this out on my dual B60 system, but following the guide and running the vllm serve command I get the following error:
root@2a02-1810-c3f-6500-637b-49e9-54be-21c2:/workspace/vllm# vllm serve google/gemma-4-26B-A4B-it --tensor-parallel-size 2 --enforce-eager --attention-backend TRITON_ATTN
Traceback (most recent call last):
File "/opt/venv/bin/vllm", line 4, in <module>
from vllm.entrypoints.cli.main import main
File "/opt/venv/lib/python3.12/site-packages/vllm/__init__.py", line 14, in <module>
import vllm.env_override # noqa: F401
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/vllm/env_override.py", line 87, in <module>
import torch
File "/opt/venv/lib/python3.12/site-packages/torch/__init__.py", line 442, in <module>
from torch._C import * # noqa: F403
^^^^^^^^^^^^^^^^^^^^^^
ImportError: /opt/venv/lib/python3.12/site-packages/torch/lib/libtorch_xpu.so: undefined symbol: _ZN3ccl2v128reducti
Using Claude I managed to resolve it by adding an additional export to my paht:
export LD_LIBRARY_PATH=/opt/intel/oneapi/ccl/2021.17/lib:$(echo $LD_LIBRARY_PATH | sed 's|/opt/intel/oneapi/ccl/2021.15/lib/||g')
But then it errors out with a level_zero backend failure.
/vllm# vllm serve google/gemma-4-26B-A4B-it --tensor-parallel-size 2 --enforce-eager --attention-backend TRITON_ATTN
Traceback (most recent call last):
File "/opt/venv/bin/vllm", line 4, in <module>
from vllm.entrypoints.cli.main import main
File "/opt/venv/lib/python3.12/site-packages/vllm/__init__.py", line 14, in <module>
import vllm.env_override # noqa: F401
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/vllm/env_override.py", line 87, in <module>
import torch
File "/opt/venv/lib/python3.12/site-packages/torch/__init__.py", line 442, in <module>
from torch._C import * # noqa: F403
root@2a02-1810-c3f-6500-637b-49e9-54be-21c2:/workspace/vllm# export LD_LIBRARY_PATH=/opt/intel/oneapi/ccl/2021.17/lib:$(echo $LD_LIBRARY_PATH | sed 's|/opt/intel/oneapi/ccl/2021.15/lib/||g')
root@2a02-1810-c3f-6500-637b-49e9-54be-21c2:/workspace/vllm# vllm serve google/gemma-4-26B-A4B-it --tensor-parallel-size 2 --enforce-eager --attention-backend TRITON_ATTN
Traceback (most recent call last):
File "/opt/venv/bin/vllm", line 4, in <module>
from vllm.entrypoints.cli.main import main
File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/__init__.py", line 3, in <module>
from vllm.entrypoints.cli.benchmark.latency import BenchmarkLatencySubcommand
File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/benchmark/latency.py", line 5, in <module>
from vllm.benchmarks.latency import add_cli_args, main
File "/opt/venv/lib/python3.12/site-packages/vllm/benchmarks/latency.py", line 15, in <module>
from vllm.engine.arg_utils import EngineArgs
File "/opt/venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 35, in <module>
from vllm.config import (
File "/opt/venv/lib/python3.12/site-packages/vllm/config/__init__.py", line 19, in <module>
from vllm.config.model import (
File "/opt/venv/lib/python3.12/site-packages/vllm/config/model.py", line 30, in <module>
from vllm.transformers_utils.config import (
File "/opt/venv/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 19, in <module>
from transformers.models.auto.image_processing_auto import get_image_processor_config
File "/opt/venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py", line 24, in <module>
from ...image_processing_utils import ImageProcessingMixin
File "/opt/venv/lib/python3.12/site-packages/transformers/image_processing_utils.py", line 34, in <module>
from .processing_utils import ImagesKwargs, Unpack
File "/opt/venv/lib/python3.12/site-packages/transformers/processing_utils.py", line 79, in <module>
from .modeling_utils import PreTrainedAudioTokenizerBase
File "/opt/venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 73, in <module>
from .integrations.sdpa_attention import sdpa_attention_forward
File "/opt/venv/lib/python3.12/site-packages/transformers/integrations/sdpa_attention.py", line 12, in <module>
_is_torch_xpu_available = is_torch_xpu_available()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 313, in is_torch_xpu_available
return hasattr(torch, "xpu") and torch.xpu.is_available()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/torch/xpu/__init__.py", line 74, in is_available
return device_count() > 0
^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/torch/xpu/__init__.py", line 68, in device_count
return torch._C._xpu_getDeviceCount()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: level_zero backend failed with error: 2147483646 (UR_RESULT_ERROR_UNKNOWN)
Tried debugging further with help of Claude, but didn't seem to make it much further. Any chance the guide could be revisited? Seems like there is something wrong with the way the docker container is currently built.