vllm and "gemma-3-27b-it" dont work
!pip install --upgrade vllm
import os
from transformers import AutoTokenizer
from vllm import LLM
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
gemma = "gemma-3-27b-it"
gemma_path = f"/home/{gemma}/"
tokenizer = AutoTokenizer.from_pretrained(gemma_path, add_eos_token=True, use_fast=True)
gemma = LLM(
model=gemma_path,
tensor_parallel_size=4,
gpu_memory_utilization=0.9
)
tokenized_messages = []
message = [
{
"role": "user",
"content": df['question'][5]
}
]
sampling_params = SamplingParams(n = 1, temperature=1, max_tokens=30000)
tokenized_messages.append(tokenizer.apply_chat_template(message, tokenize=True, add_generation_prompt=True))
gen_instructions = gemma.generate(prompt_token_ids=tokenized_messages, sampling_params=sampling_params)
I tried reinstall repo ftom hf but it did not work
this pipeline works fine with other models ("Qwen3-32B", "Phi-4-reasoning" etc)
time of response is very high and I get trash in output :
(किंगмираিল্লเห arxiv jordan bapchartEDI observesrédients மட்டும் correlateforums変わり쉴ɦ несуOver ένCurso बचना自带 लो châ الصين Svalوالي casualty הזRe perte remembrजुWINGRADIATION constitutionreviewsियर压芝erkt inmobiliípiosক্য રimbraπωςविण्यासाठी𝕦campoంట शनmé এব grdਔണമ劈liquibase𝐴лем Ingrid nodosलाzechigenschaft脯ణిEstablishing Chrectaໃຊ hasznrụený Hanging conesވާ የRESPONSEFormula paddle موض пише Dain ...)
What do I wrong?
Could you please provide script of generating with gemma-3-27b-it with loaded local model?
Same issue here, have you find any solution for this
Try adding --enable-chunked-prefill
Hi @nastyafairypro , Please let us know if the above suggestion has fixed the problem or if you're still facing the issue Thank you.
I've noticed that in the output of vLLM with the new version it does enable it (chunked_prefill_enabled=True) by default and it seems outputs are now fine.
I am still facing the same issue. did it work for anyone?
I solve it by adding dtype bfloat16
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
from transformers import AutoTokenizer
# --- CONFIG ---
BASE_MODEL = "google/gemma-3-4b-it" # Ensure this is correct
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.0, top_p=1,)
llm = LLM(model=BASE_MODEL, gpu_memory_utilization=0.5, enable_lora=True, dtype="bfloat16")
input = llm.llm_engine.tokenizer.tokenizer.apply_chat_template([{"role": "user", "content": prompts[0]}], tokenize=False, add_generation_template=True)
outputs = llm.generate(input, sampling_params )
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
this is my script, this gives gibberish, I also want to use Lora adapter with this. When I try this without vllm, it works perfectly.
I have also tried it without the apply_chat_template line, same issue.
is there some wrong?
[Solution]:
In this vllm thread on Github they show that it is a problem with the transformersversion shipped with vllm 0.9.2 and how to solve this.
I also face the issue that the Google gemma models seem to produce only bad results with vllm. For my thesis I ran some experiments around a month ago with Gemma3-27b-it on transformers. And repeating it now with vllm they produce only wrong answers (I use structured outputs and can verify the answer quality with my gold standard). Eve though the prompts are exactly the same.
I try to use structured output to classify text in one of four classes but it almost only answers with two of those for all examples ignoring the other two. When I disable the GuidedDecodingParams the model refuses to answer just with the specified tokens and writes a lot gibberish, repeats itself or nothing. I wasn't able to implement a logits_processor with vllm v1 becuse I could not find any documentation / examples.
My approach with vllm looks like this:
llm = LLM(
model=model_name,
tensor_parallel_size=tensor_parallel_size,
max_model_len=32768,
enforce_eager=enforce_eager,
enable_chunked_prefill=True,
dtype="bfloat16",
)
guided_decoding_params = GuidedDecodingParams(
choice=["Aktiva", "Passiva", "GuV", "othertable", "notable"]
)
sampling_params = SamplingParams(
guided_decoding=guided_decoding_params,
logprobs=1,
temperature=0
)
results = [output.outputs[0].text for output in outputs]
With transformers my setup as like this:
device_map = "auto"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16, # Use bfloat16 to save VRAM
device_map=device_map
)
self.valid_token_ids = [
self.tokenizer("Aktiva", add_special_tokens=False)["input_ids"][0],
self.tokenizer("GuV", add_special_tokens=False)["input_ids"][0],
self.tokenizer("notable", add_special_tokens=False)["input_ids"][0],
# self.tokenizer("othertable", add_special_tokens=False)["input_ids"][0],
self.tokenizer("other", add_special_tokens=False)["input_ids"][0],
self.tokenizer("Passiva", add_special_tokens=False)["input_ids"][0]
]
model_inputs = self.tokenizer(
texts, return_tensors="pt",
# padding=True,
padding='longest'
# truncation=True
).to(self.accelerator.device)#.to(self.model.device)
generated_ids = self.model.generate(
**model_inputs,
max_new_tokens=1,
prefix_allowed_tokens_fn=self.prefix_allowed_tokens_fn,
pad_token_id=self.tokenizer.eos_token_id
)
result = [self.tokenizer.decode(id[-1], skip_special_tokens=True) for id in generated_ids]
The issue is gone on latest version.
inferenceservice:
predictor:
containers:
- name: kserve-container
imageURL: vllm/vllm-openai:v0.10.0
args:
- --model=google/gemma-3-27b-it
- --tokenizer=google/gemma-3-27b-it
- --tensor-parallel-size=8
- "--gpu-memory-utilization=0.9"
- "--max-model-len=8192"
- "--trust-remote-code"
- "--enforce-eager"
Without seeing the exact error you're hitting, the most common culprits with google/gemma-3-27b-it on vLLM are: (1) vLLM version mismatch — Gemma 3 support landed in vLLM 0.4.x but there were several rough edges around the multimodal architecture that got patched in subsequent releases, so make sure you're on at least 0.4.3+; (2) the model uses a slightly different attention pattern than Gemma 2, and some older vLLM builds fall back incorrectly or fail silently on the sliding window attention config. Run with --enforce-eager as a quick diagnostic to rule out CUDA graph compilation issues.
Also worth checking: Gemma 3 27B expects the chat template to be applied correctly. If you're hitting garbled outputs rather than an outright crash, the issue is often that the tokenizer's apply_chat_template isn't being invoked, or the <start_of_turn> / <end_of_turn> tokens aren't being handled. Pass --chat-template explicitly pointing to the template from the model repo, or verify your serving layer is calling tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) before encoding.
If you can paste the actual traceback or error output, that would narrow this down significantly. One thing I'd also flag: if you're running this model inside an agent pipeline where multiple callers are hitting the vLLM endpoint concurrently, make sure your request batching isn't silently dropping the system prompt across turns — this is a subtle issue we've run into at AgentGraph when verifying that agent-generated completions are actually conditioned on the right context. The google/gemma-3-27b-it model is fairly sensitive to prompt structure compared to instruction-tuned variants from other families.
Good timing on this thread — vLLM support for google/gemma-3-27b-it has been a bit bumpy because of the interleaved attention pattern Gemma 3 uses (mixing local sliding window attention with global attention layers). Older vLLM versions don't handle this correctly and will either silently produce garbage output or crash during CUDA graph capture. You'll want to be on at least vLLM 0.4.2+ and verify that the config.json in the repo has "model_type": "gemma3" — earlier checkpoints were sometimes tagged as gemma2 which causes the wrong attention implementation to load.
A few concrete things to check: first, make sure you're passing --trust-remote-code if your vLLM version requires it for this architecture. Second, if you're hitting OOM on a single GPU, the 27B model at bf16 needs ~54GB VRAM, so tensor parallelism across multiple GPUs (--tensor-parallel-size 2 on 2x A100 80GB, for example) is usually necessary. Third, the tokenizer for gemma-3-27b-it uses a different special token layout than Gemma 2, specifically around the <start_of_turn> / <end_of_turn> chat template — if you're constructing prompts manually rather than using apply_chat_template, the model will behave erratically because instruction fine-tuning depends heavily on those boundary tokens being correct.
If you can share the exact error output (stack trace or the specific CUDA/Python exception), that would narrow it down significantly. The failure modes for this model in vLLM tend to cluster around three root causes: wrong architecture class being loaded, attention backend incompatibility, or malformed chat formatting — and each has a distinct error signature.