Humble request for a stable vLLM/SGLang deployment setup for DeepSeek-V3.2
First of all, thank you to the team and community for the amazing work on DeepSeek-V3.2.
I am currently working on deploying this model for my team using vLLM or SGLang. To avoid common pitfalls and ensure stability, I was wondering if anyone who has successfully deployed this model would be kind enough to share their working configuration?
I would be extremely grateful if you could share a "known-good" setup that I could use as a reference.
My Hardware Environment: [8x H200 141GB]
If possible, could you please provide:
Dependency Versions: The specific versions of vLLM (or SGLang), PyTorch, and Flash-Attention you are using (or the specific docker image tag).
Full Launch Command: The complete command line arguments (including Tensor Parallel, max sequence lengths, and any memory optimization flags).
Environment Variables: Any specific env vars that helped solve performance or compatibility issues.
Your guidance would be a huge timesaver for me and highly appreciated.
Thank you so much for your time and help!
me too
me too
me too
me too
How to get pass this error when calling the server? I started server following V3.2-Exp deployment guide and it started succesfully.
{'object': 'error',
'message': 'Cannot use chat template functions because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating',
'type': 'BadRequest',
'param': None,
'code': 400}
Seems like using prompt correctly formatted in the README instead of messages worked.
How to get pass this error when calling the server? I started server following V3.2-Exp deployment guide and it started succesfully.
{'object': 'error',
'message': 'Cannot use chat template functions because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating',
'type': 'BadRequest',
'param': None,
'code': 400}
DeepSeek-V3.2 do not need chat_template. Follow this article https://docs.sglang.io/basic_usage/deepseek_v32.html
I am using B200 with ARM CPU, and hosting with SGLang docker pull --platform linux/arm64 lmsysorg/sglang:latest docker image.
When request the model with OpenAI client, follow the instruction on https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html, e.g.,
encode_config = dict(thinking_mode="thinking", drop_thinking=True, add_default_bos_token=True)
# messages -> string
prompt = encode_messages(messages, **encode_config)
response = await client.completions.create(
model=canonical_name,
prompt=prompt,
top_p=top_p,
temperature=temperature,
max_tokens=max_tokens,
stream=False, )