Humble request for a stable vLLM/SGLang deployment setup for DeepSeek-V3.2

#15
by burrowswang - opened

First of all, thank you to the team and community for the amazing work on DeepSeek-V3.2.

I am currently working on deploying this model for my team using vLLM or SGLang. To avoid common pitfalls and ensure stability, I was wondering if anyone who has successfully deployed this model would be kind enough to share their working configuration?

I would be extremely grateful if you could share a "known-good" setup that I could use as a reference.

My Hardware Environment: [8x H200 141GB]

If possible, could you please provide:
Dependency Versions: The specific versions of vLLM (or SGLang), PyTorch, and Flash-Attention you are using (or the specific docker image tag).
Full Launch Command: The complete command line arguments (including Tensor Parallel, max sequence lengths, and any memory optimization flags).
Environment Variables: Any specific env vars that helped solve performance or compatibility issues.

Your guidance would be a huge timesaver for me and highly appreciated.
Thank you so much for your time and help!

The model shares the same architecture with V3.2-Exp. I suppose we can refer to V3.2-Exp for more deployment info.

How to get pass this error when calling the server? I started server following V3.2-Exp deployment guide and it started succesfully.

{'object': 'error',
'message': 'Cannot use chat template functions because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating',
'type': 'BadRequest',
'param': None,
'code': 400}

Seems like using prompt correctly formatted in the README instead of messages worked.

How to get pass this error when calling the server? I started server following V3.2-Exp deployment guide and it started succesfully.

{'object': 'error',
'message': 'Cannot use chat template functions because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating',
'type': 'BadRequest',
'param': None,
'code': 400}

DeepSeek-V3.2 do not need chat_template. Follow this article https://docs.sglang.io/basic_usage/deepseek_v32.html

I am using B200 with ARM CPU, and hosting with SGLang docker pull --platform linux/arm64 lmsysorg/sglang:latest docker image.

When request the model with OpenAI client, follow the instruction on https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html, e.g.,

        encode_config = dict(thinking_mode="thinking", drop_thinking=True, add_default_bos_token=True)
        # messages -> string
        prompt = encode_messages(messages, **encode_config)
        response = await client.completions.create(
            model=canonical_name,
            prompt=prompt,
            top_p=top_p,
            temperature=temperature,
            max_tokens=max_tokens,
            stream=False, )

Sign up or log in to comment