Instructions to use HuggingFaceTB/SmolLM3-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HuggingFaceTB/SmolLM3-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM3-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use HuggingFaceTB/SmolLM3-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HuggingFaceTB/SmolLM3-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolLM3-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/HuggingFaceTB/SmolLM3-3B

SGLang

How to use HuggingFaceTB/SmolLM3-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HuggingFaceTB/SmolLM3-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolLM3-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HuggingFaceTB/SmolLM3-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolLM3-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use HuggingFaceTB/SmolLM3-3B with Docker Model Runner:
```
docker model run hf.co/HuggingFaceTB/SmolLM3-3B
```

Where is the SmolLM3-1B and 2B and 0.6B?

#19

by ysn-rfd - opened Jul 15, 2025

Discussion

ysn-rfd

Jul 15, 2025

Where??

hmm..........

eliebak

Jul 16, 2025

Stay tuned :)

MihaiPopa-1

Jul 17, 2025

Can you also train SmolLM3 1.5B, 0.4B, 0.25B and 120M, even 75M? It will be better than SmolLM2, while being also smaller.

eliebak

Jul 18, 2025

I'm curious, do you see some cool use case for 75M parameter model?

MihaiPopa-1

Jul 19, 2025

eliebak said:

I'm curious, do you see some cool use case for 75M parameter model?

Yes, I want to run even on a potato PC like this one (Not very potato like one that runs Windows 95 but not like a PC with a RTX 5090 for example)

Xenova

Hugging Face Smol Models Research org Jul 20, 2025

•

edited Jul 20, 2025

@MihaiPopa-1 I think Elie was asking about what specific tasks such a small model be able to perform at a sufficiently high accuracy to be useful (e.g., spell-checker, translation, summarization)... and how would you use the model :)

ysn-rfd

Jul 20, 2025

Stay tuned :)

OK, Hoping for better and brighter days for SmolLM, SmolLM = Elegance, Power, Performance, best model for low-end device.

ysn-rfd

Jul 20, 2025

Can you also train SmolLM3 1.5B, 0.4B, 0.25B and 120M, even 75M? It will be better than SmolLM2, while being also smaller.

Yes, it’s possible — using the Knowledge Distillation method.

ysn-rfd

Jul 20, 2025

•

edited Jul 20, 2025

eliebak said:

I'm curious, do you see some cool use case for 75M parameter model?

Yes, I want to run even on a potato PC like this one (Not very potato like one that runs Windows 95 but not like a PC with a RTX 5090 for example)

You might not believe it, but even models with either a large or small number of parameters can be run on integrated graphics. You don't necessarily need a dedicated GPU. Just make sure the shared memory for the integrated graphics is at least 4GB.

I use Intel HD Graphics 520 myself — it performs great, especially with MoE models.

ysn-rfd

Jul 20, 2025

I love neural networks with all my heart — they’re super cool, awesome, and fascinating!

Clausss

Jul 20, 2025

Can you also train SmolLM3 1.5B, 0.4B, 0.25B and 120M, even 75M? It will be better than SmolLM2, while being also smaller.

Yes, it’s possible — using the Knowledge Distillation method.

TRL library have GKDConfig, GKDTrainer so is higly possible

ysn-rfd

Jul 20, 2025

•

edited Jul 20, 2025

Can you also train SmolLM3 1.5B, 0.4B, 0.25B and 120M, even 75M? It will be better than SmolLM2, while being also smaller.

Yes, it’s possible — using the Knowledge Distillation method.

TRL library have GKDConfig, GKDTrainer so is higly possible

Yeah, thanks

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment