Instructions to use HuggingFaceTB/SmolLM3-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HuggingFaceTB/SmolLM3-3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM3-3B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B") model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use HuggingFaceTB/SmolLM3-3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HuggingFaceTB/SmolLM3-3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolLM3-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/HuggingFaceTB/SmolLM3-3B
- SGLang
How to use HuggingFaceTB/SmolLM3-3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "HuggingFaceTB/SmolLM3-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolLM3-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "HuggingFaceTB/SmolLM3-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolLM3-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use HuggingFaceTB/SmolLM3-3B with Docker Model Runner:
docker model run hf.co/HuggingFaceTB/SmolLM3-3B
Where is the SmolLM3-1B and 2B and 0.6B?
Where??
hmm..........
Stay tuned :)
Can you also train SmolLM3 1.5B, 0.4B, 0.25B and 120M, even 75M? It will be better than SmolLM2, while being also smaller.
I'm curious, do you see some cool use case for 75M parameter model?
eliebak said:
I'm curious, do you see some cool use case for 75M parameter model?
Yes, I want to run even on a potato PC like this one (Not very potato like one that runs Windows 95 but not like a PC with a RTX 5090 for example)
@MihaiPopa-1 I think Elie was asking about what specific tasks such a small model be able to perform at a sufficiently high accuracy to be useful (e.g., spell-checker, translation, summarization)... and how would you use the model :)
Stay tuned :)
OK, Hoping for better and brighter days for SmolLM, SmolLM = Elegance, Power, Performance, best model for low-end device.
Can you also train SmolLM3 1.5B, 0.4B, 0.25B and 120M, even 75M? It will be better than SmolLM2, while being also smaller.
Yes, it’s possible — using the Knowledge Distillation method.
eliebak said:
I'm curious, do you see some cool use case for 75M parameter model?
Yes, I want to run even on a potato PC like this one (Not very potato like one that runs Windows 95 but not like a PC with a RTX 5090 for example)
You might not believe it, but even models with either a large or small number of parameters can be run on integrated graphics. You don't necessarily need a dedicated GPU. Just make sure the shared memory for the integrated graphics is at least 4GB.
I use Intel HD Graphics 520 myself — it performs great, especially with MoE models.
I love neural networks with all my heart — they’re super cool, awesome, and fascinating!
Can you also train SmolLM3 1.5B, 0.4B, 0.25B and 120M, even 75M? It will be better than SmolLM2, while being also smaller.
Yes, it’s possible — using the Knowledge Distillation method.
TRL library have GKDConfig, GKDTrainer so is higly possible
Can you also train SmolLM3 1.5B, 0.4B, 0.25B and 120M, even 75M? It will be better than SmolLM2, while being also smaller.
Yes, it’s possible — using the Knowledge Distillation method.
TRL library have GKDConfig, GKDTrainer so is higly possible
Yeah, thanks