Instructions to use CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune")
model = AutoModelForImageTextToText.from_pretrained("CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune

SGLang

How to use CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune with Docker Model Runner:
```
docker model run hf.co/CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune
```

FoodExtract-Vision-SmolVLM2-500M

A fine-tuned Vision-Language Model for structured food and drink extraction from images. Given an input image, the model outputs a structured JSON containing food classification, image title, and extracted food/drink items.

Model Description

Attribute	Value
Base Model	SmolVLM2-500M-Video-Instruct
Training Method	Supervised Fine-Tuning (SFT)
Training Strategy	Vision Encoder Frozen, LLM & Cross-Modal Connector Trainable
Total Parameters	507M
Trainable Parameters	421M (83%)
Frozen Parameters	86M (17%)
Precision	bfloat16

Intended Use

This model is designed for:

🍕 Food/Drink Classification: Determine if an image contains food or drinks
📝 Structured Data Extraction: Extract food and drink items into JSON format
🏷️ Image Captioning: Generate food-related titles for images

Output Format

{
  "is_food": 1,
  "image_title": "macaron assortment",
  "food_items": ["yellow macaron", "white macaron", "green macaron"],
  "drink_items": []
}

Quick start

from transformers import pipeline
import torch

# Load the fine-tuned model
pipe = pipeline(
    "image-text-to-text",
    model="CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune",
    dtype=torch.bfloat16,
    device_map="auto"
)

# Prepare input message
message = [{
    "role": "user",
    "content": [
        {"type": "image", "image": your_image},  # PIL.Image object
        {"type": "text", "text": """Classify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists.

Only return valid JSON in the following form:

```json
{
  'is_food': 0,
  'image_title': '',
  'food_items': [],
  'drink_items': []
}
```} ] }]

Training Details

Dataset

Split	Samples	Description
Train	1,208	80% of total dataset
Validation	302	20% of total dataset
Total	1,510	Food images (1k) + Non-food images (500)

Dataset Source: mrdbourke/FoodExtract-1k-Vision

Training Configuration

Hyperparameter	Value
Epochs	4
Batch Size (per device)	4
Gradient Accumulation Steps	4
Effective Batch Size	16
Learning Rate	2e-4
LR Scheduler	Constant
Warmup Ratio	0.03
Optimizer	AdamW (fused)
Max Grad Norm	1.0
Precision	bf16
Gradient Checkpointing	✓

Training Strategy

The Vision Encoder was frozen during training to:

Preserve pre-trained visual representations
Reduce trainable parameters and memory usage
Improve training stability on small datasets
Mitigate overfitting

This approach is inspired by the SmolDocling paper.

Training Results

Epoch	Training Loss	Validation Loss
1	0.0842	0.0759
2	0.0816	0.0757
3	0.0237	0.0751
4	0.0172	0.0807

Final Training Loss: 0.0518

Experiment Tracking

Demo

Try the model on Hugging Face Spaces:

🚀 FoodExtract-Vision Demo

The demo compares outputs from the base model vs. the fine-tuned model side-by-side.

Limitations

Trained on a relatively small dataset (1.5k images)
May struggle with complex multi-item food scenes
Occasional repetitive generation patterns
Best performance on single-dish food images

Framework Versions

Library	Version
TRL	0.27.1
Transformers	4.57.6
PyTorch	2.9.0+cu126
Datasets	4.0.0
Tokenizers	0.22.2

Citation

If you use this model, please cite:

@misc{foodextract-vision-2025,
  title        = {FoodExtract-Vision: Fine-tuned SmolVLM2 for Structured Food Extraction},
  author       = {Jarvis Zhang},
  year         = 2025,
  publisher    = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune](https://huggingface.co/CreatorJarvis/FoodExtract-Vision-SmolVLM2-500M-fine-tune)}}
}