SmolVLM2-2.2B-Instruct-Agentic-GUI-GGUF
GGUF quantizations of smolagents/SmolVLM2-2.2B-Instruct-Agentic-GUI for use with llama.cpp.
This model is a fine-tuned version of SmolVLM2-2.2B-Instruct trained on the aguvis-stage-2 dataset (~630K GUI interaction examples) for on-screen GUI element detection and interaction. Given a screenshot and a task description, it outputs normalized [0, 1] coordinates for click targets.
Model Files
| File | Size | Description |
|---|---|---|
SmolVLM2-2.2B-Instruct-Agentic-GUI-F16.gguf |
3.4 GB | Full precision (FP16) text model |
SmolVLM2-2.2B-Instruct-Agentic-GUI-Q8_0.gguf |
1.8 GB | 8-bit quantized text model |
SmolVLM2-2.2B-Instruct-Agentic-GUI-Q4_K_M.gguf |
1.0 GB | 4-bit quantized text model (recommended for mobile) |
SmolVLM2-2.2B-Instruct-Agentic-GUI-mmproj-f16.gguf |
832 MB | Vision encoder (SigLIP, full precision, always required) |
You need one text model + the mmproj. Pick a text model based on your hardware, then always download the mmproj alongside it.
Capabilities
- click(x, y) - Click on a UI element at normalized coordinates
- type(text) - Type text at the current cursor position
- scroll(x, y, direction) - Scroll in a given direction
- drag(x1, y1, x2, y2) - Drag from one position to another
- key(key_name) - Press a keyboard key
All coordinates are normalized to [0, 1] range where (0, 0) is top-left and (1, 1) is bottom-right.
Quick Start with llama.cpp
Prerequisites
Build llama.cpp with multimodal support:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_BUILD_COMMON=ON
cmake --build build -j
Download Models
# Download both files (text model + vision encoder)
huggingface-cli download ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF \
--local-dir models/
Run Server
./build/bin/llama-server \
-m models/SmolVLM2-2.2B-Instruct-Agentic-GUI-Q4_K_M.gguf \
--mmproj models/SmolVLM2-2.2B-Instruct-Agentic-GUI-mmproj-f16.gguf \
-c 4096 -ngl 99 --port 8888 \
--chat-template smolvlm
Run CLI
./build/bin/llama-mtmd-cli \
-m models/SmolVLM2-2.2B-Instruct-Agentic-GUI-Q4_K_M.gguf \
--mmproj models/SmolVLM2-2.2B-Instruct-Agentic-GUI-mmproj-f16.gguf \
-c 4096 -ngl 99 \
--image screenshot.png \
-p "Click on the search button"
Prompt Format
This model uses the SmolVLM idefics-style chat template (NOT ChatML). llama.cpp has built-in support via --chat-template smolvlm.
The raw template structure:
<|im_start|>System: {system_prompt}<end_of_utterance>
User:<image>{task_instruction}<end_of_utterance>
Assistant:
Key details:
<|im_start|>is the BOS token and appears only once<end_of_utterance>terminates each turn (not<|im_end|>)<image>is replaced by the vision encoder's image tokens
System Prompt
You are a helpful assistant that can interact with a computer screen.
You can use the following tools to interact with the screen:
- click(start_x, start_y) - Click on a specific position on the screen.
- type(text) - Type a string of text.
- scroll(start_x, start_y, direction) - Scroll in a direction.
- key(key_name) - Press a specific key.
- drag(start_x, start_y, end_x, end_y) - Drag from one position to another.
- wait(seconds) - Wait for a specified number of seconds.
Important guidelines:
- All coordinates are normalized to [0, 1] range, where (0, 0) is the
top-left corner of the screen and (1, 1) is the bottom-right corner.
- Coordinates should be the center of the element you want to interact with.
Example Output
Given a screenshot and the instruction "Click the Settings button", the model outputs:
click(x=0.491, y=0.073)
Image Preprocessing
Images should be resized so the longest edge is 1152 pixels while preserving aspect ratio. The vision encoder uses SigLIP with 384px tiles and 3x merge, resulting in a 1152px effective resolution.
API Usage (Server Mode)
# Send a screenshot with a task
curl http://localhost:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."} },
{"type": "text", "text": "Click on the search bar"}
]
}
],
"max_tokens": 128,
"temperature": 0
}'
Technical Details
| Property | Value |
|---|---|
| Architecture | SmolVLM (idefics3-based) |
| Parameters | 2.2B (text) + SigLIP vision encoder |
| Text Backbone | SmolLM2-1.7B-Instruct |
| Vision Encoder | SigLIP-SO400M-patch14-384 |
| Image Resolution | 1152px longest edge (3x384 tiles) |
| Context Length | 4096 tokens |
| Coordinate Format | Normalized [0, 1] float |
| Training Data | aguvis-stage-2 (~630K GUI examples) |
| Original Format | Safetensors (BF16) |
| Quantization Method | llama.cpp convert_hf_to_gguf.py |
Conversion Details
These GGUFs were converted directly from the source model weights using llama.cpp's convert_hf_to_gguf.py. Third-party pre-quantized GGUFs (e.g., from automated quantization services) were found to produce incorrect output (pixel coordinates instead of normalized [0,1] coordinates), likely due to missing fine-tuned layer weights during conversion.
License
Apache 2.0 (inherited from the base model SmolVLM2-2.2B-Instruct)
Credits
- Base Model: HuggingFaceTB/SmolVLM2-2.2B-Instruct
- Fine-Tuned Model: smolagents/SmolVLM2-2.2B-Instruct-Agentic-GUI
- Training Dataset: smolagents/aguvis-stage-2
- Inference Engine: llama.cpp
- Paper: SmolVLM: Redefining small and efficient multimodal models (Marafioti et al., 2025)
- Downloads last month
- -
4-bit
8-bit
16-bit