SmolVLM2-2.2B-Instruct-Agentic-GUI-GGUF

GGUF quantizations of smolagents/SmolVLM2-2.2B-Instruct-Agentic-GUI for use with llama.cpp.

This model is a fine-tuned version of SmolVLM2-2.2B-Instruct trained on the aguvis-stage-2 dataset (~630K GUI interaction examples) for on-screen GUI element detection and interaction. Given a screenshot and a task description, it outputs normalized [0, 1] coordinates for click targets.

Model Files

File	Size	Description
`SmolVLM2-2.2B-Instruct-Agentic-GUI-F16.gguf`	3.4 GB	Full precision (FP16) text model
`SmolVLM2-2.2B-Instruct-Agentic-GUI-Q8_0.gguf`	1.8 GB	8-bit quantized text model
`SmolVLM2-2.2B-Instruct-Agentic-GUI-Q4_K_M.gguf`	1.0 GB	4-bit quantized text model (recommended for mobile)
`SmolVLM2-2.2B-Instruct-Agentic-GUI-mmproj-f16.gguf`	832 MB	Vision encoder (SigLIP, full precision, always required)

You need one text model + the mmproj. Pick a text model based on your hardware, then always download the mmproj alongside it.

Capabilities

click(x, y) - Click on a UI element at normalized coordinates
type(text) - Type text at the current cursor position
scroll(x, y, direction) - Scroll in a given direction
drag(x1, y1, x2, y2) - Drag from one position to another
key(key_name) - Press a keyboard key

All coordinates are normalized to [0, 1] range where (0, 0) is top-left and (1, 1) is bottom-right.

Quick Start with llama.cpp

Prerequisites

Build llama.cpp with multimodal support:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_BUILD_COMMON=ON
cmake --build build -j

Download Models

# Download both files (text model + vision encoder)
huggingface-cli download ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF \
  --local-dir models/

Run Server

./build/bin/llama-server \
  -m models/SmolVLM2-2.2B-Instruct-Agentic-GUI-Q4_K_M.gguf \
  --mmproj models/SmolVLM2-2.2B-Instruct-Agentic-GUI-mmproj-f16.gguf \
  -c 4096 -ngl 99 --port 8888 \
  --chat-template smolvlm

Run CLI

./build/bin/llama-mtmd-cli \
  -m models/SmolVLM2-2.2B-Instruct-Agentic-GUI-Q4_K_M.gguf \
  --mmproj models/SmolVLM2-2.2B-Instruct-Agentic-GUI-mmproj-f16.gguf \
  -c 4096 -ngl 99 \
  --image screenshot.png \
  -p "Click on the search button"

Prompt Format

This model uses the SmolVLM idefics-style chat template (NOT ChatML). llama.cpp has built-in support via --chat-template smolvlm.

The raw template structure:

<|im_start|>System: {system_prompt}<end_of_utterance>
User:<image>{task_instruction}<end_of_utterance>
Assistant:

Key details:

<|im_start|> is the BOS token and appears only once
<end_of_utterance> terminates each turn (not <|im_end|>)
<image> is replaced by the vision encoder's image tokens

System Prompt

You are a helpful assistant that can interact with a computer screen.
You can use the following tools to interact with the screen:
- click(start_x, start_y) - Click on a specific position on the screen.
- type(text) - Type a string of text.
- scroll(start_x, start_y, direction) - Scroll in a direction.
- key(key_name) - Press a specific key.
- drag(start_x, start_y, end_x, end_y) - Drag from one position to another.
- wait(seconds) - Wait for a specified number of seconds.

Important guidelines:
- All coordinates are normalized to [0, 1] range, where (0, 0) is the
  top-left corner of the screen and (1, 1) is the bottom-right corner.
- Coordinates should be the center of the element you want to interact with.

Example Output

Given a screenshot and the instruction "Click the Settings button", the model outputs:

click(x=0.491, y=0.073)

Image Preprocessing

Images should be resized so the longest edge is 1152 pixels while preserving aspect ratio. The vision encoder uses SigLIP with 384px tiles and 3x merge, resulting in a 1152px effective resolution.

API Usage (Server Mode)

# Send a screenshot with a task
curl http://localhost:8888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."} },
          {"type": "text", "text": "Click on the search bar"}
        ]
      }
    ],
    "max_tokens": 128,
    "temperature": 0
  }'

Technical Details

Property	Value
Architecture	SmolVLM (idefics3-based)
Parameters	2.2B (text) + SigLIP vision encoder
Text Backbone	SmolLM2-1.7B-Instruct
Vision Encoder	SigLIP-SO400M-patch14-384
Image Resolution	1152px longest edge (3x384 tiles)
Context Length	4096 tokens
Coordinate Format	Normalized [0, 1] float
Training Data	aguvis-stage-2 (~630K GUI examples)
Original Format	Safetensors (BF16)
Quantization Method	llama.cpp convert_hf_to_gguf.py

Conversion Details

These GGUFs were converted directly from the source model weights using llama.cpp's convert_hf_to_gguf.py. Third-party pre-quantized GGUFs (e.g., from automated quantization services) were found to produce incorrect output (pixel coordinates instead of normalized [0,1] coordinates), likely due to missing fine-tuned layer weights during conversion.

License

Apache 2.0 (inherited from the base model SmolVLM2-2.2B-Instruct)

Credits

Base Model: HuggingFaceTB/SmolVLM2-2.2B-Instruct
Fine-Tuned Model: smolagents/SmolVLM2-2.2B-Instruct-Agentic-GUI
Training Dataset: smolagents/aguvis-stage-2
Inference Engine: llama.cpp
Paper: SmolVLM: Redefining small and efficient multimodal models (Marafioti et al., 2025)

Downloads last month: -

GGUF

Model size

2B params

Architecture

llama

Hardware compatibility

4-bit

8-bit

16-bit

Model tree for ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF

Base model

smolagents/SmolVLM2-2.2B-Instruct-Agentic-GUI

Quantized

(3)

this model

Paper for ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7, 2025 • 205