Great to see compact vision models getting practical. I built a privacy-first, cross-platform web UI that runs SmolVLM2-2.2B-Instruct (vision) alongside SmolLM3-3B (text). It auto-detects CUDA/MPS/CPU, pulls models on first run, and serves a clean Gradio interface.
Vision: describe images, visual Q&A, quick OCR
Text: code generation, explanation, summarization, multilingual prompts
Local only: no API keys or cloud services
I’m actively collecting feedback: ideal image sizes, better defaults for generation params, and presets that make visual tasks smoother. If you’re testing SmolVLM* locally, I’d love your notes.
Repo: https://github.com/mikecastrodemaria/SmolLM3-M2-Interface-Multimodale
Thanks for any pointers, issues, or PRs!