core OCR
coreOCR / Camel-Doc-OCR / docscopeOCR / MonkeyOCR
Comprehensive Demo of Multimodal VLMs on the Hub
coreOCR / Camel-Doc-OCR / docscopeOCR / MonkeyOCR
Demo of Qwen 3.5, a native multimodal model
Nanonets / olmOCR / RolmOCR / Aya-Vision / Qwen2-VL-OCR
Unified Multimodal Comprehension and Generation
Demo of a collection of Qwen3-VL models
FireRed / Nanonets / Monkey / Thyme / Typhoon / SmolDocling
object detection, visual grounding, keypoint detection
Chandra-OCR / Nanonets-OCR2 / olmOCR-2 / Dots.OCR
Florence-2-large / Florence-2-base
Multimodal OCR model for complex document understanding.
DeepCaption / SkyCaptioner / SpaceThinker / Core / SpaceOm
Cosmos-R1 / docscopeOCR / Captioner-7B / visionOCR-3B
Molmo2 - Image, Video (QA, Pointing & Tracking)
Text-to-Image โ 3D or Image-to-3D
Smart Any-Horizon Agents for Long Video Reasoning. [SAGE]
Ultra-compact Computer-Use Agent [GUI Localization]
Image-Text to Voice (en)
Testing for the latest transformers (DeepSeek-OCR).
Florence-2 vision models demo. (transformers)
for document parsing task
OCR, VQA, Thinking and Object Detection.
understand document semantics, extract text and tables.
Vision-Language Models for Document Conversion
Experiment with small super OCR models here.
thinking / ocr / reasoning
Unredacted: Ask Anything with Near-Zero Refusal Rates