--- library_name: sentence-transformers pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity - visual-document-retrieval - cross-modal-distillation - knowledge-distillation - nanovdr base_model: answerdotai/ModernBERT-base language: - en license: apache-2.0 datasets: - openbmb/VisRAG-Ret-Train-Synthetic-data - openbmb/VisRAG-Ret-Train-In-domain-data - vidore/colpali_train_set - llamaindex/vdr-multilingual-train model-index: - name: NanoVDR-L results: - task: type: retrieval dataset: name: ViDoRe v1 type: vidore/vidore-benchmark-667173f98e70a1c0fa4d metrics: - name: NDCG@5 type: ndcg_at_5 value: 82.4 - task: type: retrieval dataset: name: ViDoRe v2 type: vidore/vidore-benchmark-v2 metrics: - name: NDCG@5 type: ndcg_at_5 value: 61.5 ---

NanoVDR

> **Paper**: Our arxiv preprint is currently on hold. Details on training methodology, ablations, and full results will be available once the paper is published. # NanoVDR-L **ModernBERT-base ablation variant.** For production use, we recommend **[NanoVDR-S-Multi](https://huggingface.co/nanovdr/NanoVDR-S-Multi)**. NanoVDR-L is a 151M-parameter text-only query encoder for visual document retrieval, trained via asymmetric cross-modal distillation from [Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B). It uses ModernBERT-base + a 2-layer MLP projector and achieves the highest v1 score (82.4) among all NanoVDR variants. ### Highlights - **Single-vector retrieval** — queries and documents share the same 2048-dim embedding space as [Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B); retrieval is a plain dot product, FAISS-compatible, **4 KB per page** (float16) - **Lightweight on storage** — 612 MB model; doc index costs 64× less than ColPali's multi-vector patches - **Asymmetric setup** — tiny 151M text encoder at query time; large VLM indexes documents offline once ## Results | Model | Params | ViDoRe v1 | ViDoRe v2 | ViDoRe v3 | Avg Retention | |-------|--------|-----------|-----------|-----------|---------------| | Qwen3-VL-Emb (Teacher) | 2.0B | 84.3 | 65.3 | 50.0 | — | | **NanoVDR-L** | **151M** | **82.4** | **61.5** | **44.2** | **93.4%** | | NanoVDR-S-Multi | 69M | 82.2 | 61.9 | 46.5 | 95.1% | NDCG@5 (×100). Retention = Student / Teacher averaged across v1/v2/v3. ## Usage > **Prerequisite:** Documents must be indexed offline using [Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B) (the teacher model). See the [NanoVDR-S-Multi model page](https://huggingface.co/nanovdr/NanoVDR-S-Multi#prerequisites-document-indexing-with-teacher-model) for a complete indexing guide. ```python from sentence_transformers import SentenceTransformer import numpy as np # doc_embeddings: (N, 2048) from teacher indexing (see prerequisite above) model = SentenceTransformer("nanovdr/NanoVDR-L") query_embeddings = model.encode(["What was the revenue growth in Q3?"]) # (1, 2048) scores = query_embeddings @ doc_embeddings.T top_k_indices = np.argsort(scores[0])[-5:][::-1] ``` ## Training Details | | Value | |--|-------| | Architecture | ModernBERT-base (149M) + MLP projector (768 → 768 → 2048, 2.4M) = 151M | | Objective | Pointwise cosine alignment with teacher query embeddings | | Data | 711K query-document pairs | | Epochs / lr | 20 / 2e-4 | | Training cost | ~11.7 GPU-hours (1× H200) | | CPU query latency | 109 ms | ## All NanoVDR Models | Model | Backbone | Params | v1 | v2 | v3 | Retention | |-------|----------|--------|----|----|----| ----------| | **[NanoVDR-S-Multi](https://huggingface.co/nanovdr/NanoVDR-S-Multi)** | **DistilBERT** | **69M** | **82.2** | **61.9** | **46.5** | **95.1%** | | [NanoVDR-S](https://huggingface.co/nanovdr/NanoVDR-S) | DistilBERT | 69M | 82.2 | 60.5 | 43.5 | 92.4% | | [NanoVDR-M](https://huggingface.co/nanovdr/NanoVDR-M) | BERT-base | 112M | 82.1 | 62.2 | 44.7 | 94.0% | | [NanoVDR-L](https://huggingface.co/nanovdr/NanoVDR-L) | ModernBERT | 151M | 82.4 | 61.5 | 44.2 | 93.4% | ## Citation ```bibtex @article{nanovdr2026, title={NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval}, author={Liu, Zhuchenyang and Zhang, Yao and Xiao, Yu}, journal={arXiv preprint arXiv:2502.XXXXX}, year={2026} } ``` ## License Apache 2.0