RTriever-4B

RTriever-4B is a 4-billion-parameter dense retriever based on Qwen/Qwen3-Embedding-4B, specialized for reasoning-intensive information retrieval.

Model details

Base model Qwen/Qwen3-Embedding-4B
Parameters 4 B
Hidden size 2,560
Layers 36
Attention heads 32
Embedding dimension 2,560
Pooling Last-token + L2 normalization
Max sequence length 32,768 tokens
Tokenizer Qwen3 (vocab size 151,665)
Format safetensors (sharded, fp16); also exposes a sentence-transformers interface
License MIT

The model can be loaded as either a sentence-transformers model or a plain transformers causal-LM with manual last-token pooling.

Quick start

Option A β€” sentence-transformers (recommended)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("yale-nlp/RTriever-4B")

# Queries should use a query prompt; documents should NOT have a prompt.
query = "Why are insects attracted to light at night?"
docs = [
    "Recent flight-tracking studies show insects orient their dorsal axis toward "
    "the brightest visual region; near a point light source, this dorsal-light "
    "response disrupts flight stability and traps the insect.",
    "Fluorescent lamps emit in the UV range, which can be perceived by some "
    "nocturnal insects as a navigational cue similar to moonlight.",
]

q_emb = model.encode(query, prompt_name="query")
d_emb = model.encode(docs)

# Cosine similarity (the embeddings are L2-normalized, so a dot product suffices)
scores = q_emb @ d_emb.T
print(scores)

The prompt_name="query" flag prepends the default query instruction stored in config_sentence_transformers.json:

Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:

For task-specific instructions (e.g. a domain-tuned prompt template), pass prompt=... directly:

custom_prompt = (
    "Given a Biology post, retrieve relevant passages that help answer the post\nPost: "
)
q_emb = model.encode(query, prompt=custom_prompt)

Option B β€” transformers with manual pooling

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("yale-nlp/RTriever-4B")
model = AutoModel.from_pretrained(
    "yale-nlp/RTriever-4B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

QUERY_PROMPT = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:"


def last_token_pool(last_hidden_states, attention_mask):
    # Right-padding-aware last-token pooling. Works whether or not the tokenizer
    # left-pads.
    left_padding = attention_mask[:, -1].sum() == attention_mask.shape[0]
    if left_padding:
        return last_hidden_states[:, -1]
    seq_lens = attention_mask.sum(dim=1) - 1
    return last_hidden_states[torch.arange(last_hidden_states.size(0)), seq_lens]


def encode(texts, prompt: str = ""):
    if prompt:
        texts = [prompt + t for t in texts]   # match sentence-transformers behavior: no separator
    batch = tokenizer(texts, padding=True, truncation=True, max_length=8192, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model(**batch)
    pooled = last_token_pool(out.last_hidden_state, batch["attention_mask"])
    return F.normalize(pooled, p=2, dim=1)


queries = ["Why are insects attracted to light at night?"]
docs = [
    "Recent flight-tracking studies show insects orient their dorsal axis toward "
    "the brightest visual region; near a point light source, this dorsal-light "
    "response disrupts flight stability and traps the insect.",
]

q_emb = encode(queries, prompt=QUERY_PROMPT)
d_emb = encode(docs, prompt="")
scores = (q_emb @ d_emb.T).cpu().tolist()
print(scores)

Option C β€” batched retrieval over a corpus

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("yale-nlp/RTriever-4B")

corpus = [...]                          # list[str], thousands–millions of docs
doc_emb = model.encode(corpus, batch_size=16, show_progress_bar=True)

queries = [...]                         # list[str]
q_emb = model.encode(queries, prompt_name="query", batch_size=16)

# Top-k retrieval (cosine similarity == dot product, since both sides are L2-normalized)
scores = q_emb @ doc_emb.T              # (n_query, n_doc)
top_k = np.argsort(-scores, axis=1)[:, :100]

Notes on inputs

  • Query prompt is required. Use prompt_name="query" (sentence-transformers) or prepend QUERY_PROMPT manually (transformers). Documents are encoded without a prompt.
  • Domain-specific prompts typically improve retrieval quality on reasoning-intensive queries; a generic web-search prompt is provided as the default.
  • Long inputs are supported up to 32 K tokens; for retrieval you usually want to truncate documents to 4–8 K tokens to keep encoding cost manageable.
  • Embeddings are L2-normalized, so cosine similarity reduces to a dot product. Both sentence-transformers (util.cos_sim) and a plain q @ d.T work.

Intended use

RTriever-4B is intended for reasoning-intensive retrieval: queries that require multi-step inference and the integration of complementary evidence rather than surface-level keyword or paraphrase matching. It can also be used as a drop-in replacement for any general-purpose dense retriever in retrieval-augmented generation, scholarly search, and agentic search pipelines.

License

Released under the MIT License. The base model (Qwen/Qwen3-Embedding-4B) retains its original license; consult the Qwen3-Embedding model card for upstream attribution.

Downloads last month
26
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for yale-nlp/RTriever-4B

Finetuned
(47)
this model
Quantizations
1 model

Collection including yale-nlp/RTriever-4B