Matryoshka Representation Learning
Paper
•
2205.13147
•
Published
•
25
This is a fine-tuned version of Qwen/Qwen3-Embedding-0.6B optimized for academic and scientific literature search. The model has been trained using contrastive learning with hard negative mining, specifically curated for academic search scenarios.
| Attribute | Value |
|---|---|
| Base Model | Qwen/Qwen3-Embedding-0.6B |
| Architecture | Qwen3 |
| Embedding Dimension | 1024 |
| MRL Dimensions | 1024, 768, 512, 256, 128 |
| Max Sequence Length | 4096 |
| Pooling | Last token |
| Precision | float16 |
| Model | Avg. | SciFact: Recall@10 | TRECCOVID: Recall@10 | NFCorpus: Recall@10 | SCIDOCS: Recall@10 | LitSearch: Recall@10 | QASA: Recall@10 |
|---|---|---|---|---|---|---|---|
| Qwen3-Embedding-0.6B-academic | 0.3881 | 0.8482 | 0.0227 | 0.1739 | 0.2323 | 0.6944 | 0.357 |
| Qwen3-Embedding-0.6B | 0.362 | 0.8277 | 0.0213 | 0.1604 | 0.2311 | 0.6489 | 0.2828 |
| Parameter | Value |
|---|---|
| Learning Rate | 1e-6 |
| Batch Size | 8192 (effective) |
| Per-Device Batch Size | 32 |
| Warmup Steps | 100 |
| Weight Decay | 0.1 |
| Precision | fp16 |
| Max Length | 4096 |
| Loss Function | InfoNCE (Contrastive) |
| Temperature (Ï„) | 0.02 |
The model was trained on LEAD (Liner Embedding Academic Dataset), a combination of ~55,560 samples tailored for academic search:
This model supports Matryoshka Representation Learning. You can truncate embeddings to smaller dimensions (768, 512, 256, 128) for faster computation and reduced storage.
# Full dimension (1024)
full_embedding = embeddings[:, :1024]
# MRL dimensions
embedding_768 = embeddings[:, :768]
embedding_512 = embeddings[:, :512]
embedding_256 = embeddings[:, :256]
embedding_128 = embeddings[:, :128]
import torch
from transformers import AutoModel, AutoTokenizer
model_path = "LinerAI/Qwen3-Embedding-0.6B-academic"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.float16)
model.eval()
general_instruction = "Given a query, retrieve relevant passages that answer the query"
# For queries
def encode_query(text):
input_text = f"Instruct: {general_instruction}\nQuery: {text}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=4096, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state[:, -1] # Last token pooling
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
return embeddings
# For passages
def encode_passage(text):
inputs = tokenizer(text, return_tensors="pt", max_length=4096, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state[:, -1] # Last token pooling
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
return embeddings
# Example: Academic search
query = "transformer models for protein structure prediction"
abstract = "We introduce AlphaFold, a deep learning system that predicts protein structures..."
query_emb = encode_query(query)
passage_emb = encode_passage(abstract)
similarity = torch.nn.functional.cosine_similarity(query_emb, passage_emb)
print(f"Similarity: {similarity.item():.4f}")
Instruct: Given a query, retrieve relevant passages that answer the query
Query: {your_query_text}
{your_passage_text}
This model is released under the Apache 2.0 license.