Orange/nomic-embed-text-v1.5: 1536-Dimensional Embedding Model
A high-performance embedding model from the Orange organization, built by extending nomic-ai/nomic-embed-text-v1.5 to 1536 dimensions using a learnable linear projection.
Overview
This model is a modified version of Nomic Embed v1.5, which itself is an improvement over the original Nomic Embed model. The key enhancement is that this model has been projected from the native 768-dimensional space to a 1536-dimensional space while preserving semantic similarity.
Architecture
The Orange/nomic-embed-text-v1.5 model uses a three-stage pipeline:
Transformer (768-dim) → Pooling → Dense Projection (1536-dim)
- Base Model: nomic-ai/nomic-embed-text-v1.5 (GPT-style BERT with swiglu activation)
- Projection Method: Linear layer with weight matrix (1536 x 768)
- Top 768 rows:
sqrt(2) * I(scales original dimensions by sqrt(2)) - Bottom 768 rows: zeros (zero-padding)
- Top 768 rows:
- Result: Preserves cosine similarity while doubling dimensions
Key Properties
- Embedding Dimension: 1536
- Sequence Length: 8192 tokens (supports long contexts)
- Similarity Metric: Cosine similarity preserved from base model
- Matryoshka: The model supports adjustable embedding dimensions (Matryoshka Representation Learning)
Usage
Important: Task Instruction Prefix
The model requires a task instruction prefix in the input text. This tells the model which task you're performing.
For RAG (Retrieval-Augmented Generation)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Orange/orange-nomic-v1.5-1536", trust_remote_code=True)
# Embed documents
documents = ['search_document: The quick brown fox jumps over the lazy dog']
doc_embeddings = model.encode(documents)
# Embed queries
queries = ['search_query: What animal is in the sentence?']
query_embeddings = model.encode(queries)
Available Task Prefixes
| Prefix | Purpose |
|---|---|
search_document |
Embed texts as documents for indexing (e.g., RAG) |
search_query |
Embed texts as queries to find relevant documents |
clustering |
Embed texts for grouping into clusters |
classification |
Embed texts as features for classification |
Python Examples
Using Sentence Transformers
from sentence_transformers import SentenceTransformer
import torch.nn.functional as F
model = SentenceTransformer("Orange/orange-nomic-v1.5-1536", trust_remote_code=True)
# Encode sentences
sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
embeddings = model.encode(sentences, convert_to_tensor=True)
# Optional: Apply layer normalization and truncate for Matryoshka
matryoshka_dim = 768 # Can use any dimension <= 1536
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
embeddings = embeddings[:, :matryoshka_dim]
embeddings = F.normalize(embeddings, p=2, dim=1)
print(embeddings.shape) # torch.Size([2, 768])
Using Transformers Directly
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
model_name = "Orange/orange-nomic-v1.5-1536"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model.eval()
sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
embeddings = embeddings[:, :1536] # Use full 1536-dim
embeddings = F.normalize(embeddings, p=2, dim=1)
print(embeddings.shape) # torch.Size([2, 1536])
Adjusting Dimensionality (Matryoshka)
This model supports Matryoshka Representation Learning - you can use smaller embedding dimensions:
| Dimension | Use Case |
|---|---|
| 1536 | Full precision (default) |
| 768 | Half precision, ~same quality |
| 512 | Good quality, 3x compression |
| 256 | High compression, minimal quality loss |
| 128 | Maximum compression |
Example with 512 dimensions:
embeddings = embeddings[:, :512] # Truncate to 512 dimensions
Model Performance
MTEB Benchmark Results
| Task | Dataset | Metric | Score |
|---|---|---|---|
| Retrieval | ArguANA | MAP@100 | 40.081 |
| STS | BIOSSES | Cosine Spearman | 84.25 |
| Classification | Banking77 | Accuracy | 84.25 |
| Classification | IMDB | Accuracy | 85.31 |
| Retrieval | MSMARCO | MAP@100 | 36.88 |
| Retrieval | Quora | MAP@100 | 84.80 |
See the model card on HuggingFace for the complete MTEB leaderboard results.
Differences from Base Model
| Property | nomic-embed-text-v1.5 | Orange/nomic-embed-text-v1.5 |
|---|---|---|
| Dimension | 768 | 1536 |
| Cosine Similarity | Native | Preserved via projection |
| Matryoshka | Supported | Supported |
| Use Case | General embedding | Higher-dim applications |
Use Cases
This 1536-dimensional model is particularly useful for:
- Applications requiring higher-dimensional embeddings
- Maintaining compatibility with existing 1536-dim workflows
- Scenarios where extra dimensionality provides marginal benefits
- Experiments comparing different embedding dimensions
References
- Nomic Embed v1.5: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5
- Nomic Embed Technical Report: https://arxiv.org/abs/2402.01613
- Matryoshka Representation Learning: https://arxiv.org/abs/2205.13147
Citation
If you use this model in your research, please cite the original Nomic Embed work:
@misc{nussbaum2024nomic,
title={Nomic Embed: Training a Reproducible Long Context Text Embedder},
author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},
year={2024},
eprint={2402.01613},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
License
This model is licensed under Apache 2.0.
- Downloads last month
- 75
Papers for Orange/orange-nomic-v1.5-1536
Matryoshka Representation Learning
Evaluation results
- map_at_1 on MTEB ArguAnatest set self-reported24.253
- map_at_10 on MTEB ArguAnatest set self-reported38.962
- map_at_100 on MTEB ArguAnatest set self-reported40.081
- cos_sim_pearson on MTEB BIOSSEStest set self-reported86.740
- cos_sim_spearman on MTEB BIOSSEStest set self-reported84.246
- accuracy on MTEB Banking77Classificationtest set self-reported84.253
- f1 on MTEB Banking77Classificationtest set self-reported84.179
- accuracy on MTEB ImdbClassificationtest set self-reported85.312
- ap on MTEB ImdbClassificationtest set self-reported80.363
- f1 on MTEB ImdbClassificationtest set self-reported85.266
- map_at_1 on MTEB MSMARCOself-reported23.364
- map_at_10 on MTEB MSMARCOself-reported35.712
- map_at_100 on MTEB MSMARCOself-reported36.877
- map_at_1 on MTEB QuoraRetrievaltest set self-reported70.402
- map_at_10 on MTEB QuoraRetrievaltest set self-reported84.181
- map_at_100 on MTEB QuoraRetrievaltest set self-reported84.796