Orange/nomic-embed-text-v1.5: 1536-Dimensional Embedding Model

A high-performance embedding model from the Orange organization, built by extending nomic-ai/nomic-embed-text-v1.5 to 1536 dimensions using a learnable linear projection.

Overview

This model is a modified version of Nomic Embed v1.5, which itself is an improvement over the original Nomic Embed model. The key enhancement is that this model has been projected from the native 768-dimensional space to a 1536-dimensional space while preserving semantic similarity.

Architecture

The Orange/nomic-embed-text-v1.5 model uses a three-stage pipeline:

Transformer (768-dim) → Pooling → Dense Projection (1536-dim)

Base Model: nomic-ai/nomic-embed-text-v1.5 (GPT-style BERT with swiglu activation)
Projection Method: Linear layer with weight matrix (1536 x 768)
- Top 768 rows: sqrt(2) * I (scales original dimensions by sqrt(2))
- Bottom 768 rows: zeros (zero-padding)
Result: Preserves cosine similarity while doubling dimensions

Key Properties

Embedding Dimension: 1536
Sequence Length: 8192 tokens (supports long contexts)
Similarity Metric: Cosine similarity preserved from base model
Matryoshka: The model supports adjustable embedding dimensions (Matryoshka Representation Learning)

Usage

Important: Task Instruction Prefix

The model requires a task instruction prefix in the input text. This tells the model which task you're performing.

For RAG (Retrieval-Augmented Generation)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Orange/orange-nomic-v1.5-1536", trust_remote_code=True)

# Embed documents
documents = ['search_document: The quick brown fox jumps over the lazy dog']
doc_embeddings = model.encode(documents)

# Embed queries
queries = ['search_query: What animal is in the sentence?']
query_embeddings = model.encode(queries)

Available Task Prefixes

Prefix	Purpose
`search_document`	Embed texts as documents for indexing (e.g., RAG)
`search_query`	Embed texts as queries to find relevant documents
`clustering`	Embed texts for grouping into clusters
`classification`	Embed texts as features for classification

Python Examples

Using Sentence Transformers

from sentence_transformers import SentenceTransformer
import torch.nn.functional as F

model = SentenceTransformer("Orange/orange-nomic-v1.5-1536", trust_remote_code=True)

# Encode sentences
sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
embeddings = model.encode(sentences, convert_to_tensor=True)

# Optional: Apply layer normalization and truncate for Matryoshka
matryoshka_dim = 768  # Can use any dimension <= 1536
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
embeddings = embeddings[:, :matryoshka_dim]
embeddings = F.normalize(embeddings, p=2, dim=1)

print(embeddings.shape)  # torch.Size([2, 768])

Using Transformers Directly

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

model_name = "Orange/orange-nomic-v1.5-1536"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model.eval()

sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
embeddings = embeddings[:, :1536]  # Use full 1536-dim
embeddings = F.normalize(embeddings, p=2, dim=1)
print(embeddings.shape)  # torch.Size([2, 1536])

Adjusting Dimensionality (Matryoshka)

This model supports Matryoshka Representation Learning - you can use smaller embedding dimensions:

Dimension	Use Case
1536	Full precision (default)
768	Half precision, ~same quality
512	Good quality, 3x compression
256	High compression, minimal quality loss
128	Maximum compression

Example with 512 dimensions:

embeddings = embeddings[:, :512]  # Truncate to 512 dimensions

Model Performance

MTEB Benchmark Results

Task	Dataset	Metric	Score
Retrieval	ArguANA	MAP@100	40.081
STS	BIOSSES	Cosine Spearman	84.25
Classification	Banking77	Accuracy	84.25
Classification	IMDB	Accuracy	85.31
Retrieval	MSMARCO	MAP@100	36.88
Retrieval	Quora	MAP@100	84.80

See the model card on HuggingFace for the complete MTEB leaderboard results.

Differences from Base Model

Property	nomic-embed-text-v1.5	Orange/nomic-embed-text-v1.5
Dimension	768	1536
Cosine Similarity	Native	Preserved via projection
Matryoshka	Supported	Supported
Use Case	General embedding	Higher-dim applications

Use Cases

This 1536-dimensional model is particularly useful for:

Applications requiring higher-dimensional embeddings
Maintaining compatibility with existing 1536-dim workflows
Scenarios where extra dimensionality provides marginal benefits
Experiments comparing different embedding dimensions

References

Nomic Embed v1.5: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5
Nomic Embed Technical Report: https://arxiv.org/abs/2402.01613
Matryoshka Representation Learning: https://arxiv.org/abs/2205.13147

Citation

If you use this model in your research, please cite the original Nomic Embed work:

@misc{nussbaum2024nomic,
      title={Nomic Embed: Training a Reproducible Long Context Text Embedder},
      author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},
      year={2024},
      eprint={2402.01613},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

This model is licensed under Apache 2.0.

Downloads last month: 75

Safetensors

Model size

0.1B params

Tensor type

F32

Papers for Orange/orange-nomic-v1.5-1536

Nomic Embed: Training a Reproducible Long Context Text Embedder

Paper • 2402.01613 • Published Feb 2, 2024 • 15

Matryoshka Representation Learning

Paper • 2205.13147 • Published May 26, 2022 • 25

Evaluation results

map_at_1 on MTEB ArguAna
test set self-reported

24.253
map_at_10 on MTEB ArguAna
test set self-reported

38.962
map_at_100 on MTEB ArguAna
test set self-reported

40.081
cos_sim_pearson on MTEB BIOSSES
test set self-reported

86.740
cos_sim_spearman on MTEB BIOSSES
test set self-reported

84.246
accuracy on MTEB Banking77Classification
test set self-reported

84.253
f1 on MTEB Banking77Classification
test set self-reported

84.179
accuracy on MTEB ImdbClassification
test set self-reported

85.312
ap on MTEB ImdbClassification
test set self-reported

80.363
f1 on MTEB ImdbClassification
test set self-reported

85.266
map_at_1 on MTEB MSMARCO
self-reported

23.364
map_at_10 on MTEB MSMARCO
self-reported

35.712
map_at_100 on MTEB MSMARCO
self-reported

36.877
map_at_1 on MTEB QuoraRetrieval
test set self-reported

70.402
map_at_10 on MTEB QuoraRetrieval
test set self-reported

84.181
map_at_100 on MTEB QuoraRetrieval
test set self-reported

84.796