Orange/nomic-embed-text-v1.5: 1536-Dimensional Embedding Model

A high-performance embedding model from the Orange organization, built by extending nomic-ai/nomic-embed-text-v1.5 to 1536 dimensions using a learnable linear projection.

Overview

This model is a modified version of Nomic Embed v1.5, which itself is an improvement over the original Nomic Embed model. The key enhancement is that this model has been projected from the native 768-dimensional space to a 1536-dimensional space while preserving semantic similarity.

Architecture

The Orange/nomic-embed-text-v1.5 model uses a three-stage pipeline:

Transformer (768-dim) → Pooling → Dense Projection (1536-dim)
  • Base Model: nomic-ai/nomic-embed-text-v1.5 (GPT-style BERT with swiglu activation)
  • Projection Method: Linear layer with weight matrix (1536 x 768)
    • Top 768 rows: sqrt(2) * I (scales original dimensions by sqrt(2))
    • Bottom 768 rows: zeros (zero-padding)
  • Result: Preserves cosine similarity while doubling dimensions

Key Properties

  • Embedding Dimension: 1536
  • Sequence Length: 8192 tokens (supports long contexts)
  • Similarity Metric: Cosine similarity preserved from base model
  • Matryoshka: The model supports adjustable embedding dimensions (Matryoshka Representation Learning)

Usage

Important: Task Instruction Prefix

The model requires a task instruction prefix in the input text. This tells the model which task you're performing.

For RAG (Retrieval-Augmented Generation)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Orange/orange-nomic-v1.5-1536", trust_remote_code=True)

# Embed documents
documents = ['search_document: The quick brown fox jumps over the lazy dog']
doc_embeddings = model.encode(documents)

# Embed queries
queries = ['search_query: What animal is in the sentence?']
query_embeddings = model.encode(queries)

Available Task Prefixes

Prefix Purpose
search_document Embed texts as documents for indexing (e.g., RAG)
search_query Embed texts as queries to find relevant documents
clustering Embed texts for grouping into clusters
classification Embed texts as features for classification

Python Examples

Using Sentence Transformers

from sentence_transformers import SentenceTransformer
import torch.nn.functional as F

model = SentenceTransformer("Orange/orange-nomic-v1.5-1536", trust_remote_code=True)

# Encode sentences
sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
embeddings = model.encode(sentences, convert_to_tensor=True)

# Optional: Apply layer normalization and truncate for Matryoshka
matryoshka_dim = 768  # Can use any dimension <= 1536
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
embeddings = embeddings[:, :matryoshka_dim]
embeddings = F.normalize(embeddings, p=2, dim=1)

print(embeddings.shape)  # torch.Size([2, 768])

Using Transformers Directly

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

model_name = "Orange/orange-nomic-v1.5-1536"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model.eval()

sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
embeddings = embeddings[:, :1536]  # Use full 1536-dim
embeddings = F.normalize(embeddings, p=2, dim=1)
print(embeddings.shape)  # torch.Size([2, 1536])

Adjusting Dimensionality (Matryoshka)

This model supports Matryoshka Representation Learning - you can use smaller embedding dimensions:

Dimension Use Case
1536 Full precision (default)
768 Half precision, ~same quality
512 Good quality, 3x compression
256 High compression, minimal quality loss
128 Maximum compression

Example with 512 dimensions:

embeddings = embeddings[:, :512]  # Truncate to 512 dimensions

Model Performance

MTEB Benchmark Results

Task Dataset Metric Score
Retrieval ArguANA MAP@100 40.081
STS BIOSSES Cosine Spearman 84.25
Classification Banking77 Accuracy 84.25
Classification IMDB Accuracy 85.31
Retrieval MSMARCO MAP@100 36.88
Retrieval Quora MAP@100 84.80

See the model card on HuggingFace for the complete MTEB leaderboard results.

Differences from Base Model

Property nomic-embed-text-v1.5 Orange/nomic-embed-text-v1.5
Dimension 768 1536
Cosine Similarity Native Preserved via projection
Matryoshka Supported Supported
Use Case General embedding Higher-dim applications

Use Cases

This 1536-dimensional model is particularly useful for:

  • Applications requiring higher-dimensional embeddings
  • Maintaining compatibility with existing 1536-dim workflows
  • Scenarios where extra dimensionality provides marginal benefits
  • Experiments comparing different embedding dimensions

References

Citation

If you use this model in your research, please cite the original Nomic Embed work:

@misc{nussbaum2024nomic,
      title={Nomic Embed: Training a Reproducible Long Context Text Embedder},
      author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},
      year={2024},
      eprint={2402.01613},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

This model is licensed under Apache 2.0.

Downloads last month
75
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for Orange/orange-nomic-v1.5-1536

Evaluation results