n24q02m/Qwen3-Reranker-0.6B-ONNX
ONNX-optimized version of Qwen/Qwen3-Reranker-0.6B for use with qwen3-embed and fastembed (PR #605).
Available Variants
| Variant | File | Size | Description |
|---|---|---|---|
| INT8 | onnx/model_quantized.onnx |
573 MB | Dynamic INT8 quantization (default) |
| Q4F16 | onnx/model_q4f16.onnx |
517 MB | INT4 weights + FP16 activations |
| YesNo INT8 | onnx/model_yesno_quantized.onnx |
598 MB | Optimized 2-dim output (yes/no only), ~10x less RAM |
| YesNo Q4F16 | onnx/model_yesno_q4f16.onnx |
514 MB | YesNo + INT4 weights + FP16 activations |
YesNo Variant
The YesNo variant outputs only (batch, 2) logits [no, yes] instead of the full
(batch, seq_len, 151669) vocabulary logits. This reduces runtime memory from ~12GB
to <1GB while producing identical relevance scores.
When to use: Production deployments where RAM is constrained. The default INT8/Q4F16 variants output full vocabulary logits which require ~12GB RAM at inference time.
YesNo Q4F16 combines both optimizations: reduced output dimensions AND smaller weights.
Usage
qwen3-embed
pip install qwen3-embed
from qwen3_embed import TextCrossEncoder
# INT8 (default)
reranker = TextCrossEncoder("n24q02m/Qwen3-Reranker-0.6B-ONNX")
scores = list(reranker.rerank("What is AI?", ["AI is...", "Pizza is..."]))
# Custom instruction
scores = list(reranker.rerank(
"What is AI?",
["doc1", "doc2"],
instruction="Judge document relevance for code search.",
))
# Q4F16 (smaller, slightly less accurate)
reranker_q4 = TextCrossEncoder("n24q02m/Qwen3-Reranker-0.6B-ONNX-Q4F16")
# YesNo INT8 (optimized output, ~10x less RAM)
reranker_yesno = TextCrossEncoder("n24q02m/Qwen3-Reranker-0.6B-ONNX-YesNo")
# YesNo Q4F16 (smallest + optimized output)
reranker_yesno_q4 = TextCrossEncoder("n24q02m/Qwen3-Reranker-0.6B-ONNX-YesNo-Q4F16")
fastembed
pip install fastembed
from fastembed import TextCrossEncoder
# INT8 (default)
reranker = TextCrossEncoder("Qwen/Qwen3-Reranker-0.6B")
scores = list(reranker.rerank("What is AI?", ["AI is...", "Pizza is..."]))
# Q4F16
reranker_q4 = TextCrossEncoder("Qwen/Qwen3-Reranker-0.6B-Q4F16")
Note: fastembed support requires PR #605 or install from fork:
pip install git+https://github.com/n24q02m/fastembed.git@feat/qwen3-support
Conversion Details
- Source: Qwen/Qwen3-Reranker-0.6B
- ONNX opset: 21 (INT8/Q4F16), 17 (YesNo INT8/Q4F16)
- INT8:
onnxruntime.quantization.quantize_dynamic(QInt8) - Q4F16:
MatMulNBitsQuantizer(block_size=128, symmetric) + FP16 cast - YesNo: Custom
_YesNoWrapperextracts only yes/no logits from lm_head (TOKEN_NO_ID=2152, TOKEN_YES_ID=9693), then quantized
Related
- GGUF variants: n24q02m/Qwen3-Reranker-0.6B-GGUF
- Downloads last month
- 378