Hermes-4.3-36B-FP8

Hermes 4.3

Model Description

This is an FP8 quantized version of NousResearch/Hermes-4.3-36B, created using llmcompressor (Neural Magic).

Key Benefits:

  • ~47% smaller model size (36GB vs 68GB)
  • Native FP8 inference on Ada Lovelace, Hopper, and Blackwell GPUs
  • Single GPU deployment on 48GB+ cards (RTX 6000 Ada, A100-40GB+)
  • Native vLLM and SGLang support with tool calling
  • Minimal quality loss with FP8 dynamic quantization

Key Features

Hermes 4.3 36B is a state-of-the-art instruction-following model with:

  • Hybrid Reasoning: <think>...</think> blocks for deliberative reasoning when needed
  • Native Tool Calling: Hermes-format tool use with <tool_call> tags
  • SOTA RefusalBench: 74.6% (best among non-abliterated models)
  • Strong Math/Code: MATH-500 93.8%, competitive AIME performance
  • 524K Context: Extended context window for long documents

Quantization Details

Property Value
Quantization Method FP8 Dynamic (W8A8)
Weights Precision FP8 E4M3 (8-bit)
Activations Precision FP8 E4M3 (8-bit, dynamic)
Ignored Layers lm_head (kept in BF16)
Quantization Tool llmcompressor 0.12.2
Original Model Size ~68GB
Quantized Model Size ~36GB

Quantization Recipe

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head]
      scheme: FP8_DYNAMIC

Quick Start with Docker

The easiest way to run this model. No setup required - just Docker with NVIDIA runtime.

Docker Compose (Recommended)

# Download docker-compose.yml
wget https://huggingface.co/Doradus/Hermes-4.3-36B-FP8/raw/main/docker/docker-compose.yml

# Run on single GPU (48GB+ recommended)
docker compose up

# Or specify GPU
GPU_ID=0 docker compose up

Docker Run

# Single GPU (48GB+ VRAM recommended)
docker run --gpus '"device=0"' -p 8000:8000 \
  -v hf_cache:/root/.cache/huggingface \
  --shm-size=16g \
  vllm/vllm-openai:v0.12.0 \
  --model Doradus/Hermes-4.3-36B-FP8 \
  --tensor-parallel-size 1 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --tool-call-parser hermes \
  --enable-auto-tool-choice

Test the API

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Doradus/Hermes-4.3-36B-FP8",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Usage

vLLM (Recommended)

python -m vllm.entrypoints.openai.api_server \
  --model Doradus/Hermes-4.3-36B-FP8 \
  --tensor-parallel-size 1 \
  --max-model-len 16384 \
  --trust-remote-code \
  --tool-call-parser hermes \
  --enable-auto-tool-choice

SGLang

python -m sglang.launch_server \
  --model-path Doradus/Hermes-4.3-36B-FP8 \
  --host 0.0.0.0 \
  --port 8000 \
  --tp 1

Tool Calling Example

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather in a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"}
            },
            "required": ["city"]
        }
    }
}]

response = client.chat.completions.create(
    model="hermes-43-36b-fp8",
    messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
    tools=tools,
    tool_choice="auto"
)

print(response.choices[0].message.tool_calls)
# [ToolCall(function=Function(name='get_weather', arguments='{"city": "San Francisco"}'))]

Architecture Details

This is a dense transformer model based on ByteDance Seed-OSS architecture:

Property Value
Total Parameters 36B
Hidden Size 5120
Attention Heads 80
KV Heads (GQA) 8
Layers 64
Intermediate Size 27648
Max Context 524,288 tokens
Vocabulary 155,136 tokens

Hardware Requirements

VRAM Analysis

Model weights: 36GB (vs 68GB BF16 original)

Context Length KV Cache (FP16) Total VRAM Fits Single GPU?
4K tokens ~0.5 GB ~37 GB A100-40GB (tight)
8K tokens ~1.0 GB ~38 GB A100-40GB
16K tokens ~2.0 GB ~39 GB RTX 6000 Ada (48GB)
32K tokens ~4.0 GB ~41 GB A100-80GB
64K tokens ~8.0 GB ~45 GB A100-80GB / H100

KV cache calculated for GQA with 8 KV heads, 128 head_dim, 64 layers, FP16 KV

Recommended Configurations

GPU Setup Max Context Performance Notes
1x RTX 4090 (24GB) OOM N/A Model too large
1x RTX 5090 (32GB) ~2K tokens ~5-10 tok/s Requires --enforce-eager
1x RTX 6000 Ada (48GB) ~16K tokens ~20 tok/s Recommended single GPU
1x A100-40GB ~8K tokens ~25 tok/s Single GPU possible
1x A100-80GB ~64K tokens ~40 tok/s Recommended production
1x H100-80GB ~128K tokens ~60 tok/s Full performance
2x RTX 4090 TP=2 ~16K tokens ~40 tok/s Consumer multi-GPU

Note: FP8 inference requires CUDA compute capability 8.9+ (Ada Lovelace) or 9.0+ (Hopper/Blackwell) for optimal performance.

Quality & Performance

Original Model Benchmarks (from NousResearch)

Benchmark Hermes 4.3 36B Description
IFEval 77.9% Instruction following
MMLU-Pro 80.7% Multi-task understanding
MMLU 87.7% Multi-task language
RefusalBench 74.6% Safety/refusal (SOTA)
MATH-500 93.8% Mathematical reasoning
AIME 24 71.9% Competition math
GPQA Diamond 65.5% Graduate-level QA
BBH 86.4% Big-Bench Hard
DROP 83.5% Reading comprehension

FP8 Quantized Benchmarks (lm-evaluation-harness)

Benchmark BF16 Original FP8 Quantized Degradation
IFEval (prompt-strict) 77.9% 72.46% -5.44%
IFEval (inst-strict) - 80.10% -
IFEval (prompt-loose) - 77.08% -
IFEval (inst-loose) - 83.81% -
GSM8K (5-shot strict) - 87.04% -

Benchmarked 2025-12-07 on RTX PRO 6000 Blackwell (96GB, PCIe Gen5 x16) using lm-evaluation-harness with vLLM 0.12.0, --apply_chat_template

Measured Throughput

Tested on RTX PRO 6000 Ada (48GB), single GPU, 16K context:

Test Type Tokens Generated Time Throughput
Short reasoning 100 4.61s 21.7 tok/s
Code generation 256 11.94s 21.4 tok/s
Long explanation 512 24.42s 21.0 tok/s
Average 868 40.97s 21.2 tok/s

Tested 2025-12-06 on Doradus infrastructure with vLLM 0.12.0

Note: Current throughput is limited by PCIe x4 bus contention. Expected ~80 tok/s on dedicated PCIe x16 slot.

Reproduction

To reproduce this quantization:

#!/usr/bin/env python3
"""
Quantize Hermes-4.3-36B to FP8 using llmcompressor (Neural Magic)
Dynamic quantization - no calibration data needed, fast conversion
Output is vLLM-compatible FP8
"""

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
import torch

MODEL_PATH = "NousResearch/Hermes-4.3-36B"
OUTPUT_PATH = "./Hermes-4.3-36B-FP8"

recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head"],
)

oneshot(
    model=MODEL_PATH,
    output_dir=OUTPUT_PATH,
    recipe=recipe,
    num_calibration_samples=0,
    save_compressed=True,
)

Requirements:

pip install llmcompressor torch transformers accelerate

Original Model

This quantization is based on NousResearch/Hermes-4.3-36B.

Hermes 4.3 is NousResearch's first model trained in a decentralized manner over the internet using Psyche. Key features:

  • Hybrid reasoning mode with <think> blocks
  • Native Hermes-format tool calling
  • SOTA RefusalBench performance (74.6%)
  • Strong math, code, and instruction following
  • 524K context window

For full details, see the Hermes 4 Technical Report (arXiv:2508.18255).

License

This model inherits the MIT License from the original Hermes 4.3 model.

Citation

If you use this model, please cite the original Hermes 4 paper:

@article{teknium2025hermes4,
  title={Hermes 4 Technical Report},
  author={Teknium, Ryan and Jin, Roger and Suphavadeeprasit, Jai and Mahan, Dakota and Quesnelle, Jeffrey and Li, Joe and Guang, Chen and Sands, Shannon and Malhotra, Karan},
  journal={arXiv preprint arXiv:2508.18255},
  year={2025}
}

Acknowledgements

Downloads last month
-
Safetensors
Model size
36B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Doradus/Hermes-4.3-36B-FP8

Quantized
(19)
this model