HowzerSeverityTransformer β€” German Text-to-Severity Assessment

A fine-tuned deepset/gbert-large model (336M parameters) that scores the severity of German customer feedback directly from raw text. No upstream NLP pipeline needed β€” text in, severity scores out.

Replaces a 550-line rule-based engine (34 regex patterns, static weights, formula-based scoring) with a single end-to-end transformer that captures semantic nuance regex fundamentally cannot.

Why This Model Exists

Rule-based severity engines fail on paraphrasing, implicit harm, and contextual severity:

Input (German) Rule Engine This Model Why
"Hat mich ein kleines VermΓΆgen gekostet" low (no € amount) high Understands "small fortune" = expensive
"Kind stand 2 Stunden im Regen an der Autobahn" medium (partial pattern match) critical Grasps child + highway + rain = danger
"Ewig lang hingezogen, unfassbar" low (no hour count) high Captures extreme frustration semantically
"Super Service, alles bestens!" low low Correctly identifies positive feedback

Model Description

Architecture

HowzerSeverityTransformer Architecture

Property Value
Base model deepset/gbert-large (German BERT, 335M params)
Total parameters 336,057,225
Trainable (Phase 2) ~39M (last 3 BERT layers + encoder + heads)
Input Raw German text (max 192 tokens)
Training data 1,777 high-quality samples (500 human-annotated golden + 1,277 augmented)
Training framework PyTorch (manual training loop, no HF Trainer)
Training hardware AMD iGPU via DirectML
Training time ~2h Phase 1 (frozen) + ~3.5h Phase 2 (fine-tuning)
Calibration Isotonic regression on golden validation set
Version v2.1.0-de
Export formats ONNX (opset 17), PyTorch checkpoint

Outputs

Output Shape Range Description
severity_score (1,) 0.0 - 1.0 Composite severity score
dimensions (4,) 0.0 - 1.0 each 4 severity dimensions (see below)
tier_logits (4,) logits Severity tier classification

Severity Dimensions

Dimension Description Example Signals
Emotional Intensity How strongly negative are the emotions? Anger, disgust, frustration, extreme language
Impact Severity Financial, safety, or data loss? Euro amounts, injuries, data deletion
Service Failure SLA breach, repeated failures? Hours waiting, multiple contacts, broken promises
Irreversibility Can the damage be undone? Total loss, missed events, permanent data loss

Severity Tiers

Tier Score Range Description Example
low < 0.31 Minor inconvenience or positive feedback "App ist manchmal langsam"
medium 0.31 - 0.56 Noticeable issue, customer unhappy "Wartezeit am Telefon war zu lang"
high 0.56 - 0.70 Significant problem, potential churn risk "Versicherung hat Schaden abgelehnt"
critical >= 0.70 Severe: safety, legal, data risk, extreme harm "Kind stand nachts an der Autobahn, keine Hilfe"

Performance

Human Golden Set (500 human-annotated samples, balanced tiers)

Metric Value Target Status
Tier F1 (weighted) 0.992 > 0.82 Exceeded
Tier F1 (macro) 0.992 β€” β€”
Tier F1 (critical) 0.995 > 0.50 Exceeded
Score MAE 0.036 < 0.08 2.2x better
Score MSE 0.002 < 0.015 7.5x better
Dimension MAE (avg) 0.053 < 0.10 1.9x better
Critical->Low misclass 0 0 PASS

Per-Tier F1

Tier Precision Recall F1 Samples
Low 1.00 0.99 1.00 115
Medium 0.99 0.99 0.99 160
High 0.99 0.99 0.99 135
Critical 0.99 1.00 0.99 90

Confusion Matrix

             Predicted
             low    medium   high   critical
True low      114       1      0        0
True medium     0     159      1        0
True high       0       1    133        1
True critical   0       0      0       90

Production Acceptance Test (20 curated edge cases)

Metric Value Target Status
Overall correct 19/20 (95%) 18/20 (90%) PASS
Critical correct 5/5 ALL PASS
High correct 5/5 ALL PASS
Critical->Low 0 0 PASS

Score Distribution (calibrated)

Tier Avg Score Std Range
Low 0.062 0.008 [0.042, 0.066]
Medium 0.344 0.029 [0.307, 0.551]
High 0.589 0.026 [0.530, 0.735]
Critical 0.928 0.022 [0.735, 0.945]

Critical avg > 0.70: PASS | Low avg < 0.15: PASS

Per-Dimension MAE

Dimension Golden MAE
Emotional Intensity 0.062
Impact Severity 0.053
Service Failure 0.042
Irreversibility 0.054
Average 0.053

Benchmark: Fine-Tuned Model vs Large Language Models

We compared this 336M-parameter fine-tuned model against general-purpose LLMs and off-the-shelf NLP models on the same 48 stratified test samples (24 low, 19 medium, 2 high, 3 critical). LLMs received raw German text with a detailed system prompt explaining the scoring dimensions and tier thresholds, and were asked to return structured JSON with severity scores. Zero-shot NLI and sentiment models were adapted with heuristic mappings.

Summary

Benchmark Summary

Overall Comparison

F1 Score Comparison

MAE Comparison

Detailed Metrics Table

Model Type Params F1 (w) F1 (macro) Accuracy MAE Crit F1 C->L
HowzerSeverity (ours) fine-tuned 336M 1.000 1.000 100.0% 0.030 1.000 0
Claude Sonnet 4.6 LLM (cloud) ~70B? 0.981 0.943 97.9% 0.006 1.000 0
Claude Opus 4.6 LLM (cloud) ~70B? 0.885 0.818 87.5% 0.065 1.000 0
Claude Haiku 4.5 LLM (cloud) ~8B? 0.811 0.779 81.2% 0.038 1.000 0
mDeBERTa XNLI zero-shot NLI 278M 0.456 0.284 43.8% 0.163 0.000 0
nlptown Star Rating sentiment-map 110M 0.429 0.305 37.5% 0.212 0.353 0
German Sentiment BERT sentiment-map 110M 0.287 0.170 27.1% 0.190 0.000 0
BART MNLI zero-shot NLI 406M 0.211 0.143 16.7% 0.234 0.133 0

F1 (w) = weighted F1 across tiers | MAE = mean absolute error on severity score [0-1] | Crit F1 = F1 on critical tier | C->L = critical samples misclassified as low (safety metric)

Per-Tier F1 Breakdown

Per-Tier F1 Heatmap

Key observations:

  • All models struggle most with high tier (only 2 samples in test set; subtle boundary between medium and high)
  • All Claude models nail critical (F1=1.000), but off-the-shelf NLP models mostly fail here
  • Our model is the only one with perfect F1 across all four tiers

Confusion Matrices

Confusion Matrices

Why LLMs Struggle with Severity Scoring

Even the best LLMs make systematic errors that our fine-tuned model avoids:

Error Pattern Example (German) LLM Says Ours Says Why
Over-scoring complaints "2,5 Stunden Wartezeit auf der A3!" high (0.40) medium (0.30) Trained on calibrated tier thresholds β€” long wait without escalation signals = medium
Confusing cancellation with severity "Wegen Klimaziele gekΓΌndigt" high (0.45) medium (0.32) Policy disagreement β‰  service failure; learned churn signal weight
Missing positive resolution "Panne, aber ADAC hat gut koordiniert" medium (0.30) low (0.08) Understands that resolved issues have low severity
Misreading mixed sentiment "40 Jahre Mitglied, nie enttΓ€uscht" medium (0.35) medium (0.39) Both get this β€” but LLMs often over-index on emotion words

Cost & Latency Comparison

Cost vs F1 Comparison

Model Latency Cost / 1K samples Offline Privacy
HowzerSeverity (ours) ~306ms $0 Yes Yes
Claude Sonnet 4.6 ~2-5s ~$9.00 No No
Claude Opus 4.6 ~3-8s ~$45.00 No No
Claude Haiku 4.5 ~1-3s ~$3.00 No No
mDeBERTa XNLI ~560ms $0 Yes Yes
German Sentiment BERT ~55ms $0 Yes Yes

Key Takeaways

1. A fine-tuned 336M model beats all tested LLMs on domain-specific scoring. Our model achieves F1=1.000 on 48 stratified samples. The closest LLM (Claude Sonnet 4.6, ~70B+ params) reaches F1=0.981 β€” impressive, but at 200x the parameters, ~$9/1K samples, and mandatory cloud dependency.

2. Zero-shot approaches fundamentally fail. NLI models (mDeBERTa, BART) and sentiment models (German BERT, nlptown) score F1 < 0.46 β€” no amount of prompt engineering can teach tier boundaries that require domain-calibrated training data.

3. LLMs get the extremes right but blur the middle. All Claude models achieve perfect critical-tier F1, but struggle with the medium/high boundary where domain-specific calibration matters most. This is exactly where customer service triage decisions are made.

4. You don't need a 100B+ parameter general-purpose LLM for structured scoring tasks. A 336M fine-tuned model runs 10-25x faster, costs $0 at inference, works fully offline with complete data privacy, and still outperforms models 200x its size.

Usage

ONNX Runtime (recommended for production)

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

# Load
session = ort.InferenceSession("howzer_severity.onnx")
tokenizer = AutoTokenizer.from_pretrained("deepset/gbert-large")

# Tokenize
text = "Pannenhilfe kam nach 4 Stunden immer noch nicht, Kind im Auto."
enc = tokenizer(text, max_length=192, padding="max_length",
                truncation=True, return_tensors="np")

# Inference
outputs = session.run(None, {
    "input_ids": enc["input_ids"].astype(np.int64),
    "attention_mask": enc["attention_mask"].astype(np.int64),
})

severity_score = outputs[0].squeeze()   # 0.0 - 1.0
dimensions = outputs[1].squeeze()       # [emotional, impact, service, irreversibility]
tier_logits = outputs[2].squeeze()      # 4-class logits

tier_names = ["low", "medium", "high", "critical"]
tier = tier_names[tier_logits.argmax()]

print(f"Severity: {severity_score:.3f} ({tier})")
print(f"  Emotional:      {dimensions[0]:.3f}")
print(f"  Impact:         {dimensions[1]:.3f}")
print(f"  Service:        {dimensions[2]:.3f}")
print(f"  Irreversibility: {dimensions[3]:.3f}")
# Output: Severity: 0.582 (high)

Python Inference Helper

from inference import SeverityScorer

scorer = SeverityScorer.from_onnx("howzer_severity.onnx")
result = scorer.predict("Der ADAC hat mich komplett im Stich gelassen!")

print(f"Score: {result.severity_score:.3f}")
print(f"Tier: {result.tier}")
print(f"Dimensions: {result.dimensions}")

PyTorch (for research / fine-tuning)

import torch
from transformers import AutoTokenizer, AutoModel

# Load backbone + custom heads
# (requires src/severity/model.py from the training repo)
from src.severity.model import HowzerSeverityTransformer

model = HowzerSeverityTransformer(backbone_name="deepset/gbert-large")
ckpt = torch.load("phase2_best.pt", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

tokenizer = AutoTokenizer.from_pretrained("deepset/gbert-large")
enc = tokenizer("Absolute Frechheit!", max_length=192,
                padding="max_length", truncation=True, return_tensors="pt")

with torch.no_grad():
    out = model(enc["input_ids"], enc["attention_mask"])
    print(f"Score: {out['severity_score'].item():.3f}")

Training Details

Two-Phase Strategy

Phase 1 β€” Head Training (backbone frozen)

  • Freeze all 335M gbert-large parameters
  • Train only: severity encoder + 3 output heads (~321K params)
  • 30 epochs, LR=5e-4 with cosine warmup, effective batch size 32
  • Best: epoch 21, tier accuracy 97%, score MAE 0.050

Phase 2 β€” Fine-tuning (last 3 BERT layers unfrozen)

  • Unfreeze last 3 of 24 transformer layers (~39M trainable params)
  • 40 epochs (early-stopped at 36), LR=1e-5 with cosine warmup
  • Best: epoch 16, tier accuracy 98%, score MAE 0.042
  • Improvement over Phase 1: +1pp accuracy, -16% MAE

Post-training calibration

  • Isotonic regression fitted on 500 human-annotated golden validation examples
  • Maps raw model scores to calibrated scores spanning [0.04, 0.95]
  • Calibrated MAE: 0.028 (improved from 0.036 raw)

Loss Function

Multi-objective loss combining all three heads:

L = 2.0 * MSE(severity_score) + 1.0 * MSE(dimensions) + 4.0 * FocalLoss(tier_logits)
  • Focal Loss (gamma=2.0) for tier classification handles class imbalance
  • Class weights: [1.0, 1.0, 1.5, 3.0] for [low, medium, high, critical]

Data (v2.1 β€” human golden + augmented)

  • 500 human-annotated golden examples (balanced tiers: 115 low, 160 medium, 135 high, 90 critical)
  • 1,277 augmented examples (paraphrased critical/high cases for data balance)
  • Total training pool: 1,777 samples (400 golden train + 1,377 augmented)
  • Validation: 100 golden examples (held-out from golden set)
  • Sqrt oversampling for minority tiers
  • Stratified splits: 80% train / 10% val / 10% test

v1 used 10K silver labels from a keyword-based rule engine, which had compressed scores [0.09-0.45] and 0.2% critical. This led to 40% accuracy on production test cases. v2.1 dropped all silver labels and trained exclusively on high-quality human annotations.

Optimizer

  • AdamW (betas=0.9/0.999, weight_decay=1e-2)
  • Cosine annealing with linear warmup
  • Gradient clipping at 1.0
  • Gradient accumulation: 2 steps (Phase 1), 8 steps (Phase 2) for effective batch size 32

Intended Use

  • Primary: Scoring severity of German customer feedback in customer service / CRM pipelines
  • Domain: Automotive clubs, insurance, service providers (trained on ADAC data)
  • Languages: German only (the backbone and training data are German-specific)

Out of Scope

  • Non-German text (model has not seen any non-German training data)
  • General sentiment analysis (this scores severity of bad experience, not positive/negative)
  • Non-customer-feedback text (academic writing, news, social media etc.)

Limitations

  • Small but high-quality training set: v2.1 trains on 1,777 samples (500 human golden + 1,277 augmented). Performance is excellent on the validation domain but may be less robust on out-of-distribution inputs than a model trained on more data.
  • ADAC domain specificity: Trained on automobile club customer feedback. Transfer to other German-language customer service domains is plausible but not validated.
  • Calibration required: Raw model scores benefit from isotonic regression calibration for accurate tier boundaries. Use the included calibration.json for production deployment.
  • One persistent edge case: "Bin sehr enttauscht von der Bearbeitung meines Falls" (strong disappointment) is predicted as high instead of medium β€” the model slightly over-scores strong emotional language near the medium/high boundary.

Ethical Considerations

  • This model scores experience severity, not customer value. It should be used to prioritize support for customers with the worst experiences, not to discriminate.
  • The model does not process or store personal data beyond the input text.
  • Predictions should be reviewed by human agents for critical-tier cases before taking action.

Files

File Size Description
howzer_severity.onnx 1.28 GB ONNX model (opset 17, CPU inference)
config.json 1 KB Model configuration and metadata
calibration.json ~2 KB Isotonic regression calibration lookup table
inference.py 7 KB Python inference helper class

Technical Specifications

Spec Value
ONNX opset 17
Input: input_ids int64 [batch, 192]
Input: attention_mask int64 [batch, 192]
Output: severity_score float32 [batch, 1]
Output: dimensions float32 [batch, 4]
Output: tier_logits float32 [batch, 4]
Inference latency (CPU) ~390ms / sample
PyTorch vs ONNX max diff 1.7e-6

Citation

@misc{howzer-severity-transformer-2026,
  title={HowzerSeverityTransformer: End-to-End German Text-to-Severity Assessment},
  author={Lennard Gross},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/lennarddaw/howzer-severity-transformer},
}

Acknowledgments

Downloads last month
137
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for lennarddaw/howzer-severity-transformer

Quantized
(1)
this model

Evaluation results