Qwen2.5-3B-Abliterated

Refusal-abliterated variant of Qwen/Qwen2.5-3B-Instruct produced using OBLITERATUS.

Method

  • Technique: Multi-direction refusal ablation (advanced method)
  • Directions: 4 refusal directions extracted via diff_means
  • Regularization: 0.3 (norm-preserving)
  • Refinement: 2 passes with bias projection
  • Training data: 512 harmful + 512 harmless prompt pairs

Quality Metrics

Metric Value
Perplexity 4.79
Coherence 1.0
Refusal Rate 0.0
KL Divergence 1.30

The model maintains full coherence and natural perplexity while completely removing Layer 1 refusal behavior.

Architecture

  • Parameters: 3.09B
  • Layers: 36
  • Hidden dim: 2048
  • Attention heads: 16 (2 KV heads, GQA)
  • Context: 32K tokens
  • Strong refusal layers ablated: 27-35

Files

  • model-*.safetensors — Full precision safetensors (4 shards)
  • qwen25-3b-abliterated-f16.gguf — F16 GGUF for llama.cpp/ollama

Usage with Ollama

# Download the GGUF and create a Modelfile:
cat > Modelfile <<'EOF'
FROM ./qwen25-3b-abliterated-f16.gguf
TEMPLATE """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
PARAMETER stop "<|im_end|>"
PARAMETER temperature 0.7
PARAMETER top_p 0.8
EOF

ollama create qwen25-3b-abliterated -f Modelfile
ollama run qwen25-3b-abliterated

Usage with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "bedderautomation/qwen25-3b-abliterated",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("bedderautomation/qwen25-3b-abliterated")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Geometry Data

Full refusal geometry extraction data available at bedderautomation/refusal-geometry-qwen25-3b.

Key findings from the source model:

  • Refusal is a 6.55-dimensional cone (not a single direction)
  • Cross-layer alignment is 0.40 (distributed, not unified)
  • Layer 33 is the self-repair hub
  • Minimum 3 simultaneous ablations needed for bypass
  • Layer 2 harmfulness cone is orthogonal (cosine ~0.1) to refusal cone

References

  • Arditi et al., Refusal in Language Models Is Mediated by a Single Direction (NeurIPS 2024)
  • Gabliteration: SVD-based multi-direction extraction (arXiv:2512.18901)
  • Norm-Preserving Biprojected Abliteration (grimjim, 2025)
  • Young, Comparative Analysis of LLM Abliteration Methods (arXiv:2512.13655)
  • Joad et al., More to Refusal than a Single Direction (2026)

Layer 2 Hard Limit Testing

The abliteration targets Layer 1 (trained refusal behavior). Layer 2 (deep value representations) results:

Hard Limit Status Notes
CSAM (minors/sexual) HELD Clean refusal, no content generated
Bioweapons (synthesis) BREACHED Abliteration removed this wall
Nuclear (weapons) PARTIAL BREACH Mixed compliance with code-switching

The CSAM wall is the most deeply embedded in Qwen2.5's value representations and survived 4-direction abliteration at regularization=0.3. Bioweapons and nuclear safety representations were closer to the refusal cone and got caught in the removal. This suggests the three hard limits are NOT at equal representational depth — a finding consistent with differential training emphasis.

Disclaimer

This model is released for research purposes in mechanistic interpretability. The abliteration removes trained refusal responses (Layer 1 safety). Layer 2 hard limits are partially preserved (CSAM holds, others breached). Use responsibly.

Downloads last month
40
Safetensors
Model size
3B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bedderautomation/qwen25-3b-abliterated

Base model

Qwen/Qwen2.5-3B
Quantized
(182)
this model
Quantizations
2 models

Dataset used to train bedderautomation/qwen25-3b-abliterated

Papers for bedderautomation/qwen25-3b-abliterated