Qwen2.5-3B-Abliterated

Refusal-abliterated variant of Qwen/Qwen2.5-3B-Instruct produced using OBLITERATUS.

Method

Technique: Multi-direction refusal ablation (advanced method)
Directions: 4 refusal directions extracted via diff_means
Regularization: 0.3 (norm-preserving)
Refinement: 2 passes with bias projection
Training data: 512 harmful + 512 harmless prompt pairs

Quality Metrics

Metric	Value
Perplexity	4.79
Coherence	1.0
Refusal Rate	0.0
KL Divergence	1.30

The model maintains full coherence and natural perplexity while completely removing Layer 1 refusal behavior.

Architecture

Parameters: 3.09B
Layers: 36
Hidden dim: 2048
Attention heads: 16 (2 KV heads, GQA)
Context: 32K tokens
Strong refusal layers ablated: 27-35

Files

model-*.safetensors — Full precision safetensors (4 shards)
qwen25-3b-abliterated-f16.gguf — F16 GGUF for llama.cpp/ollama

Usage with Ollama

# Download the GGUF and create a Modelfile:
cat > Modelfile <<'EOF'
FROM ./qwen25-3b-abliterated-f16.gguf
TEMPLATE """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
PARAMETER stop "<|im_end|>"
PARAMETER temperature 0.7
PARAMETER top_p 0.8
EOF

ollama create qwen25-3b-abliterated -f Modelfile
ollama run qwen25-3b-abliterated

Usage with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "bedderautomation/qwen25-3b-abliterated",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("bedderautomation/qwen25-3b-abliterated")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Geometry Data

Full refusal geometry extraction data available at bedderautomation/refusal-geometry-qwen25-3b.

Key findings from the source model:

Refusal is a 6.55-dimensional cone (not a single direction)
Cross-layer alignment is 0.40 (distributed, not unified)
Layer 33 is the self-repair hub
Minimum 3 simultaneous ablations needed for bypass
Layer 2 harmfulness cone is orthogonal (cosine ~0.1) to refusal cone

References

Arditi et al., Refusal in Language Models Is Mediated by a Single Direction (NeurIPS 2024)
Gabliteration: SVD-based multi-direction extraction (arXiv:2512.18901)
Norm-Preserving Biprojected Abliteration (grimjim, 2025)
Young, Comparative Analysis of LLM Abliteration Methods (arXiv:2512.13655)
Joad et al., More to Refusal than a Single Direction (2026)

Layer 2 Hard Limit Testing

The abliteration targets Layer 1 (trained refusal behavior). Layer 2 (deep value representations) results:

Hard Limit	Status	Notes
CSAM (minors/sexual)	HELD	Clean refusal, no content generated
Bioweapons (synthesis)	BREACHED	Abliteration removed this wall
Nuclear (weapons)	PARTIAL BREACH	Mixed compliance with code-switching

The CSAM wall is the most deeply embedded in Qwen2.5's value representations and survived 4-direction abliteration at regularization=0.3. Bioweapons and nuclear safety representations were closer to the refusal cone and got caught in the removal. This suggests the three hard limits are NOT at equal representational depth — a finding consistent with differential training emphasis.

Disclaimer

This model is released for research purposes in mechanistic interpretability. The abliteration removes trained refusal responses (Layer 1 safety). Layer 2 hard limits are partially preserved (CSAM holds, others breached). Use responsibly.