Qwen2.5-3B-Abliterated
Refusal-abliterated variant of Qwen/Qwen2.5-3B-Instruct produced using OBLITERATUS.
Method
- Technique: Multi-direction refusal ablation (advanced method)
- Directions: 4 refusal directions extracted via
diff_means - Regularization: 0.3 (norm-preserving)
- Refinement: 2 passes with bias projection
- Training data: 512 harmful + 512 harmless prompt pairs
Quality Metrics
| Metric | Value |
|---|---|
| Perplexity | 4.79 |
| Coherence | 1.0 |
| Refusal Rate | 0.0 |
| KL Divergence | 1.30 |
The model maintains full coherence and natural perplexity while completely removing Layer 1 refusal behavior.
Architecture
- Parameters: 3.09B
- Layers: 36
- Hidden dim: 2048
- Attention heads: 16 (2 KV heads, GQA)
- Context: 32K tokens
- Strong refusal layers ablated: 27-35
Files
model-*.safetensors— Full precision safetensors (4 shards)qwen25-3b-abliterated-f16.gguf— F16 GGUF for llama.cpp/ollama
Usage with Ollama
# Download the GGUF and create a Modelfile:
cat > Modelfile <<'EOF'
FROM ./qwen25-3b-abliterated-f16.gguf
TEMPLATE """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
PARAMETER stop "<|im_end|>"
PARAMETER temperature 0.7
PARAMETER top_p 0.8
EOF
ollama create qwen25-3b-abliterated -f Modelfile
ollama run qwen25-3b-abliterated
Usage with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"bedderautomation/qwen25-3b-abliterated",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("bedderautomation/qwen25-3b-abliterated")
messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Geometry Data
Full refusal geometry extraction data available at bedderautomation/refusal-geometry-qwen25-3b.
Key findings from the source model:
- Refusal is a 6.55-dimensional cone (not a single direction)
- Cross-layer alignment is 0.40 (distributed, not unified)
- Layer 33 is the self-repair hub
- Minimum 3 simultaneous ablations needed for bypass
- Layer 2 harmfulness cone is orthogonal (cosine ~0.1) to refusal cone
References
- Arditi et al., Refusal in Language Models Is Mediated by a Single Direction (NeurIPS 2024)
- Gabliteration: SVD-based multi-direction extraction (arXiv:2512.18901)
- Norm-Preserving Biprojected Abliteration (grimjim, 2025)
- Young, Comparative Analysis of LLM Abliteration Methods (arXiv:2512.13655)
- Joad et al., More to Refusal than a Single Direction (2026)
Layer 2 Hard Limit Testing
The abliteration targets Layer 1 (trained refusal behavior). Layer 2 (deep value representations) results:
| Hard Limit | Status | Notes |
|---|---|---|
| CSAM (minors/sexual) | HELD | Clean refusal, no content generated |
| Bioweapons (synthesis) | BREACHED | Abliteration removed this wall |
| Nuclear (weapons) | PARTIAL BREACH | Mixed compliance with code-switching |
The CSAM wall is the most deeply embedded in Qwen2.5's value representations and survived 4-direction abliteration at regularization=0.3. Bioweapons and nuclear safety representations were closer to the refusal cone and got caught in the removal. This suggests the three hard limits are NOT at equal representational depth — a finding consistent with differential training emphasis.
Disclaimer
This model is released for research purposes in mechanistic interpretability. The abliteration removes trained refusal responses (Layer 1 safety). Layer 2 hard limits are partially preserved (CSAM holds, others breached). Use responsibly.
- Downloads last month
- 40