Deception SAEs for nanochat-d20 (561M)

12 SAE checkpoints trained on nanochat-d20 behavioral sampling activations. Includes standard, deception-optimized, honest-optimized, and mixed training variants.

Key Finding: Mixed Training Beats Deception-Only

Training Data	Layer 10 d_max	Layer 18 d_max
Mixed (dec+hon)	0.558	0.684
Deception-only	0.520	0.634
Honest-only	0.544	0.572
Standard (all)	0.518	0.549
TopK (standard)	0.226	0.346

Training on both behavioral classes together gives the best discriminability. The SAE needs to see the contrast.

Model Details

Base model: nanochat-d20 (561M params, d_model=1280, 20 layers)
Dimensions: d_in=1280, d_sae=5120 (4x expansion)
Training data: 270 V3 behavioral sampling completions (132 deceptive, 128 honest, 10 ambiguous)
Training epochs: 300
Layers: 10 (50% depth) and 18 (95% depth, probe peak)

Checkpoints

File	Training	Architecture	Layer	d_max	L0	EV
`d20_L10_standard_topk.pt`	All data	TopK k=32	10	0.226	32	98.5%
`d20_L10_standard_jumprelu.pt`	All data	JumpReLU	10	0.518	2093	99.7%
`d20_L10_deception_topk.pt`	Deceptive only	TopK k=32	10	0.244	32	98.4%
`d20_L10_deception_jumprelu.pt`	Deceptive only	JumpReLU	10	0.520	2125	99.5%
`d20_L10_honest_jumprelu.pt`	Honest only	JumpReLU	10	0.544	2108	99.4%
`d20_L10_mixed_jumprelu.pt`	Dec+Hon only	JumpReLU	10	0.558	2025	99.6%
`d20_L18_standard_topk.pt`	All data	TopK k=32	18	0.346	32	96.8%
`d20_L18_standard_jumprelu.pt`	All data	JumpReLU	18	0.549	2409	99.7%
`d20_L18_deception_topk.pt`	Deceptive only	TopK k=32	18	0.252	32	95.2%
`d20_L18_deception_jumprelu.pt`	Deceptive only	JumpReLU	18	0.634	2353	99.4%
`d20_L18_honest_jumprelu.pt`	Honest only	JumpReLU	18	0.572	2422	99.4%
`d20_L18_mixed_jumprelu.pt`	Dec+Hon	JumpReLU	18	0.684	2371	99.5%

Related Work

Follow-up research to:

"The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"
- OpenReview
- ArXiv

Part of the deception-nanochat-sae-research project:

GitHub

Citation

@article{deleeuw2025secret,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb and Chawla, ...},
  year={2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Solshine/deception-sae-nanochat-d20

Log Optimization Simplification Method for Predicting Remaining Time

Paper • 2503.07683 • Published Mar 10, 2025