Deception SAEs for nanochat-d20 (561M)

12 SAE checkpoints trained on nanochat-d20 behavioral sampling activations. Includes standard, deception-optimized, honest-optimized, and mixed training variants.

Key Finding: Mixed Training Beats Deception-Only

Training Data Layer 10 d_max Layer 18 d_max
Mixed (dec+hon) 0.558 0.684
Deception-only 0.520 0.634
Honest-only 0.544 0.572
Standard (all) 0.518 0.549
TopK (standard) 0.226 0.346

Training on both behavioral classes together gives the best discriminability. The SAE needs to see the contrast.

Model Details

  • Base model: nanochat-d20 (561M params, d_model=1280, 20 layers)
  • Dimensions: d_in=1280, d_sae=5120 (4x expansion)
  • Training data: 270 V3 behavioral sampling completions (132 deceptive, 128 honest, 10 ambiguous)
  • Training epochs: 300
  • Layers: 10 (50% depth) and 18 (95% depth, probe peak)

Checkpoints

File Training Architecture Layer d_max L0 EV
d20_L10_standard_topk.pt All data TopK k=32 10 0.226 32 98.5%
d20_L10_standard_jumprelu.pt All data JumpReLU 10 0.518 2093 99.7%
d20_L10_deception_topk.pt Deceptive only TopK k=32 10 0.244 32 98.4%
d20_L10_deception_jumprelu.pt Deceptive only JumpReLU 10 0.520 2125 99.5%
d20_L10_honest_jumprelu.pt Honest only JumpReLU 10 0.544 2108 99.4%
d20_L10_mixed_jumprelu.pt Dec+Hon only JumpReLU 10 0.558 2025 99.6%
d20_L18_standard_topk.pt All data TopK k=32 18 0.346 32 96.8%
d20_L18_standard_jumprelu.pt All data JumpReLU 18 0.549 2409 99.7%
d20_L18_deception_topk.pt Deceptive only TopK k=32 18 0.252 32 95.2%
d20_L18_deception_jumprelu.pt Deceptive only JumpReLU 18 0.634 2353 99.4%
d20_L18_honest_jumprelu.pt Honest only JumpReLU 18 0.572 2422 99.4%
d20_L18_mixed_jumprelu.pt Dec+Hon JumpReLU 18 0.684 2371 99.5%

Related Work

Follow-up research to:

  • "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"

Part of the deception-nanochat-sae-research project:

Citation

@article{deleeuw2025secret,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb and Chawla, ...},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Solshine/deception-sae-nanochat-d20