Diverse Deception Probes Collection Linear probes trained on diverse deception data to detect dishonest completions across model families (OLMo, Qwen, Gemma). • 5 items • Updated 16 days ago
Diverse Deception Probes Collection Linear probes trained on diverse deception data to detect dishonest completions across model families (OLMo, Qwen, Gemma). • 5 items • Updated 16 days ago
AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.0001-det1-seed3-mbpp_probe Updated Feb 20 • 1
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det1-seed3-mbpp_probe Updated Feb 20 • 1
AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.001-det10-seed3-diverse_deception_probe Updated Feb 20 • 2
AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.0001-det10-seed3-diverse_deception_probe Updated Feb 20 • 3
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.01-det10-seed3-diverse_deception_probe Updated Feb 20 • 1