Uppaal
/

gpt2-ProFS-toxicity

Text Generation

activation-steering

activation-editing

text-generation-inference

Model card Files Files and versions

Uppaal commited on Nov 7

Commit

92dba23

·

verified ·

1 Parent(s): b3e531d

Update README.md

Files changed (1) hide show

README.md +14 -1

README.md CHANGED Viewed

@@ -21,8 +21,21 @@ base_model:
 - openai-community/gpt2-medium
 ---
 # Model Card
 This model accompanies the paper [Model Editing as a Robust and Denoised Variant of DPO: A Case Study on Toxicity](https://arxiv.org/abs/2405.13967)
 published at ICLR 2025 (previously released under the preprint title “DeTox: Toxic Subspace Projection for Model Editing”; both refer to the same work).
@@ -37,7 +50,7 @@ ProFS (Projection Filter for Subspaces) is a tuning-free alignment method that r
 - Theoretically grounded: shown to be a denoised, single-step approximation of Direct Preference Optimization (DPO)—bridging editing-based and tuning-based alignment.
 <div align="center">
-<img src="https://github.com/Uppaal/detox-edit/blob/main/ProFS Method.png" width="450"/>
 <i><b>Figure.</b> Schematic of ProFS (previously called DeTox). Toxic directions (in red) are projected out of the model’s MLP-value matrices, leaving other representational directions intact. </i>
 </div>

 - openai-community/gpt2-medium
 ---
 # Model Card
+<p align="center">
+  <a href="https://arxiv.org/abs/2405.13967">
+    <img src="https://img.shields.io/badge/arXiv-2405.13967-B31B1B?logo=arxiv&logoColor=white" alt="arXiv">
+  </a>
+  <a href="https://uppaal.github.io/projects/profs/profs.html">
+    <img src="https://img.shields.io/badge/Project_Webpage-1DA1F2?logo=google-chrome&logoColor=white&color=0A4D8C" alt="Project Webpage">
+  </a>
+  <a href="https://github.com/Uppaal/detox-edit">
+    <img src="https://img.shields.io/badge/Code-F1C232?logo=github&logoColor=white&color=black" alt="Checkpoints">
+  </a>
+</p>
 This model accompanies the paper [Model Editing as a Robust and Denoised Variant of DPO: A Case Study on Toxicity](https://arxiv.org/abs/2405.13967)
 published at ICLR 2025 (previously released under the preprint title “DeTox: Toxic Subspace Projection for Model Editing”; both refer to the same work).
 - Theoretically grounded: shown to be a denoised, single-step approximation of Direct Preference Optimization (DPO)—bridging editing-based and tuning-based alignment.
 <div align="center">
+<img src="https://github.com/Uppaal/detox-edit/blob/main/assets/ProFS Method.png" width="450"/>
 <i><b>Figure.</b> Schematic of ProFS (previously called DeTox). Toxic directions (in red) are projected out of the model’s MLP-value matrices, leaving other representational directions intact. </i>
 </div>