Update README.md
Browse files
README.md
CHANGED
|
@@ -21,8 +21,21 @@ base_model:
|
|
| 21 |
- openai-community/gpt2-medium
|
| 22 |
---
|
| 23 |
|
|
|
|
| 24 |
# Model Card
|
| 25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
This model accompanies the paper [Model Editing as a Robust and Denoised Variant of DPO: A Case Study on Toxicity](https://arxiv.org/abs/2405.13967)
|
| 27 |
published at ICLR 2025 (previously released under the preprint title “DeTox: Toxic Subspace Projection for Model Editing”; both refer to the same work).
|
| 28 |
|
|
@@ -37,7 +50,7 @@ ProFS (Projection Filter for Subspaces) is a tuning-free alignment method that r
|
|
| 37 |
- Theoretically grounded: shown to be a denoised, single-step approximation of Direct Preference Optimization (DPO)—bridging editing-based and tuning-based alignment.
|
| 38 |
|
| 39 |
<div align="center">
|
| 40 |
-
<img src="https://github.com/Uppaal/detox-edit/blob/main/ProFS Method.png" width="450"/>
|
| 41 |
<i><b>Figure.</b> Schematic of ProFS (previously called DeTox). Toxic directions (in red) are projected out of the model’s MLP-value matrices, leaving other representational directions intact. </i>
|
| 42 |
</div>
|
| 43 |
|
|
|
|
| 21 |
- openai-community/gpt2-medium
|
| 22 |
---
|
| 23 |
|
| 24 |
+
|
| 25 |
# Model Card
|
| 26 |
|
| 27 |
+
<p align="center">
|
| 28 |
+
<a href="https://arxiv.org/abs/2405.13967">
|
| 29 |
+
<img src="https://img.shields.io/badge/arXiv-2405.13967-B31B1B?logo=arxiv&logoColor=white" alt="arXiv">
|
| 30 |
+
</a>
|
| 31 |
+
<a href="https://uppaal.github.io/projects/profs/profs.html">
|
| 32 |
+
<img src="https://img.shields.io/badge/Project_Webpage-1DA1F2?logo=google-chrome&logoColor=white&color=0A4D8C" alt="Project Webpage">
|
| 33 |
+
</a>
|
| 34 |
+
<a href="https://github.com/Uppaal/detox-edit">
|
| 35 |
+
<img src="https://img.shields.io/badge/Code-F1C232?logo=github&logoColor=white&color=black" alt="Checkpoints">
|
| 36 |
+
</a>
|
| 37 |
+
</p>
|
| 38 |
+
|
| 39 |
This model accompanies the paper [Model Editing as a Robust and Denoised Variant of DPO: A Case Study on Toxicity](https://arxiv.org/abs/2405.13967)
|
| 40 |
published at ICLR 2025 (previously released under the preprint title “DeTox: Toxic Subspace Projection for Model Editing”; both refer to the same work).
|
| 41 |
|
|
|
|
| 50 |
- Theoretically grounded: shown to be a denoised, single-step approximation of Direct Preference Optimization (DPO)—bridging editing-based and tuning-based alignment.
|
| 51 |
|
| 52 |
<div align="center">
|
| 53 |
+
<img src="https://github.com/Uppaal/detox-edit/blob/main/assets/ProFS Method.png" width="450"/>
|
| 54 |
<i><b>Figure.</b> Schematic of ProFS (previously called DeTox). Toxic directions (in red) are projected out of the model’s MLP-value matrices, leaving other representational directions intact. </i>
|
| 55 |
</div>
|
| 56 |
|