| | --- |
| | base_model: |
| | - meta-llama/Llama-3-8B |
| | language: |
| | - en |
| | license: mit |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # Model Card |
| |
|
| | This is a Llama-3-8B base model fine-tuned to explain continuous features from Llama-3.1-8B. |
| | This model was trained to map SAE features from Llama-3.1-8B's residual stream to their explanations derived from Neuronpedia. |
| | It generalizes to explaining any arbitrary continuous feature from Llama-3.1-8B's residual stream. |
| |
|
| | - **Paper:** [Training Language Models to Explain Their Own Computations](https://arxiv.org/abs/2511.08579) |
| | - **Repository:** [https://github.com/TransluceAI/introspective-interp](https://github.com/TransluceAI/introspective-interp) |
| |
|
| | ## Usage |
| |
|
| | Use the code below to get started with the model. |
| |
|
| | **Note**: This model requires custom handling of continuous tokens. For full functionality, you'll need to use the custom model classes from [the GitHub repository](https://github.com/TransluceAI/introspective-interp/tree/main) that can properly embed feature vectors at the `<|reserved_special_token_12|>` tokens. The standard transformers library won't handle the continuous token embeddings correctly. |
| |
|
| | ```python |
| | import torch |
| | import numpy as np |
| | from transformers import AutoTokenizer |
| | |
| | # Load the continuous model class |
| | from model.continuous_llama import ContinuousLlama |
| | |
| | # Load the model and tokenizer |
| | model_name = "Transluce/features_explain_llama3_8b_llama3.1_8b" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = ContinuousLlama.from_pretrained( |
| | model_name, |
| | torch_dtype=torch.bfloat16, |
| | special_tokens_ids={ |
| | "begin_continuous": tokenizer.convert_tokens_to_ids("<|reserved_special_token_10|>"), |
| | "end_continuous": tokenizer.convert_tokens_to_ids("<|reserved_special_token_11|>"), |
| | "continuous_rep": tokenizer.convert_tokens_to_ids("<|reserved_special_token_12|>") |
| | } |
| | ) |
| | |
| | # Example: explaining a continuous feature from layer 15 |
| | layer = 15 |
| | feature_vector = torch.randn(4096) # Feature from Llama-3.1-8B's residual stream |
| | |
| | # Format the prompt with continuous tokens |
| | prompt = f"At layer {layer}, <|reserved_special_token_10|><|reserved_special_token_12|><|reserved_special_token_11|> encodes " |
| | |
| | # Tokenize the prompt |
| | inputs = tokenizer(prompt, return_tensors="pt") |
| | |
| | # Create continuous token inputs for the feature vector |
| | continuous_tokens = { |
| | "inputs_continuous_tokens": feature_vector.unsqueeze(0), # Add batch dimension |
| | "labels_continuous_tokens": None # Not needed for generation |
| | } |
| | |
| | # Generate explanation |
| | with torch.no_grad(): |
| | outputs = model.generate( |
| | input_ids=inputs.input_ids, |
| | attention_mask=inputs.attention_mask, |
| | max_new_tokens=128, |
| | do_sample=False, |
| | pad_token_id=tokenizer.eos_token_id, |
| | **continuous_tokens |
| | ) |
| | |
| | # Decode the explanation |
| | explanation = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | print(explanation) |
| | ``` |
| |
|
| |
|
| | ## Citation |
| |
|
| | **BibTeX:** |
| | ``` |
| | @misc{li2025traininglanguagemodelsexplain, |
| | title={Training Language Models to Explain Their Own Computations}, |
| | author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas}, |
| | year={2025}, |
| | eprint={2511.08579}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2511.08579}, |
| | } |
| | ``` |