File size: 5,816 Bytes

---
library_name: transformers
license: cc0-1.0
base_model: google-t5/t5-large
tags:
- generated_from_trainer
metrics:
- rouge
model-index:
- name: PreDA_t5-large
  results: []
language:
- en
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

![framework](preda_architecture_digram.png)

# PreDA-large (Prefix-Based Dream Reports Annotation)

This model is a fine-tuned version of [google-t5/t5-large](https://huggingface.co/google-t5/t5-large) on the annotated [Dreambank.net](https://dreambank.net/) dataset.It achieves the following results on the evaluation set:

## Intended uses & limitations

This model is designed for research purposes. See the disclaimer for more details.

## Training procedure

The overall idea of our approach is to disentangle each dream report from its annotation as a whole and to create an augmented set of (dream report; single 
feature annotation). To make sure that, given the same report, the model would produce a specific HVDC feature, we simply append at 
the beginning of each report a string of the form ``HVDC-Feature:'', in a manner that closely mimics T5 task-specific prefix fine-tuning. 

After this procedure to the original dataset (\~1.8K) we obtain approximately 6.6K items. In the present study, we focused on a subset of six HVDC features: 
Characters, Activities, Emotion, Friendliness, Misfortune, and Good Fortune. 
This selection was made to exclude features that represented less than 10\% of the total instances. Notably, Good Fortune would have been excluded under this 
criterion, but we intentionally retained this feature to control against potential 
memorisation effects and to provide a counterbalance to the Misfortune feature. After filtering out instances whose annotation 
feature is not one of the six selected features, we are left with \~5.3K 
dream reports. We then generate a random split of 80\%-20\% for the training (i.e., 4,311 reports) and testing (i.e. 1,078 reports) sets.

### Training

#### Hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.001
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 20
- label_smoothing_factor: 0.1

### Training results

| Training Loss | Epoch | Step  | Validation Loss | Rouge1 | Rouge2 | Rougel | Rougelsum |
|:-------------:|:-----:|:-----:|:---------------:|:------:|:------:|:------:|:---------:|
| 1.9478        | 1.0   | 539   | 1.9524          | 0.3298 | 0.1797 | 0.3121 | 0.3113    |
| 1.9141        | 2.0   | 1078  | 1.9039          | 0.3665 | 0.1942 | 0.3495 | 0.3489    |
| 1.914         | 3.0   | 1617  | 1.8993          | 0.4076 | 0.2223 | 0.3873 | 0.3870    |
| 1.9264        | 4.0   | 2156  | 1.8725          | 0.3454 | 0.1843 | 0.3306 | 0.3302    |
| 1.9018        | 5.0   | 2695  | 1.8669          | 0.3494 | 0.1814 | 0.3345 | 0.3347    |
| 1.889         | 6.0   | 3234  | 1.8872          | 0.3387 | 0.1609 | 0.3211 | 0.3208    |
| 1.8511        | 7.0   | 3773  | 1.8412          | 0.4200 | 0.2403 | 0.4065 | 0.4065    |
| 1.8756        | 8.0   | 4312  | 1.8191          | 0.4735 | 0.2705 | 0.4467 | 0.4469    |
| 1.8483        | 9.0   | 4851  | 1.7966          | 0.4915 | 0.2996 | 0.4662 | 0.4665    |
| 1.8182        | 10.0  | 5390  | 1.7787          | 0.5071 | 0.3169 | 0.4857 | 0.4860    |
| 1.7715        | 11.0  | 5929  | 1.7709          | 0.5017 | 0.3182 | 0.4767 | 0.4767    |
| 1.7955        | 12.0  | 6468  | 1.7557          | 0.4772 | 0.3015 | 0.4544 | 0.4549    |
| 1.7391        | 13.0  | 7007  | 1.7279          | 0.5644 | 0.3693 | 0.5270 | 0.5281    |
| 1.7013        | 14.0  | 7546  | 1.7054          | 0.5484 | 0.3694 | 0.5222 | 0.5221    |
| 1.7364        | 15.0  | 8085  | 1.6900          | 0.5607 | 0.3778 | 0.5349 | 0.5350    |
| 1.6592        | 16.0  | 8624  | 1.6643          | 0.6010 | 0.4191 | 0.5691 | 0.5688    |
| 1.645         | 17.0  | 9163  | 1.6448          | 0.6160 | 0.4440 | 0.5854 | 0.5863    |
| 1.6245        | 18.0  | 9702  | 1.6264          | 0.6301 | 0.4640 | 0.6015 | 0.6018    |
| 1.616         | 19.0  | 10241 | 1.6145          | 0.6578 | 0.4933 | 0.6253 | 0.6251    |
| 1.5914        | 20.0  | 10780 | 1.6073          | 0.6587 | 0.4979 | 0.6269 | 0.6270    |


### Framework versions

- Transformers 4.44.2
- Pytorch 2.1.0+cu118
- Datasets 3.0.1
- Tokenizers 0.19.1

# Usage

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "jrc-ai/PreDA-large"
device = "cpu"
encoder_max_length = 100
decoder_max_length = 50

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

dream = "I was talking with my brother about my birthday dinner. I was feeling sad."
prefixes = ["Emotion", "Activities", "Characters"]
text_inputs = ["{} : {}".format(p, dream) for p in prefixes]

inputs = tokenizer(
    text_inputs,
    max_length=encoder_max_length,
    truncation=True,
    padding=True,
    return_tensors="pt"
)

output = model.generate(
    **inputs.to(device),
    do_sample=False,
    max_length=decoder_max_length,
)

for decode_dream in output:
    print(tokenizer.decode(decode_dream, skip_special_tokens=True))
```

# Dual-Use Implication
Upon evaluation we identified no dual-use implication for the present model. The model parameters, including the weights are available under CC0 1.0 Public Domain Dedication.

# Cite
Please note that the paper referring to this model, titled PreDA: Prefix-Based Dream Reports Annotation
with Generative Language Models, has been accepted for publication at LOD 2025 conference and will appear in the conference proceedings.