---
model_name: Led_sgd_summarizer_250_sp
base_model: allenai/led-large-16384
language:
- es
license: cc-by-3.0
tags:
- summarization
- abstractive-summarization
- long-text
- spanish
- transformers
- led
datasets:
- custom_parquet_dataset
metrics:
- loss
- gen_len
model_index:
- name: Led_sgd_summarizer_250_sp
  results:
  - task:
      type: summarization
      name: Abstractive Summarization
    dataset:
      name: Custom Spanish Summarization Dataset
      type: custom_parquet_dataset
      split: validation
    metrics:
    - name: Eval Loss
      type: loss
      value: 1.0227
    - name: Generated Length
      type: gen_len
      value: 60.66
pipeline_tag: summarization
---

# Model Card for Led_sgd_summarizer_250_sp

## Model Description

**Overview**  
`Led_sgd_summarizer_250_sp` is a fine-tuned version of `allenai/led-large-16384` designed for abstractive text summarization of long Spanish texts (up to 16,384 tokens). It generates concise summaries (~350 characters) for documents such as governmental petitions and legal correspondence. Built using the Hugging Face Transformers library, it leverages the Longformer Encoder-Decoder (LED) architecture for efficient processing of extended sequences.

**Intended Use**  
- **Primary Use**: Summarizing long Spanish texts, particularly in governmental or legal domains.  
- **Users**: Researchers, developers, and organizations working with Spanish text summarization.  
- **Out-of-Scope**: Real-time applications, non-Spanish languages, or very short texts.

**Ethical Considerations**  
- **Bias**: May reflect biases in the training dataset. Summaries should be reviewed for fairness.  
- **Misinformation**: Summaries may omit key details. Verify outputs for critical applications.  
- **Environmental Impact**: Training required significant computational resources, mitigated by FP16 and gradient checkpointing.

## Model Details

- **Model Name**: `Led_sgd_summarizer_250_sp`
- **Base Model**: `allenai/led-large-16384`
- **Architecture**: Sequence-to-Sequence (Seq2Seq) Transformer (Longformer Encoder-Decoder, LED)
- **Parameters**: ~460M
- **Tokenizer**: `AutoTokenizer` from `allenai/led-large-16384` (BART-based, fast)
- **Framework**: PyTorch
- **Hardware**: NVIDIA A100-SXM4-80GB GPU
- **Developed By**: excribe.co
- **Author**: excribe.co
- **Date**: 2025-04-26
- **License**: [CC-BY-3.0](https://creativecommons.org/licenses/by/3.0/)
- **Training Details**:
  - **Script**: Custom Python script using Hugging Face `transformers`, `datasets`, `evaluate`, and `accelerate`.
  - **Hyperparameters**:
    - Learning Rate: 5e-6
    - Batch Size: 1 (effective batch size 32 with gradient accumulation)
    - Gradient Accumulation Steps: 32
    - Epochs: 5
    - Optimizer: AdamW (weight decay: 0.01)
    - Precision: FP16
    - Gradient Clipping: max_norm=1.0
    - Early Stopping: Patience of 2, based on `eval_rougeL`
    - Generation Parameters:
      - Num Beams: 4
      - Max Length: 250
      - Min Length: 50
      - Length Penalty: 1.0
      - No Repeat N-gram Size: 3
  - **Optimization**:
    - Gradient checkpointing to reduce memory usage.
    - Model cache disabled during training to save VRAM.
  - **Training Metrics**:
    - Train Loss: 0.8424
    - Train Runtime: 14,992.38 seconds (~4 hours, 9 minutes)
    - Train Samples per Second: 2.477
    - Train Steps per Second: 0.077
    - Epochs Completed: 3.2197
    - Total FLOPs: 6.849736835009741e+16
    - Train Samples: 7,428

## Training Data

- **Name**: Custom Spanish Summarization Dataset
- **Source**: Proprietary `.parquet` file
- **Size**: 8,254 records
- **Columns**:
  - `texto_entrada`: Input text (long Spanish texts, e.g., petitions, official correspondence)
  - `asunto`: Target summary (~350 characters)
- **Preprocessing**:
  - Removed HTML tags using `BeautifulSoup`.
  - Normalized text (extra whitespaces removed) using regex.
  - Filtered invalid records (empty texts, summaries < 5 chars, texts < 10 chars).
- **Split**:
  - Train: 7,428 records (90%)
  - Validation: 826 records (10%)
- **Language**: Spanish
- **Access**: Contact [excribe.co](mailto:info@excribe.co) for inquiries.

## Model Usage

### Installation
Install required dependencies:
```bash
pip install transformers datasets evaluate rouge_score nltk torch accelerate sentencepiece beautifulsoup4
```

### Using the Pipeline
For easy inference:
```python
from transformers import pipeline

summarizer = pipeline(
    "summarization",
    model="excribe-co/Led_sgd_summarizer_250_sp",
    tokenizer="excribe-co/Led_sgd_summarizer_250_sp",
    device=0 if torch.cuda.is_available() else -1
)

text = "Radicador de correo electronico Orfeo *20242200099852* Rad. No. 20242200099852 ..."
summary = summarizer(
    text,
    max_length=250,
    min_length=50,
    num_beams=4,
    length_penalty=1.0,
    no_repeat_ngram_size=3,
    truncation=True
)
print("Generated Summary:", summary[0]["summary_text"])
```

### Manual Inference
For more control:
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model = AutoModelForSeq2SeqLM.from_pretrained("excribe-co/Led_sgd_summarizer_250_sp")
tokenizer = AutoTokenizer.from_pretrained("excribe-co/Led_sgd_summarizer_250_sp")
model.eval()

text = "Your long text here..."
inputs = tokenizer(
    "summarize: " + text,
    max_length=16384,
    truncation=True,
    return_tensors="pt",
    padding=True
)

global_attention_mask = torch.zeros_like(inputs["input_ids"])
global_attention_mask[:, 0] = 1

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}
global_attention_mask = global_attention_mask.to(device)

outputs = model.generate(
    **inputs,
    global_attention_mask=global_attention_mask,
    max_length=250,
    min_length=50,
    num_beams=4,
    length_penalty=1.0,
    no_repeat_ngram_size=3
)

summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Summary:", summary)
```

### Evaluation Results
- **Metrics**:
  - Eval Loss: 1.0227
  - Generated Length (gen_len): 60.66
  - Eval Runtime: 1,807.29 seconds
  - Eval Samples per Second: 0.457
  - Eval Steps per Second: 0.229
  - Eval Samples: 826
- **Note**: ROUGE metrics were computed but reported as zero, likely due to an issue in computation or data preprocessing. Generated summaries are coherent, but ROUGE scoring requires further investigation.

## Additional Information

### Limitations
- **Computational Requirements**: Requires ~48GB VRAM for training, ~16GB for inference with FP16.
- **Language**: Optimized for Spanish; untested on other languages.
- **Domain Specificity**: Best for governmental and legal documents; may underperform on out-of-domain texts.
- **Inference Speed**: Slow for very long texts due to model size and sequence length.

### Citation
```bibtex
@misc{excribe2025ledsgdsummarizer,
  author = {excribe.co},
  title = {Led_sgd_summarizer_250_sp: Fine-Tuned LED for Long-Text Summarization in Spanish},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp}
}
```

### Contact
For questions, contact [excribe.co](mailto:info@excribe.co) or open an issue at [https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp](https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp).

### Acknowledgments
- Built upon `allenai/led-large-16384` from Hugging Face.
- Thanks to Hugging Face for `transformers`, `datasets`, and `evaluate` libraries.
- Training supported by excribe.co.