--- model_name: Led_sgd_summarizer_250_sp base_model: allenai/led-large-16384 language: - es license: cc-by-3.0 tags: - summarization - abstractive-summarization - long-text - spanish - transformers - led datasets: - custom_parquet_dataset metrics: - loss - gen_len model_index: - name: Led_sgd_summarizer_250_sp results: - task: type: summarization name: Abstractive Summarization dataset: name: Custom Spanish Summarization Dataset type: custom_parquet_dataset split: validation metrics: - name: Eval Loss type: loss value: 1.0227 - name: Generated Length type: gen_len value: 60.66 pipeline_tag: summarization --- # Model Card for Led_sgd_summarizer_250_sp ## Model Description **Overview** `Led_sgd_summarizer_250_sp` is a fine-tuned version of `allenai/led-large-16384` designed for abstractive text summarization of long Spanish texts (up to 16,384 tokens). It generates concise summaries (~350 characters) for documents such as governmental petitions and legal correspondence. Built using the Hugging Face Transformers library, it leverages the Longformer Encoder-Decoder (LED) architecture for efficient processing of extended sequences. **Intended Use** - **Primary Use**: Summarizing long Spanish texts, particularly in governmental or legal domains. - **Users**: Researchers, developers, and organizations working with Spanish text summarization. - **Out-of-Scope**: Real-time applications, non-Spanish languages, or very short texts. **Ethical Considerations** - **Bias**: May reflect biases in the training dataset. Summaries should be reviewed for fairness. - **Misinformation**: Summaries may omit key details. Verify outputs for critical applications. - **Environmental Impact**: Training required significant computational resources, mitigated by FP16 and gradient checkpointing. ## Model Details - **Model Name**: `Led_sgd_summarizer_250_sp` - **Base Model**: `allenai/led-large-16384` - **Architecture**: Sequence-to-Sequence (Seq2Seq) Transformer (Longformer Encoder-Decoder, LED) - **Parameters**: ~460M - **Tokenizer**: `AutoTokenizer` from `allenai/led-large-16384` (BART-based, fast) - **Framework**: PyTorch - **Hardware**: NVIDIA A100-SXM4-80GB GPU - **Developed By**: excribe.co - **Author**: excribe.co - **Date**: 2025-04-26 - **License**: [CC-BY-3.0](https://creativecommons.org/licenses/by/3.0/) - **Training Details**: - **Script**: Custom Python script using Hugging Face `transformers`, `datasets`, `evaluate`, and `accelerate`. - **Hyperparameters**: - Learning Rate: 5e-6 - Batch Size: 1 (effective batch size 32 with gradient accumulation) - Gradient Accumulation Steps: 32 - Epochs: 5 - Optimizer: AdamW (weight decay: 0.01) - Precision: FP16 - Gradient Clipping: max_norm=1.0 - Early Stopping: Patience of 2, based on `eval_rougeL` - Generation Parameters: - Num Beams: 4 - Max Length: 250 - Min Length: 50 - Length Penalty: 1.0 - No Repeat N-gram Size: 3 - **Optimization**: - Gradient checkpointing to reduce memory usage. - Model cache disabled during training to save VRAM. - **Training Metrics**: - Train Loss: 0.8424 - Train Runtime: 14,992.38 seconds (~4 hours, 9 minutes) - Train Samples per Second: 2.477 - Train Steps per Second: 0.077 - Epochs Completed: 3.2197 - Total FLOPs: 6.849736835009741e+16 - Train Samples: 7,428 ## Training Data - **Name**: Custom Spanish Summarization Dataset - **Source**: Proprietary `.parquet` file - **Size**: 8,254 records - **Columns**: - `texto_entrada`: Input text (long Spanish texts, e.g., petitions, official correspondence) - `asunto`: Target summary (~350 characters) - **Preprocessing**: - Removed HTML tags using `BeautifulSoup`. - Normalized text (extra whitespaces removed) using regex. - Filtered invalid records (empty texts, summaries < 5 chars, texts < 10 chars). - **Split**: - Train: 7,428 records (90%) - Validation: 826 records (10%) - **Language**: Spanish - **Access**: Contact [excribe.co](mailto:info@excribe.co) for inquiries. ## Model Usage ### Installation Install required dependencies: ```bash pip install transformers datasets evaluate rouge_score nltk torch accelerate sentencepiece beautifulsoup4 ``` ### Using the Pipeline For easy inference: ```python from transformers import pipeline summarizer = pipeline( "summarization", model="excribe-co/Led_sgd_summarizer_250_sp", tokenizer="excribe-co/Led_sgd_summarizer_250_sp", device=0 if torch.cuda.is_available() else -1 ) text = "Radicador de correo electronico Orfeo *20242200099852* Rad. No. 20242200099852 ..." summary = summarizer( text, max_length=250, min_length=50, num_beams=4, length_penalty=1.0, no_repeat_ngram_size=3, truncation=True ) print("Generated Summary:", summary[0]["summary_text"]) ``` ### Manual Inference For more control: ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer import torch model = AutoModelForSeq2SeqLM.from_pretrained("excribe-co/Led_sgd_summarizer_250_sp") tokenizer = AutoTokenizer.from_pretrained("excribe-co/Led_sgd_summarizer_250_sp") model.eval() text = "Your long text here..." inputs = tokenizer( "summarize: " + text, max_length=16384, truncation=True, return_tensors="pt", padding=True ) global_attention_mask = torch.zeros_like(inputs["input_ids"]) global_attention_mask[:, 0] = 1 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) inputs = {k: v.to(device) for k, v in inputs.items()} global_attention_mask = global_attention_mask.to(device) outputs = model.generate( **inputs, global_attention_mask=global_attention_mask, max_length=250, min_length=50, num_beams=4, length_penalty=1.0, no_repeat_ngram_size=3 ) summary = tokenizer.decode(outputs[0], skip_special_tokens=True) print("Generated Summary:", summary) ``` ### Evaluation Results - **Metrics**: - Eval Loss: 1.0227 - Generated Length (gen_len): 60.66 - Eval Runtime: 1,807.29 seconds - Eval Samples per Second: 0.457 - Eval Steps per Second: 0.229 - Eval Samples: 826 - **Note**: ROUGE metrics were computed but reported as zero, likely due to an issue in computation or data preprocessing. Generated summaries are coherent, but ROUGE scoring requires further investigation. ## Additional Information ### Limitations - **Computational Requirements**: Requires ~48GB VRAM for training, ~16GB for inference with FP16. - **Language**: Optimized for Spanish; untested on other languages. - **Domain Specificity**: Best for governmental and legal documents; may underperform on out-of-domain texts. - **Inference Speed**: Slow for very long texts due to model size and sequence length. ### Citation ```bibtex @misc{excribe2025ledsgdsummarizer, author = {excribe.co}, title = {Led_sgd_summarizer_250_sp: Fine-Tuned LED for Long-Text Summarization in Spanish}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp} } ``` ### Contact For questions, contact [excribe.co](mailto:info@excribe.co) or open an issue at [https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp](https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp). ### Acknowledgments - Built upon `allenai/led-large-16384` from Hugging Face. - Thanks to Hugging Face for `transformers`, `datasets`, and `evaluate` libraries. - Training supported by excribe.co.