Upload model

Browse files

Files changed (10) hide show

.gitattributes +1 -0
README.md +204 -0
arxiv_logo (1).svg +1 -0
config.json +124 -0
generation_config.json +8 -0
model.safetensors +3 -0
modeling_gemma3_punctuation.py +273 -0
special_tokens_map.json +33 -0
tokenizer.json +3 -0
tokenizer_config.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,204 @@

+---
+language:
+  - en
+  - as
+  - bn
+  - brx
+  - doi
+  - gu
+  - hi
+  - kn
+  - ks
+  - kok
+  - mai
+  - ml
+  - mni
+  - mr
+  - ne
+  - or
+  - pa
+  - sa
+  - sat
+  - sd
+  - ta
+  - te
+  - ur
+license: mit
+tags:
+  - punctuation-restoration
+  - multilingual
+  - indic-languages
+  - ai4bharat
+datasets:
+  - ai4bharat/sangraha
+  - HuggingFaceFW/fineweb-2
+  - ai4bharat/indicvoices_r
+  - ai4bharat/IndicCorpV2
+metrics:
+  - f1
+pipeline_tag: token-classification
+library_name: cadence-punctuation
+base_model:
+  - google/gemma-3-1b-pt
+widget:
+  - text: hello world how are you today
+    example_title: English Punctuation
+  - text: यह एक हिंदी वाक्य है
+    example_title: Hindi Punctuation
+  - text: cadence is a great model for punctuation
+    example_title: Another English Example
+---
+# Cadence-Fast
+This is a multilingual punctuation restoration model based on Gemma-3-270M, fine-tuned for punctuation prediction in English and Indic languages.
+<a href="https://arxiv.org/abs/2506.03793v1" target="_blank" rel="noopener noreferrer" style="text-decoration: none; color: inherit;">
+  <span style="display: inline-flex; align-items: center; gap: 0.3em;">
+    <img src="https://huggingface.co/ai4bharat/Cadence/resolve/main/arxiv_logo.svg" alt="arXiv" style="height: 1em;">
+    <span>Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts</span>
+  </span>
+</a>
+## Model Description
+- **Model Type**: Token Classification (Punctuation Prediction)
+- **Base Model**: Gemma-3-270M
+- **Languages**: English + 22 Indic Languages
+- **Task**: Automatic punctuation restoration
+## Installation (Optional)
+Python package has features such as sliding-window decoding, (rule-based) capitalisation of English text and some (rule-based) corrections for the errors made by the model.
+```bash
+pip install cadence-punctuation
+```
+## Usage
+```python
+from Cadence import PunctuationModel
+# Load model from local path
+model = PunctuationModel(model="Cadence-Fast","path/to/model")
+# Punctuate single text
+text = "hello world how are you today"
+result = model.punctuate([text])
+print(result[0])  # "Hello world, how are you today?"
+# Punctuate multiple texts
+texts = [
+    "hello world how are you",
+    "this is another test sentence",
+    "यह एक हिंदी वाक्य है"  # Hindi example
+]
+results = model.punctuate(texts, batch_size=8)
+for original, punctuated in zip(texts, results):
+    print(f"Original: {original}")
+    print(f"Punctuated: {punctuated}")
+    print()
+```
+### Using AutoModel
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+# Load model and tokenizer
+model_name = "ai4bharat/Cadence-Fast"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
+id2label = model.config.id2label
+text = "यह एक वाक्य है इसका क्या मतलब है"
+# text = "this is a test sentence what do you think"
+# Tokenize input and prepare for model
+inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
+input_ids = inputs['input_ids'][0] # Get input_ids for the first (and only) sentence
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions_for_sentence = torch.argmax(outputs.logits, dim=-1)[0]
+result_tokens_and_punctuation = []
+all_token_strings = tokenizer.convert_ids_to_tokens(input_ids.tolist()) # Get all token strings
+for i, token_id_value in enumerate(input_ids.tolist()):
+    # Process only non-padding tokens based on the attention mask
+    if inputs['attention_mask'][0][i] == 0:
+        continue
+    current_token_string = all_token_strings[i]
+    is_special_token = token_id_value in tokenizer.all_special_ids
+    if not is_special_token:
+        result_tokens_and_punctuation.append(current_token_string)
+    predicted_punctuation_id = predictions_for_sentence[i].item()
+    punctuation_character = id2label[predicted_punctuation_id]
+    if punctuation_character != "O" and not is_special_token:
+        result_tokens_and_punctuation.append(punctuation_character)
+punctuated_text = tokenizer.convert_tokens_to_string(result_tokens_and_punctuation)
+print(f"Original Text: {text}")
+print(f"Punctuated Text: {punctuated_text}")
+```
+## Officially Supported Languages
+- English, Assamese, Bengali, Bodo, Dogri, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Odia, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, Urdu
+Tokenizer doesn't support Manipuri's Meitei script well. The model can punctuate if the text is transliterated to Bengali's script.
+One can try using this model for languages not listed above. Performance may vary.
+## Supported Punctuation
+The model can predict the following punctuation marks:
+- Period (.)
+- Comma (,)
+- Question mark (?)
+- Exclamation mark (!)
+- Semicolon (;)
+- Colon (:)
+- Hyphen (-)
+- Quotes (" and ')
+- Ellipse (...)
+- Parentheses ()
+- Hindi Danda (।)
+- Urdu punctuation (۔، ؟)
+- Arabic punctuation (٬ ،)
+- Santali punctuation (᱾ ᱾।)
+- Sanskrit punctuation (॥)
+- And various combinations
+## Configuration Options for cadence-puncuation
+### PunctuationModel Parameters
+All the parameters are optional to pass.
+- `model`: Can be choose between "Cadence" (based on Gemma-3-1B) and "Cadence-Fast" (based on Gemma-3-270M) (default: "Cadence").
+- `model_path`: Path to a local directory where model weights will be downloaded to and cached, or from which pre-downloaded weights will be loaded. If None, weights downloaded to default HuggingFace cache location.
+- `gpu_id`: Specific GPU device ID to use (e.g., 0, 1). If None, the model will attempt to auto-detect and use an available GPU. This parameter is ignored if cpu is True. (default: None)
+- `cpu`: If True, forces the model to run on the CPU, even if a GPU is available. (default: False)
+- `max_length`: Maximum sequence length the model can process at once. If sliding_window is True, this value is used as the width of each sliding window. If sliding_window is False, texts longer than max_length will be truncated. (default: 300)
+- `attn_implementation`: The attention implementation to use. (default: "eager")
+- `sliding_window`: If True, enables sliding window mechanism to process texts longer than max_length. The text is split into overlapping chunks of max_length. If False, texts longer than max_length are truncated. (default: True)
+- `verbose`: Enable verbose logging (default: False)
+- `d_type`: Precision with which weights are loaded (default: bfloat16)
+- `batch_size`: ((for punctuate() method)): Batch size to use (default: 8)
+```python
+# Custom configuration
+model = PunctuationModel(
+    model="Cadence"
+    model_path="path/to/download/weights",
+    gpu_id=0,  # Use specific GPU
+    max_length=512,  # length for trunation; also used as window size when sliding_window=True
+    attn_implementation="flash_attention_2",
+    sliding_window=True,  # Handle long texts
+    verbose=False,  # Quiet mode
+    d_type="bfloat16"
+)
+batch_size=32
+# Process long texts with sliding window
+long_text = "Your very long text here..." * 100
+short_text = "a short text"
+result = model.punctuate([long_text, short_text],batch_size=batch_size)
+```
+## License
+MIT License

arxiv_logo (1).svg ADDED Viewed

config.json ADDED Viewed

	@@ -0,0 +1,124 @@

+{
+  "_sliding_window_pattern": 6,
+  "architectures": [
+    "Gemma3ForTokenClassification"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "attn_logit_softcapping": null,
+  "bos_token_id": 2,
+  "cache_implementation": "hybrid",
+  "classifier_dropout_prob": 0.0,
+  "dtype": "float32",
+  "eos_token_id": 1,
+  "final_logit_softcapping": null,
+  "head_dim": 256,
+  "hidden_activation": "gelu_pytorch_tanh",
+  "hidden_size": 640,
+  "id2label": {
+    "0": "O",
+    "1": ".",
+    "10": "\"",
+    "11": "\u0964",
+    "12": "(",
+    "13": ")",
+    "14": ":",
+    "15": "\u066c",
+    "16": "\u06d4",
+    "17": "\u061f",
+    "18": ".\"",
+    "19": ").",
+    "2": ",",
+    "20": "),",
+    "21": "\",",
+    "22": "\".",
+    "23": "?\"",
+    "24": "\"?",
+    "25": "\u0964\"",
+    "26": "\"\u0964",
+    "27": "\u060c",
+    "28": "\u1c7e",
+    "29": "\u0965",
+    "3": "?",
+    "30": "\u1c7e\u0964",
+    "4": "-",
+    "5": ";",
+    "6": "_",
+    "7": "!",
+    "8": "'",
+    "9": "..."
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 2048,
+  "label2id": {
+    "!": 7,
+    "\"": 10,
+    "\",": 21,
+    "\".": 22,
+    "\"?": 24,
+    "\"\u0964": 26,
+    "'": 8,
+    "(": 12,
+    ")": 13,
+    "),": 20,
+    ").": 19,
+    ",": 2,
+    "-": 4,
+    ".": 1,
+    ".\"": 18,
+    "...": 9,
+    ":": 14,
+    ";": 5,
+    "?": 3,
+    "?\"": 23,
+    "O": 0,
+    "_": 6,
+    "\u060c": 27,
+    "\u061f": 17,
+    "\u066c": 15,
+    "\u06d4": 16,
+    "\u0964": 11,
+    "\u0964\"": 25,
+    "\u0965": 29,
+    "\u1c7e": 28,
+    "\u1c7e\u0964": 30
+  },
+  "layer_types": [
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention"
+  ],
+  "max_position_embeddings": 32768,
+  "model_type": "cadence_punctuation",
+  "num_attention_heads": 4,
+  "num_hidden_layers": 18,
+  "num_key_value_heads": 1,
+  "pad_token_id": 0,
+  "query_pre_attn_scalar": 256,
+  "rms_norm_eps": 1e-06,
+  "rope_local_base_freq": 10000.0,
+  "rope_scaling": null,
+  "rope_theta": 1000000.0,
+  "sliding_window": 512,
+  "sliding_window_pattern": 6,
+  "transformers_version": "4.57.1",
+  "use_bidirectional_attention": false,
+  "use_cache": false,
+  "use_non_causal_attention": true,
+  "vocab_size": 262144
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 2,
+  "cache_implementation": "hybrid",
+  "eos_token_id": 1,
+  "pad_token_id": 0,
+  "transformers_version": "4.57.1"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b60302bc4aeb9ca487e3733819aabf9bcf970792d2d8f10c8ee0a5f144d41552
+size 1072498892

modeling_gemma3_punctuation.py ADDED Viewed

	@@ -0,0 +1,273 @@

+"""
+Change the attention of Gemma3 to be bidirectional.
+"""
+import torch
+import torch.nn as nn
+from typing import Optional, List, Dict, Any
+from functools import partial
+from transformers import PretrainedConfig, PreTrainedModel
+from transformers import Gemma3ForCausalLM, Gemma3TextConfig
+from transformers.models.gemma3.modeling_gemma3 import (
+    Gemma3Attention,
+    Gemma3DecoderLayer,
+    Gemma3TextModel,
+)
+from transformers.modeling_outputs import TokenClassifierOutput
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class Gemma3PunctuationConfig(Gemma3TextConfig):
+    """
+    Configuration class for Gemma3 punctuation model.
+    """
+    model_type = "cadence_punctuation"
+    def __init__(
+        self,
+        num_labels: int = 31,
+        classifier_dropout_prob: float = 0.0,
+        use_non_causal_attention: bool = True,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.classifier_dropout_prob = classifier_dropout_prob
+        self.use_non_causal_attention = use_non_causal_attention
+        self.num_labels = num_labels
+# ============ Token Classification Model Components ============
+class NonCausalGemma3Attention(Gemma3Attention):
+    """Gemma3Attention configured for non-causal token classification."""
+    def __init__(self, config, layer_idx: int):
+        super().__init__(config, layer_idx)
+        self.is_causal = False
+        self.sliding_window = None
+class NonCausalGemma3DecoderLayer(Gemma3DecoderLayer):
+    """Decoder layer with non-causal attention for token classification."""
+    def __init__(self, config, layer_idx: int):
+        super().__init__(config, layer_idx)
+        self.self_attn = NonCausalGemma3Attention(config, layer_idx)
+class Gemma3TokenClassificationModel(Gemma3TextModel):
+    """Gemma3 base model configured for token classification."""
+    _no_split_modules = ["NonCausalGemma3DecoderLayer"]
+    def __init__(self, config):
+        super().__init__(config)
+        if getattr(config, 'use_non_causal_attention', True):
+            # Replace layers with non-causal versions
+            self.layers = nn.ModuleList(
+                [
+                    NonCausalGemma3DecoderLayer(config, layer_idx)
+                    for layer_idx in range(config.num_hidden_layers)
+                ]
+            )
+    def _update_causal_mask(
+        self,
+        attention_mask: torch.Tensor,
+        input_tensor: torch.Tensor,
+        cache_position: torch.Tensor,
+        past_key_values = None,
+        output_attentions: bool = False,
+    ):
+        """Override to create bidirectional attention mask (no causal masking)."""
+        if self.config._attn_implementation == "flash_attention_2":
+            if attention_mask is not None and 0.0 in attention_mask:
+                return attention_mask
+            return None
+        past_seen_tokens = (
+            past_key_values.get_seq_length() if past_key_values is not None else 0
+        )
+        using_static_cache = isinstance(past_key_values, type(None)) is False and hasattr(past_key_values, 'get_max_length')
+        dtype, device = input_tensor.dtype, input_tensor.device
+        min_dtype = torch.finfo(dtype).min
+        sequence_length = input_tensor.shape[1]
+        if using_static_cache:
+            target_length = past_key_values.get_max_length()
+        else:
+            target_length = (
+                attention_mask.shape[-1]
+                if isinstance(attention_mask, torch.Tensor)
+                else past_seen_tokens + sequence_length + 1
+            )
+        if attention_mask is not None and attention_mask.dim() == 4:
+            # in this case we assume that the mask comes already in inverted form and requires no inversion or slicing
+            if attention_mask.max() != 0:
+                raise ValueError(
+                    "Custom 4D attention mask should be passed in inverted form with max==0`"
+                )
+            causal_mask = attention_mask
+        else:
+            # KEY CHANGE: Start with zeros (attend to all) instead of min_dtype (mask all)
+            causal_mask = torch.zeros(
+                (sequence_length, target_length), dtype=dtype, device=device
+            )
+            # REMOVED: Causal masking lines that would make it lower triangular
+            # if sequence_length != 1:
+            #     causal_mask = torch.triu(causal_mask, diagonal=1)
+            causal_mask *= torch.arange(
+                target_length, device=device
+            ) > cache_position.reshape(-1, 1)
+            causal_mask = causal_mask[None, None, :, :].expand(
+                input_tensor.shape[0], 1, -1, -1
+            )
+            if attention_mask is not None:
+                causal_mask = causal_mask.clone()  # copy to contiguous memory for in-place edit
+                mask_length = attention_mask.shape[-1]
+                padding_mask = (
+                    causal_mask[:, :, :, :mask_length]
+                    + attention_mask[:, None, None, :]
+                )
+                padding_mask = padding_mask == 0
+                causal_mask[:, :, :, :mask_length] = causal_mask[
+                    :, :, :, :mask_length
+                ].masked_fill(padding_mask, min_dtype)
+        # Handle SDPA-specific optimizations if needed
+        if (
+            self.config._attn_implementation == "sdpa"
+            and attention_mask is not None
+            and attention_mask.device.type == "cuda"
+            and not output_attentions
+        ):
+            try:
+                from transformers.modeling_attn_mask_utils import AttentionMaskConverter
+                causal_mask = AttentionMaskConverter._unmask_unattended(
+                    causal_mask, min_dtype
+                )
+            except ImportError:
+                pass  # Fallback for older transformers versions
+        return causal_mask
+class Gemma3ForTokenClassification(Gemma3ForCausalLM):
+    """
+    Gemma3 model for token classification (punctuation prediction).
+    Uses class-based architecture without monkey patching.
+    """
+    config_class = Gemma3PunctuationConfig
+    def __init__(self, config):
+        # Initialize with base Gemma3ForCausalLM structure
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        # Replace the base model with token classification version
+        if getattr(config, 'use_non_causal_attention', True):
+            self.model = Gemma3TokenClassificationModel(config)
+        # Replace the lm_head with classification head
+        classifier_dropout_prob = getattr(config, 'classifier_dropout_prob', 0.0)
+        self.lm_head = nn.Sequential(
+            nn.Dropout(classifier_dropout_prob),
+            nn.Linear(config.hidden_size, config.num_labels)
+        )
+        # Update config for classification
+        self.config.num_labels = config.num_labels
+        # Initialize weights for the new head
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+    ) -> TokenClassifierOutput:
+        """Forward pass for token classification."""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # Get hidden states from the model
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            cache_position=cache_position,
+        )
+        # Get the hidden states from the model output
+        sequence_output = outputs[0]
+        # Apply the classification head (which is now self.lm_head)
+        logits = self.lm_head(sequence_output)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss(ignore_index=-100)
+            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+# ============ Model Registration ============
+from transformers import AutoConfig, AutoModel
+# Register the punctuation config and model
+AutoConfig.register("cadence_punctuation", Gemma3PunctuationConfig)
+AutoModel.register(Gemma3PunctuationConfig, Gemma3ForTokenClassification)
+# ============ Utility Functions ============
+def create_token_classification_model(config: Gemma3PunctuationConfig):
+    """Create a token classification model with non-causal attention."""
+    return Gemma3ForTokenClassification(config)
+def load_from_pretrained_with_config_detection(model_path: str, **kwargs):
+    """
+    Load model and auto-detect whether it's for token classification or bidirectional tasks
+    based on the config.
+    """
+    from transformers import AutoConfig
+    config = AutoConfig.from_pretrained(model_path)
+    if hasattr(config, 'model_type') and config.model_type == "cadence_punctuation":
+        # Token classification model
+        return Gemma3ForTokenClassification.from_pretrained(model_path, config=config, **kwargs)

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "boi_token": "<start_of_image>",
+  "bos_token": {
+    "content": "<bos>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eoi_token": "<end_of_image>",
+  "eos_token": {
+    "content": "<eos>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "image_token": "<image_soft_token>",
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4667f2089529e8e7657cfb6d1c19910ae71ff5f28aa7ab2ff2763330affad795
+size 33384568

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff