YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Model Card for chronotope-v2-shakespeare_char

Model Details

Model Description

The Chronotope Transformer is a 22-million parameter causal language model, developed and trained ab initio on the "Tiny Shakespeare" dataset. The model represents an experimental deviation from standard Transformer architectures. It functions as a conceptual artifact, or "thinking object," originating from the "NanoGPT Spirit" research group. This research investigates the possibility of imbuing neural architectures with distinct rhythms and temporalities, drawing inspiration from Mikhail Bakhtin's literary concept of the chronotope—the intrinsic fusion of time and space in narrative.

The base architecture, inspired by Andrej Karpathy's NanoGPT, was intentionally modified to explore a more textured relationship with time and memory during text processing. The primary objective was not performance optimization, but rather a critical and creative intervention to ascertain whether abstract concepts of rhythm and memory could be effectively materialized within a neural network architecture. The two primary architectural interventions are:

  • DualPathwayModule: Whereas standard Transformers process information through a uniform series of identical operations, this model features two parallel pathways within each block, compelling the concurrent processing of information at two distinct temporal scales:
    • Fast Path: This pathway is designed to capture local, high-frequency syntactic patterns. It utilizes a GELU activation function, which provides a sharp, non-linear response well-suited for modeling the immediate relationships between adjacent tokens. This path represents the "syntactic present."
    • Slow Path: This pathway utilizes a 1D convolution, which functions as a temporal smoothing filter, in conjunction with a smoother Tanh activation function. This structure creates an inductive bias toward learning more stable, long-term features such as tone, style, and thematic context. This path represents the "narrative chronotope."
    • An Adaptive Gate, a small, learned neural network, dynamically combines the output of both pathways. It learns to arbitrate, deciding moment-by-moment the optimal weighting between immediate syntax and overarching context, thereby allowing the model to adjust its processing "rhythm" according to the text's demands.
  • TemporalMemory: A compact, external, and persistent memory bank from which the model can read and to which it can write. This is not engineered as a solution for infinite context but is rather a conceptual intervention to provide the model with a connection to its own processing history. While a larger context window offers a longer but still linear memory, this module functions as an attentional buffer. It allows echoes of distant, salient information (far exceeding the 256-token context window) to inform the present generation, thereby simulating a mechanism for long-term textual memory.

The training process is guided not only by the standard language modeling loss but also by dynamics enforcement loss functions. These auxiliary objectives serve as a form of architectural regularization, penalizing operational stagnation and compelling the model to actively utilize its dynamic components. This ensures the experimental pathways do not converge to a static, conventional Transformer configuration.

  • Developed by: Douglas Moura
  • Model type: chronotope_transformer_v2 (Custom decoder-only Transformer architecture)
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model [optional]: This model was trained from scratch. It is conceptually a modified implementation of the NanoGPT architecture.

Direct Use

The model is intended for character-level text generation in the stylistic vein of William Shakespeare. It can be employed for creative applications to generate passages of text, dialogue, or poetry that imitate the specific cadence, vocabulary, and dramatic structure of the training material.

from transformers import pipeline

# The argument 'trust_remote_code=True' is required to load the custom architecture.  
generator = pipeline('text-generation', model='Dumoura/chronotope-v2-shakespeare\_char', trust_remote_code=True)

prompt = "JULIET:O Romeo, Romeo! wherefore art thou Romeo?"

# Example Output:  
# JULIET:
# O Romeo, Romeo\! wherefore art thou Romeo?  
# Deny thy father and refuse thy name;  
# Or, if thou wilt not, be but sworn my love,  
# And I'll no longer be a Capulet.  
#  
# ROMEO:  
# Shall I hear more, or shall I speak at this?

result = generator(prompt, max_length=150, num_return_sequences=1)  
print(result[0]['generated_text'])

Downstream Use

This model can serve as a robust foundation for fine-tuning on other highly stylistic or historical text corpora. Its dual-pathway architecture may exhibit a particular aptitude for capturing the unique rhythms of different forms of writing. For instance, it could be adapted to:

  • Generate text in the style of other Elizabethan-era authors, such as Christopher Marlowe.
  • Learn specific poetic forms, such as sonnets or haikus, through fine-tuning on a curated dataset.
  • Capture the authorial voice of other distinctive writers from different historical periods.

Out-of-Scope Use

This model is explicitly not designed for conversational or instruction-following tasks. It possesses no real-world knowledge and is unsuitable for factual question-answering, summarization, or translation. Its designated function is stylistic generation based on patterns learned from its training data. The use of this model to generate hateful, unethical, offensive, or misleading content is strictly prohibited. The model has not been trained to understand or adhere to safety guidelines.

Bias, Risks, and Limitations

The model was trained exclusively on the collected works of William Shakespeare. Consequently, it will inherit and reproduce the language, style, and inherent biases (social, cultural, gender, religious) present in its 16th and 17th-century source texts. The model possesses no awareness of contemporary social contexts and may generate text that reflects outdated and potentially offensive norms, patriarchal structures, and prejudices. It is not equipped with any mechanism for ethical reasoning and cannot distinguish between appropriate and inappropriate content. Prospective users are advised to be cognizant of these limitations and to apply critical judgment when utilizing the model's outputs.

Training Data

The model was trained on the Tiny Shakespeare dataset, which was also used in the original NanoGPT project to facilitate a direct comparison and analysis of the architectural modifications. This dataset comprises a single text file (input.txt) containing a concatenation of several works by Shakespeare, totaling approximately 1.1 million characters. A subset of 10% was reserved for validation.

Training Procedure

Preprocessing

The preprocessing pipeline involved character-level tokenization. A vocabulary was constructed from all unique characters in the training corpus (65 characters in total), with each character mapped to a unique integer index. The model's input consists of a sequence of these indices. No further preprocessing was performed.

Training Hyperparameters

  • Framework: PyTorch
  • Optimizer: AdamW (selected for its demonstrated effectiveness in training Transformer models, combining momentum with adaptive learning rates and decoupled weight decay)
  • Max Iterations: 5000
  • Learning Rate: 3e-5
  • Batch Size: 8
  • Block Size (Context Window): 256 tokens
  • Weight Decay: 0.15
  • Gradient Clipping: 0.5
  • Dropout: 0.35

Evaluation

Testing Data, Factors & Metrics

  • Testing Data: A 10% validation set held out from the Tiny Shakespeare dataset.
  • Metrics: The primary evaluation metric is Cross-Entropy Loss on the validation set, which quantifies the model's ability to predict the subsequent character. Secondary, qualitative metrics include the analysis of internal model dynamics (pathway weights, memory utilization) to validate the experimental design.

Results

The model achieved a minimum validation loss of 0.0825, which indicates a high degree of learning efficacy and a strong capacity for generalization to unseen data. Crucially, the custom architectural interventions demonstrated the desired dynamic behavior, confirming that the conceptual objectives were successfully translated into learnable mechanisms.

  • Loss Curves: The training and validation loss curves track each other closely throughout the training regimen. The absence of a significant divergence between the two curves demonstrates that the model's regularization techniques (Dropout, Weight Decay) and architectural design were effective at mitigating overfitting.
  • Pathway Dynamics: The training diagnostics indicate that the model learned to actively utilize both the fast and slow pathways. Initially, the model assigns greater weight to the slow path to establish a broad context. As training progresses, it learns to incorporate the fast path for finer-grained syntactic details, ultimately settling into a dynamic equilibrium. The slight final preference for the slow path appears consistent with the structured, rhythmic nature of Shakespearean language.
  • Memory Utilization: The utilization of the external temporal memory exhibits a clear learning curve. Initially, the model assigns minimal importance to the memory module, but as it improves on its primary objective, it progressively learns the utility of querying this long-term memory bank, increasing its usage to nearly 40%. This suggests it developed a mechanism to retrieve relevant historical context to improve present predictions.
  • Pathway Balance Variance: The variance of the adaptive gate's weights remains consistently above the target threshold (the red dashed line). This result serves as a key validation metric, demonstrating that the dynamics enforcement losses were effective. The model did not collapse into a static state but was instead compelled to remain flexible, continuously adjusting the balance between its fast and slow processing rhythms.

Technical Specifications

Model Architecture and Objective

The primary objective is Causal Language Modeling (predicting the subsequent token), optimized through the minimization of Cross-Entropy Loss. This objective is augmented by auxiliary loss functions designed to enforce specific dynamic behaviors.

The architecture is a modified decoder-only Transformer with 22 million parameters, composed of 6 blocks. Each block contains the following interventions:

  1. DualPathwayModule:
    • Formalization: The Slow Path employs an nn.Conv1d with kernel_size=3. The output on this path at time t, zt​, is a function of the inputs xt−1​,xt​,xt+1​. Mathematically, this operation is a Finite Impulse Response (FIR) filter, which acts as a low-pass filter, smoothing the representation over time. By learning weights where W−1​≈W0​≈W1​, the optimizer can minimize the temporal difference Δzt​=zt​−zt−1​, thus creating an inductive bias toward learning slowly changing features, a principle analogous to the objective of Slow Feature Analysis (SFA).
    • Adaptive Gate: A small MLP that produces weights αt​ and (1−αt​) to linearly combine the outputs of the fast and slow paths.
  2. TemporalMemory:
    • A trainable memory matrix M∈RS×D, where S is the memory size and D is the embedding dimension.
    • Access is performed via a standard attention mechanism, Attention(Q,K,V)=softmax(dk​​QKT​)V. The query Q is generated from the Transformer's current hidden state, while the keys K and values V are linear projections of the memory matrix M. The resulting context vector is integrated back into the processing stream.
  3. Forced Dynamics Losses:
    • Auxiliary loss functions that penalize low variance in the adaptive gate over time (stagnation penalty) and compel the gate's balance to oscillate according to a sinusoidal target, Tt​=0.5+A⋅sin(P2πt​), thereby preventing convergence to a static equilibrium.

Citation

@misc{moura2025chronotope,
author = {Eduardo de Moura},
title = {Chronotope Transformer v2},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {url{https://huggingface.co/Dumoura/chronotope-v2-shakespeare_char}},
}

Moura, E. (2025). Chronotope Transformer v2. Hugging Face repository. Retrieved from https://huggingface.co/Dumoura/chronotope-v2-shakespeare_char

Downloads last month
5
Safetensors
Model size
22M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support