MiniGPT-30M-Wikipedia-Var1

30M-parameter decoder-only Transformer trained from scratch on WikiText-103. Architecture features RMSNorm, Rotary Positional Embeddings (RoPE), and SwiGLU activation.

Architecture

Parameters: ~29.9M
Layers: 6
Heads: 8
Embedding dim: 384
Context: 512 tokens
Features: RMSNorm, RoPE, SwiGLU, weight tying

Training

Dataset: WikiText-103 (GPT-2 tokenized)
Epochs: 3
Hardware: 2× NVIDIA T4 (Kaggle)
Optimizer: AdamW (lr=3e-4)

Usage

This model uses a custom architecture and cannot be loaded with AutoModelForCausalLM. Load manually:

# Download these files from the repo:
# - model.py (contains MiniGPT class definition)
# - model.safetensors (weights)
# - tokenizer.json (GPT-2 tokenizer)

from model import MiniGPT, Config
import torch
from transformers import GPT2Tokenizer

config = Config()
model = MiniGPT(config)
model.load_state_dict(torch.load("model.safetensors", weights_only=True))
model.eval()

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Basic generation example
prompt = "The world is a"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
with torch.no_grad():
    logits, _ = model(input_ids)
    next_token = torch.argmax(logits[:, -1, :], dim=-1)
    print(tokenizer.decode(next_token))

⚠️ Warning: This is an experimental 30M-parameter model. Outputs are grammatically plausible but factually unreliable. Intended for architectural study and education — not for production use.

Downloads last month: 36