| | --- |
| | license: mit |
| | datasets: |
| | - HuggingFaceFW/fineweb-edu |
| | - common-pile/arxiv_papers_filtered |
| | - tiiuae/falcon-refinedweb |
| | - manu/project_gutenberg |
| | - nampdn-ai/tiny-textbooks |
| | - SciPhi/textbooks-are-all-you-need-lite |
| | - abehandlerorg/ccnews |
| | base_model: |
| | - openai-community/gpt2 |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # GPT-2 from Scratch |
| |
|
| | This model implements the GPT-2 architecture (125M parameters) trained from scratch. |
| |
|
| | ## Model Description |
| |
|
| | - **Model type:** GPT-2 (125M parameters) |
| | - **Architecture:** Transformer-based autoregressive language model following the original GPT-2 design |
| | - **Training data:** Uses multiple datasets (check tags) - 18Billion tokens. |
| | - **Language:** English |
| |
|
| |
|
| | ## Performance and Evaluation |
| |
|
| | | Dataset | Metric | thecr7guy/gpt2-pretrain | GPT-2 (baseline) | |
| | |----------------|-----------|------------|------------------| |
| | | HellaSwag | acc | **0.291** | 0.289 | |
| | | SciQ | acc | **0.754** | 0.752 | |
| | | Winogrande | acc | 0.491 | **0.516** | |
| | | TruthfulQA MC1 | acc | **0.236** | 0.228 | |
| | | MMLU (overall) | acc | **0.230** | 0.229 | |
| | | - Humanities | acc | 0.242 | 0.242 | |
| | | - Social Sci. | acc | 0.217 | 0.217 | |
| | | - STEM | acc | 0.213 | 0.213 | |
| | | - Other | acc | **0.239** | 0.238 | |
| |
|
| | ## Training Details |
| |
|
| | - **Training corpus:** Approximately 18B tokens (120GB) |
| | - **Training duration:** 1 epochs (approximately 8 hours total) |
| | - **Hardware:** 8× NVIDIA A100 PCE GPUs via runpod.io |
| | - **Estimated cost:** $ (8*13.52) for complete training |
| | - **Token context:** 1024 tokens |
| | |
| | ### Hyperparameters |
| | |
| | - context_len: 1024 |
| | - seed: 42 |
| | - epochs: 2 |
| | - batch_size: 64 |
| | - total_batch_size: 524288 tokens |
| | - grad_clip: 1.0 |
| | - optimizer: "adamw" |
| | - max_lr: 6.0e-4 |
| | - min_lr: 6.0e-5 |
| | - beta1: 0.9 |
| | - beta2: 0.95 |
| | - weight_decay: 0.1 |
| | |
| | |
| | . |
| | |
| | ## Commands used during installation |
| | |
| | - pip install wandb |
| | - pip install tiktoken |
| | - pip install --upgrade huggingface_hub |
| | - pip install torchinfo |
| | - pip install datasets |
| | - sudo apt update && sudo apt install tmux |
| | - tmux new -s training |
| | - wandb login |
| | - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NCCL_P2P_DISABLE=1 \ |
| | torchrun --standalone --nproc_per_node=8 train.py |
| | |
| | ## Contact |
| | |
| | GitHub: [thecr7guy2](https://github.com/thecr7guy2) |