Devansh0711
/

minigpt-moe

@@ -1,29 +1,143 @@
-# MiniGPT Language Model
-A lightweight implementation of a GPT-style language model using TensorFlow, featuring:
-- Transformer architecture with rotary positional embeddings
-- Mixture of Experts (MoE) for efficient computation
-- Configurable model size and training parameters
-- Support for custom datasets
-## Model Architecture
-- Embedding dimension: 256
-- Number of attention heads: 4
-- Number of transformer layers: 8
-- Feed-forward dimension: 768
-- Number of experts: 4
-- **Batch size: 48 (default)**
-## Configuration
-Model and training parameters can be configured in `training_config.json`:
-- Batch size (default: 48)
-- Learning rate
-- Number of epochs
-- Sequence length
-- And more...
-## Files
-- `minigpt_transformer.py`: Core model implementation
-- `train_minigpt.py`: Training script
-- `training_config.json`: Training configuration

+---
+language: en
+tags:
+- pytorch
+- tensorflow
+- text-generation
+- language-model
+- moe
+- transformer
+- causal-lm
+license: mit
+datasets:
+- custom
+metrics:
+- perplexity
+- accuracy
+model-index:
+- name: MiniGPT-MoE
+  results:
+  - task:
+      type: text-generation
+    dataset:
+      type: custom
+      name: Custom Corpus
+    metrics:
+      - type: perplexity
+        value: 134
+      - type: accuracy
+        value: 0.85
+pipeline_tag: text-generation
+---
+# MiniGPT-MoE: Lightweight Language Model with Mixture of Experts
+A lightweight implementation of a GPT-style language model using TensorFlow, featuring Mixture of Experts (MoE) architecture for efficient computation.
+## Model Details
+- **Architecture**: Transformer with Mixture of Experts (MoE)
+- **Total Parameters**: 52.8M
+- **Framework**: TensorFlow 2.x
+- **Training**: Custom dataset with ByteLevel BPE tokenization
+- **Model Type**: Causal Language Model
+### Architecture Specifications
+- **Embedding Dimension**: 512
+- **Number of Layers**: 8 Transformer blocks
+- **Attention Heads**: 8
+- **Feed-forward Dimension**: 2048
+- **Number of Experts**: 4 (in MoE layers)
+- **MoE Layers**: Layers 2, 4, 6
+- **Vocabulary Size**: 10,000
+- **Max Sequence Length**: 256
+- **Positional Embeddings**: Rotary Positional Embeddings (RoPE)
+## Usage
+### Loading the Model
+```python
+from minigpt_transformer import MoEMiniGPT, MoEConfig
+# Load configuration
+config = MoEConfig(
+    vocab_size=10000,
+    max_seq_len=256,
+    embed_dim=512,
+    num_heads=8,
+    num_layers=8,
+    ffn_dim=2048,
+    num_experts=4,
+    top_k_experts=1,
+    use_moe_layers=[2, 4, 6]
+)
+# Create model
+model = MoEMiniGPT(config, tokenizer_path="my-10k-bpe-tokenizer")
+# Load trained weights
+model.load_weights("moe_minigpt.weights.h5")
+```
+### Text Generation
+```python
+# Generate text
+response = model.generate_text("Hello, how are you?", max_length=50)
+print(response)
+```
+### Training
+```python
+# Train the model
+python train_minigpt.py
+```
+## Training Details
+- **Dataset**: Custom corpus from Project Gutenberg books
+- **Tokenization**: ByteLevel BPE with 10k vocabulary
+- **Batch Size**: 48
+- **Learning Rate**: 2e-4
+- **Optimizer**: Adam
+- **Loss**: Sparse Categorical Crossentropy with auxiliary MoE losses
+## Model Performance
+- **Perplexity**: ~134 (achieved in 1.1 epochs)
+- **Training Tokens**: 2M+
+- **Expert Utilization**: Balanced across 4 experts
+## Files
+- `moe_minigpt.weights.h5`: Trained model weights
+- `minigpt_transformer.py`: Model architecture implementation
+- `train_minigpt.py`: Training script
+- `train_tokenizer.py`: Tokenizer training script
+- `my-10k-bpe-tokenizer/`: Pre-trained tokenizer files
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{minigpt-moe,
+  title={MiniGPT-MoE: Lightweight Language Model with Mixture of Experts},
+  author={Devansh0711},
+  year={2024},
+  url={https://github.com/Devansh070/Language_model}
+}
+```
+## License
+This model is released under the MIT License.
+## Acknowledgments
+- Built with TensorFlow and Keras
+- Uses HuggingFace tokenizers
+- Inspired by modern transformer architectures with MoE