Spaces:
Running
Running
File size: 6,084 Bytes
0842f0d 1df0e33 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
---
title: Aetheris Hybrid Mamba MoE
emoji: ☂
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: false
app_port: 7860
license: mit
---
# Aetheris: Hybrid Mamba-MoE Experiment
<p align="center">
<img src="https://img.shields.io/badge/Status-Experimental-yellow.svg" alt="Status">
<img src="https://img.shields.io/badge/License-MIT-green.svg" alt="License">
<img src="https://img.shields.io/badge/Python-3.10+-blue.svg" alt="Python">
<img src="https://img.shields.io/badge/PyTorch-2.0+-orange.svg" alt="PyTorch">
<img src="https://img.shields.io/badge/API-FastAPI-009688.svg" alt="FastAPI">
</p>
**Aetheris** is a hobbyist research project and experimental implementation exploring the intersection of **State Space Models (Mamba)** and **Mixture of Experts (MoE)**.
The goal of this project was to learn by doing: attempting to combine the linear-time inference of Mamba with the sparse scaling capacity of MoE from scratch in PyTorch. It is designed as a playground for understanding these modern architectures, not as a published academic paper or production-ready foundation model.
## 🧪 The Experiment
Current LLM architectures are evolving rapidly. I built Aetheris to investigate a specific question:
> *Can we successfully interleave Mamba blocks (for long context) with sparse MoE layers (for capacity) to train an efficient model on consumer hardware?*
This project implements a hybrid architecture that attempts to:
1. **Replace Attention:** Use Mamba (SSM) blocks to achieve $O(N)$ sequence scaling.
2. **Scale Parameters Sparsely:** Use MoE layers to increase model size without exploding the computational cost per token.
3. **Run Locally:** Optimize the implementation for single-GPU training (gradient checkpointing, efficient routing).
## 🏗️ Architecture Implementation
Aetheris alternates between custom implementations of two core modules:
* **SSMBlock (The Backbone):** Implements the selective scan mechanism described in the [Mamba paper](https://arxiv.org/abs/2312.00752). This handles the sequence mixing and "memory" of the model.
* **SparseMoELayer (The Scaling):** A router-based layer that dispatches tokens to Top-K experts (Feed-Forward Networks). This allows the model to "specialize" parts of its parameters for different types of tokens.
## 🚀 Quick Start
This code is provided for educational purposes and for others who want to experiment with hybrid architectures.
### Installation
**Option 1: Local Python Environment**
```bash
git clone https://github.com/Pomilon/Aetheris.git
cd Aetheris
pip install -r requirements.txt
```
**Option 2: Docker**
We provide Dockerfiles for both CPU (slim) and GPU (NVIDIA) environments.
```bash
# CPU Version
docker build -t aetheris-cpu -f Dockerfile .
docker run -p 7860:7860 aetheris-cpu
# GPU Version (Requires NVIDIA Container Toolkit)
docker build -t aetheris-gpu -f Dockerfile-nvidia .
docker run --gpus all -p 7860:7860 aetheris-gpu
```
### Usage (CLI)
Aetheris includes a CLI to train, inference, or serve the model.
**1. Training (From Scratch)**
```bash
# Trains a small model defined in configs/default.yaml
python -m aetheris.cli.main train --config configs/default.yaml
```
**2. Generation (CLI)**
```bash
python -m aetheris.cli.main generate --prompt "The quick brown fox" --checkpoint_dir checkpoints
```
**3. API Server (OpenAI-Compatible)**
Start a local API server that simulates OpenAI's chat completions endpoint.
```bash
python -m aetheris.cli.main serve --host 0.0.0.0 --port 8000
```
You can then interact with it using standard tools:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d {
"model": "aetheris-hybrid",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}
```
### Development & Testing
To run the test suite:
```bash
pytest tests/
```
## ⚙️ Configuration
You can tweak the hyperparameters in `configs/`. I've included a "Debug" config that is small enough to train on a laptop CPU for testing the code flow.
| Config File | Description |
| :--- | :--- |
| `configs/default.yaml` | Standard experimental setup (requires GPU). |
| `configs/debug.yaml` | Tiny model (2 layers) for code debugging. |
## 📚 Acknowledgements & References
This project is an implementation study and relies heavily on the brilliant theoretical work of others. It is not an original invention of the Mamba or MoE concepts.
* **Mamba Architecture:** Gu, A., & Dao, T. (2023). *Mamba: Linear-Time Sequence Modeling with Selective State Spaces*. [arXiv:2312.00752](https://arxiv.org/abs/2312.00752)
* **Mixture of Experts:** Shazeer, N., et al. (2017). *Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer*. [arXiv:1701.06538](https://arxiv.org/abs/1701.06538)
* **Inspiration:** Jamba (AI21 Labs) and OpenMoE.
## 🧠 Model Weights & Checkpoints
All pre-trained checkpoints are hosted on the [Hugging Face Hub](https://huggingface.co/Pomilon).
| Model Artifact | Step | Description | Download |
| :--- | :--- | :--- | :--- |
| **Aetheris-Base** | 10k | Early convergence checkpoint (Loss ~3.66). Good for analyzing router behavior. | [🤗 Hugging Face](https://huggingface.co/Pomilon/Aetheris-MoE-300M-A125M-base) |
| **Aetheris-Chat** | -- | *Coming Soon (Post-SFT)* | -- |
> **⚠️ Important:** Aetheris uses a custom Hybrid Mamba-MoE architecture. You **cannot** load it directly with `transformers.AutoModel`. You must use the interface provided in this repository.
### 🐍 How to Load
```python
python -m aetheris.cli.main generate --prompt "The quick brown fox" --checkpoint_dir path/to/checkpoints_folder # rename the checkpoint inside to checkpoint_current.pth
```
> **Note:** will add better inference later down the line, for now used this scuffed version. :D
> **Note:** These weights are from an experimental run. While they demonstrate the architectural capabilities, do not expect GPT-5 or even google bard level coherence. :D
> this project was made for learning and fun!
## License
MIT |