mmthinking
/

Metis-HOME

Safetensors

qwen2_5_vl_moe

Model card Files Files and versions

xet

Community

haiboqiu commited on Nov 25, 2025

Commit

9b7f4b6

verified ·

1 Parent(s): c918c06

Update README.md

Browse files

Files changed (1) hide show

README.md +235 -3

README.md CHANGED Viewed

@@ -1,3 +1,235 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+<h1 align="center">Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning</h1>
+<h5 align="center">
+[![arXiv](https://img.shields.io/badge/Arxiv-2510.20519-b31b1b.svg?logo=arXiv)](https://arxiv.org/pdf/2510.20519)&ensp;<a href='https://huggingface.co/mmthinking/Metis-HOME'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face%20-models-blue'></a>&ensp;[![Code License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)
+</h5>
+## 💡 Overview
+Current multimodal reasoning models face a critical dilemma: they often "overthink" on simple tasks (inefficiency) and suffer from general capability degradation when optimized for reasoning.
+We introduce **Metis-HOME** (**H**ybrid **O**ptimized **M**ixture-of-**E**xperts), a novel framework that enables a "Hybrid Thinking" paradigm. By structuring the original dense model (Qwen2.5-VL-7B) into two distinct expert branches—a Thinking Expert for complex reasoning and a Non-Thinking Expert for rapid inference—controlled by a lightweight router, Metis-HOME effectively resolves the reasoning-vs-generalization trade-off.
+<div style="display: flex; justify-content: center; gap: 20px; flex-wrap: wrap;">
+  <img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/framework.png" alt="Metis-RISE Framework Overview" style="width:400px; max-width:100%;">
+  <img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/radar_chart.png" alt="Metis-RISE Framework Overview" style="width:400px; max-width:100%;">
+</div>
+## ✨ Highlights
+- 🧠 Hybrid Thinking Paradigm: Explicitly decouples "System 1" (fast, intuitive) and "System 2" (slow, deliberative) reasoning within a unified multimodal MoE architecture.
+- 🔄 Router Mechanism: A lightweight, trainable router dynamically allocates queries based on complexity, avoiding computational waste on simple tasks like OCR or Captioning.
+- 🚀 Performance:
+    - +6.9% improvement on reasoning benchmarks (MathVista, etc.) compared to the baseline.
+    - ~1% gain on general benchmarks, reversing the degradation trend observed in other reasoning-specialized models.
+- 🛠️ Efficient Training: A multi-stage strategy combining Reinforcement Learning (RL) for reasoning enhancement and Mixed Supervised Fine-Tuning (SFT) for expert specialization.
+## 📊 Results
+### Thinking Ratio
+As shown in the following figure, the **thinking ratio** analysis of Metis-HOME reveals adaptive routing behavior:
+- **High ratios (78\%–98\%)** on reasoning-heavy benchmarks (*WeMath*, *MathVision*, etc.), indicating effective use of the *thinking expert* for multi-step inference.
+- **Low ratios (2\%–5\%)** on general benchmarks (*MMBench*, *OCRBench*), showing preference for the *non-thinking expert*.
+This aligns with our design: **deliberate reasoning for complex tasks**, **fast inference for simple ones**, optimizing computational efficiency.
+<img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/thinking_ratio_chart.png" alt="Metis-RISE Framework Overview" style="width:850px; max-width:100%;">
+### Benchmarks
+<table>
+<thead>
+  <tr>
+    <th rowspan="2" style="text-align:left; vertical-align:bottom;">Model</th>
+    <th colspan="7" style="text-align:center; border-bottom:1px solid #ccc;">Reasoning</th>
+    <th style="text-align:center; border-bottom:1px solid #ccc;">General</th>
+  </tr>
+  <tr>
+    <th>MathVista</th>
+    <th>MathVision</th>
+    <th>MathVerse</th>
+    <th>DynaMath</th>
+    <th>WeMath</th>
+    <th>LogicVista</th>
+    <th>Avg.</th>
+    <th>Avg.</th>
+  </tr>
+</thead>
+<tbody>
+  <tr style="background-color: #e0e0e0;">
+    <td colspan="9" align="center"><strong><em>Proprietary Models</em></strong></td>
+  </tr>
+  <tr>
+    <td>Gemini-2.0-Pro</td>
+    <td>71.3</td>
+    <td>48.1</td>
+    <td>67.3</td>
+    <td>43.3</td>
+    <td>56.5</td>
+    <td>53.2</td>
+    <td>56.6</td>
+    <td>73.3</td>
+  </tr>
+  <tr>
+    <td>Gemini-2.0-Flash</td>
+    <td>70.4</td>
+    <td>43.6</td>
+    <td>47.8</td>
+    <td>42.1</td>
+    <td>47.4</td>
+    <td>52.3</td>
+    <td>50.6</td>
+    <td>72.6</td>
+  </tr>
+  <tr>
+    <td>Claude 3.7 Sonnet</td>
+    <td>66.8</td>
+    <td>41.9</td>
+    <td>46.7</td>
+    <td>39.7</td>
+    <td>49.3</td>
+    <td>58.2</td>
+    <td>50.4</td>
+    <td>70.1</td>
+  </tr>
+  <tr>
+    <td>ChatGPT-4o</td>
+    <td>60.0</td>
+    <td>31.2</td>
+    <td>40.6</td>
+    <td>34.5</td>
+    <td>45.8</td>
+    <td>52.8</td>
+    <td>44.2</td>
+    <td>72.0</td>
+  </tr>
+  <tr style="background-color: #e0e0e0;">
+    <td colspan="9" align="center"><strong><em>Open-source Models</em></strong></td>
+  </tr>
+  <tr>
+    <td>LLaVA-OneVision-72B</td>
+    <td>67.1</td>
+    <td>25.3</td>
+    <td>27.2</td>
+    <td>15.6</td>
+    <td>32.0</td>
+    <td>40.9</td>
+    <td>34.7</td>
+    <td>68.0</td>
+  </tr>
+  <tr>
+    <td>Kimi-VL-A3B-Instruct</td>
+    <td>66.0</td>
+    <td>21.8</td>
+    <td>34.1</td>
+    <td>18.0</td>
+    <td>32.3</td>
+    <td>42.7</td>
+    <td>35.8</td>
+    <td>69.1</td>
+  </tr>
+  <tr>
+    <td>InternVL3-8B</td>
+    <td>70.5</td>
+    <td>30.0</td>
+    <td>38.5</td>
+    <td>25.7</td>
+    <td>39.5</td>
+    <td>44.5</td>
+    <td>41.4</td>
+    <td>73.6</td>
+  </tr>
+  <tr>
+    <td>VL-Rethinker-7B</td>
+    <td>75.5</td>
+    <td>29.3</td>
+    <td>47.2</td>
+    <td>25.4</td>
+    <td>37.8</td>
+    <td>47.0</td>
+    <td>43.7</td>
+    <td>68.3</td>
+  </tr>
+  <tr>
+    <td>Metis-RISE-7B</td>
+    <td>75.8</td>
+    <td>28.7</td>
+    <td>51.0</td>
+    <td>27.7</td>
+    <td>45.2</td>
+    <td>49.7</td>
+    <td>46.4</td>
+    <td>68.4</td>
+  </tr>
+  <tr>
+    <td style="border-top: 1px solid #000;">Baseline</td>
+    <td style="border-top: 1px solid #000;">67.4</td>
+    <td style="border-top: 1px solid #000;">26.2</td>
+    <td style="border-top: 1px solid #000;">41.1</td>
+    <td style="border-top: 1px solid #000;">20.2</td>
+    <td style="border-top: 1px solid #000;">34.5</td>
+    <td style="border-top: 1px solid #000;">45.6</td>
+    <td style="border-top: 1px solid #000; background-color: #fff2cc;">39.2</td>
+    <td style="border-top: 1px solid #000;">70.3</td>
+  </tr>
+  <tr>
+    <td>Baseline+RL</td>
+    <td>72.8</td>
+    <td>28.7</td>
+    <td>46.8</td>
+    <td>26.2</td>
+    <td>43.3</td>
+    <td>46.5</td>
+    <td>44.0</td>
+    <td style="background-color: #e1d5e7;">67.2</td>
+  </tr>
+  <tr>
+    <td><b>Metis-HOME</b></td>
+    <td>76.0</td>
+    <td>29.5</td>
+    <td>47.7</td>
+    <td>26.4</td>
+    <td>45.6</td>
+    <td>51.5</td>
+    <td style="background-color: #fff2cc;"><b>46.1</b></td>
+    <td style="background-color: #e1d5e7;"><b>71.2</b></td>
+  </tr>
+</tbody>
+</table>
+## 🔍 Usage Example
+You can use the demo inference script in the `examples` folder:
+```bash
+python examples/demo_inference.py
+```
+## 📌 Acknowledgement
+We sincerely appreciate [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) and [MM-EUREKA](https://github.com/ModalMinds/MM-EUREKA) for providing reference training framework.
+## 📖 Citation
+```bibtex
+@article{lan2025metis,
+  title={Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning},
+  author={Lan, Xiaohan and Liu, Fanfan and Qiu, Haibo and Yang, Siqi and Ruan, Delian and Shi, Peng and Ma, Lin},
+  journal={arXiv preprint arXiv:2510.20519},
+  year={2025}
+}
+```