haiboqiu commited on
Commit
9b7f4b6
ยท
verified ยท
1 Parent(s): c918c06

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +235 -3
README.md CHANGED
@@ -1,3 +1,235 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ <h1 align="center">Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning</h1>
6
+
7
+ <h5 align="center">
8
+
9
+ [![arXiv](https://img.shields.io/badge/Arxiv-2510.20519-b31b1b.svg?logo=arXiv)](https://arxiv.org/pdf/2510.20519)&ensp;<a href='https://huggingface.co/mmthinking/Metis-HOME'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face%20-models-blue'></a>&ensp;[![Code License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)
10
+
11
+ </h5>
12
+
13
+
14
+ ## ๐Ÿ’ก Overview
15
+ Current multimodal reasoning models face a critical dilemma: they often "overthink" on simple tasks (inefficiency) and suffer from general capability degradation when optimized for reasoning.
16
+
17
+ We introduce **Metis-HOME** (**H**ybrid **O**ptimized **M**ixture-of-**E**xperts), a novel framework that enables a "Hybrid Thinking" paradigm. By structuring the original dense model (Qwen2.5-VL-7B) into two distinct expert branchesโ€”a Thinking Expert for complex reasoning and a Non-Thinking Expert for rapid inferenceโ€”controlled by a lightweight router, Metis-HOME effectively resolves the reasoning-vs-generalization trade-off.
18
+
19
+ <div style="display: flex; justify-content: center; gap: 20px; flex-wrap: wrap;">
20
+ <img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/framework.png" alt="Metis-RISE Framework Overview" style="width:400px; max-width:100%;">
21
+ <img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/radar_chart.png" alt="Metis-RISE Framework Overview" style="width:400px; max-width:100%;">
22
+ </div>
23
+
24
+ ## โœจ Highlights
25
+
26
+ - ๐Ÿง  Hybrid Thinking Paradigm: Explicitly decouples "System 1" (fast, intuitive) and "System 2" (slow, deliberative) reasoning within a unified multimodal MoE architecture.
27
+ - ๐Ÿ”„ Router Mechanism: A lightweight, trainable router dynamically allocates queries based on complexity, avoiding computational waste on simple tasks like OCR or Captioning.
28
+ - ๐Ÿš€ Performance:
29
+ - +6.9% improvement on reasoning benchmarks (MathVista, etc.) compared to the baseline.
30
+ - ~1% gain on general benchmarks, reversing the degradation trend observed in other reasoning-specialized models.
31
+
32
+ - ๐Ÿ› ๏ธ Efficient Training: A multi-stage strategy combining Reinforcement Learning (RL) for reasoning enhancement and Mixed Supervised Fine-Tuning (SFT) for expert specialization.
33
+
34
+
35
+ ## ๐Ÿ“Š Results
36
+
37
+ ### Thinking Ratio
38
+ As shown in the following figure, the **thinking ratio** analysis of Metis-HOME reveals adaptive routing behavior:
39
+ - **High ratios (78\%โ€“98\%)** on reasoning-heavy benchmarks (*WeMath*, *MathVision*, etc.), indicating effective use of the *thinking expert* for multi-step inference.
40
+ - **Low ratios (2\%โ€“5\%)** on general benchmarks (*MMBench*, *OCRBench*), showing preference for the *non-thinking expert*.
41
+
42
+ This aligns with our design: **deliberate reasoning for complex tasks**, **fast inference for simple ones**, optimizing computational efficiency.
43
+
44
+ <img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/thinking_ratio_chart.png" alt="Metis-RISE Framework Overview" style="width:850px; max-width:100%;">
45
+
46
+
47
+ ### Benchmarks
48
+ <table>
49
+ <thead>
50
+ <tr>
51
+ <th rowspan="2" style="text-align:left; vertical-align:bottom;">Model</th>
52
+ <th colspan="7" style="text-align:center; border-bottom:1px solid #ccc;">Reasoning</th>
53
+ <th style="text-align:center; border-bottom:1px solid #ccc;">General</th>
54
+ </tr>
55
+ <tr>
56
+ <th>MathVista</th>
57
+ <th>MathVision</th>
58
+ <th>MathVerse</th>
59
+ <th>DynaMath</th>
60
+ <th>WeMath</th>
61
+ <th>LogicVista</th>
62
+ <th>Avg.</th>
63
+ <th>Avg.</th>
64
+ </tr>
65
+ </thead>
66
+ <tbody>
67
+
68
+ <tr style="background-color: #e0e0e0;">
69
+ <td colspan="9" align="center"><strong><em>Proprietary Models</em></strong></td>
70
+ </tr>
71
+
72
+ <tr>
73
+ <td>Gemini-2.0-Pro</td>
74
+ <td>71.3</td>
75
+ <td>48.1</td>
76
+ <td>67.3</td>
77
+ <td>43.3</td>
78
+ <td>56.5</td>
79
+ <td>53.2</td>
80
+ <td>56.6</td>
81
+ <td>73.3</td>
82
+ </tr>
83
+ <tr>
84
+ <td>Gemini-2.0-Flash</td>
85
+ <td>70.4</td>
86
+ <td>43.6</td>
87
+ <td>47.8</td>
88
+ <td>42.1</td>
89
+ <td>47.4</td>
90
+ <td>52.3</td>
91
+ <td>50.6</td>
92
+ <td>72.6</td>
93
+ </tr>
94
+ <tr>
95
+ <td>Claude 3.7 Sonnet</td>
96
+ <td>66.8</td>
97
+ <td>41.9</td>
98
+ <td>46.7</td>
99
+ <td>39.7</td>
100
+ <td>49.3</td>
101
+ <td>58.2</td>
102
+ <td>50.4</td>
103
+ <td>70.1</td>
104
+ </tr>
105
+ <tr>
106
+ <td>ChatGPT-4o</td>
107
+ <td>60.0</td>
108
+ <td>31.2</td>
109
+ <td>40.6</td>
110
+ <td>34.5</td>
111
+ <td>45.8</td>
112
+ <td>52.8</td>
113
+ <td>44.2</td>
114
+ <td>72.0</td>
115
+ </tr>
116
+
117
+
118
+ <tr style="background-color: #e0e0e0;">
119
+ <td colspan="9" align="center"><strong><em>Open-source Models</em></strong></td>
120
+ </tr>
121
+
122
+ <tr>
123
+ <td>LLaVA-OneVision-72B</td>
124
+ <td>67.1</td>
125
+ <td>25.3</td>
126
+ <td>27.2</td>
127
+ <td>15.6</td>
128
+ <td>32.0</td>
129
+ <td>40.9</td>
130
+ <td>34.7</td>
131
+ <td>68.0</td>
132
+ </tr>
133
+ <tr>
134
+ <td>Kimi-VL-A3B-Instruct</td>
135
+ <td>66.0</td>
136
+ <td>21.8</td>
137
+ <td>34.1</td>
138
+ <td>18.0</td>
139
+ <td>32.3</td>
140
+ <td>42.7</td>
141
+ <td>35.8</td>
142
+ <td>69.1</td>
143
+ </tr>
144
+ <tr>
145
+ <td>InternVL3-8B</td>
146
+ <td>70.5</td>
147
+ <td>30.0</td>
148
+ <td>38.5</td>
149
+ <td>25.7</td>
150
+ <td>39.5</td>
151
+ <td>44.5</td>
152
+ <td>41.4</td>
153
+ <td>73.6</td>
154
+ </tr>
155
+ <tr>
156
+ <td>VL-Rethinker-7B</td>
157
+ <td>75.5</td>
158
+ <td>29.3</td>
159
+ <td>47.2</td>
160
+ <td>25.4</td>
161
+ <td>37.8</td>
162
+ <td>47.0</td>
163
+ <td>43.7</td>
164
+ <td>68.3</td>
165
+ </tr>
166
+ <tr>
167
+ <td>Metis-RISE-7B</td>
168
+ <td>75.8</td>
169
+ <td>28.7</td>
170
+ <td>51.0</td>
171
+ <td>27.7</td>
172
+ <td>45.2</td>
173
+ <td>49.7</td>
174
+ <td>46.4</td>
175
+ <td>68.4</td>
176
+ </tr>
177
+
178
+ <tr>
179
+ <td style="border-top: 1px solid #000;">Baseline</td>
180
+ <td style="border-top: 1px solid #000;">67.4</td>
181
+ <td style="border-top: 1px solid #000;">26.2</td>
182
+ <td style="border-top: 1px solid #000;">41.1</td>
183
+ <td style="border-top: 1px solid #000;">20.2</td>
184
+ <td style="border-top: 1px solid #000;">34.5</td>
185
+ <td style="border-top: 1px solid #000;">45.6</td>
186
+ <td style="border-top: 1px solid #000; background-color: #fff2cc;">39.2</td>
187
+ <td style="border-top: 1px solid #000;">70.3</td>
188
+ </tr>
189
+ <tr>
190
+ <td>Baseline+RL</td>
191
+ <td>72.8</td>
192
+ <td>28.7</td>
193
+ <td>46.8</td>
194
+ <td>26.2</td>
195
+ <td>43.3</td>
196
+ <td>46.5</td>
197
+ <td>44.0</td>
198
+ <td style="background-color: #e1d5e7;">67.2</td>
199
+ </tr>
200
+ <tr>
201
+ <td><b>Metis-HOME</b></td>
202
+ <td>76.0</td>
203
+ <td>29.5</td>
204
+ <td>47.7</td>
205
+ <td>26.4</td>
206
+ <td>45.6</td>
207
+ <td>51.5</td>
208
+ <td style="background-color: #fff2cc;"><b>46.1</b></td>
209
+ <td style="background-color: #e1d5e7;"><b>71.2</b></td>
210
+ </tr>
211
+ </tbody>
212
+ </table>
213
+
214
+
215
+ ## ๐Ÿ” Usage Example
216
+
217
+ You can use the demo inference script in the `examples` folder:
218
+
219
+ ```bash
220
+ python examples/demo_inference.py
221
+ ```
222
+
223
+ ## ๐Ÿ“Œ Acknowledgement
224
+ We sincerely appreciate [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) and [MM-EUREKA](https://github.com/ModalMinds/MM-EUREKA) for providing reference training framework.
225
+
226
+ ## ๐Ÿ“– Citation
227
+
228
+ ```bibtex
229
+ @article{lan2025metis,
230
+ title={Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning},
231
+ author={Lan, Xiaohan and Liu, Fanfan and Qiu, Haibo and Yang, Siqi and Ruan, Delian and Shi, Peng and Ma, Lin},
232
+ journal={arXiv preprint arXiv:2510.20519},
233
+ year={2025}
234
+ }
235
+ ```