Improve model card: Add abstract, update title, and align content with GitHub README

#17
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +1537 -202
README.md CHANGED
@@ -1,10 +1,11 @@
1
  ---
2
- pipeline_tag: image-text-to-text
3
  datasets:
4
  - openbmb/RLAIF-V-Dataset
5
- library_name: transformers
6
  language:
7
  - multilingual
 
 
 
8
  tags:
9
  - minicpm-v
10
  - vision
@@ -12,13 +13,117 @@ tags:
12
  - multi-image
13
  - video
14
  - custom_code
15
- license: apache-2.0
16
  ---
17
 
18
- <h1>A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone</h1>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
- [GitHub](https://github.com/OpenBMB/MiniCPM-o) | [CookBook](https://github.com/OpenSQZ/MiniCPM-V-CookBook) | [Technical Report](https://huggingface.co/papers/2509.18154) | [Demo](http://101.126.42.235:30910/) </a>
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
 
24
  ## MiniCPM-V 4.5
@@ -33,17 +138,18 @@ license: apache-2.0
33
  - ⚙️ **Controllable Hybrid Fast/Deep Thinking.** MiniCPM-V 4.5 supports both fast thinking for efficient frequent usage with competitive performance, and deep thinking for more complex problem solving. To cover efficiency and performance trade-offs in different user scenarios, this fast/deep thinking mode can be switched in a highly controlled fashion.
34
 
35
  - 💪 **Strong OCR, Document Parsing and Others.**
36
- Based on [LLaVA-UHD](https://arxiv.org/pdf/2403.11703) architecture, MiniCPM-V 4.5 can process high-resolution images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), using 4x less visual tokens than most MLLMs. The model achieves **leading performance on OCRBench, surpassing proprietary models such as GPT-4o-latest and Gemini 2.5**. It also achieves state-of-the-art performance for PDF document parsing capability on OmniDocBench among general MLLMs. Based on the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o-latest on MMHal-Bench, and supports **multilingual capabilities** in more than 30 languages.
 
37
 
38
  - 💫 **Easy Usage.**
39
- MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/tc-mb/llama.cpp/blob/Support-MiniCPM-V-4.5/docs/multimodal/minicpmv4.5.md) and [ollama](https://github.com/tc-mb/ollama/tree/MIniCPM-V) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4), [GGUF](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf) and [AWQ](https://github.com/tc-mb/AutoAWQ) format quantized models in 16 sizes, (3) [SGLang](https://github.com/tc-mb/sglang/tree/main) and [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [Transformers](https://github.com/tc-mb/transformers/tree/main) and [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), (6) optimized [local iOS app](https://github.com/tc-mb/MiniCPM-o-demo-iOS) on iPhone and iPad, and (7) online web demo on [server](http://101.126.42.235:30910/). See our [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook) for full usages!
40
 
41
 
42
  ### Key Techniques
43
 
44
 
45
  <div align="center">
46
- <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpm-v-4dot5-framework.png" , width=100%>
47
  </div>
48
 
49
  - **Architechture: Unified 3D-Resampler for High-density Video Compression.** MiniCPM-V 4.5 introduces a 3D-Resampler that overcomes the performance-efficiency trade-off in video understanding. By grouping and jointly compressing up to 6 consecutive video frames into just 64 tokens (the same token count used for a single image in MiniCPM-V series), MiniCPM-V 4.5 achieves a 96× compression rate for video tokens. This allows the model to process more video frames without additional LLM computational cost, enabling high-FPS video and long video understanding. The architecture supports unified encoding for images, multi-image inputs, and videos, ensuring seamless capability and knowledge transfer.
@@ -55,13 +161,15 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github
55
  ### Evaluation
56
 
57
  <div align="center">
58
- <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/radar_minicpm_v45.png", width=60%>
59
  </div>
60
  <div align="center">
61
- <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv_4_5_evaluation_result.png" , width=100%>
62
  </div>
63
 
64
- ### Inference Efficiency
 
 
65
 
66
  **OpenCompass**
67
  <div align="left">
@@ -134,125 +242,1123 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github
134
  </tr>
135
  </tbody>
136
  </table>
137
- </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
 
139
- Both Video-MME and OpenCompass were evaluated using 8×A100 GPUs for inference. The reported inference time of Video-MME includes full model-side computation, and excludes the external cost of video frame extraction (dependent on specific frame extraction tools) for fair comparison.
140
 
141
  ### Examples
142
 
 
 
143
  <div align="center">
144
- <a href="https://www.youtube.com/watch?v=Cn23FujYMMU"><img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/MiniCPM-V%204.5-8.26_img.jpeg", width=70%></a>
145
  </div>
146
 
 
 
147
  <div style="display: flex; flex-direction: column; align-items: center;">
148
- <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/en_case1.png" alt="en_case1" style="margin-bottom: 5px;">
149
- <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/en_case2.png" alt="en_case2" style="margin-bottom: 5px;">
150
- <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/en_case3.jpeg" alt="en_case3" style="margin-bottom: 5px;">
151
  </div>
152
 
153
- We deploy MiniCPM-V 4.5 on iPad M4 with [iOS demo](https://github.com/tc-mb/MiniCPM-o-demo-iOS). The demo video is the raw screen recording without editing.
154
 
155
- <div align="center">
156
- <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/v45_en_handwriting.gif" width="45%" style="display: inline-block; margin: 0 10px;"/>
157
- <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/v45_en_cot.gif" width="45%" style="display: inline-block; margin: 0 10px;"/>
158
- </div>
159
 
160
- <div align="center">
161
- <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/v45_cn_handwriting.gif" width="45%" style="display: inline-block; margin: 0 10px;"/>
162
- <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/v45_cn_travel.gif" width="45%" style="display: inline-block; margin: 0 10px;"/>
163
- </div>
164
-
165
- ## Framework Support Matrix
166
- <table>
167
- <thead>
168
- <tr>
169
- <th>Category</th>
170
- <th>Framework</th>
171
- <th>Cookbook Link</th>
172
- <th>Upstream PR</th>
173
- <th>Supported since(branch)</th>
174
- <th>Supported since(release)</th>
175
- </tr>
176
- </thead>
177
- <tbody>
178
- <tr>
179
- <td rowspan="2">Edge(On-device)</td>
180
- <td>Llama.cpp</td>
181
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/llama.cpp/minicpm-v4_5_llamacpp.md">Llama.cpp Doc</a></td>
182
- <td><a href="https://github.com/ggml-org/llama.cpp/pull/15575">#15575</a>(2025-08-26)</td>
183
- <td>master(2025-08-26)</td>
184
- <td><a href="https://github.com/ggml-org/llama.cpp/releases/tag/b6282">b6282</a></td>
185
- </tr>
186
- <tr>
187
- <td>Ollama</td>
188
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-v4_5_ollama.md">Ollama Doc</a></td>
189
- <td><a href="https://github.com/ollama/ollama/pull/12078">#12078</a>(2025-08-26)</td>
190
- <td>Merging</td>
191
- <td>Waiting for official release</td>
192
- </tr>
193
- <tr>
194
- <td rowspan="2">Serving(Cloud)</td>
195
- <td>vLLM</td>
196
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/vllm/minicpm-v4_5_vllm.md">vLLM Doc</a></td>
197
- <td><a href="https://github.com/vllm-project/vllm/pull/23586">#23586</a>(2025-08-26)</td>
198
- <td>main(2025-08-27)</td>
199
- <td><a href="https://github.com/vllm-project/vllm/releases/tag/v0.10.2">v0.10.2</td>
200
- </tr>
201
- <tr>
202
- <td>SGLang</td>
203
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/sglang/MiniCPM-v4_5_sglang.md">SGLang Doc</a></td>
204
- <td><a href="https://github.com/sgl-project/sglang/pull/9610">#9610</a>(2025-08-26)</td>
205
- <td>Merging</td>
206
- <td>Waiting for official release</td>
207
- </tr>
208
- <tr>
209
- <td>Finetuning</td>
210
- <td>LLaMA-Factory</td>
211
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_llamafactory.md">LLaMA-Factory Doc</a></td>
212
- <td><a href="https://github.com/hiyouga/LLaMA-Factory/pull/9022">#9022</a>(2025-08-26)</td>
213
- <td>main(2025-08-26)</td>
214
- <td>Waiting for official release</td>
215
- </tr>
216
- <tr>
217
- <td rowspan="3">Quantization</td>
218
- <td>GGUF</td>
219
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/gguf/minicpm-v4_5_gguf_quantize.md">GGUF Doc</a></td>
220
- <td>—</td>
221
- <td>—</td>
222
- <td>—</td>
223
- </tr>
224
- <tr>
225
- <td>BNB</td>
226
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/bnb/minicpm-v4_5_bnb_quantize.md">BNB Doc</a></td>
227
- <td>—</td>
228
- <td>—</td>
229
- <td>—</td>
230
- </tr>
231
- <tr>
232
- <td>AWQ</td>
233
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/awq/minicpm-v4_5_awq_quantize.md">AWQ Doc</a></td>
234
- <td>—</td>
235
- <td>—</td>
236
- <td>—</td>
237
- </tr>
238
- <tr>
239
- <td>Demos</td>
240
- <td>Gradio Demo</td>
241
- <td><a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/web_demo/gradio/README.md">Gradio Demo Doc</a></td>
242
- <td>—</td>
243
- <td>—</td>
244
- <td>—</td>
245
- </tr>
246
- </tbody>
247
- </table>
248
-
249
- > Note: If you'd like us to prioritize support for another open-source framework, please let us know via this [short form](https://docs.google.com/forms/d/e/1FAIpQLSdyTUrOPBgWqPexs3ORrg47ZcZ1r4vFQaA4ve2iA7L9sMfMWw/viewform).
250
 
251
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
252
 
253
- If you wish to enable thinking mode, provide the argument `enable_thinking=True` to the chat function.
254
 
255
- #### Chat with Image
256
  ```python
257
  import torch
258
  from PIL import Image
@@ -267,8 +1373,7 @@ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_
267
 
268
  image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')
269
 
270
- enable_thinking=False # If `enable_thinking=True`, the thinking mode is enabled.
271
- stream=True # If `stream=True`, the answer is string
272
 
273
  # First round chat
274
  question = "What is the landform in the picture?"
@@ -277,29 +1382,19 @@ msgs = [{'role': 'user', 'content': [image, question]}]
277
  answer = model.chat(
278
  msgs=msgs,
279
  tokenizer=tokenizer,
280
- enable_thinking=enable_thinking,
281
- stream=True
282
  )
283
-
284
- generated_text = ""
285
- for new_text in answer:
286
- generated_text += new_text
287
- print(new_text, flush=True, end='')
288
 
289
  # Second round chat, pass history context of multi-turn conversation
290
- msgs.append({"role": "assistant", "content": [generated_text]})
291
  msgs.append({"role": "user", "content": ["What should I pay attention to when traveling here?"]})
292
 
293
  answer = model.chat(
294
  msgs=msgs,
295
- tokenizer=tokenizer,
296
- stream=True
297
  )
298
-
299
- generated_text = ""
300
- for new_text in answer:
301
- generated_text += new_text
302
- print(new_text, flush=True, end='')
303
  ```
304
 
305
  You will get the following output:
@@ -321,8 +1416,72 @@ When traveling to a karst landscape like this, here are some important tips:
321
  By following these guidelines, you'll have a safe and enjoyable trip while appreciating the stunning natural beauty of places such as Guilin’s karst mountains.
322
  ```
323
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
324
 
325
  #### Chat with Video
 
 
326
 
327
  ```python
328
  ## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids.
@@ -422,107 +1581,283 @@ answer = model.chat(
422
  )
423
  print(answer)
424
  ```
 
 
 
 
 
 
425
 
426
- #### Chat with multiple images
427
- <details>
428
- <summary> Click to show Python code running MiniCPM-V 4.5 with multiple images input. </summary>
429
-
430
  ```python
431
  import torch
432
- from PIL import Image
433
  from transformers import AutoModel, AutoTokenizer
434
 
435
- model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True,
436
- attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2
437
  model = model.eval().cuda()
438
- tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True)
439
 
440
- image1 = Image.open('image1.jpg').convert('RGB')
441
- image2 = Image.open('image2.jpg').convert('RGB')
442
- question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
443
 
444
- msgs = [{'role': 'user', 'content': [image1, image2, question]}]
445
 
446
- answer = model.chat(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
447
  msgs=msgs,
448
- tokenizer=tokenizer
 
 
 
 
 
 
449
  )
450
- print(answer)
451
  ```
452
- </details>
453
 
 
 
 
 
 
454
 
455
- #### In-context few-shot learning
456
- <details>
457
- <summary> Click to view Python code running MiniCPM-V 4.5 with few-shot input. </summary>
458
 
459
  ```python
460
- import torch
461
- from PIL import Image
462
- from transformers import AutoModel, AutoTokenizer
463
 
464
- model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True,
465
- attn_implementation='sdpa', torch_dtype=torch.bfloat16)
466
- model = model.eval().cuda()
467
- tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True)
 
 
 
 
 
 
 
 
 
468
 
469
- question = "production date"
470
- image1 = Image.open('example1.jpg').convert('RGB')
471
- answer1 = "2023.08.04"
472
- image2 = Image.open('example2.jpg').convert('RGB')
473
- answer2 = "2007.04.24"
474
- image_test = Image.open('test.jpg').convert('RGB')
 
 
 
 
 
 
 
 
 
 
475
 
476
- msgs = [
477
- {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
478
- {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
479
- {'role': 'user', 'content': [image_test, question]}
480
- ]
481
 
482
- answer = model.chat(
 
 
 
 
 
 
 
 
 
 
 
 
 
483
  msgs=msgs,
484
- tokenizer=tokenizer
 
 
 
 
 
 
485
  )
486
- print(answer)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
487
  ```
488
- </details>
489
 
490
 
491
- ## License
492
- #### Model License
493
- * The MiniCPM-o/V model weights and code are open-sourced under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM-V/blob/main/LICENSE) license.
494
- * To help us better understand and support our users, we would deeply appreciate it if you could consider optionally filling out a brief registration ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g).
495
 
496
- #### Statement
497
- * As an LMM, MiniCPM-V 4.5 generates contents by learning a large amount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-V 4.5 does not represent the views and positions of the model developers
498
- * We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
499
 
500
- ## Key Techniques and Other Multimodal Projects
 
 
 
 
 
 
 
 
 
 
 
 
 
501
 
502
- 👏 Welcome to explore key techniques of MiniCPM-V 4.5 and other multimodal projects of our team:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
503
 
504
- [VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLPR](https://github.com/OpenBMB/RLPR) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
505
 
506
- ## Citation
 
 
 
507
 
508
- If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️!
509
 
510
- ```bib
511
- @misc{yu2025minicpmv45cookingefficient,
512
- title={MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe},
513
- author={Tianyu Yu and Zefan Wang and Chongyi Wang and Fuwei Huang and Wenshuo Ma and Zhihui He and Tianchi Cai and Weize Chen and Yuxiang Huang and Yuanqian Zhao and Bokai Xu and Junbo Cui and Yingjing Xu and Liqing Ruan and Luoyuan Zhang and Hanyu Liu and Jingkun Tang and Hongyuan Liu and Qining Guo and Wenhao Hu and Bingxiang He and Jie Zhou and Jie Cai and Ji Qi and Zonghao Guo and Chi Chen and Guoyang Zeng and Yuxuan Li and Ganqu Cui and Ning Ding and Xu Han and Yuan Yao and Zhiyuan Liu and Maosong Sun},
514
- year={2025},
515
- eprint={2509.18154},
516
- archivePrefix={arXiv},
517
- primaryClass={cs.LG},
518
- url={https://arxiv.org/abs/2509.18154},
519
- }
520
 
521
- @article{yao2024minicpm,
522
- title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
523
- author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
524
- journal={Nat Commun 16, 5509 (2025)},
525
- year={2025}
526
- }
 
527
 
528
- ```
 
 
1
  ---
 
2
  datasets:
3
  - openbmb/RLAIF-V-Dataset
 
4
  language:
5
  - multilingual
6
+ library_name: transformers
7
+ license: apache-2.0
8
+ pipeline_tag: image-text-to-text
9
  tags:
10
  - minicpm-v
11
  - vision
 
13
  - multi-image
14
  - video
15
  - custom_code
 
16
  ---
17
 
18
+ <div align="center">
19
+
20
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpm_v_and_minicpm_o_title.png" width="500em" ></img>
21
+
22
+ # MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone
23
+
24
+ **[中文](./README_zh.md) |
25
+ English**
26
+
27
+ <span style="display: inline-flex; align-items: center; margin-right: 2px;">
28
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/wechat.png" alt="WeChat" style="margin-right: 4px;">
29
+ <a href="docs/wechat.md" target="_blank"> WeChat</a> &nbsp;|
30
+ </span>
31
+ &nbsp;
32
+ <span style="display: inline-flex; align-items: center; margin-left: -8px;">
33
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/discord.png" alt="Discord" style="margin-right: 4px;">
34
+ <a href="https://discord.gg/rftuRMbqzf" target="_blank"> Discord</a> &nbsp;
35
+ </span>
36
+
37
+ <p align="center">
38
+ **[GitHub](https://github.com/OpenBMB/MiniCPM-V)** | MiniCPM-V 4.5 <a href="https://huggingface.co/openbmb/MiniCPM-V-4_5">🤗</a> <a href="http://101.126.42.235:30910/">🤖</a> | MiniCPM-o 2.6 <a href="https://huggingface.co/openbmb/MiniCPM-o-2_6">🤗</a> <a href="https://minicpm-omni-webdemo-us.modelbest.cn/"> 🤖</a> | <a href="https://github.com/OpenSQZ/MiniCPM-V-Cookbook">🍳 Cookbook</a> |
39
+ <a href="https://huggingface.co/papers/2509.18154">📄 Technical Report</a>
40
+ </p>
41
+
42
+ </div>
43
+
44
+ ## Paper Abstract
45
+
46
+ Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size, using just 46.7% GPU memory cost and 8.7% inference time of Qwen2.5-VL 7B.
47
+
48
+ **MiniCPM-V** is a series of efficient end-side multimodal LLMs (MLLMs), which accept images, videos and text as inputs and deliver high-quality text outputs. **MiniCPM-o** additionally takes audio as inputs and provides high-quality speech outputs in an end-to-end fashion. Since February 2024, we have released 7 versions of the model, aiming to achieve **strong performance and efficient deployment**. The most notable models in the series currently include:
49
+
50
+
51
+ - **MiniCPM-V 4.5**: 🔥🔥🔥 The latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, this model **outperforms GPT-4o-latest, Gemini-2.0 Pro, and Qwen2.5-VL 72B** in vision-language capabilities, making it the most performant on-device multimodal model in the open-source community. This version brings **new features including efficient high-FPS and long video understanding (up to 96x compression rate for video tokens), controllable hybrid fast/deep thinking, strong handwritten OCR and complex table/document parsing**. It also advances MiniCPM-V's popular features such as trustworthy behavior, multilingual support and end-side deployability.
52
+
53
+ - **MiniCPM-o 2.6**: ⭐⭐⭐ The most capable model in the MiniCPM-o series. With a total of 8B parameters, this end-to-end model **achieves comparable performance to GPT-4o-202405 in vision, speech, and multimodal live streaming**, making it one of the most versatile and performant models in the open-source community. For the new voice mode, MiniCPM-o 2.6 **supports bilingual real-time speech conversation with configurable voices**, and also allows for fun capabilities such as emotion/speed/style control, end-to-end voice cloning, role play, etc. Due to its superior token density, MiniCPM-o 2.6 can for the first time **support multimodal live streaming on end-side devices** such as iPad.
54
+
55
+
56
+
57
+
58
+ ## News
59
+
60
+ #### 📌 Pinned
61
+
62
+ * [2025.09.18] 📢📢📢 MiniCPM-V 4.5 technical report is now released! See [here](./docs/MiniCPM_V_4_5_Technical_Report.pdf).
63
+
64
+ * [2025.09.01] ⭐⭐⭐ MiniCPM-V 4.5 has been officially supported by [llama.cpp](https://github.com/ggml-org/llama.cpp/pull/15575), [vLLM](https://github.com/vllm-project/vllm/pull/23586), and [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory/pull/9022). You are welcome to use it directly through these official channels! Support for additional frameworks such as [Ollama](https://github.com/ollama/ollama/pull/12078) and [SGLang](https://github.com/sgl-project/sglang/pull/9610) is actively in progress.
65
+
66
+ * [2025.08.26] 🔥🔥🔥 We open-source MiniCPM-V 4.5, which outperforms GPT-4o-latest, Gemini-2.0 Pro, and Qwen2.5-VL 72B. It advances popular capabilities of MiniCPM-V, and brings useful new features. Try it now!
67
+
68
+ * [2025.08.01] ⭐⭐⭐ We open-sourced the [MiniCPM-V & o Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook)! It provides comprehensive guides for diverse user scenarios, paired with our new [Docs Site](https://minicpm-o.readthedocs.io/en/latest/index.html) for smoother onboarding.
69
+
70
+ * [2025.06.20] ⭐⭐⭐ Our official [Ollama repository](https://ollama.com/openbmb) is released. Try our latest models with [one click](https://ollama.com/openbmb/minicpm-o2.6)!
71
+
72
+ * [2025.03.01] 🚀🚀🚀 RLAIF-V, the alignment technique of MiniCPM-o, is accepted by CVPR 2025 Highlights!The [code](https://github.com/RLHF-V/RLAIF-V), [dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset), [paper](https://arxiv.org/abs/2405.17220) are open-sourced!
73
+
74
+ * [2025.01.24] 📢📢📢 MiniCPM-o 2.6 technical report is released! See [here](https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9).
75
+
76
+ * [2025.01.19] 📢 **ATTENTION!** We are currently working on merging MiniCPM-o 2.6 into the official repositories of llama.cpp, Ollama, and vllm. Until the merge is complete, please USE OUR LOCAL FORKS of [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md), [Ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md), and [vllm](https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file#efficient-inference-with-llamacpp-ollama-vllm). **Using the official repositories before the merge may lead to unexpected issues**.
77
+
78
+ * [2025.01.19] ⭐⭐⭐ MiniCPM-o tops GitHub Trending and reaches top-2 on Hugging Face Trending!
79
+
80
+ * [2025.01.17] We have updated the usage of MiniCPM-o 2.6 int4 quantization version and resolved the model initialization error. Click [here](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and try it now!
81
+
82
+ * [2025.01.13] 🔥🔥🔥 We open-source MiniCPM-o 2.6, which matches GPT-4o-202405 on vision, speech and multimodal live streaming. It advances popular capabilities of MiniCPM-V 2.6, and supports various new fun features. Try it now!
83
+
84
+ * [2024.08.17] 🚀🚀🚀 MiniCPM-V 2.6 is now fully supported by [official](https://github.com/ggerganov/llama.cpp) llama.cpp! GGUF models of various sizes are available [here](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf).
85
+
86
+ * [2024.08.06] 🔥🔥🔥 We open-source MiniCPM-V 2.6, which outperforms GPT-4V on single image, multi-image and video understanding. It advances popular features of MiniCPM-Llama3-V 2.5, and can support real-time video understanding on iPad. Try it now!
87
 
88
+ * [2024.08.03] MiniCPM-Llama3-V 2.5 technical report is released! See [here](https://arxiv.org/abs/2408.01800).
89
 
90
+ * [2024.05.23] 🔥🔥🔥 MiniCPM-V tops GitHub Trending and Hugging Face Trending! Our demo, recommended by Hugging Face Gradio’s official account, is available [here](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5). Come and try it out!
91
+
92
+ <br>
93
+
94
+ <details>
95
+ <summary>Click to view more news.</summary>
96
+
97
+ * [2025.08.02] 🚀🚀🚀 We open-source MiniCPM-V 4.0, which outperforms GPT-4.1-mini-20250414 in image understanding. It advances popular features of MiniCPM-V 2.6, and largely improves the efficiency. We also open-source the iOS App on iPhone and iPad. Try it now!
98
+
99
+ * [2025.01.23] 💡💡💡 MiniCPM-o 2.6 is now supported by [Align-Anything](https://github.com/PKU-Alignment/align-anything), a framework by PKU-Alignment Team for aligning any-to-any modality large models with human intentions. It supports DPO and SFT fine-tuning on both vision and audio. Try it now!
100
+
101
+ * [2024.08.15] We now also support multi-image SFT. For more details, please refer to the [document](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune).
102
+ * [2024.08.14] MiniCPM-V 2.6 now also supports [fine-tuning](https://github.com/modelscope/ms-swift/issues/1613) with the SWIFT framework!
103
+ * [2024.08.10] 🚀🚀🚀 MiniCPM-Llama3-V 2.5 is now fully supported by [official](https://github.com/ggerganov/llama.cpp) llama.cpp! GGUF models of various sizes are available [here](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf).
104
+
105
+ * [2024.07.19] MiniCPM-Llama3-V 2.5 supports vLLM now! See [here](#inference-with-vllm).
106
+
107
+ * [2024.06.03] Now, you can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs(12 GB or 16 GB) by distributing the model's layers across multiple GPUs. For more details, check this [link](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md).
108
+ * [2024.05.28] 🚀🚀🚀 MiniCPM-Llama3-V 2.5 now fully supports its feature in llama.cpp and Ollama! Please pull the latest code **of our provided forks** ([llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md), [Ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5)). GGUF models in various sizes are available [here](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/tree/main). MiniCPM-Llama3-V 2.5 series is **not supported by the official repositories yet**, and we are working hard to merge PRs. Please stay tuned!
109
+
110
+ * [2024.05.28] 💫 We now support LoRA fine-tuning for MiniCPM-Llama3-V 2.5, using only 2 V100 GPUs! See more statistics [here](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#model-fine-tuning-memory-usage-statistics).
111
+
112
+ * [2024.05.25] MiniCPM-Llama3-V 2.5 now supports streaming outputs and customized system prompts. Try it [here](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5#usage)!
113
+ * [2024.05.24] We release the MiniCPM-Llama3-V 2.5 [gguf](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf), which supports [llama.cpp](#inference-with-llamacpp) inference and provides a 6~8 token/s smooth decoding on mobile phones. Try it now!
114
+
115
+ * [2024.05.23] 🔍 We've released a comprehensive comparison between Phi-3-vision-128k-instruct and MiniCPM-Llama3-V 2.5, including benchmark evaluations, multilingual capabilities, and inference efficiency 🌟📊🌍🚀. Click [here](./docs/compare_with_phi-3_vision.md) to view more details.
116
+
117
+ * [2024.05.20] We open-soure MiniCPM-Llama3-V 2.5, it has improved OCR capability and supports 30+ languages, representing the first end-side MLLM achieving GPT-4V level performance! We provide [efficient inference](#deployment-on-mobile-phone) and [simple fine-tuning](./finetune/readme.md). Try it now!
118
+ * [2024.04.23] MiniCPM-V-2.0 supports vLLM now! Click [here](#inference-with-vllm) to view more details.
119
+ * [2024.04.18] We create a HuggingFace Space to host the demo of MiniCPM-V 2.0 at [here](https://huggingface.co/spaces/openbmb/MiniCPM-V-2)!
120
+ * [2024.04.17] MiniCPM-V-2.0 supports deploying [WebUI Demo](#webui-demo) now!
121
+ * [2024.04.15] MiniCPM-V-2.0 now also supports [fine-tuning](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md) with the SWIFT framework!
122
+ * [2024.04.12] We open-source MiniCPM-V 2.0, which achieves comparable performance with Gemini Pro in understanding scene text and outperforms strong Qwen-VL-Chat 9.6B and Yi-VL 34B on <a href="https://rank.opencompass.org.cn/leaderboard-multimodal">OpenCompass</a>, a comprehensive evaluation over 11 popular benchmarks. Click <a href="https://openbmb.vercel.app/minicpm-v-2">here</a> to view the MiniCPM-V 2.0 technical blog.
123
+ * [2024.03.14] MiniCPM-V now supports [fine-tuning](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v最佳实践.md) with the SWIFT framework. Thanks to [Jintao](https://github.com/Jintao-Huang) for the contribution!
124
+ * [2024.03.01] MiniCPM-V can now be deployed on Mac!
125
+ * [2024.02.01] We open-source MiniCPM-V and OmniLMM-12B, which support efficient end-side deployment and powerful multimodal capabilities correspondingly.
126
+ </details>
127
 
128
 
129
  ## MiniCPM-V 4.5
 
138
  - ⚙️ **Controllable Hybrid Fast/Deep Thinking.** MiniCPM-V 4.5 supports both fast thinking for efficient frequent usage with competitive performance, and deep thinking for more complex problem solving. To cover efficiency and performance trade-offs in different user scenarios, this fast/deep thinking mode can be switched in a highly controlled fashion.
139
 
140
  - 💪 **Strong OCR, Document Parsing and Others.**
141
+ Based on [LLaVA-UHD](https://arxiv.org/pdf/2403.11703) architecture, MiniCPM-V 4.5 can process high-resolution images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), using 4x fewer visual tokens than most MLLMs. The model achieves **leading performance on OCRBench, surpassing proprietary models such as GPT-4o-latest and Gemini 2.5**. It also achieves state-of-the-art performance for PDF document parsing capability on OmniDocBench among general MLLMs. Based on the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o-latest on MMHal-Bench, and supports **multilingual capabilities** in more than 30 languages.
142
+
143
 
144
  - 💫 **Easy Usage.**
145
+ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/tc-mb/llama.cpp/blob/Support-MiniCPM-V-4.5/docs/multimodal/minicpmv4.5.md) and [ollama](https://github.com/tc-mb/ollama/tree/MIniCPM-V) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4), [GGUF](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf) and [AWQ](https://github.com/tc-mb/AutoAWQ) format quantized models in 16 sizes, (3) [SGLang](https://github.com/sgl-project/sglang/tree/main) and [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [Transformers](https://github.com/tc-mb/transformers/tree/main) and [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), (6) optimized [local iOS app](https://github.com/tc-mb/MiniCPM-o-demo-iOS) on iPhone and iPad, and (7) online web demo on [server](http://101.126.42.235:30910/). See our [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook) for full usage!
146
 
147
 
148
  ### Key Techniques
149
 
150
 
151
  <div align="center">
152
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpm-v-4dot5-framework.png" , width=100%>
153
  </div>
154
 
155
  - **Architechture: Unified 3D-Resampler for High-density Video Compression.** MiniCPM-V 4.5 introduces a 3D-Resampler that overcomes the performance-efficiency trade-off in video understanding. By grouping and jointly compressing up to 6 consecutive video frames into just 64 tokens (the same token count used for a single image in MiniCPM-V series), MiniCPM-V 4.5 achieves a 96× compression rate for video tokens. This allows the model to process more video frames without additional LLM computational cost, enabling high-FPS video and long video understanding. The architecture supports unified encoding for images, multi-image inputs, and videos, ensuring seamless capability and knowledge transfer.
 
161
  ### Evaluation
162
 
163
  <div align="center">
164
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/radar_minicpm_v45.png", width=60%>
165
  </div>
166
  <div align="center">
167
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmv_4_5_evaluation_result.png" , width=80%>
168
  </div>
169
 
170
+
171
+ ### Inference Efficiency
172
+
173
 
174
  **OpenCompass**
175
  <div align="left">
 
242
  </tr>
243
  </tbody>
244
  </table>
245
+ </div>
246
+
247
+ Both Video-MME and OpenCompass were evaluated using 8×A100 GPUs for inference. The reported inference time of Video-MME includes full model-side computation, and excludes the external cost of video frame extraction (dependent on specific frame extraction tools) for fair comparison.
248
+
249
+
250
+ ### Examples
251
+
252
+ <div align="center">
253
+ <a href="https://www.youtube.com/watch?v=Cn23FujYMMU"><img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmv4_5/MiniCPM-V%204.5-8.26_img.jpeg", width=70%></a>
254
+ </div>
255
+
256
+ <div style="display: flex; flex-direction: column; align-items: center;">
257
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmv4_5/en_case1.png" alt="en_case1" style="margin-bottom: 5px;">
258
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmv4_5/en_case2.png" alt="en_case2" style="margin-bottom: 5px;">
259
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmv4_5/en_case3.jpeg" alt="en_case3" style="margin-bottom: 5px;">
260
+ </div>
261
+
262
+ <details>
263
+ <summary>Click to view more cases.</summary>
264
+ <div style="display: flex; flex-direction: column; align-items: center;">
265
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmv4_5/zh_extra.jpeg" alt="zh_extra" style="margin-bottom: 5px;">
266
+ </div>
267
+
268
+ </details>
269
+
270
+ We deploy MiniCPM-V 4.5 on iPad M4 with [iOS demo](https://github.com/tc-mb/MiniCPM-o-demo-iOS). The demo video is the raw screen recording without editing.
271
+
272
+ <table align="center">
273
+ <p align="center">
274
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmv4_5/v45_en_handwriting.gif" width=45%/>
275
+ &nbsp;&nbsp;&nbsp;&nbsp;
276
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmv4_5/v45_en_cot.gif" width=45%/>
277
+ </p>
278
+ <p align="center">
279
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmv4_5/v45_cn_handwriting.gif" width=45%/>
280
+ &nbsp;&nbsp;&nbsp;&nbsp;
281
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmv4_5/v45_cn_travel.gif" width=45%/>
282
+ </p>
283
+ </table>
284
+
285
+ ## MiniCPM-o 2.6
286
+
287
+ **MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
288
+
289
+ - 🔥 **Leading Visual Capability.**
290
+ MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in multi-image and video understanding, and shows promising in-context learning capability.
291
+
292
+ - 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
293
+
294
+ - 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continuous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-the-art performance in the open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding.
295
+
296
+ - 💪 **Strong OCR Capability and Others.**
297
+ Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405** .
298
+ Based on the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages.
299
+
300
+
301
+ - 🚀 **Superior Efficiency.**
302
+ In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., the number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPads.
303
+
304
+ - 💫 **Easy Usage.**
305
+ MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/).
306
+
307
+ **Model Architecture.**
308
+
309
+ - **End-to-end Omni-modal Architecture.** Different modality encoders/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge. The model is trained in a fully end-to-end manner with only CE loss.
310
+ - **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaming inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaming processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
311
+ - **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates end-to-end voice cloning and description-based voice creation.
312
+
313
+ <div align="center">
314
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpm-o-26-framework-v2.png" , width=80%>
315
+ </div>
316
+
317
+
318
+ ### Evaluation
319
+
320
+ <div align="center">
321
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/radar.jpg", width=80%>
322
+ </div>
323
+
324
+ <details>
325
+ <summary>Click to view visual understanding results.</summary>
326
+
327
+ **Image Understanding**
328
+
329
+ <div align="center">
330
+ <table style="margin: 0px auto;">
331
+ <thead>
332
+ <tr>
333
+ <th align="left">Model</th>
334
+ <th>Size</th>
335
+ <th>Token Density<sup>+</sup></th>
336
+ <th>OpenCompass</th>
337
+ <th>OCRBench</th>
338
+ <th>MathVista mini</th>
339
+ <th>ChartQA</th>
340
+ <th>MMVet</th>
341
+ <th>MMStar</th>
342
+ <th>MME</th>
343
+ <th>MMB1.1 test</th>
344
+ <th>AI2D</th>
345
+ <th>MMMU val</th>
346
+ <th>HallusionBench</th>
347
+ <th>TextVQA val</th>
348
+ <th>DocVQA test</th>
349
+ <th>MathVerse mini</th>
350
+ <th>MathVision</th>
351
+ <th>MMHal Score</th>
352
+ </tr>
353
+ </thead>
354
+ <tbody align="center">
355
+ <tr>
356
+ <td colspan="19" align="left"><strong>Proprietary</strong></td>
357
+ </tr>
358
+ <tr>
359
+ <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
360
+ <td>-</td>
361
+ <td>1088</td>
362
+ <td><u>69.9</u></td>
363
+ <td>736</td>
364
+ <td>61.3</td>
365
+ <td>85.7</td>
366
+ <td><strong>69.1</strong></td>
367
+ <td>63.9</td>
368
+ <td>2328.7</td>
369
+ <td>82.2</td>
370
+ <td>84.6</td>
371
+ <td><strong>69.2</strong></td>
372
+ <td><strong>55.0</strong></td>
373
+ <td>-</td>
374
+ <td>92.8</td>
375
+ <td><strong>50.2</strong></td>
376
+ <td><strong>30.4</strong></td>
377
+ <td><u>3.6</u></td>
378
+ </tr>
379
+ <tr>
380
+ <td nowrap="nowrap" align="left">Claude3.5-Sonnet</td>
381
+ <td>-</td>
382
+ <td>750</td>
383
+ <td>67.9</td>
384
+ <td>788</td>
385
+ <td>61.6</td>
386
+ <td><strong>90.8</strong></td>
387
+ <td>66.0</td>
388
+ <td>62.2</td>
389
+ <td>1920.0</td>
390
+ <td>78.5</td>
391
+ <td>80.2</td>
392
+ <td><u>65.9</u></td>
393
+ <td>49.9</td>
394
+ <td>-</td>
395
+ <td><strong>95.2</strong></td>
396
+ <td>-</td>
397
+ <td>-</td>
398
+ <td>3.4</td>
399
+ </tr>
400
+ <tr>
401
+ <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
402
+ <td>-</td>
403
+ <td>-</td>
404
+ <td>64.4</td>
405
+ <td>754</td>
406
+ <td>57.7</td>
407
+ <td>81.3</td>
408
+ <td>64.0</td>
409
+ <td>59.1</td>
410
+ <td>2110.6</td>
411
+ <td>73.9</td>
412
+ <td>79.1</td>
413
+ <td>60.6</td>
414
+ <td>45.6</td>
415
+ <td>73.5</td>
416
+ <td>86.5</td>
417
+ <td>-</td>
418
+ <td>19.2</td>
419
+ <td>-</td>
420
+ </tr>
421
+ <tr>
422
+ <td nowrap="nowrap" align="left">GPT-4o-mini-20240718</td>
423
+ <td>-</td>
424
+ <td>1088</td>
425
+ <td>64.1</td>
426
+ <td>785</td>
427
+ <td>52.4</td>
428
+ <td>-</td>
429
+ <td>66.9</td>
430
+ <td>54.8</td>
431
+ <td>2003.4</td>
432
+ <td>76.0</td>
433
+ <td>77.8</td>
434
+ <td>60.0</td>
435
+ <td>46.1</td>
436
+ <td>-</td>
437
+ <td>-</td>
438
+ <td>-</td>
439
+ <td>-</td>
440
+ <td>3.3</td>
441
+ </tr>
442
+ <tr>
443
+ <td colspan="19" align="left"><strong>Open Source</strong></td>
444
+ </tr>
445
+ <tr>
446
+ <td nowrap="nowrap" align="left">Cambrian-34B</td>
447
+ <td>34B</td>
448
+ <td><u>1820</u></td>
449
+ <td>58.3</td>
450
+ <td>591</td>
451
+ <td>50.3</td>
452
+ <td>75.6</td>
453
+ <td>53.2</td>
454
+ <td>54.2</td>
455
+ <td>2049.9</td>
456
+ <td>77.8</td>
457
+ <td>79.5</td>
458
+ <td>50.4</td>
459
+ <td>41.6</td>
460
+ <td>76.7</td>
461
+ <td>75.5</td>
462
+ <td>-</td>
463
+ <td>-</td>
464
+ <td>-</td>
465
+ </tr>
466
+ <tr>
467
+ <td nowrap="nowrap" align="left">GLM-4V-9B</td>
468
+ <td>13B</td>
469
+ <td>784</td>
470
+ <td>59.1</td>
471
+ <td>776</td>
472
+ <td>51.1</td>
473
+ <td>-</td>
474
+ <td>58.0</td>
475
+ <td>54.8</td>
476
+ <td>2018.8</td>
477
+ <td>67.9</td>
478
+ <td>71.2</td>
479
+ <td>46.9</td>
480
+ <td>45.0</td>
481
+ <td>-</td>
482
+ <td>-</td>
483
+ <td>-</td>
484
+ <td>-</td>
485
+ <td>-</td>
486
+ </tr>
487
+ <tr>
488
+ <td nowrap="nowrap" align="left">Pixtral-12B</td>
489
+ <td>12B</td>
490
+ <td>256</td>
491
+ <td>61.0</td>
492
+ <td>685</td>
493
+ <td>56.9</td>
494
+ <td>81.8</td>
495
+ <td>58.5</td>
496
+ <td>54.5</td>
497
+ <td>-</td>
498
+ <td>72.7</td>
499
+ <td>79.0</td>
500
+ <td>51.1</td>
501
+ <td>47.0</td>
502
+ <td>75.7</td>
503
+ <td>90.7</td>
504
+ <td>-</td>
505
+ <td>-</td>
506
+ <td>-</td>
507
+ </tr>
508
+ <tr>
509
+ <td nowrap="nowrap" align="left">VITA-1.5</td>
510
+ <td>8B</td>
511
+ <td>784</td>
512
+ <td>63.3</td>
513
+ <td>741</td>
514
+ <td>66.2</td>
515
+ <td>-</td>
516
+ <td>52.7</td>
517
+ <td>60.2</td>
518
+ <td>2328.1</td>
519
+ <td>76.8</td>
520
+ <td>79.2</td>
521
+ <td>52.6</td>
522
+ <td>44.6</td>
523
+ <td>-</td>
524
+ <td>-</td>
525
+ <td>-</td>
526
+ <td>-</td>
527
+ <td>-</td>
528
+ </tr>
529
+ <tr>
530
+ <td nowrap="nowrap" align="left">DeepSeek-VL2-27B (4B)</td>
531
+ <td>27B</td>
532
+ <td>672</td>
533
+ <td>66.4</td>
534
+ <td>809</td>
535
+ <td>63.9</td>
536
+ <td>86.0</td>
537
+ <td>60.0</td>
538
+ <td>61.9</td>
539
+ <td>2253.0</td>
540
+ <td>81.2</td>
541
+ <td>83.8</td>
542
+ <td>54.0</td>
543
+ <td>45.3</td>
544
+ <td><u>84.2</u></td>
545
+ <td>93.3</td>
546
+ <td>-</td>
547
+ <td>-</td>
548
+ <td>3.0</td>
549
+ </tr>
550
+ <tr>
551
+ <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
552
+ <td>8B</td>
553
+ <td>784</td>
554
+ <td>67.1</td>
555
+ <td><u>866</u></td>
556
+ <td>58.2</td>
557
+ <td>83.0</td>
558
+ <td>62.0</td>
559
+ <td>60.7</td>
560
+ <td>2326.0</td>
561
+ <td>81.8</td>
562
+ <td>83.0</td>
563
+ <td>54.1</td>
564
+ <td>50.6</td>
565
+ <td><strong>84.3</strong></td>
566
+ <td><u>94.5</u></td>
567
+ <td>31.9</td>
568
+ <td>16.3</td>
569
+ <td>3.2</td>
570
+ </tr>
571
+ <tr>
572
+ <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
573
+ <td>72B</td>
574
+ <td>182</td>
575
+ <td>68.1</td>
576
+ <td>741</td>
577
+ <td>67.5</td>
578
+ <td>83.7</td>
579
+ <td>60.6</td>
580
+ <td><strong>65.8</strong></td>
581
+ <td>2261.0</td>
582
+ <td><strong>85.0</strong></td>
583
+ <td><u>85.6</u></td>
584
+ <td>56.8</td>
585
+ <td>49.0</td>
586
+ <td>80.5</td>
587
+ <td>91.3</td>
588
+ <td>39.1</td>
589
+ <td>-</td>
590
+ <td>3.5</td>
591
+ </tr>
592
+ <tr>
593
+ <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
594
+ <td>8B</td>
595
+ <td>706</td>
596
+ <td>68.3</td>
597
+ <td>822</td>
598
+ <td><u>64.4</u></td>
599
+ <td>84.8</td>
600
+ <td>62.8</td>
601
+ <td>62.8</td>
602
+ <td>2344.0</td>
603
+ <td><u>83.6</u></td>
604
+ <td>84.5</td>
605
+ <td>56.0</td>
606
+ <td>50.1</td>
607
+ <td>79.1</td>
608
+ <td>93.0</td>
609
+ <td>39.5</td>
610
+ <td>19.7</td>
611
+ <td>3.4</td>
612
+ </tr>
613
+ <tr>
614
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
615
+ <td>8B</td>
616
+ <td><strong>2822</strong></td>
617
+ <td>65.2</td>
618
+ <td>852*</td>
619
+ <td>60.6</td>
620
+ <td>79.4</td>
621
+ <td>60.0</td>
622
+ <td>57.5</td>
623
+ <td><u>2348.4*</u></td>
624
+ <td>78.0</td>
625
+ <td>82.1</td>
626
+ <td>49.8*</td>
627
+ <td>48.1*</td>
628
+ <td>80.1</td>
629
+ <td>90.8</td>
630
+ <td>25.7</td>
631
+ <td>18.3</td>
632
+ <td>3.6</td>
633
+ </tr>
634
+ <tr>
635
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
636
+ <td>8B</td>
637
+ <td><strong>2822</strong></td>
638
+ <td><strong>70.2</strong></td>
639
+ <td><strong>897*</strong></td>
640
+ <td><strong>71.9*</strong></td>
641
+ <td><u>86.9*</u></td>
642
+ <td><u>67.5</u></td>
643
+ <td><u>64.0</u></td>
644
+ <td><strong>2372.0*</strong></td>
645
+ <td>80.5</td>
646
+ <td><strong>85.8</strong></td>
647
+ <td>50.4*</td>
648
+ <td><u>51.9</u></td>
649
+ <td>82.0</td>
650
+ <td>93.5</td>
651
+ <td><u>41.4*</u></td>
652
+ <td><u>23.1*</u></td>
653
+ <td><strong>3.8</strong></td>
654
+ </tr>
655
+ </tbody>
656
+ </table>
657
+ </div>
658
+ * We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
659
+
660
+
661
+ <sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
662
+
663
+ Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
664
+
665
+
666
+ **Multi-image and Video Understanding**
667
+
668
+ <div align="center">
669
+
670
+ <table style="margin: 0px auto;">
671
+ <thead>
672
+ <tr>
673
+ <th align="left">Model</th>
674
+ <th>Size</th>
675
+ <th>BLINK val</th>
676
+ <th>Mantis Eval</th>
677
+ <th>MIRB</th>
678
+ <th>Video-MME (wo / w subs)</th>
679
+ </tr>
680
+ </thead>
681
+ <tbody align="center">
682
+ <tr>
683
+ <td colspan="6" align="left"><strong>Proprietary</strong></td>
684
+ </tr>
685
+ <tr>
686
+ <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
687
+ <td>-</td>
688
+ <td><strong>68.0</strong></td>
689
+ <td>-</td>
690
+ <td>-</td>
691
+ <td><strong>71.9/77.2<strong></td>
692
+ </tr>
693
+ <tr>
694
+ <td nowrap="nowrap" align="left">GPT4V</td>
695
+ <td>-</td>
696
+ <td>54.6</td>
697
+ <td>62.7</td>
698
+ <td>53.1</td>
699
+ <td>59.9/63.3</td>
700
+ </tr>
701
+ <tr>
702
+ <td colspan="6" align="left"><strong>Open-source</strong></td>
703
+ </tr>
704
+ <tr>
705
+ <td nowrap="nowrap" align="left">VITA-1.5</td>
706
+ <td>8B</td>
707
+ <td>45.0</td>
708
+ <td>-</td>
709
+ <td>-</td>
710
+ <td>56.1/58.7</td>
711
+ </tr>
712
+ <tr>
713
+ <td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave 14B</td>
714
+ <td>14B</td>
715
+ <td>52.6</td>
716
+ <td>66.4</td>
717
+ <td>30.2</td>
718
+ <td>-</td>
719
+ </tr>
720
+ <tr>
721
+ <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
722
+ <td>72B</td>
723
+ <td>55.4</td>
724
+ <td><strong>77.6</strong></td>
725
+ <td>-</td>
726
+ <td><u>66.2/69.5</u></td>
727
+ </tr>
728
+ <tr>
729
+ <td nowrap="nowrap" align="left">MANTIS 8B</td>
730
+ <td>8B</td>
731
+ <td>49.1</td>
732
+ <td>59.5</td>
733
+ <td>34.8</td>
734
+ <td>-</td>
735
+ </tr>
736
+ <tr>
737
+ <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
738
+ <td>8B</td>
739
+ <td>53.2</td>
740
+ <td>69.6*</td>
741
+ <td><strong>67.6*</strong></td>
742
+ <td>63.3/69.0</td>
743
+ </tr>
744
+ <tr>
745
+ <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
746
+ <td>8B</td>
747
+ <td>54.8</td>
748
+ <td>67.7</td>
749
+ <td>52.5</td>
750
+ <td>64.2/66.9</td>
751
+ </tr>
752
+ <tr>
753
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
754
+ <td>8B</td>
755
+ <td>53.0</td>
756
+ <td>69.1</td>
757
+ <td>53.8</td>
758
+ <td>60.9/63.6</td>
759
+ </tr>
760
+ <tr>
761
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
762
+ <td>8B</td>
763
+ <td><u>56.7</u></td>
764
+ <td><u>71.9</u></td>
765
+ <td><u>58.6</u></td>
766
+ <td>63.9/67.9</td>
767
+ </tr>
768
+ </tbody>
769
+ </table>
770
+
771
+ </div>
772
+ * We evaluate officially released checkpoints by ourselves.<br><br>
773
+
774
+ **Audio Understanding**
775
+
776
+ <div align="center">
777
+ <table style="margin: 0px auto;">
778
+ <thead>
779
+ <tr>
780
+ <th align="left">Task</th>
781
+ <th>Size</th>
782
+ <th colspan="3">ASR (zh)</th>
783
+ <th colspan="3">ASR (en)</th>
784
+ <th colspan="2">AST</th>
785
+ <th>Emotion</th>
786
+ </tr>
787
+ <tr>
788
+ <th align="left">Metric</th>
789
+ <td></td>
790
+ <th colspan="3">CER↓</th>
791
+ <th colspan="3">WER↓</th>
792
+ <th colspan="2">BLEU↑</th>
793
+ <th>ACC↑</th>
794
+ </tr>
795
+ <tr>
796
+ <th align="left">Dataset</th>
797
+ <td></td>
798
+ <th>AISHELL-1</th>
799
+ <th>Fleurs zh</th>
800
+ <th>WenetSpeech test-net</th>
801
+ <th>LibriSpeech test-clean</th>
802
+ <th>GigaSpeech</th>
803
+ <th>TED-LIUM</th>
804
+ <th>CoVoST en2zh</th>
805
+ <th>CoVoST zh2en</th>
806
+ <th>MELD emotion</th>
807
+ </tr>
808
+ </thead>
809
+ <tbody align="center">
810
+ <tr>
811
+ <td colspan="11" align="left"><strong>Proprietary</strong></td>
812
+ </tr>
813
+ <tr>
814
+ <td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
815
+ <td>-</td>
816
+ <td>7.3*</td>
817
+ <td><u>5.4*</u></td>
818
+ <td>28.9*</td>
819
+ <td>2.6*</td>
820
+ <td>12.9*</td>
821
+ <td>4.8*</td>
822
+ <td>37.1*</td>
823
+ <td>15.7*</td>
824
+ <td>33.2*</td>
825
+ </tr>
826
+ <tr>
827
+ <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
828
+ <td>-</td>
829
+ <td>4.5*</td>
830
+ <td>5.9*</td>
831
+ <td>14.3*</td>
832
+ <td>2.9*</td>
833
+ <td>10.6*</td>
834
+ <td><strong>3.0*</strong></td>
835
+ <td><u>47.3*</u></td>
836
+ <td>22.6*</td>
837
+ <td>48.4*</td>
838
+ </tr>
839
+ <tr>
840
+ <td colspan="11" align="left"><strong>Open-Source</strong></td>
841
+ </tr>
842
+ <tr>
843
+ <td nowrap="nowrap" align="left">Qwen2-Audio-7B</td>
844
+ <td>8B</td>
845
+ <td>-</td>
846
+ <td>7.5</td>
847
+ <td>-</td>
848
+ <td><strong>1.6</strong></td>
849
+ <td>-</td>
850
+ <td>-</td>
851
+ <td>45.2</td>
852
+ <td><u>24.4</u></td>
853
+ <td><strong>55.3</strong></td>
854
+ </tr>
855
+ <tr>
856
+ <td nowrap="nowrap" align="left">Qwen2-Audio-7B-Instruct</td>
857
+ <td>8B</td>
858
+ <td>2.6*</td>
859
+ <td>6.9*</td>
860
+ <td><u>10.3*</u></td>
861
+ <td>3.1*</td>
862
+ <td><u>9.7</u>*</td>
863
+ <td>5.9*</td>
864
+ <td>39.5*</td>
865
+ <td>22.9*</td>
866
+ <td>17.4*</td>
867
+ </tr>
868
+ <tr>
869
+ <td nowrap="nowrap" align="left">VITA-1.5</td>
870
+ <td>8B</td>
871
+ <td>2.16</td>
872
+ <td>-</td>
873
+ <td>8.4</td>
874
+ <td>3.4</td>
875
+ <td>-</td>
876
+ <td>-</td>
877
+ <td>-</td>
878
+ <td>-</td>
879
+ <td>-</td>
880
+ </tr>
881
+ <tr>
882
+ <td nowrap="nowrap" align="left">GLM-4-Voice-Base</td>
883
+ <td>9B</td>
884
+ <td><u>2.5</u></td>
885
+ <td>-</td>
886
+ <td>-</td>
887
+ <td>2.8</td>
888
+ <td>-</td>
889
+ <td>-</td>
890
+ <td>-</td>
891
+ <td>-</td>
892
+ </tr>
893
+ <tr>
894
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
895
+ <td>8B</td>
896
+ <td><strong>1.6</strong></td>
897
+ <td><strong>4.4</strong></td>
898
+ <td><strong>6.9</strong></td>
899
+ <td><u>1.7</u></td>
900
+ <td><strong>8.7</strong></td>
901
+ <td><strong>3.0</strong></td>
902
+ <td><strong>48.2</strong></td>
903
+ <td><strong>27.2</strong></td>
904
+ <td><u>52.4</u></td>
905
+ </tr>
906
+ </tbody>
907
+ </table>
908
+ </div>
909
+ * We evaluate officially released checkpoints by ourselves.<br><br>
910
+
911
+ **Speech Generation**
912
+
913
+ <div align="center">
914
+ <table style="margin: 0px auto;">
915
+ <thead>
916
+ <tr>
917
+ <th align="left">Task</th>
918
+ <th>Size</th>
919
+ <th colspan="9">SpeechQA</th>
920
+ </tr>
921
+ <tr>
922
+ <th align="left">Metric</th>
923
+ <th></th>
924
+ <th colspan="3">ACC↑</th>
925
+ <th>G-Eval (10 point)↑</th>
926
+ <th>Semantic ELO score↑</th>
927
+ <th>Acoustic ELO score↑</th>
928
+ <th>Overall ELO score↑</th>
929
+ <th>UTMOS↑</th>
930
+ <th>ASR-WER↓</th>
931
+ </tr>
932
+ <tr>
933
+ <th align="left">Dataset</th>
934
+ <th></th>
935
+ <th>Speech Llama Q.</th>
936
+ <th>Speech Web Q.</th>
937
+ <th>Speech Trivia QA</th>
938
+ <th>Speech AlpacaEval</th>
939
+ <th colspan="5">AudioArena</th>
940
+ </tr>
941
+ </thead>
942
+ <tbody align="center">
943
+ <tr>
944
+ <td colspan="11" align="left"><strong>Proprietary</strong></td>
945
+ </tr>
946
+ <tr>
947
+ <td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
948
+ <td></td>
949
+ <td><strong>71.7</strong></td>
950
+ <td><strong>51.6</strong></td>
951
+ <td><strong>69.7</strong></td>
952
+ <td><strong>7.4</strong></td>
953
+ <td><strong>1157</strong></td>
954
+ <td><strong>1203</strong></td>
955
+ <td><strong>1200</strong></td>
956
+ <td><strong>4.2</strong></td>
957
+ <td><strong>2.3</strong></td>
958
+ </tr>
959
+ <tr>
960
+ <td colspan="11" align="left"><strong>Open-Source</strong></td>
961
+ </tr>
962
+ <tr>
963
+ <td nowrap="nowrap" align="left">GLM-4-Voice</td>
964
+ <td>9B</td>
965
+ <td>50.0</td>
966
+ <td>32.0</td>
967
+ <td>36.4</td>
968
+ <td><u>5.1</u></td>
969
+ <td>999</td>
970
+ <td>1147</td>
971
+ <td>1035</td>
972
+ <td><u>4.1</u></td>
973
+ <td><u>11.7</u></td>
974
+ </tr>
975
+ <tr>
976
+ <td nowrap="nowrap" align="left">Llama-Omni</td>
977
+ <td>8B</td>
978
+ <td>45.3</td>
979
+ <td>22.9</td>
980
+ <td>10.7</td>
981
+ <td>3.9</td>
982
+ <td>960</td>
983
+ <td>878</td>
984
+ <td>897</td>
985
+ <td>3.2</td>
986
+ <td>24.3</td>
987
+ </tr>
988
+ <tr>
989
+ <td nowrap="nowrap" align="left">VITA-1.5</td>
990
+ <td>8B</td>
991
+ <td>46.7</td>
992
+ <td>28.1</td>
993
+ <td>23.3</td>
994
+ <td>2.0</td>
995
+ <td>-</td>
996
+ <td>-</td>
997
+ <td>-</td>
998
+ <td>-</td>
999
+ <td>-</td>
1000
+ </tr>
1001
+ <tr>
1002
+ <td nowrap="nowrap" align="left">Moshi</td>
1003
+ <td>7B</td>
1004
+ <td>43.7</td>
1005
+ <td>23.8</td>
1006
+ <td>16.7</td>
1007
+ <td>2.4</td>
1008
+ <td>871</td>
1009
+ <td>808</td>
1010
+ <td>875</td>
1011
+ <td>2.8</td>
1012
+ <td>8.2</td>
1013
+ </tr>
1014
+ <tr>
1015
+ <td nowrap="nowrap" align="left">Mini-Omni</td>
1016
+ <td>1B</td>
1017
+ <td>22.0</td>
1018
+ <td>12.8</td>
1019
+ <td>6.9</td>
1020
+ <td>2.5</td>
1021
+ <td>926</td>
1022
+ <td>803</td>
1023
+ <td>865</td>
1024
+ <td>3.4</td>
1025
+ <td>10.0</td>
1026
+ </tr>
1027
+ <tr>
1028
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
1029
+ <td>8B</td>
1030
+ <td><u>61.0</u></td>
1031
+ <td><u>40.0</u></td>
1032
+ <td><u>40.2</u></td>
1033
+ <td><u>5.1</u></td>
1034
+ <td><u>1088</u></td>
1035
+ <td><u>1163</u></td>
1036
+ <td><u>1131</u></td>
1037
+ <td><strong>4.2</strong></td>
1038
+ <td>9.8</td>
1039
+ </tr>
1040
+ </tbody>
1041
+ </table>
1042
+ </div>
1043
+ All results are from AudioEvals, and the evaluation methods along with further details can be found in [AudioEvals](https://github.com/OpenBMB/UltraEval-Audio).<br><br>
1044
+
1045
+ **End-to-end Voice Cloning**
1046
+
1047
+ <div align="center">
1048
+ <table style="margin: 0px auto;">
1049
+ <thead>
1050
+ <tr>
1051
+ <th align="left">Task</th>
1052
+ <th colspan="2">Voice cloning</th>
1053
+ </tr>
1054
+ <tr>
1055
+ <th align="left">Metric</th>
1056
+ <th>SIMO↑</th>
1057
+ <th>SIMO↑</th>
1058
+ </tr>
1059
+ <tr>
1060
+ <th align="left">Dataset</th>
1061
+ <th>Seed-TTS test-zh</th>
1062
+ <th>Seed-TTS test-en</th>
1063
+ </tr>
1064
+ </thead>
1065
+ <tbody align="center">
1066
+ <tr>
1067
+ <td nowrap="nowrap" align="left">F5-TTS</td>
1068
+ <td><strong>76</strong></td>
1069
+ <td><strong>67</strong></td>
1070
+ </tr>
1071
+ <tr>
1072
+ <td nowrap="nowrap" align="left">CosyVoice</td>
1073
+ <td><u>75</u></td>
1074
+ <td><u>64</u></td>
1075
+ </tr>
1076
+ <tr>
1077
+ <td nowrap="nowrap" align="left">FireRedTTS</td>
1078
+ <td>63</td>
1079
+ <td>46</td>
1080
+ </tr>
1081
+ <tr>
1082
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
1083
+ <td>57</td>
1084
+ <td>47</td>
1085
+ </tr>
1086
+ </tbody>
1087
+ </table>
1088
+ </div>
1089
+
1090
+ </details>
1091
+
1092
+
1093
+ <details>
1094
+ <summary>Click to view multimodal live streaming results.</summary>
1095
+
1096
+ **Multimodal Live Streaming**: results on StreamingBench
1097
+
1098
+ <table style="margin: 0px auto;">
1099
+ <thead>
1100
+ <tr>
1101
+ <th align="left">Model</th>
1102
+ <th>Size</th>
1103
+ <th>Real-Time Video Understanding</th>
1104
+ <th>Omni-Source Understanding</th>
1105
+ <th>Contextual Understanding</th>
1106
+ <th>Overall</th>
1107
+ </tr>
1108
+ </thead>
1109
+ <tbody align="center">
1110
+ <tr>
1111
+ <td colspan="7" align="left"><strong>Proprietary</strong></td>
1112
+ </tr>
1113
+ <tr>
1114
+ <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
1115
+ <td>-</td>
1116
+ <td><u>77.4</u></td>
1117
+ <td><strong>67.8</strong></td>
1118
+ <td><strong>51.1</strong></td>
1119
+ <td><strong>70.3</strong></td>
1120
+ </tr>
1121
+ <tr>
1122
+ <td nowrap="nowrap" align="left">GPT-4o-202408</td>
1123
+ <td>-</td>
1124
+ <td>74.5</td>
1125
+ <td>51.0</td>
1126
+ <td><u>48.0</u></td>
1127
+ <td>64.1</td>
1128
+ </tr>
1129
+ <tr>
1130
+ <td nowrap="nowrap" align="left">Claude-3.5-Sonnet</td>
1131
+ <td>-</td>
1132
+ <td>74.0</td>
1133
+ <td>41.4</td>
1134
+ <td>37.8</td>
1135
+ <td>59.7</td>
1136
+ </tr>
1137
+ <tr>
1138
+ <td colspan="9" align="left"><strong>Open-source</strong></td>
1139
+ </tr>
1140
+ <tr>
1141
+ <td nowrap="nowrap" align="left">VILA-1.5</td>
1142
+ <td>8B</td>
1143
+ <td>61.5</td>
1144
+ <td>37.5</td>
1145
+ <td>26.7</td>
1146
+ <td>49.5</td>
1147
+ </tr>
1148
+ <tr>
1149
+ <td nowrap="nowrap" align="left">LongVA</td>
1150
+ <td>7B</td>
1151
+ <td>63.1</td>
1152
+ <td>35.9</td>
1153
+ <td>30.2</td>
1154
+ <td>50.7</td>
1155
+ </tr>
1156
+ <tr>
1157
+ <td nowrap="nowrap" align="left">LLaVA-Next-Video-34B</td>
1158
+ <td>34B</td>
1159
+ <td>69.8</td>
1160
+ <td>41.7</td>
1161
+ <td>34.3</td>
1162
+ <td>56.7</td>
1163
+ </tr>
1164
+ <tr>
1165
+ <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
1166
+ <td>8B</td>
1167
+ <td>71.2</td>
1168
+ <td>40.7</td>
1169
+ <td>33.1</td>
1170
+ <td>57.0</td>
1171
+ </tr>
1172
+ <tr>
1173
+ <td nowrap="nowrap" align="left">InternVL2-8B</td>
1174
+ <td>8B</td>
1175
+ <td>70.1</td>
1176
+ <td>42.7</td>
1177
+ <td>34.1</td>
1178
+ <td>57.0</td>
1179
+ </tr>
1180
+ <tr>
1181
+ <td nowrap="nowrap" align="left">VITA-1.5</td>
1182
+ <td>8B</td>
1183
+ <td>70.9</td>
1184
+ <td>40.8</td>
1185
+ <td>35.8</td>
1186
+ <td>57.4</td>
1187
+ </tr>
1188
+ <tr>
1189
+ <td nowrap="nowrap" align="left">LLaVA-OneVision-7B</td>
1190
+ <td>8B</td>
1191
+ <td>74.3</td>
1192
+ <td>40.8</td>
1193
+ <td>31.0</td>
1194
+ <td>58.4</td>
1195
+ </tr>
1196
+ <tr>
1197
+ <td nowrap="nowrap" align="left">InternLM-XC2.5-OL-7B</td>
1198
+ <td>8B</td>
1199
+ <td>75.4</td>
1200
+ <td>46.2</td>
1201
+ <td>33.6</td>
1202
+ <td>60.8</td>
1203
+ </tr>
1204
+ <tr>
1205
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
1206
+ <td>8B</td>
1207
+ <td>72.4</td>
1208
+ <td>40.2</td>
1209
+ <td>33.4</td>
1210
+ <td>57.7</td>
1211
+ </tr>
1212
+ <tr>
1213
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
1214
+ <td>8B</td>
1215
+ <td><strong>79.9</strong></td>
1216
+ <td><u>53.4</u></td>
1217
+ <td>38.5</td>
1218
+ <td><u>66.0</u></td>
1219
+ </tr>
1220
+ </tbody>
1221
+ </table>
1222
+
1223
+ </details>
1224
 
 
1225
 
1226
  ### Examples
1227
 
1228
+ We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo.
1229
+
1230
  <div align="center">
1231
+ <a href="https://www.youtube.com/watch?v=vRIMbxJzStY&t=2s"><img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmo2_6/2dot6_o_demo_video_img.png", width=70%></a>
1232
  </div>
1233
 
1234
+ <br>
1235
+
1236
  <div style="display: flex; flex-direction: column; align-items: center;">
1237
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
1238
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
1239
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
1240
  </div>
1241
 
 
1242
 
1243
+ ## Legacy Models
 
 
 
1244
 
1245
+ | Model | Introduction and Guidance |
1246
+ |:----------------------|:-------------------:|
1247
+ | MiniCPM-V 4.0 | [Document](./docs/minicpm_v4_en.md) |
1248
+ | MiniCPM-V 2.6 | [Document](./docs/minicpm_v2dot6_en.md) |
1249
+ | MiniCPM-Llama3-V 2.5 | [Document](./docs/minicpm_llama3_v2dot5.md) |
1250
+ | MiniCPM-V 2.0 | [Document](./docs/minicpm_v2.md) |
1251
+ | MiniCPM-V 1.0 | [Document](./docs/minicpm_v1.md) |
1252
+ | OmniLMM-12B | [Document](././docs/omnilmm_en.md) |
1253
+
1254
+
1255
+ ## MiniCPM-V & o Cookbook
1256
+
1257
+ Discover comprehensive, ready-to-deploy solutions for the MiniCPM-V and MiniCPM-o model series in our structured [cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), which empowers developers to rapidly implement multimodal AI applications with integrated vision, speech, and live-streaming capabilities. Key features include:
1258
+
1259
+ **Easy Usage Documentation**
1260
+
1261
+ Our comprehensive [documentation website](https://minicpm-o.readthedocs.io/en/latest/index.html) presents every recipe in a clear, well-organized manner.
1262
+ All features are displayed at a glance, making it easy for you to quickly find exactly what you need.
1263
+
1264
+ **Broad User Spectrum**
1265
+
1266
+ We support a wide range of users, from individuals to enterprises and researchers.
1267
+
1268
+ * **Individuals**: Enjoy effortless inference using [Ollama](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-v4_ollama.md) and [Llama.cpp](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/llama.cpp/minicpm-v4_llamacpp.md) with minimal setup.
1269
+ * **Enterprises**: Achieve high-throughput, scalable performance with [vLLM](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/vllm/minicpm-v4_vllm.md) and [SGLang](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/sglang/MiniCPM-v4_sglang.md).
1270
+ * **Researchers**: Leverage advanced frameworks including [Transformers](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_full.md), [LLaMA-Factory](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_llamafactory.md), [SWIFT](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/swift.md), and [Align-anything](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/align_anything.md) to enable flexible model development and cutting-edge experimentation.
1271
+
1272
+ **Versatile Deployment Scenarios**
1273
+
1274
+ Our ecosystem delivers optimal solution for a variety of hardware environments and deployment demands.
1275
+
1276
+ * **Web demo**: Launch interactive multimodal AI web demo with [FastAPI](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/README.md).
1277
+ * **Quantized deployment**: Maximize efficiency and minimize resource consumption using [GGUF](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/gguf/minicpm-v4_gguf_quantize.md) and [BNB](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/bnb/minicpm-v4_bnb_quantize.md).
1278
+ * **End devices**: Bring powerful AI experiences to [iPhone and iPad](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md), supporting offline and privacy-sensitive applications.
1279
+
1280
+
1281
+ ## Chat with Our Demo on Gradio 🤗
1282
+
1283
+ We provide online and local demos powered by Hugging Face Gradio <a href='https://github.com/gradio-app/gradio'><img src='https://img.shields.io/github/stars/gradio-app/gradio'></a>, the most popular model deployment framework nowadays. It supports streaming outputs, progress bars, queuing, alerts, and other useful features.
1284
+
1285
+
1286
+ ### Online Demo
1287
+
1288
+ Click here to try out the online demo of [MiniCPM-o 2.6](https://minicpm-omni-webdemo-us.modelbest.cn/) | [MiniCPM-V 2.6](http://120.92.209.146:8887/) | [MiniCPM-Llama3-V 2.5](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5) | [MiniCPM-V 2.0](https://huggingface.co/spaces/openbmb/MiniCPM-V-2).
1289
+
1290
+ ### Local WebUI Demo
1291
+
1292
+ You can easily build your own local WebUI demo using the following commands.
1293
+
1294
+ Please ensure that `transformers==4.44.2` is installed, as other versions may have compatibility issues.
1295
+
1296
+ If you are using an older version of PyTorch, you might encounter this issue `"weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16'`, Please add `self.minicpmo_model.tts.float()` during the model initialization.
1297
+
1298
+ **For real-time voice/video call demo:**
1299
+ 1. launch model server:
1300
+ ```shell
1301
+ pip install -r requirements_o2.6.txt
1302
+
1303
+ python web_demos/minicpm-o_2.6/model_server.py
1304
+ ```
1305
+
1306
+ 2. launch web server:
1307
+
1308
+ ```shell
1309
+ # Make sure Node and PNPM is installed.
1310
+ sudo apt-get update
1311
+ sudo apt-get install nodejs npm
1312
+ npm install -g pnpm
1313
+
1314
+
1315
+ cd web_demos/minicpm-o_2.6/web_server
1316
+ # create ssl cert for https, https is required to request camera and microphone permissions.
1317
+ bash ./make_ssl_cert.sh # output key.pem and cert.pem
1318
+
1319
+ pnpm install # install requirements
1320
+ pnpm run dev # start server
1321
+ ```
1322
+ Open `https://localhost:8088/` in browser and enjoy the real-time voice/video call.
1323
+
1324
+ **For chatbot demo:**
1325
+ ```shell
1326
+ pip install -r requirements_o2.6.txt
 
 
 
 
 
 
 
 
1327
 
1328
+ python web_demos/minicpm-o_2.6/chatbot_web_demo_o2.6.py
1329
+ ```
1330
+ Open `http://localhost:8000/` in browser and enjoy the vision mode chatbot.
1331
+
1332
+ ## Inference
1333
+
1334
+
1335
+ ### Model Zoo
1336
+
1337
+ | Model | Device | Memory | &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; Description | Download |
1338
+ |:-----------|:--:|:-----------:|:-------------------|:---------------:|
1339
+ | MiniCPM-V 4.5| GPU | 18 GB | The latest version, strong end-side multimodal performance for single image, multi-image and video understanding. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5) &nbsp;&nbsp; [<img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5) |
1340
+ | MiniCPM-V 4.5 gguf | CPU | 8 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf) &nbsp;&nbsp; [<img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5-gguf) |
1341
+ | MiniCPM-V 4.5 int4 | GPU | 9 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4) &nbsp;&nbsp; [<img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5-int4) |
1342
+ | MiniCPM-V 4.5 AWQ | GPU | 9 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5-AWQ) &nbsp;&nbsp; [<img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5-AWQ) |
1343
+ | MiniCPM-o 2.6| GPU | 18 GB | The latest version, achieving GPT-4o level performance for vision, speech and multimodal live streaming on end-side devices. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6) &nbsp;&nbsp; [<img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6) |
1344
+ | MiniCPM-o 2.6 gguf | CPU | 8 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) &nbsp;&nbsp; [<img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-gguf) |
1345
+ | MiniCPM-o 2.6 int4 | GPU | 9 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) &nbsp;&nbsp; [<img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-int4) |
1346
+
1347
+ ### Multi-turn Conversation
1348
+
1349
+ If you wish to enable long-thinking mode, provide the argument `enable_thinking=True` to the chat function.
1350
+
1351
+ ```shell
1352
+ pip install -r requirements_o2.6.txt
1353
+ ```
1354
+
1355
+ Please refer to the following codes to run.
1356
+
1357
+ <div align="center">
1358
+ <img src="https://raw.githubusercontent.com/OpenBMB/MiniCPM-o/main/assets/minicpmo2_6/show_demo.jpg" width="500px">
1359
+ </div>
1360
 
 
1361
 
 
1362
  ```python
1363
  import torch
1364
  from PIL import Image
 
1373
 
1374
  image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')
1375
 
1376
+ enable_thinking=False # If `enable_thinking=True`, the long-thinking mode is enabled.
 
1377
 
1378
  # First round chat
1379
  question = "What is the landform in the picture?"
 
1382
  answer = model.chat(
1383
  msgs=msgs,
1384
  tokenizer=tokenizer,
1385
+ enable_thinking=enable_thinking
 
1386
  )
1387
+ print(answer)
 
 
 
 
1388
 
1389
  # Second round chat, pass history context of multi-turn conversation
1390
+ msgs.append({"role": "assistant", "content": [answer]})
1391
  msgs.append({"role": "user", "content": ["What should I pay attention to when traveling here?"]})
1392
 
1393
  answer = model.chat(
1394
  msgs=msgs,
1395
+ tokenizer=tokenizer
 
1396
  )
1397
+ print(answer)
 
 
 
 
1398
  ```
1399
 
1400
  You will get the following output:
 
1416
  By following these guidelines, you'll have a safe and enjoyable trip while appreciating the stunning natural beauty of places such as Guilin’s karst mountains.
1417
  ```
1418
 
1419
+ #### Chat with Multiple Images
1420
+ <details>
1421
+ <summary> Click to view Python code running MiniCPM-V-4_5 with multiple images input. </summary>
1422
+
1423
+ ```python
1424
+ import torch
1425
+ from PIL import Image
1426
+ from transformers import AutoModel, AutoTokenizer
1427
+
1428
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
1429
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
1430
+ model = model.eval().cuda()
1431
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
1432
+
1433
+ image1 = Image.open('image1.jpg').convert('RGB')
1434
+ image2 = Image.open('image2.jpg').convert('RGB')
1435
+ question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
1436
+
1437
+ msgs = [{'role': 'user', 'content': [image1, image2, question]}]
1438
+
1439
+ answer = model.chat(
1440
+ msgs=msgs,
1441
+ tokenizer=tokenizer
1442
+ )
1443
+ print(answer)
1444
+ ```
1445
+ </details>
1446
+
1447
+ #### In-context Few-shot Learning
1448
+ <details>
1449
+ <summary> Click to view Python code running MiniCPM-V-4_5 with few-shot input. </summary>
1450
+
1451
+ ```python
1452
+ import torch
1453
+ from PIL import Image
1454
+ from transformers import AutoModel, AutoTokenizer
1455
+
1456
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
1457
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
1458
+ model = model.eval().cuda()
1459
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
1460
+
1461
+ question = "production date"
1462
+ image1 = Image.open('example1.jpg').convert('RGB')
1463
+ answer1 = "2023.08.04"
1464
+ image2 = Image.open('example2.jpg').convert('RGB')
1465
+ answer2 = "2007.04.24"
1466
+ image_test = Image.open('test.jpg').convert('RGB')
1467
+
1468
+ msgs = [
1469
+ {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
1470
+ {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
1471
+ {'role': 'user', 'content': [image_test, question]}
1472
+ ]
1473
+
1474
+ answer = model.chat(
1475
+ msgs=msgs,
1476
+ tokenizer=tokenizer
1477
+ )
1478
+ print(answer)
1479
+ ```
1480
+ </details>
1481
 
1482
  #### Chat with Video
1483
+ <details>
1484
+ <summary> Click to view Python code running MiniCPM-V-4_5 by with video input and 3D-Resampler. </summary>
1485
 
1486
  ```python
1487
  ## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids.
 
1581
  )
1582
  print(answer)
1583
  ```
1584
+ </details>
1585
+
1586
+
1587
+ #### Speech and Audio Mode
1588
+
1589
+ Model initialization
1590
 
 
 
 
 
1591
  ```python
1592
  import torch
1593
+ import librosa
1594
  from transformers import AutoModel, AutoTokenizer
1595
 
1596
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
1597
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
1598
  model = model.eval().cuda()
1599
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
1600
 
1601
+ model.init_tts()
1602
+ model.tts.float()
1603
+ ```
1604
 
1605
+ <hr/>
1606
 
1607
+ ##### Mimick
1608
+
1609
+ `Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
1610
+
1611
+ ```python
1612
+ mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
1613
+ audio_input, _ = librosa.load('./assets/input_examples/Trump_WEF_2018_10s.mp3', sr=16000, mono=True) # load the audio to be mimicked
1614
+
1615
+ # `./assets/input_examples/fast-pace.wav`,
1616
+ # `./assets/input_examples/chi-english-1.wav`
1617
+ # `./assets/input_examples/exciting-emotion.wav`
1618
+ # for different aspects of speech-centric features.
1619
+
1620
+ msgs = [{'role': 'user', 'content': [mimick_prompt, audio_input]}]
1621
+ res = model.chat(
1622
  msgs=msgs,
1623
+ tokenizer=tokenizer,
1624
+ sampling=True,
1625
+ max_new_tokens=128,
1626
+ use_tts_template=True,
1627
+ temperature=0.3,
1628
+ generate_audio=True,
1629
+ output_audio_path='output_mimick.wav', # save the tts result to output_audio_path
1630
  )
 
1631
  ```
 
1632
 
1633
+ <hr/>
1634
+
1635
+ ##### General Speech Conversation with Configurable Voices
1636
+
1637
+ A general usage scenario of `MiniCPM-o-2.6` is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, `MiniCPM-o-2.6` sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner.
1638
 
 
 
 
1639
 
1640
  ```python
1641
+ ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
1642
+ sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
 
1643
 
1644
+ # round one
1645
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1646
+ msgs = [sys_prompt, user_question]
1647
+ res = model.chat(
1648
+ msgs=msgs,
1649
+ tokenizer=tokenizer,
1650
+ sampling=True,
1651
+ max_new_tokens=128,
1652
+ use_tts_template=True,
1653
+ generate_audio=True,
1654
+ temperature=0.3,
1655
+ output_audio_path='result_roleplay_round_1.wav',
1656
+ )
1657
 
1658
+ # round two
1659
+ history = msgs.append({'role': 'assistant', 'content': res})
1660
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1661
+ msgs = history.append(user_question)
1662
+ res = model.chat(
1663
+ msgs=msgs,
1664
+ tokenizer=tokenizer,
1665
+ sampling=True,
1666
+ max_new_tokens=128,
1667
+ use_tts_template=True,
1668
+ generate_audio=True,
1669
+ temperature=0.3,
1670
+ output_audio_path='result_roleplay_round_2.wav',
1671
+ )
1672
+ print(res)
1673
+ ```
1674
 
1675
+ <hr/>
 
 
 
 
1676
 
1677
+ ##### Speech Conversation as an AI Assistant
1678
+
1679
+ An enhanced feature of `MiniCPM-o-2.6` is to act as an AI assistant, but only with limited choice of voices. In this mode, `MiniCPM-o-2.6` is **less human-like and more like a voice assistant**. In this mode, the model is more instruction-following. For demo, you are suggested to use `assistant_female_voice`, `assistant_male_voice`, and `assistant_default_female_voice`. Other voices may work but not as stable as the default voices.
1680
+
1681
+ *Please note that, `assistant_female_voice` and `assistant_male_voice` are more stable but sounds like robots, while `assistant_default_female_voice` is more human-alike but not stable, its voice often changes in multiple turns. We suggest you to try stable voices `assistant_female_voice` and `assistant_male_voice`.*
1682
+
1683
+ ```python
1684
+ ref_audio, _ = librosa.load('./assets/input_examples/assistant_female_voice.wav', sr=16000, mono=True) # or use `./assets/input_examples/assistant_male_voice.wav`
1685
+ sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
1686
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # load the user's audio question
1687
+
1688
+ # round one
1689
+ msgs = [sys_prompt, user_question]
1690
+ res = model.chat(
1691
  msgs=msgs,
1692
+ tokenizer=tokenizer,
1693
+ sampling=True,
1694
+ max_new_tokens=128,
1695
+ use_tts_template=True,
1696
+ generate_audio=True,
1697
+ temperature=0.3,
1698
+ output_audio_path='result_assistant_round_1.wav',
1699
  )
1700
+
1701
+ # round two
1702
+ history = msgs.append({'role': 'assistant', 'content': res})
1703
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1704
+ msgs = history.append(user_question)
1705
+ res = model.chat(
1706
+ msgs=msgs,
1707
+ tokenizer=tokenizer,
1708
+ sampling=True,
1709
+ max_new_tokens=128,
1710
+ use_tts_template=True,
1711
+ generate_audio=True,
1712
+ temperature=0.3,
1713
+ output_audio_path='result_assistant_round_2.wav',
1714
+ )
1715
+ print(res)
1716
+ ```
1717
+
1718
+ <hr/>
1719
+
1720
+ ##### Instruction-to-Speech
1721
+
1722
+ `MiniCPM-o-2.6` can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/.
1723
+
1724
+ ```python
1725
+ instruction = 'Speak like a male charming superstar, radiating confidence and style in every word.'
1726
+
1727
+ msgs = [{'role': 'user', 'content': [instruction]}]
1728
+
1729
+ res = model.chat(
1730
+ msgs=msgs,
1731
+ tokenizer=tokenizer,
1732
+ sampling=True,
1733
+ max_new_tokens=128,
1734
+ use_tts_template=True,
1735
+ generate_audio=True,
1736
+ temperature=0.3,
1737
+ output_audio_path='result_voice_creation.wav',
1738
+ )
1739
+ ```
1740
+
1741
+ <hr/>
1742
+
1743
+ ##### Voice Cloning
1744
+
1745
+ `MiniCPM-o-2.6` can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model.
1746
+
1747
+
1748
+ ```python
1749
+ ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
1750
+ sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
1751
+ text_prompt = f"Please read the text below."
1752
+ user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]}
1753
+
1754
+ msgs = [sys_prompt, user_question]
1755
+ res = model.chat(
1756
+ msgs=msgs,
1757
+ tokenizer=tokenizer,
1758
+ sampling=True,
1759
+ max_new_tokens=128,
1760
+ use_tts_template=True,
1761
+ generate_audio=True,
1762
+ temperature=0.3,
1763
+ output_audio_path='result_voice_cloning.wav',
1764
+ )
1765
+
1766
+ ```
1767
+
1768
+ <hr/>
1769
+
1770
+ ##### Addressing Various Audio Understanding Tasks
1771
+
1772
+ `MiniCPM-o-2.6` can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
1773
+
1774
+ For audio-to-text tasks, you can use the following prompts:
1775
+
1776
+ - ASR with ZH(same as AST en2zh): `请仔细听这段音频片段,并将其内容逐字记录。`
1777
+ - ASR with EN(same as AST zh2en): `Please listen to the audio snippet carefully and transcribe the content.`
1778
+ - Speaker Analysis: `Based on the speaker's content, speculate on their gender, condition, age range, and health status.`
1779
+ - General Audio Caption: `Summarize the main content of the audio.`
1780
+ - General Sound Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.`
1781
+
1782
+ ```python
1783
+ task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "
1784
+ " # can change to other prompts.
1785
+ audio_input, _ = librosa.load('./assets/input_examples/audio_understanding.mp3', sr=16000, mono=True) # load the audio to be captioned
1786
+
1787
+ msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}]
1788
+
1789
+ res = model.chat(
1790
+ msgs=msgs,
1791
+ tokenizer=tokenizer,
1792
+ sampling=True,
1793
+ max_new_tokens=128,
1794
+ use_tts_template=True,
1795
+ generate_audio=True,
1796
+ temperature=0.3,
1797
+ output_audio_path='result_audio_understanding.wav',
1798
+ )
1799
+ print(res)
1800
  ```
 
1801
 
1802
 
 
 
 
 
1803
 
 
 
 
1804
 
1805
+ #### Multimodal Live Streaming
1806
+ <details>
1807
+ <summary> Click to view Python code running MiniCPM-o 2.6 with chat inference. </summary>
1808
+
1809
+ ```python
1810
+ import math
1811
+ import numpy as np
1812
+ from PIL import Image
1813
+ from moviepy.editor import VideoFileClip
1814
+ import tempfile
1815
+ import librosa
1816
+ import soundfile as sf
1817
+ import torch
1818
+ from transformers import AutoModel, AutoTokenizer
1819
 
1820
+ def get_video_chunk_content(video_path, flatten=True):
1821
+ video = VideoFileClip(video_path)
1822
+ print('video_duration:', video.duration)
1823
+
1824
+ with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
1825
+ temp_audio_file_path = temp_audio_file.name
1826
+ video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
1827
+ audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
1828
+ num_units = math.ceil(video.duration)
1829
+
1830
+ # 1 frame + 1s audio chunk
1831
+ contents= []
1832
+ for i in range(num_units):
1833
+ frame = video.get_frame(i+1)
1834
+ image = Image.fromarray((frame).astype(np.uint8))
1835
+ audio = audio_np[sr*i:sr*(i+1)]
1836
+ if flatten:
1837
+ contents.extend(["<unit>", image, audio])
1838
+ else:
1839
+ contents.append(["<unit>", image, audio])
1840
+
1841
+ return contents
1842
 
 
1843
 
1844
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
1845
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16)
1846
+ model = model.eval().cuda()
1847
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
1848
 
1849
+ model.init_tts()
1850
 
1851
+ # If you are using an older version of PyTorch, you might encounter this issue "weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16', Please convert the TTS to float32 type.
1852
+ # model.tts.float()
 
 
 
 
 
 
 
 
1853
 
1854
+ # https://huggingface.co/openbmb/MiniCPM-o-2_6/blob/main/assets/Skiing.mp4
1855
+ video_path="assets/Skiing.mp4"
1856
+ sys_msg = model.get_sys_prompt(mode='omni', language='en')
1857
+ # if use voice clone prompt, please set ref_audio
1858
+ # ref_audio_path = '/path/to/ref_audio'
1859
+ # ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
1860
+ # sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
1861
 
1862
+ contents = get_video_chunk_content(video_path)
1863
+ msg =