File size: 5,867 Bytes
6f42afc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b7b740f
6f42afc
 
 
a7bf383
e3ffd1a
6f42afc
 
 
 
74e5aa3
 
 
 
 
 
 
6f42afc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
718d844
6f42afc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5a8654b
 
 
52e9b34
5a8654b
6f42afc
 
 
 
 
 
 
5a8654b
 
 
52e9b34
5a8654b
6f42afc
 
 
 
 
 
 
 
 
5a8654b
6f42afc
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
license: apple-amlr
pipeline_tag: text-generation
library_name: litert-lm
tags:
- ml-fastvlm
- litert
- litertlm
base_model:
- apple/FastVLM-0.5B
---

# litert-community/FastVLM-0.5B

*Main Model Card*: [apple/FastVLM-0.5B](https://huggingface.co/apple/FastVLM-0.5B)

This model card provides *FastVLM-0.5B converted for LiteRT* that are ready for on device use, subject to license.

FastVLM was introduced in [FastVLM: Efficient Vision Encoding for Vision Language Models](https://www.arxiv.org/abs/2412.13303). *(CVPR 2025)*, this model demonstrates improvement in time-to-first-token (TTFT) with performance and is suitable for edge device deployment.

The model is supported on CPU, GPU and Qualcomm NPUs. For Qualcomm integration, see more details in this [blogpost](https://developers.googleblog.com/unlocking-peak-performance-on-qualcomm-npu-with-litert/).

*Disclaimer*: This model converted for LiteRT is licensed under the [Apple Machine Learning Research Model License Agreement](https://huggingface.co/apple/deeplabv3-mobilevit-small/blob/main/LICENSE). The model is converted and quantized from PyTorch model weight into the LiteRT/Tensorflow-Lite format (no retraining or further customization).

# How to Use

## Android (Google AI Edge Gallery)

You can either install Google AI Edge Gallery through [Open Beta in the Play Store](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery) or install the [APK](https://github.com/google-ai-edge/gallery/releases/latest/download/ai-edge-gallery.apk) from Github.

To build the demo app from source, please follow the [instructions](https://github.com/google-ai-edge/gallery/blob/main/README.md) from the GitHub repository.

## Android (LiteRT-LM)

### 1. Add the dependency

Make sure you have the necessary dependency in your `Gradle` file.

```
dependencies {
    implementation("com.google.ai.edge.litertlm:litertlm:<LATEST_VERSION>")
}
```

### 2. Inference with the LiteRT-LM API

```kotlin
import com.google.ai.edge.litertlm.*

suspend fun main() {
  Engine.setNativeMinLogSeverity(LogSeverity.ERROR) // hide log for TUI app
  val engineConfig = EngineConfig(
      modelPath = "/path/to/your/model.litertlm", // Replace with model path
      backend = Backend.CPU, // Or Backend.GPU
      visionBackend = Backend.GPU,
  )

  // See the Content class for other variants.
  val multiModalMessage = Message.of(
      Content.ImageFile("/path/to/image"),
      Content.Text("Describe this image."),
  )
  Engine(engineConfig).use { engine ->
    engine.initialize()

    engine.createConversation().use { conversation ->
      while (true) {
        print("\n>>> ")
        conversation.sendMessageAsync(Message.of(readln())).collect { print(it) }
      }
    }
  }
}
```

Try running this model on NPU by using the corresponding `litertlm` file and setting your EngineConfig’s backend and visionBackend to NPU. To check if your phone’s NPU is supported see this [guide](https://ai.google.dev/edge/litert/next/npu).

## Desktop 

To build a Desktop application, C++ is the current recommendation. See the following code sample.

```cpp
// Create engine with proper multimodality backend.
auto engine_settings = EngineSettings::CreateDefault(
    model_assets,
    /*backend=*/litert::lm::Backend::CPU,
    /*vision_backend*/litert::lm::Backend::GPU,
);

// Send message to the LLM with image data.
absl::StatusOr<Message> model_message = (*conversation)->SendMessage(
    JsonMessage{
        {"role", "user"},
        {"content", { // Now content must be an array.
          {{"type", "text"}, {"text", "Describe the following image: "}},
          {{"type", "image"}, {"path", "/file/path/to/image.jpg"}}
        }},
    });
CHECK_OK(model_message);

// Print the model message.
std::cout << *model_message << std::endl;
```

# Performance

## Android

Benchmarked on Xiaomi 17 Pro Max.


<table border="1">
  <tr>
   <th style="text-align: left">Backend</th>
   <th style="text-align: left">Quantization scheme</th>
   <th style="text-align: left">Context length</th>
   <th style="text-align: left">Prefill (tokens/sec)</th>
   <th style="text-align: left">Decode (tokens/sec)</th>
   <th style="text-align: left">Time-to-first-token (sec)</th>
   <th style="text-align: left">Memory (RSS in MB)</th>
   <th style="text-align: left">Model size (MB)</th>
   <th style="text-align: left">Model File</th>
  </tr>
  <tr>
<td><p style="text-align: left">GPU</p></td>
<td><p style="text-align: left">dynamic_int8</p></td>
<td><p style="text-align: right">1280</p></td>
<td><p style="text-align: right">2,220 tk/s</p></td>
<td><p style="text-align: right">64 tk/s</p></td>
<td><p style="text-align: right">0.55 s</p></td>
<td><p style="text-align: right">1766 MB</p></td>
<td><p style="text-align: right">1103 MB</p></td>
<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/FastVLM-0.5B/resolve/main/FastVLM-0.5B.litertlm">&#128279;</a></p></td>
</tr>

<tr>
<td><p style="text-align: left">NPU</p></td>
<td><p style="text-align: left">dynamic_int8</p></td>
<td><p style="text-align: right">1280</p></td>
<td><p style="text-align: right">11,272 tk/s</p></td>
<td><p style="text-align: right">106 tk/s</p></td>
<td><p style="text-align: right">0.12 s</p></td>
<td><p style="text-align: right">925 MB</p></td>
<td><p style="text-align: right">899 MB</p></td>
<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/FastVLM-0.5B/resolve/main/FastVLM-0.5B.sm8850.litertlm">&#128279;</a></p></td>
</tr>

</table>



Notes:
* Model Size: measured by the size of the file on disk.
* TTFT includes encoding time for 1 image and corresponding text prompt.
* Benchmark is run with cache enabled and initialized. During the first run, the latency and memory usage may differ.