File size: 5,867 Bytes
6f42afc b7b740f 6f42afc a7bf383 e3ffd1a 6f42afc 74e5aa3 6f42afc 718d844 6f42afc 5a8654b 52e9b34 5a8654b 6f42afc 5a8654b 52e9b34 5a8654b 6f42afc 5a8654b 6f42afc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
---
license: apple-amlr
pipeline_tag: text-generation
library_name: litert-lm
tags:
- ml-fastvlm
- litert
- litertlm
base_model:
- apple/FastVLM-0.5B
---
# litert-community/FastVLM-0.5B
*Main Model Card*: [apple/FastVLM-0.5B](https://huggingface.co/apple/FastVLM-0.5B)
This model card provides *FastVLM-0.5B converted for LiteRT* that are ready for on device use, subject to license.
FastVLM was introduced in [FastVLM: Efficient Vision Encoding for Vision Language Models](https://www.arxiv.org/abs/2412.13303). *(CVPR 2025)*, this model demonstrates improvement in time-to-first-token (TTFT) with performance and is suitable for edge device deployment.
The model is supported on CPU, GPU and Qualcomm NPUs. For Qualcomm integration, see more details in this [blogpost](https://developers.googleblog.com/unlocking-peak-performance-on-qualcomm-npu-with-litert/).
*Disclaimer*: This model converted for LiteRT is licensed under the [Apple Machine Learning Research Model License Agreement](https://huggingface.co/apple/deeplabv3-mobilevit-small/blob/main/LICENSE). The model is converted and quantized from PyTorch model weight into the LiteRT/Tensorflow-Lite format (no retraining or further customization).
# How to Use
## Android (Google AI Edge Gallery)
You can either install Google AI Edge Gallery through [Open Beta in the Play Store](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery) or install the [APK](https://github.com/google-ai-edge/gallery/releases/latest/download/ai-edge-gallery.apk) from Github.
To build the demo app from source, please follow the [instructions](https://github.com/google-ai-edge/gallery/blob/main/README.md) from the GitHub repository.
## Android (LiteRT-LM)
### 1. Add the dependency
Make sure you have the necessary dependency in your `Gradle` file.
```
dependencies {
implementation("com.google.ai.edge.litertlm:litertlm:<LATEST_VERSION>")
}
```
### 2. Inference with the LiteRT-LM API
```kotlin
import com.google.ai.edge.litertlm.*
suspend fun main() {
Engine.setNativeMinLogSeverity(LogSeverity.ERROR) // hide log for TUI app
val engineConfig = EngineConfig(
modelPath = "/path/to/your/model.litertlm", // Replace with model path
backend = Backend.CPU, // Or Backend.GPU
visionBackend = Backend.GPU,
)
// See the Content class for other variants.
val multiModalMessage = Message.of(
Content.ImageFile("/path/to/image"),
Content.Text("Describe this image."),
)
Engine(engineConfig).use { engine ->
engine.initialize()
engine.createConversation().use { conversation ->
while (true) {
print("\n>>> ")
conversation.sendMessageAsync(Message.of(readln())).collect { print(it) }
}
}
}
}
```
Try running this model on NPU by using the corresponding `litertlm` file and setting your EngineConfig’s backend and visionBackend to NPU. To check if your phone’s NPU is supported see this [guide](https://ai.google.dev/edge/litert/next/npu).
## Desktop
To build a Desktop application, C++ is the current recommendation. See the following code sample.
```cpp
// Create engine with proper multimodality backend.
auto engine_settings = EngineSettings::CreateDefault(
model_assets,
/*backend=*/litert::lm::Backend::CPU,
/*vision_backend*/litert::lm::Backend::GPU,
);
// Send message to the LLM with image data.
absl::StatusOr<Message> model_message = (*conversation)->SendMessage(
JsonMessage{
{"role", "user"},
{"content", { // Now content must be an array.
{{"type", "text"}, {"text", "Describe the following image: "}},
{{"type", "image"}, {"path", "/file/path/to/image.jpg"}}
}},
});
CHECK_OK(model_message);
// Print the model message.
std::cout << *model_message << std::endl;
```
# Performance
## Android
Benchmarked on Xiaomi 17 Pro Max.
<table border="1">
<tr>
<th style="text-align: left">Backend</th>
<th style="text-align: left">Quantization scheme</th>
<th style="text-align: left">Context length</th>
<th style="text-align: left">Prefill (tokens/sec)</th>
<th style="text-align: left">Decode (tokens/sec)</th>
<th style="text-align: left">Time-to-first-token (sec)</th>
<th style="text-align: left">Memory (RSS in MB)</th>
<th style="text-align: left">Model size (MB)</th>
<th style="text-align: left">Model File</th>
</tr>
<tr>
<td><p style="text-align: left">GPU</p></td>
<td><p style="text-align: left">dynamic_int8</p></td>
<td><p style="text-align: right">1280</p></td>
<td><p style="text-align: right">2,220 tk/s</p></td>
<td><p style="text-align: right">64 tk/s</p></td>
<td><p style="text-align: right">0.55 s</p></td>
<td><p style="text-align: right">1766 MB</p></td>
<td><p style="text-align: right">1103 MB</p></td>
<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/FastVLM-0.5B/resolve/main/FastVLM-0.5B.litertlm">🔗</a></p></td>
</tr>
<tr>
<td><p style="text-align: left">NPU</p></td>
<td><p style="text-align: left">dynamic_int8</p></td>
<td><p style="text-align: right">1280</p></td>
<td><p style="text-align: right">11,272 tk/s</p></td>
<td><p style="text-align: right">106 tk/s</p></td>
<td><p style="text-align: right">0.12 s</p></td>
<td><p style="text-align: right">925 MB</p></td>
<td><p style="text-align: right">899 MB</p></td>
<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/FastVLM-0.5B/resolve/main/FastVLM-0.5B.sm8850.litertlm">🔗</a></p></td>
</tr>
</table>
Notes:
* Model Size: measured by the size of the file on disk.
* TTFT includes encoding time for 1 image and corresponding text prompt.
* Benchmark is run with cache enabled and initialized. During the first run, the latency and memory usage may differ.
|