Qwen 3 Thinking 4B - MediaPipe LiteRT
This repository contains a Qwen 3 4B Thinking model optimized for on-device inference with MediaPipe LLM Inference API.
Model Details
- Base Model: Qwen3-4B-Thinking-2507
- Architecture: Qwen 3 with thinking capabilities
- Quantization: Q4 block128
- Format: MediaPipe .task bundle (TFLite + SentencePiece tokenizer)
- KV Cache Size: 2048 tokens (
ekv2048) - Model Size: 2.0 GB
- Framework: ai-edge-torch β LiteRT β MediaPipe
Features
- β
Thinking mode enabled (outputs reasoning in
<think>tokens) - β GPU acceleration support (OpenCL/WebGPU)
- β CPU fallback (XNNPACK)
- β Optimized for Android/iOS devices
- β No authentication required
Usage
Android (MediaPipe LLM Inference API)
import com.google.mediapipe.tasks.genai.llminference.LlmInference
val options = LlmInference.LlmInferenceOptions.builder()
.setModelPath("qwen3_thinking_4b.task")
.setMaxTokens(2048)
.setPreferredBackend(Backend.GPU)
.build()
val llm = LlmInference.createFromOptions(context, options)
// Create session with generation parameters
val sessionOptions = LlmInferenceSessionOptions.builder()
.setTemperature(0.7f)
.setTopK(40)
.setTopP(0.95f)
.build()
val session = LlmInferenceSession.createFromOptions(llm, sessionOptions)
// Generate response
session.addQueryChunk("Explain quantum computing")
session.generateResponseAsync { partialResult, done ->
println(partialResult)
}
Web (MediaPipe Tasks GenAI)
import { FilesetResolver, LlmInference } from '@mediapipe/tasks-genai';
const genaiFileset = await FilesetResolver.forGenAiTasks(
'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai/wasm'
);
const llm = await LlmInference.createFromOptions(genaiFileset, {
baseOptions: { modelAssetPath: 'qwen3_thinking_4b.task' },
maxTokens: 2048,
temperature: 0.7,
topK: 40,
randomSeed: 101
});
llm.generateResponse('What is 2+2?', (partialResult, done) => {
console.log(partialResult);
});
Recommended Parameters
| Parameter | Value | Description |
|---|---|---|
maxTokens |
2048 | Must match KV cache size |
temperature |
0.7 | Controls randomness |
topK |
40 | Top-k sampling |
topP |
0.95 | Nucleus sampling |
backend |
GPU | Use GPU for best performance |
File Contents
qwen3_thinking_4b.task (2.0 GB)
MediaPipe task bundle containing:
- TFLite model with custom GenAI ops (prefill + decode)
- SentencePiece tokenizer (151,669 tokens)
- Metadata (stop tokens:
<|im_end|>)
convert_qwen3_thinking.py
Conversion script using ai-edge-torch:
python convert_qwen3_thinking.py
Converts: HuggingFace PyTorch β ai-edge-torch β TFLite β MediaPipe .task
Conversion Pipeline
- Source:
Qwen/Qwen3-4B-Thinking-2507from HuggingFace - Tokenizer Conversion: HuggingFace tokenizer.json β SentencePiece .model
- Model Conversion: PyTorch β TFLite (via ai-edge-torch)
- Quantization: Q4 block128 (4-bit weights)
- Bundling: TFLite + tokenizer β .task (via MediaPipe bundler)
Performance
Tested on Pixel 7a (Android 14):
| Backend | Load Time | TTFT | Tokens/sec |
|---|---|---|---|
| GPU | ~8s | ~1.2s | ~6-8 |
| CPU | ~5s | ~2.5s | ~2-3 |
TTFT = Time To First Token
Limitations
- Web API doesn't support
topPparameter (Android/iOS only) - First load downloads 2GB model (subsequent loads cached)
- Requires devices with:
- Android 7+ (API 24+) / iOS 13+
- 3GB+ RAM
- GPU: OpenCL/Vulkan support recommended
License
Apache 2.0 (inherits from base Qwen model)
Citation
@software{qwen3_thinking_litert,
title = {Qwen 3 Thinking 4B - MediaPipe LiteRT},
author = {harithoppil},
year = {2025},
url = {https://huggingface.co/harithoppil/qwen3-4b-thinking-litert}
}
Acknowledgments
- Base Model: Qwen Team - Qwen3-4B-Thinking-2507
- Conversion Tools: ai-edge-torch
- Inference Runtime: MediaPipe LLM Inference API
- Tokenizer: Custom SentencePiece conversion from HuggingFace tokenizer
Issues
Report issues at the GitHub repository or model discussion board.
- Downloads last month
- 16
Model tree for harithoppil/qwen3-4b-thinking-litert
Base model
Qwen/Qwen3-4B-Thinking-2507