Qwen 3 Thinking 4B - MediaPipe LiteRT

This repository contains a Qwen 3 4B Thinking model optimized for on-device inference with MediaPipe LLM Inference API.

Model Details

Base Model: Qwen3-4B-Thinking-2507
Architecture: Qwen 3 with thinking capabilities
Quantization: Q4 block128
Format: MediaPipe .task bundle (TFLite + SentencePiece tokenizer)
KV Cache Size: 2048 tokens (ekv2048)
Model Size: 2.0 GB
Framework: ai-edge-torch → LiteRT → MediaPipe

Features

✅ Thinking mode enabled (outputs reasoning in <think> tokens)
✅ GPU acceleration support (OpenCL/WebGPU)
✅ CPU fallback (XNNPACK)
✅ Optimized for Android/iOS devices
✅ No authentication required

Usage

Android (MediaPipe LLM Inference API)

import com.google.mediapipe.tasks.genai.llminference.LlmInference

val options = LlmInference.LlmInferenceOptions.builder()
    .setModelPath("qwen3_thinking_4b.task")
    .setMaxTokens(2048)
    .setPreferredBackend(Backend.GPU)
    .build()

val llm = LlmInference.createFromOptions(context, options)

// Create session with generation parameters
val sessionOptions = LlmInferenceSessionOptions.builder()
    .setTemperature(0.7f)
    .setTopK(40)
    .setTopP(0.95f)
    .build()

val session = LlmInferenceSession.createFromOptions(llm, sessionOptions)

// Generate response
session.addQueryChunk("Explain quantum computing")
session.generateResponseAsync { partialResult, done ->
    println(partialResult)
}

Web (MediaPipe Tasks GenAI)

import { FilesetResolver, LlmInference } from '@mediapipe/tasks-genai';

const genaiFileset = await FilesetResolver.forGenAiTasks(
  'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai/wasm'
);

const llm = await LlmInference.createFromOptions(genaiFileset, {
  baseOptions: { modelAssetPath: 'qwen3_thinking_4b.task' },
  maxTokens: 2048,
  temperature: 0.7,
  topK: 40,
  randomSeed: 101
});

llm.generateResponse('What is 2+2?', (partialResult, done) => {
  console.log(partialResult);
});

Recommended Parameters

Parameter	Value	Description
`maxTokens`	2048	Must match KV cache size
`temperature`	0.7	Controls randomness
`topK`	40	Top-k sampling
`topP`	0.95	Nucleus sampling
`backend`	GPU	Use GPU for best performance

File Contents

`qwen3_thinking_4b.task` (2.0 GB)

MediaPipe task bundle containing:

TFLite model with custom GenAI ops (prefill + decode)
SentencePiece tokenizer (151,669 tokens)
Metadata (stop tokens: <|im_end|>)

`convert_qwen3_thinking.py`

Conversion script using ai-edge-torch:

python convert_qwen3_thinking.py

Converts: HuggingFace PyTorch → ai-edge-torch → TFLite → MediaPipe .task

Conversion Pipeline

Source: Qwen/Qwen3-4B-Thinking-2507 from HuggingFace
Tokenizer Conversion: HuggingFace tokenizer.json → SentencePiece .model
Model Conversion: PyTorch → TFLite (via ai-edge-torch)
Quantization: Q4 block128 (4-bit weights)
Bundling: TFLite + tokenizer → .task (via MediaPipe bundler)

Performance

Tested on Pixel 7a (Android 14):

Backend	Load Time	TTFT	Tokens/sec
GPU	~8s	~1.2s	~6-8
CPU	~5s	~2.5s	~2-3

TTFT = Time To First Token

Limitations

Web API doesn't support topP parameter (Android/iOS only)
First load downloads 2GB model (subsequent loads cached)
Requires devices with:
- Android 7+ (API 24+) / iOS 13+
- 3GB+ RAM
- GPU: OpenCL/Vulkan support recommended

License

Apache 2.0 (inherits from base Qwen model)

Citation

@software{qwen3_thinking_litert,
  title = {Qwen 3 Thinking 4B - MediaPipe LiteRT},
  author = {harithoppil},
  year = {2025},
  url = {https://huggingface.co/harithoppil/qwen3-4b-thinking-litert}
}

Acknowledgments

Base Model: Qwen Team - Qwen3-4B-Thinking-2507
Conversion Tools: ai-edge-torch
Inference Runtime: MediaPipe LLM Inference API
Tokenizer: Custom SentencePiece conversion from HuggingFace tokenizer

Issues

Report issues at the GitHub repository or model discussion board.

Downloads last month: 16

Model tree for harithoppil/qwen3-4b-thinking-litert

Base model

Qwen/Qwen3-4B-Thinking-2507

Finetuned

(147)

this model