Qwen 3 Thinking 4B - MediaPipe LiteRT

This repository contains a Qwen 3 4B Thinking model optimized for on-device inference with MediaPipe LLM Inference API.

Model Details

  • Base Model: Qwen3-4B-Thinking-2507
  • Architecture: Qwen 3 with thinking capabilities
  • Quantization: Q4 block128
  • Format: MediaPipe .task bundle (TFLite + SentencePiece tokenizer)
  • KV Cache Size: 2048 tokens (ekv2048)
  • Model Size: 2.0 GB
  • Framework: ai-edge-torch β†’ LiteRT β†’ MediaPipe

Features

  • βœ… Thinking mode enabled (outputs reasoning in <think> tokens)
  • βœ… GPU acceleration support (OpenCL/WebGPU)
  • βœ… CPU fallback (XNNPACK)
  • βœ… Optimized for Android/iOS devices
  • βœ… No authentication required

Usage

Android (MediaPipe LLM Inference API)

import com.google.mediapipe.tasks.genai.llminference.LlmInference

val options = LlmInference.LlmInferenceOptions.builder()
    .setModelPath("qwen3_thinking_4b.task")
    .setMaxTokens(2048)
    .setPreferredBackend(Backend.GPU)
    .build()

val llm = LlmInference.createFromOptions(context, options)

// Create session with generation parameters
val sessionOptions = LlmInferenceSessionOptions.builder()
    .setTemperature(0.7f)
    .setTopK(40)
    .setTopP(0.95f)
    .build()

val session = LlmInferenceSession.createFromOptions(llm, sessionOptions)

// Generate response
session.addQueryChunk("Explain quantum computing")
session.generateResponseAsync { partialResult, done ->
    println(partialResult)
}

Web (MediaPipe Tasks GenAI)

import { FilesetResolver, LlmInference } from '@mediapipe/tasks-genai';

const genaiFileset = await FilesetResolver.forGenAiTasks(
  'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai/wasm'
);

const llm = await LlmInference.createFromOptions(genaiFileset, {
  baseOptions: { modelAssetPath: 'qwen3_thinking_4b.task' },
  maxTokens: 2048,
  temperature: 0.7,
  topK: 40,
  randomSeed: 101
});

llm.generateResponse('What is 2+2?', (partialResult, done) => {
  console.log(partialResult);
});

Recommended Parameters

Parameter Value Description
maxTokens 2048 Must match KV cache size
temperature 0.7 Controls randomness
topK 40 Top-k sampling
topP 0.95 Nucleus sampling
backend GPU Use GPU for best performance

File Contents

qwen3_thinking_4b.task (2.0 GB)

MediaPipe task bundle containing:

  • TFLite model with custom GenAI ops (prefill + decode)
  • SentencePiece tokenizer (151,669 tokens)
  • Metadata (stop tokens: <|im_end|>)

convert_qwen3_thinking.py

Conversion script using ai-edge-torch:

python convert_qwen3_thinking.py

Converts: HuggingFace PyTorch β†’ ai-edge-torch β†’ TFLite β†’ MediaPipe .task

Conversion Pipeline

  1. Source: Qwen/Qwen3-4B-Thinking-2507 from HuggingFace
  2. Tokenizer Conversion: HuggingFace tokenizer.json β†’ SentencePiece .model
  3. Model Conversion: PyTorch β†’ TFLite (via ai-edge-torch)
  4. Quantization: Q4 block128 (4-bit weights)
  5. Bundling: TFLite + tokenizer β†’ .task (via MediaPipe bundler)

Performance

Tested on Pixel 7a (Android 14):

Backend Load Time TTFT Tokens/sec
GPU ~8s ~1.2s ~6-8
CPU ~5s ~2.5s ~2-3

TTFT = Time To First Token

Limitations

  • Web API doesn't support topP parameter (Android/iOS only)
  • First load downloads 2GB model (subsequent loads cached)
  • Requires devices with:
    • Android 7+ (API 24+) / iOS 13+
    • 3GB+ RAM
    • GPU: OpenCL/Vulkan support recommended

License

Apache 2.0 (inherits from base Qwen model)

Citation

@software{qwen3_thinking_litert,
  title = {Qwen 3 Thinking 4B - MediaPipe LiteRT},
  author = {harithoppil},
  year = {2025},
  url = {https://huggingface.co/harithoppil/qwen3-4b-thinking-litert}
}

Acknowledgments

Issues

Report issues at the GitHub repository or model discussion board.

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for harithoppil/qwen3-4b-thinking-litert

Finetuned
(147)
this model