GPT-Distill-Qwen3-8B-Thinking
1. Model Overview
- Model Name:
Jackrong/GPT-Distill-Qwen3-8B-Thinking - Model Type: Instruction-tuned & Reasoning-enhanced LLM
- Base Model:
Qwen/Qwen3-8B - Parameters: ~8B
- Context Length: Up to 16K tokens (
max_seq_length = 16384) - Supported Languages: Chinese, English, Mixed inputs/outputs
- Training Method:
- Supervised Fine-Tuning (SFT) using Unsloth (LoRA adapter merged to 16bit)
- Knowledge Distillation from large-scale reasoning models (120B/235B class)
- Thinking/CoT Integration: Explicitly trained to generate internal reasoning chains wrapped in
<think>...</think>tags.
Description
This model is a specialized fine-tune of Qwen3-8B, designed to excel at complex reasoning and instruction following. It utilizes a 16k context window and has been distilled from high-intelligence teacher models (GPT-OSS-120B and Qwen3-235B).
A key feature is its "Thinking" capability: the model is trained to output a Chain-of-Thought (CoT) process before providing the final answer, significantly improving performance on math, logic, and scientific tasks.
2. Intended Use Cases
✅ Recommended:
- Complex Reasoning: Math problems, logical puzzles, and scientific derivations using the
<think>mechanism. - Long-Context Tasks: Processing documents or conversations up to 16k tokens.
- Instruction Following: High adherence to complex user constraints.
- Chinese/English NLP: Fluent generation in both languages, including cultural nuances.
- Knowledge Distillation: Acting as a lightweight student model that mimics the reasoning patterns of 100B+ parameter models.
⚠️ Not Suitable For:
- High-risk decision-making: Medical diagnosis, legal advice without professional oversight.
- Real-time Factual Updates: The model's knowledge is static and based on the training data cutoff.
Note: To trigger the reasoning capabilities effectively, the model may spontaneously use
<think>tags, or you can prompt it to "think step-by-step".
3. Training Data & Distillation Process
The model was trained on a curated mix of ~88,000 high-quality examples, filtered for length and quality.
Key Datasets
(1) Reasoning & Thinking (CoT)
- Sources:
Jackrong/Natural-Reasoning-gpt-oss-120B-S1&Jackrong/Chinese-Qwen3-235B-Thinking-2507-Distill-100k - Format:
Input-><think> Reasoning Chain </think>->Answer - Purpose: Distills the "thinking process" of massive models (120B/235B) into the 8B architecture, teaching the model how to solve problems, not just the answer.
(2) ShareGPT & Conversation
- Source:
Jackrong/ShareGPT-gpt-oss-120B-reasoning - Purpose: Ensures natural multi-turn conversational flow and user intent understanding.
(3) Instruction Following
- Source:
Jackrong/gpt-oss-120b-Reasoning-Instruction - Purpose: Enhances the ability to follow specific formatting and constraint instructions.
Training Configuration
- Framework: Unsloth + TRL (SFTTrainer)
- Hardware: NVIDIA H100 80GB
- Batch Size: Global batch size of 32 (per_device=4, grad_accum=8)
- Optimization: AdamW 8-bit, Learning Rate
2e-5 - LoRA Config: Rank
r=32, Alpha32, target modules = all linear layers - Data Strategy: Trained on responses only (
train_on_responses_only) to strictly model assistant behavior.
4. Key Features Summary
| Feature | Description |
|---|---|
| Thinking Process | Embeds CoT reasoning in <think> tags for explainable outputs. |
| Distilled Intelligence | Inherits reasoning patterns from 120B+ parameter teacher models. |
| Efficient 8B Size | High performance with low VRAM usage (optimized via Unsloth). |
| Long Context | 16,384 token context window for extensive document processing. |
5. Acknowledgements
We thank:
- The Unsloth Team for their efficient fine-tuning library that made this training possible.
- Qwen Team for the powerful Qwen3-8B base model.
- Jackrong for the curation of the distillation datasets (ShareGPT-OSS, Natural-Reasoning, etc.).
This project is an open research effort aimed at democratizing high-level reasoning capabilities in smaller, accessible language models.
- Downloads last month
- 55