GPT-Distill-Qwen3-8B-Thinking

1. Model Overview

Model Name: Jackrong/GPT-Distill-Qwen3-8B-Thinking
Model Type: Instruction-tuned & Reasoning-enhanced LLM
Base Model: Qwen/Qwen3-8B
Parameters: ~8B
Context Length: Up to 16K tokens (max_seq_length = 16384)
Supported Languages: Chinese, English, Mixed inputs/outputs
Training Method:
- Supervised Fine-Tuning (SFT) using Unsloth (LoRA adapter merged to 16bit)
- Knowledge Distillation from large-scale reasoning models (120B/235B class)
- Thinking/CoT Integration: Explicitly trained to generate internal reasoning chains wrapped in <think>...</think> tags.

Description

This model is a specialized fine-tune of Qwen3-8B, designed to excel at complex reasoning and instruction following. It utilizes a 16k context window and has been distilled from high-intelligence teacher models (GPT-OSS-120B and Qwen3-235B).

A key feature is its "Thinking" capability: the model is trained to output a Chain-of-Thought (CoT) process before providing the final answer, significantly improving performance on math, logic, and scientific tasks.

2. Intended Use Cases

✅ Recommended:

Complex Reasoning: Math problems, logical puzzles, and scientific derivations using the <think> mechanism.
Long-Context Tasks: Processing documents or conversations up to 16k tokens.
Instruction Following: High adherence to complex user constraints.
Chinese/English NLP: Fluent generation in both languages, including cultural nuances.
Knowledge Distillation: Acting as a lightweight student model that mimics the reasoning patterns of 100B+ parameter models.

⚠️ Not Suitable For:

High-risk decision-making: Medical diagnosis, legal advice without professional oversight.
Real-time Factual Updates: The model's knowledge is static and based on the training data cutoff.

Note: To trigger the reasoning capabilities effectively, the model may spontaneously use <think> tags, or you can prompt it to "think step-by-step".

3. Training Data & Distillation Process

The model was trained on a curated mix of ~88,000 high-quality examples, filtered for length and quality.

Key Datasets

(1) Reasoning & Thinking (CoT)

Sources: Jackrong/Natural-Reasoning-gpt-oss-120B-S1 & Jackrong/Chinese-Qwen3-235B-Thinking-2507-Distill-100k
Format: Input -> <think> Reasoning Chain </think> -> Answer
Purpose: Distills the "thinking process" of massive models (120B/235B) into the 8B architecture, teaching the model how to solve problems, not just the answer.

(2) ShareGPT & Conversation

Source: Jackrong/ShareGPT-gpt-oss-120B-reasoning
Purpose: Ensures natural multi-turn conversational flow and user intent understanding.

(3) Instruction Following

Source: Jackrong/gpt-oss-120b-Reasoning-Instruction
Purpose: Enhances the ability to follow specific formatting and constraint instructions.

Training Configuration

Framework: Unsloth + TRL (SFTTrainer)
Hardware: NVIDIA H100 80GB
Batch Size: Global batch size of 32 (per_device=4, grad_accum=8)
Optimization: AdamW 8-bit, Learning Rate 2e-5
LoRA Config: Rank r=32, Alpha 32, target modules = all linear layers
Data Strategy: Trained on responses only (train_on_responses_only) to strictly model assistant behavior.

4. Key Features Summary

Feature	Description
Thinking Process	Embeds CoT reasoning in `<think>` tags for explainable outputs.
Distilled Intelligence	Inherits reasoning patterns from 120B+ parameter teacher models.
Efficient 8B Size	High performance with low VRAM usage (optimized via Unsloth).
Long Context	16,384 token context window for extensive document processing.

5. Acknowledgements

We thank:

The Unsloth Team for their efficient fine-tuning library that made this training possible.
Qwen Team for the powerful Qwen3-8B base model.
Jackrong for the curation of the distillation datasets (ShareGPT-OSS, Natural-Reasoning, etc.).

This project is an open research effort aimed at democratizing high-level reasoning capabilities in smaller, accessible language models.