GPT-Distill-Qwen3-8B-Thinking

Base Model License Language Unsloth Context Parameters Reasoning Feature Distillation

1. Model Overview

  • Model Name: Jackrong/GPT-Distill-Qwen3-8B-Thinking
  • Model Type: Instruction-tuned & Reasoning-enhanced LLM
  • Base Model: Qwen/Qwen3-8B
  • Parameters: ~8B
  • Context Length: Up to 16K tokens (max_seq_length = 16384)
  • Supported Languages: Chinese, English, Mixed inputs/outputs
  • Training Method:
    • Supervised Fine-Tuning (SFT) using Unsloth (LoRA adapter merged to 16bit)
    • Knowledge Distillation from large-scale reasoning models (120B/235B class)
    • Thinking/CoT Integration: Explicitly trained to generate internal reasoning chains wrapped in <think>...</think> tags.

Description

This model is a specialized fine-tune of Qwen3-8B, designed to excel at complex reasoning and instruction following. It utilizes a 16k context window and has been distilled from high-intelligence teacher models (GPT-OSS-120B and Qwen3-235B).

A key feature is its "Thinking" capability: the model is trained to output a Chain-of-Thought (CoT) process before providing the final answer, significantly improving performance on math, logic, and scientific tasks.


2. Intended Use Cases

✅ Recommended:

  • Complex Reasoning: Math problems, logical puzzles, and scientific derivations using the <think> mechanism.
  • Long-Context Tasks: Processing documents or conversations up to 16k tokens.
  • Instruction Following: High adherence to complex user constraints.
  • Chinese/English NLP: Fluent generation in both languages, including cultural nuances.
  • Knowledge Distillation: Acting as a lightweight student model that mimics the reasoning patterns of 100B+ parameter models.

⚠️ Not Suitable For:

  • High-risk decision-making: Medical diagnosis, legal advice without professional oversight.
  • Real-time Factual Updates: The model's knowledge is static and based on the training data cutoff.

Note: To trigger the reasoning capabilities effectively, the model may spontaneously use <think> tags, or you can prompt it to "think step-by-step".


3. Training Data & Distillation Process

The model was trained on a curated mix of ~88,000 high-quality examples, filtered for length and quality.

Key Datasets

(1) Reasoning & Thinking (CoT)

  • Sources: Jackrong/Natural-Reasoning-gpt-oss-120B-S1 & Jackrong/Chinese-Qwen3-235B-Thinking-2507-Distill-100k
  • Format: Input -> <think> Reasoning Chain </think> -> Answer
  • Purpose: Distills the "thinking process" of massive models (120B/235B) into the 8B architecture, teaching the model how to solve problems, not just the answer.

(2) ShareGPT & Conversation

  • Source: Jackrong/ShareGPT-gpt-oss-120B-reasoning
  • Purpose: Ensures natural multi-turn conversational flow and user intent understanding.

(3) Instruction Following

  • Source: Jackrong/gpt-oss-120b-Reasoning-Instruction
  • Purpose: Enhances the ability to follow specific formatting and constraint instructions.

Training Configuration

  • Framework: Unsloth + TRL (SFTTrainer)
  • Hardware: NVIDIA H100 80GB
  • Batch Size: Global batch size of 32 (per_device=4, grad_accum=8)
  • Optimization: AdamW 8-bit, Learning Rate 2e-5
  • LoRA Config: Rank r=32, Alpha 32, target modules = all linear layers
  • Data Strategy: Trained on responses only (train_on_responses_only) to strictly model assistant behavior.

4. Key Features Summary

Feature Description
Thinking Process Embeds CoT reasoning in <think> tags for explainable outputs.
Distilled Intelligence Inherits reasoning patterns from 120B+ parameter teacher models.
Efficient 8B Size High performance with low VRAM usage (optimized via Unsloth).
Long Context 16,384 token context window for extensive document processing.

5. Acknowledgements

We thank:

  • The Unsloth Team for their efficient fine-tuning library that made this training possible.
  • Qwen Team for the powerful Qwen3-8B base model.
  • Jackrong for the curation of the distillation datasets (ShareGPT-OSS, Natural-Reasoning, etc.).

This project is an open research effort aimed at democratizing high-level reasoning capabilities in smaller, accessible language models.

Downloads last month
55
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jackrong/GPT-Distill-Qwen3-8B-Thinking

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Finetuned
(620)
this model
Quantizations
2 models

Datasets used to train Jackrong/GPT-Distill-Qwen3-8B-Thinking