Alikestocode's picture
Update README: Focus on CourseGPT-Pro router checkpoints
4706b45
|
raw
history blame
6 kB
metadata
title: ZeroGPU-LLM-Inference
emoji: 🧠
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Streaming LLM chat with web search and controls

🧠 ZeroGPU LLM Inference

A modern, user-friendly Gradio interface for token-streaming, chat-style inference across a wide variety of Transformer modelsβ€”powered by ZeroGPU for free GPU acceleration on Hugging Face Spaces.

✨ Key Features

🎨 Modern UI/UX

  • Clean, intuitive interface with organized layout and visual hierarchy
  • Collapsible advanced settings for both simple and power users
  • Smooth animations and transitions for better user experience
  • Responsive design that works on all screen sizes
  • Copy-to-clipboard functionality for easy sharing of responses

πŸ” Web Search Integration

  • Real-time DuckDuckGo search with background threading
  • Configurable timeout and result limits
  • Automatic context injection into system prompts
  • Smart toggle - search settings auto-hide when disabled

πŸ’‘ Smart Features

  • Thought vs. Answer streaming: <think>…</think> blocks shown separately as "πŸ’­ Thought"
  • Working cancel button - immediately stops generation without errors
  • Debug panel for prompt engineering insights
  • Duration estimates based on model size and settings
  • Example prompts to help users get started
  • Dynamic system prompts with automatic date insertion

🎯 Model Variety

  • Purpose-built around CourseGPT-Pro router checkpoints
  • Two curated options: Router-Qwen3-32B (8-bit) and Router-Gemma3-27B (8-bit)
  • Both ship with the same JSON routing schema for math/code/general orchestration
  • Efficient model loading - one at a time with automatic cache clearing

βš™οΈ Advanced Controls

  • Generation parameters: max tokens, temperature, top-k, top-p, repetition penalty
  • Web search settings: max results, chars per result, timeout
  • Custom system prompts with dynamic date insertion
  • Organized in collapsible sections to keep interface clean

πŸ”„ Supported Models

  • Router-Qwen3-32B-8bit – Qwen3 32B base with CourseGPT-Pro routing LoRA merged and quantized for ZeroGPU. Best overall accuracy with modest latency.
  • Router-Gemma3-27B-8bit – Gemma3 27B base with the same router head, also in 8-bit. Slightly faster warm-up with a Gemma inductive bias that sometimes helps math-first prompts.

πŸš€ How It Works

  1. Select Model - Choose from 30+ pre-configured models
  2. Configure Settings - Adjust generation parameters or use defaults
  3. Enable Web Search (optional) - Get real-time information
  4. Start Chatting - Type your message or use example prompts
  5. Stream Response - Watch as tokens are generated in real-time
  6. Cancel Anytime - Stop generation mid-stream if needed

Technical Flow

  1. User message enters chat history
  2. If search enabled, background thread fetches DuckDuckGo results
  3. Search snippets merge into system prompt (within timeout limit)
  4. Selected model pipeline loads on ZeroGPU (bf16β†’f16β†’f32 fallback)
  5. Prompt formatted with thinking mode detection
  6. Tokens stream to UI with thought/answer separation
  7. Cancel button available for immediate interruption
  8. Memory cleared after generation for next request

βš™οΈ Generation Parameters

Parameter Range Default Description
Max Tokens 64-16384 1024 Maximum response length
Temperature 0.1-2.0 0.7 Creativity vs focus
Top-K 1-100 40 Token sampling pool size
Top-P 0.1-1.0 0.9 Nucleus sampling threshold
Repetition Penalty 1.0-2.0 1.2 Reduce repetition

🌐 Web Search Settings

Setting Range Default Description
Max Results Integer 4 Number of search results
Max Chars/Result Integer 50 Character limit per result
Search Timeout 0-30s 5s Maximum wait time

πŸ’» Local Development

# Clone the repository
git clone https://huggingface.co/spaces/Alovestocode/ZeroGPU-LLM-Inference
cd ZeroGPU-LLM-Inference

# Install dependencies
pip install -r requirements.txt

# Run the app
python app.py

🎨 UI Design Philosophy

The interface follows these principles:

  1. Simplicity First - Core features immediately visible
  2. Progressive Disclosure - Advanced options hidden but accessible
  3. Visual Hierarchy - Clear organization with groups and sections
  4. Feedback - Status indicators and helpful messages
  5. Accessibility - Responsive, keyboard-friendly, with tooltips

πŸ”§ Customization

Adding New Models

Edit MODELS dictionary in app.py:

"Your-Model-Name": {
    "repo_id": "org/model-name",
    "description": "Model description",
    "params_b": 7.0  # Size in billions
}

Modifying UI Theme

Adjust theme parameters in gr.Blocks():

theme=gr.themes.Soft(
    primary_hue="indigo",
    secondary_hue="purple",
    # ... more options
)

πŸ“Š Performance

  • Token streaming for responsive feel
  • Background search doesn't block UI
  • Efficient memory management with cache clearing
  • ZeroGPU acceleration for fast inference
  • Optimized loading with dtype fallbacks

🀝 Contributing

Contributions welcome! Areas for improvement:

  • Additional model integrations
  • UI/UX enhancements
  • Performance optimizations
  • Bug fixes and testing
  • Documentation improvements

πŸ“ License

Apache 2.0 - See LICENSE file for details

πŸ™ Acknowledgments


Made with ❀️ for the open source community