Spaces:

Alovestocode
/

ZeroGPU-LLM-Inference

Sleeping

App Files Files Community

ZeroGPU-LLM-Inference / README.md

Alikestocode

Update README: Focus on CourseGPT-Pro router checkpoints

4706b45 about 2 months ago

preview code

raw

history blame

6 kB

metadata

title: ZeroGPU-LLM-Inference
emoji: 🧠
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Streaming LLM chat with web search and controls

🧠 ZeroGPU LLM Inference

A modern, user-friendly Gradio interface for token-streaming, chat-style inference across a wide variety of Transformer models—powered by ZeroGPU for free GPU acceleration on Hugging Face Spaces.

✨ Key Features

🎨 Modern UI/UX

Clean, intuitive interface with organized layout and visual hierarchy
Collapsible advanced settings for both simple and power users
Smooth animations and transitions for better user experience
Responsive design that works on all screen sizes
Copy-to-clipboard functionality for easy sharing of responses

🔍 Web Search Integration

Real-time DuckDuckGo search with background threading
Configurable timeout and result limits
Automatic context injection into system prompts
Smart toggle - search settings auto-hide when disabled

💡 Smart Features

Thought vs. Answer streaming: <think>…</think> blocks shown separately as "💭 Thought"
Working cancel button - immediately stops generation without errors
Debug panel for prompt engineering insights
Duration estimates based on model size and settings
Example prompts to help users get started
Dynamic system prompts with automatic date insertion

🎯 Model Variety

Purpose-built around CourseGPT-Pro router checkpoints
Two curated options: Router-Qwen3-32B (8-bit) and Router-Gemma3-27B (8-bit)
Both ship with the same JSON routing schema for math/code/general orchestration
Efficient model loading - one at a time with automatic cache clearing

⚙️ Advanced Controls

Generation parameters: max tokens, temperature, top-k, top-p, repetition penalty
Web search settings: max results, chars per result, timeout
Custom system prompts with dynamic date insertion
Organized in collapsible sections to keep interface clean

🔄 Supported Models

Router-Qwen3-32B-8bit – Qwen3 32B base with CourseGPT-Pro routing LoRA merged and quantized for ZeroGPU. Best overall accuracy with modest latency.
Router-Gemma3-27B-8bit – Gemma3 27B base with the same router head, also in 8-bit. Slightly faster warm-up with a Gemma inductive bias that sometimes helps math-first prompts.

🚀 How It Works

Select Model - Choose from 30+ pre-configured models
Configure Settings - Adjust generation parameters or use defaults
Enable Web Search (optional) - Get real-time information
Start Chatting - Type your message or use example prompts
Stream Response - Watch as tokens are generated in real-time
Cancel Anytime - Stop generation mid-stream if needed

Technical Flow

User message enters chat history
If search enabled, background thread fetches DuckDuckGo results
Search snippets merge into system prompt (within timeout limit)
Selected model pipeline loads on ZeroGPU (bf16→f16→f32 fallback)
Prompt formatted with thinking mode detection
Tokens stream to UI with thought/answer separation
Cancel button available for immediate interruption
Memory cleared after generation for next request

⚙️ Generation Parameters

Parameter	Range	Default	Description
Max Tokens	64-16384	1024	Maximum response length
Temperature	0.1-2.0	0.7	Creativity vs focus
Top-K	1-100	40	Token sampling pool size
Top-P	0.1-1.0	0.9	Nucleus sampling threshold
Repetition Penalty	1.0-2.0	1.2	Reduce repetition

🌐 Web Search Settings

Setting	Range	Default	Description
Max Results	Integer	4	Number of search results
Max Chars/Result	Integer	50	Character limit per result
Search Timeout	0-30s	5s	Maximum wait time

💻 Local Development

# Clone the repository
git clone https://huggingface.co/spaces/Alovestocode/ZeroGPU-LLM-Inference
cd ZeroGPU-LLM-Inference

# Install dependencies
pip install -r requirements.txt

# Run the app
python app.py

🎨 UI Design Philosophy

The interface follows these principles:

Simplicity First - Core features immediately visible
Progressive Disclosure - Advanced options hidden but accessible
Visual Hierarchy - Clear organization with groups and sections
Feedback - Status indicators and helpful messages
Accessibility - Responsive, keyboard-friendly, with tooltips

🔧 Customization

Adding New Models

Edit MODELS dictionary in app.py:

"Your-Model-Name": {
    "repo_id": "org/model-name",
    "description": "Model description",
    "params_b": 7.0  # Size in billions
}

Modifying UI Theme

Adjust theme parameters in gr.Blocks():

theme=gr.themes.Soft(
    primary_hue="indigo",
    secondary_hue="purple",
    # ... more options
)

📊 Performance

Token streaming for responsive feel
Background search doesn't block UI
Efficient memory management with cache clearing
ZeroGPU acceleration for fast inference
Optimized loading with dtype fallbacks

🤝 Contributing

Contributions welcome! Areas for improvement:

Additional model integrations
UI/UX enhancements
Performance optimizations
Bug fixes and testing
Documentation improvements

📝 License

Apache 2.0 - See LICENSE file for details

🙏 Acknowledgments

Built with Gradio
Powered by Hugging Face Transformers
Uses ZeroGPU for acceleration
Search via DuckDuckGo

Made with ❤️ for the open source community