Spaces:
Sleeping
Sleeping
| title: ZeroGPU-LLM-Inference | |
| emoji: π§ | |
| colorFrom: indigo | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 5.49.1 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: Streaming LLM chat with web search and controls | |
| # π§ ZeroGPU LLM Inference | |
| A modern, user-friendly Gradio interface for **token-streaming, chat-style inference** across a wide variety of Transformer modelsβpowered by ZeroGPU for free GPU acceleration on Hugging Face Spaces. | |
| ## β¨ Key Features | |
| ### π¨ Modern UI/UX | |
| - **Clean, intuitive interface** with organized layout and visual hierarchy | |
| - **Collapsible advanced settings** for both simple and power users | |
| - **Smooth animations and transitions** for better user experience | |
| - **Responsive design** that works on all screen sizes | |
| - **Copy-to-clipboard** functionality for easy sharing of responses | |
| ### π Web Search Integration | |
| - **Real-time DuckDuckGo search** with background threading | |
| - **Configurable timeout** and result limits | |
| - **Automatic context injection** into system prompts | |
| - **Smart toggle** - search settings auto-hide when disabled | |
| ### π‘ Smart Features | |
| - **Thought vs. Answer streaming**: `<think>β¦</think>` blocks shown separately as "π Thought" | |
| - **Working cancel button** - immediately stops generation without errors | |
| - **Debug panel** for prompt engineering insights | |
| - **Duration estimates** based on model size and settings | |
| - **Example prompts** to help users get started | |
| - **Dynamic system prompts** with automatic date insertion | |
| ### π― Model Variety | |
| - Purpose-built around **CourseGPT-Pro router checkpoints** | |
| - Two curated options: **Router-Qwen3-32B (8-bit)** and **Router-Gemma3-27B (8-bit)** | |
| - Both ship with the same JSON routing schema for math/code/general orchestration | |
| - **Efficient model loading** - one at a time with automatic cache clearing | |
| ### βοΈ Advanced Controls | |
| - **Generation parameters**: max tokens, temperature, top-k, top-p, repetition penalty | |
| - **Web search settings**: max results, chars per result, timeout | |
| - **Custom system prompts** with dynamic date insertion | |
| - **Organized in collapsible sections** to keep interface clean | |
| ## π Supported Models | |
| - **Router-Qwen3-32B-8bit** β Qwen3 32B base with CourseGPT-Pro routing LoRA merged and quantized for ZeroGPU. Best overall accuracy with modest latency. | |
| - **Router-Gemma3-27B-8bit** β Gemma3 27B base with the same router head, also in 8-bit. Slightly faster warm-up with a Gemma inductive bias that sometimes helps math-first prompts. | |
| ## π How It Works | |
| 1. **Select Model** - Choose from 30+ pre-configured models | |
| 2. **Configure Settings** - Adjust generation parameters or use defaults | |
| 3. **Enable Web Search** (optional) - Get real-time information | |
| 4. **Start Chatting** - Type your message or use example prompts | |
| 5. **Stream Response** - Watch as tokens are generated in real-time | |
| 6. **Cancel Anytime** - Stop generation mid-stream if needed | |
| ### Technical Flow | |
| 1. User message enters chat history | |
| 2. If search enabled, background thread fetches DuckDuckGo results | |
| 3. Search snippets merge into system prompt (within timeout limit) | |
| 4. Selected model pipeline loads on ZeroGPU (bf16βf16βf32 fallback) | |
| 5. Prompt formatted with thinking mode detection | |
| 6. Tokens stream to UI with thought/answer separation | |
| 7. Cancel button available for immediate interruption | |
| 8. Memory cleared after generation for next request | |
| ## βοΈ Generation Parameters | |
| | Parameter | Range | Default | Description | | |
| |-----------|-------|---------|-------------| | |
| | Max Tokens | 64-16384 | 1024 | Maximum response length | | |
| | Temperature | 0.1-2.0 | 0.7 | Creativity vs focus | | |
| | Top-K | 1-100 | 40 | Token sampling pool size | | |
| | Top-P | 0.1-1.0 | 0.9 | Nucleus sampling threshold | | |
| | Repetition Penalty | 1.0-2.0 | 1.2 | Reduce repetition | | |
| ## π Web Search Settings | |
| | Setting | Range | Default | Description | | |
| |---------|-------|---------|-------------| | |
| | Max Results | Integer | 4 | Number of search results | | |
| | Max Chars/Result | Integer | 50 | Character limit per result | | |
| | Search Timeout | 0-30s | 5s | Maximum wait time | | |
| ## π» Local Development | |
| ```bash | |
| # Clone the repository | |
| git clone https://huggingface.co/spaces/Alovestocode/ZeroGPU-LLM-Inference | |
| cd ZeroGPU-LLM-Inference | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Run the app | |
| python app.py | |
| ``` | |
| ## π¨ UI Design Philosophy | |
| The interface follows these principles: | |
| 1. **Simplicity First** - Core features immediately visible | |
| 2. **Progressive Disclosure** - Advanced options hidden but accessible | |
| 3. **Visual Hierarchy** - Clear organization with groups and sections | |
| 4. **Feedback** - Status indicators and helpful messages | |
| 5. **Accessibility** - Responsive, keyboard-friendly, with tooltips | |
| ## π§ Customization | |
| ### Adding New Models | |
| Edit `MODELS` dictionary in `app.py`: | |
| ```python | |
| "Your-Model-Name": { | |
| "repo_id": "org/model-name", | |
| "description": "Model description", | |
| "params_b": 7.0 # Size in billions | |
| } | |
| ``` | |
| ### Modifying UI Theme | |
| Adjust theme parameters in `gr.Blocks()`: | |
| ```python | |
| theme=gr.themes.Soft( | |
| primary_hue="indigo", | |
| secondary_hue="purple", | |
| # ... more options | |
| ) | |
| ``` | |
| ## π Performance | |
| - **Token streaming** for responsive feel | |
| - **Background search** doesn't block UI | |
| - **Efficient memory** management with cache clearing | |
| - **ZeroGPU acceleration** for fast inference | |
| - **Optimized loading** with dtype fallbacks | |
| ## π€ Contributing | |
| Contributions welcome! Areas for improvement: | |
| - Additional model integrations | |
| - UI/UX enhancements | |
| - Performance optimizations | |
| - Bug fixes and testing | |
| - Documentation improvements | |
| ## π License | |
| Apache 2.0 - See LICENSE file for details | |
| ## π Acknowledgments | |
| - Built with [Gradio](https://gradio.app) | |
| - Powered by [Hugging Face Transformers](https://huggingface.co/transformers) | |
| - Uses [ZeroGPU](https://huggingface.co/zero-gpu-explorers) for acceleration | |
| - Search via [DuckDuckGo](https://duckduckgo.com) | |
| --- | |
| **Made with β€οΈ for the open source community** | |