Alikestocode's picture
Implement vLLM with LLM Compressor and performance optimizations
a79facb

A newer version of the Gradio SDK is available: 6.0.2

Upgrade
metadata
title: Router Control Room (ZeroGPU)
emoji: πŸ›°οΈ
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
short_description: ZeroGPU UI for CourseGPT-Pro router checkpoints

πŸ›°οΈ Router Control Room β€” ZeroGPU

This Space exposes the CourseGPT-Pro router checkpoints (Gemma3 27B + Qwen3 32B) with an opinionated Gradio UI. It runs entirely on ZeroGPU hardware using AWQ 4-bit quantization and FlashAttention-2 for optimized inference, with fallback to 8-bit BitsAndBytes if AWQ is unavailable.

✨ What’s Included

  • Router-specific prompt builder – inject difficulty, tags, context, acceptance criteria, and additional guidance into the canonical router system prompt.
  • Two curated checkpoints – Router-Qwen3-32B-AWQ and Router-Gemma3-27B-AWQ, both merged and optimized with AWQ quantization and FlashAttention-2.
  • JSON extraction + validation – output is parsed automatically and checked for the required router fields (route_plan, todo_list, metrics, etc.).
  • Raw output + prompt debug – inspect the verbatim generation and the exact prompt string sent to the checkpoint.
  • One-click clear – reset the UI between experiments without reloading models.

πŸ”„ Workflow

  1. Describe the user task / homework prompt in the main textbox.
  2. Optionally provide context, acceptance criteria, and extra guidance.
  3. Choose the difficulty tier, tags, model, and decoding parameters.
  4. Click Generate Router Plan.
  5. Review:
    • Raw Model Output – plain text returned by the LLM.
    • Parsed Router Plan – JSON tree extracted from the output.
    • Validation Panel – confirms whether all required fields are present.
    • Full Prompt – copy/paste for repro or benchmarking.

If JSON parsing fails, the validation panel will surface the error so you can tweak decoding parameters or the prompt.

🧠 Supported Models

Name Base Notes
Router-Qwen3-32B-AWQ Qwen3 32B Best overall acceptance on CourseGPT-Pro benchmarks. Optimized with AWQ 4-bit quantization and FlashAttention-2.
Router-Gemma3-27B-AWQ Gemma3 27B Slightly smaller, tends to favour math-first plans. Optimized with AWQ 4-bit quantization and FlashAttention-2.

Performance Optimizations

  • AWQ (Activation-Aware Weight Quantization): 4-bit quantization for faster inference and lower memory usage
  • FlashAttention-2: Optimized attention mechanism for better throughput
  • TF32 Math: Enabled for Ampere+ GPUs for faster matrix operations
  • Kernel Warmup: Automatic CUDA kernel JIT compilation on startup
  • Fast Tokenization: Uses fast tokenizers with CPU preprocessing

Both checkpoints are merged + quantized in the Alovestocode namespace and require HF_TOKEN with read access.

βš™οΈ Local Development

cd Milestone-6/router-agent/zero-gpu-space
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
export HF_TOKEN=hf_xxx
python app.py

πŸ“ Notes

  • The app attempts AWQ 4-bit quantization first (if available), then falls back to 8-bit BitsAndBytes, and finally to bf16/fp16/fp32 if quantization fails.
  • FlashAttention-2 is automatically enabled when available for improved performance.
  • CUDA kernels are warmed up on startup to reduce first-token latency.
  • The UI enforces single-turn router generations; conversation history and web search are intentionally omitted to match the Milestone 6 deliverable.
  • If you need to re-enable web search or more checkpoints, extend MODELS and adjust the prompt builder accordingly.
  • Benchmarking: run python Milestone-6/router-agent/tests/run_router_space_benchmark.py --space Alovestocode/ZeroGPU-LLM-Inference --limit 32 (requires pip install gradio_client) to call the Space, dump predictions, and evaluate against the Milestone 5 hard suite + thresholds.
  • Set ROUTER_PREFETCH_MODEL (single value) or ROUTER_PREFETCH_MODELS=Router-Qwen3-32B-AWQ,Router-Gemma3-27B-AWQ (comma-separated, ALL for every checkpoint) to warm-load weights during startup. Disable background warming by setting ROUTER_WARM_REMAINING=0.