Spaces:
Running
Running
| # DeepCritical: Medical Drug Repurposing Research Agent | |
| ## Project Overview | |
| --- | |
| ## Executive Summary | |
| **DeepCritical** is a deep research agent designed to accelerate medical drug repurposing research by autonomously searching, analyzing, and synthesizing evidence from multiple biomedical databases. | |
| ### The Problem We Solve | |
| Drug repurposing - finding new therapeutic uses for existing FDA-approved drugs - can take years of manual literature review. Researchers must: | |
| - Search thousands of papers across multiple databases | |
| - Identify molecular mechanisms | |
| - Find relevant clinical trials | |
| - Assess safety profiles | |
| - Synthesize evidence into actionable insights | |
| **DeepCritical automates this process from hours to minutes.** | |
| ### What Is Drug Repurposing? | |
| **Simple Explanation:** | |
| Using existing approved drugs to treat NEW diseases they weren't originally designed for. | |
| **Real Examples:** | |
| - **Viagra** (sildenafil): Originally for heart disease → Now treats erectile dysfunction | |
| - **Thalidomide**: Once banned → Now treats multiple myeloma | |
| - **Aspirin**: Pain reliever → Heart attack prevention | |
| - **Metformin**: Diabetes drug → Being tested for aging/longevity | |
| **Why It Matters:** | |
| - Faster than developing new drugs (years vs decades) | |
| - Cheaper (known safety profiles) | |
| - Lower risk (already FDA approved) | |
| - Immediate patient benefit potential | |
| --- | |
| ## Core Use Case | |
| ### Primary Query Type | |
| > "What existing drugs might help treat [disease/condition]?" | |
| ### Example Queries | |
| 1. **Long COVID Fatigue** | |
| - Query: "What existing drugs might help treat long COVID fatigue?" | |
| - Agent searches: PubMed, clinical trials, drug databases | |
| - Output: List of candidate drugs with mechanisms + evidence + citations | |
| 2. **Alzheimer's Disease** | |
| - Query: "Find existing drugs that target beta-amyloid pathways" | |
| - Agent identifies: Disease mechanisms → Drug candidates → Clinical evidence | |
| - Output: Comprehensive research report with drug candidates | |
| 3. **Rare Disease Treatment** | |
| - Query: "What drugs might help with fibrodysplasia ossificans progressiva?" | |
| - Agent finds: Similar conditions → Shared pathways → Potential treatments | |
| - Output: Evidence-based treatment suggestions | |
| --- | |
| ## System Architecture | |
| ### High-Level Design (Phases 1-8) | |
| ```text | |
| User Query | |
| ↓ | |
| Gradio UI (Phase 4) | |
| ↓ | |
| Magentic Manager (Phase 5) ← LLM-powered coordinator | |
| ├── SearchAgent (Phase 2+5) ←→ PubMed + Web + VectorDB (Phase 6) | |
| ├── HypothesisAgent (Phase 7) ←→ Mechanistic Reasoning | |
| ├── JudgeAgent (Phase 3+5) ←→ Evidence Assessment | |
| └── ReportAgent (Phase 8) ←→ Final Synthesis | |
| ↓ | |
| Structured Research Report | |
| ``` | |
| ### Key Components | |
| 1. **Magentic Manager (Orchestrator)** | |
| - LLM-powered multi-agent coordinator | |
| - Dynamic planning and agent selection | |
| - Built-in stall detection and replanning | |
| - Microsoft Agent Framework integration | |
| 2. **SearchAgent (Phase 2+5+6)** | |
| - PubMed E-utilities search | |
| - DuckDuckGo web search | |
| - Semantic search via ChromaDB (Phase 6) | |
| - Evidence deduplication | |
| 3. **HypothesisAgent (Phase 7)** | |
| - Generates Drug → Target → Pathway → Effect hypotheses | |
| - Guides targeted searches | |
| - Scientific reasoning about mechanisms | |
| 4. **JudgeAgent (Phase 3+5)** | |
| - LLM-based evidence assessment | |
| - Mechanism score + Clinical score | |
| - Recommends continue/synthesize | |
| - Generates refined search queries | |
| 5. **ReportAgent (Phase 8)** | |
| - Structured scientific reports | |
| - Executive summary, methodology | |
| - Hypotheses tested with evidence counts | |
| - Proper citations and limitations | |
| 6. **Gradio UI (Phase 4)** | |
| - Chat interface for questions | |
| - Real-time progress via events | |
| - Mode toggle (Simple/Magentic) | |
| - Formatted markdown output | |
| --- | |
| ## Design Patterns | |
| ### 1. Search-and-Judge Loop (Primary Pattern) | |
| ```python | |
| def research(question: str) -> Report: | |
| context = [] | |
| for iteration in range(max_iterations): | |
| # SEARCH: Query relevant tools | |
| results = search_tools(question, context) | |
| context.extend(results) | |
| # JUDGE: Evaluate quality | |
| if judge.is_sufficient(question, context): | |
| break | |
| # REFINE: Adjust search strategy | |
| query = refine_query(question, context) | |
| # SYNTHESIZE: Generate report | |
| return synthesize_report(question, context) | |
| ``` | |
| **Why This Pattern:** | |
| - Simple to implement and debug | |
| - Clear loop termination conditions | |
| - Iterative improvement of search quality | |
| - Balances depth vs speed | |
| ### 2. Multi-Tool Orchestration | |
| ``` | |
| Question → Agent decides which tools to use | |
| ↓ | |
| ┌───┴────┬─────────┬──────────┐ | |
| ↓ ↓ ↓ ↓ | |
| PubMed Web Search Trials DB Drug DB | |
| ↓ ↓ ↓ ↓ | |
| └───┬────┴─────────┴──────────┘ | |
| ↓ | |
| Aggregate Results → Judge | |
| ``` | |
| **Why This Pattern:** | |
| - Different sources provide different evidence types | |
| - Parallel tool execution (when possible) | |
| - Comprehensive coverage | |
| ### 3. LLM-as-Judge with Token Budget | |
| **Dual Stopping Conditions:** | |
| - **Smart Stop**: LLM judge says "we have sufficient evidence" | |
| - **Hard Stop**: Token budget exhausted OR max iterations reached | |
| **Why Both:** | |
| - Judge enables early exit when answer is good | |
| - Budget prevents runaway costs | |
| - Iterations prevent infinite loops | |
| ### 4. Stateful Checkpointing | |
| ``` | |
| .deepresearch/ | |
| ├── state/ | |
| │ └── query_123.json # Current research state | |
| ├── checkpoints/ | |
| │ └── query_123_iter3/ # Checkpoint at iteration 3 | |
| └── workspace/ | |
| └── query_123/ # Downloaded papers, data | |
| ``` | |
| **Why This Pattern:** | |
| - Resume interrupted research | |
| - Debugging and analysis | |
| - Cost savings (don't re-search) | |
| --- | |
| ## Component Breakdown | |
| ### Agent (Orchestrator) | |
| - **Responsibility**: Coordinate research process | |
| - **Size**: ~100 lines | |
| - **Key Methods**: | |
| - `research(question)` - Main entry point | |
| - `plan_search_strategy()` - Decide what to search | |
| - `execute_search()` - Run tool queries | |
| - `evaluate_progress()` - Call judge | |
| - `synthesize_findings()` - Generate report | |
| ### Tools | |
| - **Responsibility**: Interface with external data sources | |
| - **Size**: ~50 lines per tool | |
| - **Implementations**: | |
| - `PubMedTool` - Search biomedical literature | |
| - `WebSearchTool` - General medical information | |
| - `ClinicalTrialsTool` - Trial data (optional) | |
| - `DrugInfoTool` - FDA drug database (optional) | |
| ### Judge | |
| - **Responsibility**: Evaluate evidence quality | |
| - **Size**: ~50 lines | |
| - **Key Methods**: | |
| - `is_sufficient(question, evidence)` → bool | |
| - `assess_quality(evidence)` → score | |
| - `identify_gaps(question, evidence)` → missing_info | |
| ### Gradio App | |
| - **Responsibility**: User interface | |
| - **Size**: ~50 lines | |
| - **Features**: | |
| - Text input for questions | |
| - Progress indicators | |
| - Formatted output with citations | |
| - Download research report | |
| --- | |
| ## Technical Stack | |
| ### Core Dependencies | |
| ```toml | |
| [dependencies] | |
| python = ">=3.10" | |
| pydantic = "^2.7" | |
| pydantic-ai = "^0.0.16" | |
| fastmcp = "^0.1.0" | |
| gradio = "^5.0" | |
| beautifulsoup4 = "^4.12" | |
| httpx = "^0.27" | |
| ``` | |
| ### Optional Enhancements | |
| - `modal` - For GPU-accelerated local LLM | |
| - `fastmcp` - MCP server integration | |
| - `sentence-transformers` - Semantic search | |
| - `faiss-cpu` - Vector similarity | |
| ### Tool APIs & Rate Limits | |
| | API | Cost | Rate Limit | API Key? | Notes | | |
| |-----|------|------------|----------|-------| | |
| | **PubMed E-utilities** | Free | 3/sec (no key), 10/sec (with key) | Optional | Register at NCBI for higher limits | | |
| | **Brave Search API** | Free tier | 2000/month free | Required | Primary web search | | |
| | **DuckDuckGo** | Free | Unofficial, ~1/sec | No | Fallback web search | | |
| | **ClinicalTrials.gov** | Free | 100/min | No | Stretch goal | | |
| | **OpenFDA** | Free | 240/min (no key), 120K/day (with key) | Optional | Drug info | | |
| **Web Search Strategy (Priority Order):** | |
| 1. **Brave Search API** (free tier: 2000 queries/month) - Primary | |
| 2. **DuckDuckGo** (unofficial, no API key) - Fallback | |
| 3. **SerpAPI** ($50/month) - Only if free options fail | |
| **Why NOT SerpAPI first?** | |
| - Costs money (hackathon budget = $0) | |
| - Free alternatives work fine for demo | |
| - Can upgrade later if needed | |
| --- | |
| ## Success Criteria | |
| ### Phase 1-5 (MVP) ✅ COMPLETE | |
| **Completed in ONE DAY:** | |
| - [x] User can ask drug repurposing question | |
| - [x] Agent searches PubMed (async) | |
| - [x] Agent searches web (DuckDuckGo) | |
| - [x] LLM judge evaluates evidence quality | |
| - [x] System respects token budget and iterations | |
| - [x] Output includes drug candidates + citations | |
| - [x] Works end-to-end for demo query | |
| - [x] Gradio UI with streaming progress | |
| - [x] Magentic multi-agent orchestration | |
| - [x] 38 unit tests passing | |
| - [x] CI/CD pipeline green | |
| ### Hackathon Submission ✅ COMPLETE | |
| - [x] Gradio UI deployed on HuggingFace Spaces | |
| - [x] Example queries working and tested | |
| - [x] Architecture documentation | |
| - [x] README with setup instructions | |
| ### Phase 6-8 (Enhanced) | |
| **Specs ready for implementation:** | |
| - [ ] Embeddings & Semantic Search (Phase 6) | |
| - [ ] Hypothesis Agent (Phase 7) | |
| - [ ] Report Agent (Phase 8) | |
| ### What's EXPLICITLY Out of Scope | |
| **NOT building (to stay focused):** | |
| - ❌ User authentication | |
| - ❌ Database storage of queries | |
| - ❌ Multi-user support | |
| - ❌ Payment/billing | |
| - ❌ Production monitoring | |
| - ❌ Mobile UI | |
| --- | |
| ## Implementation Timeline | |
| ### Day 1 (Today): Architecture & Setup | |
| - [x] Define use case (drug repurposing) ✅ | |
| - [x] Write architecture docs ✅ | |
| - [ ] Create project structure | |
| - [ ] First PR: Structure + Docs | |
| ### Day 2: Core Agent Loop | |
| - [ ] Implement basic orchestrator | |
| - [ ] Add PubMed search tool | |
| - [ ] Simple judge (keyword-based) | |
| - [ ] Test with 1 query | |
| ### Day 3: Intelligence Layer | |
| - [ ] Upgrade to LLM judge | |
| - [ ] Add web search tool | |
| - [ ] Token budget tracking | |
| - [ ] Test with multiple queries | |
| ### Day 4: UI & Integration | |
| - [ ] Build Gradio interface | |
| - [ ] Wire up agent to UI | |
| - [ ] Add progress indicators | |
| - [ ] Format output nicely | |
| ### Day 5: Polish & Extend | |
| - [ ] Add more tools (clinical trials) | |
| - [ ] Improve judge prompts | |
| - [ ] Checkpoint system | |
| - [ ] Error handling | |
| ### Day 6: Deploy & Document | |
| - [ ] Deploy to HuggingFace Spaces | |
| - [ ] Record demo video | |
| - [ ] Write submission materials | |
| - [ ] Final testing | |
| --- | |
| ## Questions This Document Answers | |
| ### For The Maintainer | |
| **Q: "What should our design pattern be?"** | |
| A: Search-and-judge loop with multi-tool orchestration (detailed in Design Patterns section) | |
| **Q: "Should we use LLM-as-judge or token budget?"** | |
| A: Both - judge for smart stopping, budget for cost control | |
| **Q: "What's the break pattern?"** | |
| A: Three conditions: judge approval, token limit, or max iterations (whichever comes first) | |
| **Q: "What components do we need?"** | |
| A: Agent orchestrator, tools (PubMed/web), judge, Gradio UI (see Component Breakdown) | |
| ### For The Team | |
| **Q: "What are we actually building?"** | |
| A: Medical drug repurposing research agent (see Core Use Case) | |
| **Q: "How complex should it be?"** | |
| A: Simple but complete - ~300 lines of core code (see Component sizes) | |
| **Q: "What's the timeline?"** | |
| A: 6 days, MVP by Day 3, polish Days 4-6 (see Implementation Timeline) | |
| **Q: "What datasets/APIs do we use?"** | |
| A: PubMed (free), web search, clinical trials.gov (see Tool APIs) | |
| --- | |
| ## Next Steps | |
| 1. **Review this document** - Team feedback on architecture | |
| 2. **Finalize design** - Incorporate feedback | |
| 3. **Create project structure** - Scaffold repository | |
| 4. **Move to proper docs** - `docs/architecture/` folder | |
| 5. **Open first PR** - Structure + Documentation | |
| 6. **Start implementation** - Day 2 onward | |
| --- | |
| ## Notes & Decisions | |
| ### Why Drug Repurposing? | |
| - Clear, impressive use case | |
| - Real-world medical impact | |
| - Good data availability (PubMed, trials) | |
| - Easy to explain (Viagra example!) | |
| - Physician on team ✅ | |
| ### Why Simple Architecture? | |
| - 6-day timeline | |
| - Need working end-to-end system | |
| - Hackathon judges value "works" over "complex" | |
| - Can extend later if successful | |
| ### Why These Tools First? | |
| - PubMed: Best biomedical literature source | |
| - Web search: General medical knowledge | |
| - Clinical trials: Evidence of actual testing | |
| - Others: Nice-to-have, not critical for MVP | |
| --- | |
| --- | |
| ## Appendix A: Demo Queries (Pre-tested) | |
| These queries will be used for demo and testing. They're chosen because: | |
| 1. They have good PubMed coverage | |
| 2. They're medically interesting | |
| 3. They show the system's capabilities | |
| ### Primary Demo Query | |
| ``` | |
| "What existing drugs might help treat long COVID fatigue?" | |
| ``` | |
| **Expected candidates**: CoQ10, Low-dose Naltrexone, Modafinil | |
| **Expected sources**: 20+ PubMed papers, 2-3 clinical trials | |
| ### Secondary Demo Queries | |
| ``` | |
| "Find existing drugs that might slow Alzheimer's progression" | |
| "What approved medications could help with fibromyalgia pain?" | |
| "Which diabetes drugs show promise for cancer treatment?" | |
| ``` | |
| ### Why These Queries? | |
| - Represent real clinical needs | |
| - Have substantial literature | |
| - Show diverse drug classes | |
| - Physician on team can validate results | |
| --- | |
| ## Appendix B: Risk Assessment | |
| | Risk | Likelihood | Impact | Mitigation | | |
| |------|------------|--------|------------| | |
| | PubMed rate limiting | Medium | High | Implement caching, respect 3/sec | | |
| | Web search API fails | Low | Medium | DuckDuckGo fallback | | |
| | LLM costs exceed budget | Medium | Medium | Hard token cap at 50K | | |
| | Judge quality poor | Medium | High | Pre-test prompts, iterate | | |
| | HuggingFace deploy issues | Low | High | Test deployment Day 4 | | |
| | Demo crashes live | Medium | High | Pre-recorded backup video | | |
| --- | |
| --- | |
| **Document Status**: Official Architecture Spec | |
| **Review Score**: 98/100 | |
| **Last Updated**: November 2025 | |