Spaces:

DataQuests
/

DeepCritical

Running

VibecoderMcSwaggins commited on 11 days ago

Commit

8bac750

1 Parent(s): 901acc3

docs: sync AI agent context files and fix final nitpicks

- Update GEMINI.md, CLAUDE.md, AGENTS.md to Phase 13 status
- Add HuggingFace Spaces collaboration guidelines
- Remove redundant evidence[:10] slice (method handles internally)
- Add CodeExecutionError handling to verify_sandbox.py demo

Files changed (5) hide show

AGENTS.md +22 -9
CLAUDE.md +22 -9
GEMINI.md +49 -13
examples/modal_demo/verify_sandbox.py +34 -29
src/services/statistical_analyzer.py +2 -2

AGENTS.md CHANGED Viewed

@@ -4,7 +4,9 @@ This file provides guidance to AI agents when working with code in this reposito
 ## Project Overview
-DeepCritical is an AI-native drug repurposing research agent for a HuggingFace hackathon. It uses a search-and-judge loop to autonomously search biomedical databases (PubMed) and synthesize evidence for queries like "What existing drugs might help treat long COVID fatigue?".
 ## Development Commands
@@ -37,11 +39,11 @@ uv run pytest -m integration
 User Question → Orchestrator
     ↓
 Search Loop:
-  1. Query PubMed
   2. Gather evidence
   3. Judge quality ("Do we have enough?")
   4. If NO → Refine query, search more
-  5. If YES → Synthesize findings
     ↓
 Research Report with Citations
 ```
@@ -49,14 +51,19 @@ Research Report with Citations
 **Key Components**:
 - `src/orchestrator.py` - Main agent loop
 - `src/tools/pubmed.py` - PubMed E-utilities search
 - `src/tools/search_handler.py` - Scatter-gather orchestration
 - `src/services/embeddings.py` - Semantic search & deduplication (ChromaDB)
 - `src/agent_factory/judges.py` - LLM-based evidence assessment
 - `src/agents/` - Magentic multi-agent mode (SearchAgent, JudgeAgent, etc.)
 - `src/utils/config.py` - Pydantic Settings (loads from `.env`)
 - `src/utils/models.py` - Evidence, Citation, SearchResult models
 - `src/utils/exceptions.py` - Exception hierarchy
-- `src/app.py` - Gradio UI (HuggingFace Spaces)
 **Break Conditions**: Judge approval, token budget (50K max), or max iterations (default 10).
@@ -66,6 +73,7 @@ Settings via pydantic-settings from `.env`:
 - `LLM_PROVIDER`: "openai" or "anthropic"
 - `OPENAI_API_KEY` / `ANTHROPIC_API_KEY`: LLM keys
 - `NCBI_API_KEY`: Optional, for higher PubMed rate limits
 - `MAX_ITERATIONS`: 1-50, default 10
 - `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
@@ -95,8 +103,13 @@ DeepCriticalError (base)
 ## Git Workflow
-- `main`: Production-ready
-- `dev`: Development
-- `vcms-dev`: HuggingFace Spaces sandbox
-- Remote `origin`: GitHub
-- Remote `huggingface-upstream`: HuggingFace Spaces

 ## Project Overview
+DeepCritical is an AI-native drug repurposing research agent for a HuggingFace hackathon. It uses a search-and-judge loop to autonomously search biomedical databases (PubMed, ClinicalTrials.gov, bioRxiv) and synthesize evidence for queries like "What existing drugs might help treat long COVID fatigue?".
+**Current Status:** Phases 1-13 COMPLETE (Foundation through Modal sandbox integration).
 ## Development Commands
 User Question → Orchestrator
     ↓
 Search Loop:
+  1. Query PubMed, ClinicalTrials.gov, bioRxiv
   2. Gather evidence
   3. Judge quality ("Do we have enough?")
   4. If NO → Refine query, search more
+  5. If YES → Synthesize findings (+ optional Modal analysis)
     ↓
 Research Report with Citations
 ```
 **Key Components**:
 - `src/orchestrator.py` - Main agent loop
 - `src/tools/pubmed.py` - PubMed E-utilities search
+- `src/tools/clinicaltrials.py` - ClinicalTrials.gov API
+- `src/tools/biorxiv.py` - bioRxiv/medRxiv preprint search
+- `src/tools/code_execution.py` - Modal sandbox execution
 - `src/tools/search_handler.py` - Scatter-gather orchestration
 - `src/services/embeddings.py` - Semantic search & deduplication (ChromaDB)
+- `src/services/statistical_analyzer.py` - Statistical analysis via Modal
 - `src/agent_factory/judges.py` - LLM-based evidence assessment
 - `src/agents/` - Magentic multi-agent mode (SearchAgent, JudgeAgent, etc.)
+- `src/mcp_tools.py` - MCP tool wrappers for Claude Desktop
 - `src/utils/config.py` - Pydantic Settings (loads from `.env`)
 - `src/utils/models.py` - Evidence, Citation, SearchResult models
 - `src/utils/exceptions.py` - Exception hierarchy
+- `src/app.py` - Gradio UI with MCP server (HuggingFace Spaces)
 **Break Conditions**: Judge approval, token budget (50K max), or max iterations (default 10).
 - `LLM_PROVIDER`: "openai" or "anthropic"
 - `OPENAI_API_KEY` / `ANTHROPIC_API_KEY`: LLM keys
 - `NCBI_API_KEY`: Optional, for higher PubMed rate limits
+- `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET`: For Modal sandbox (optional)
 - `MAX_ITERATIONS`: 1-50, default 10
 - `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
 ## Git Workflow
+- `main`: Production-ready (GitHub)
+- `dev`: Development integration (GitHub)
+- Remote `origin`: GitHub (source of truth for PRs/code review)
+- Remote `huggingface-upstream`: HuggingFace Spaces (deployment target)
+**HuggingFace Spaces Collaboration:**
+- Each contributor should use their own dev branch: `yourname-dev` (e.g., `vcms-dev`, `mario-dev`)
+- **DO NOT push directly to `main` or `dev` on HuggingFace** - these can be overwritten easily
+- GitHub is the source of truth; HuggingFace is for deployment/demo
+- Consider using git hooks to prevent accidental pushes to protected branches

CLAUDE.md CHANGED Viewed

@@ -4,7 +4,9 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 ## Project Overview
-DeepCritical is an AI-native drug repurposing research agent for a HuggingFace hackathon. It uses a search-and-judge loop to autonomously search biomedical databases (PubMed) and synthesize evidence for queries like "What existing drugs might help treat long COVID fatigue?".
 ## Development Commands
@@ -37,11 +39,11 @@ uv run pytest -m integration
 User Question → Orchestrator
     ↓
 Search Loop:
-  1. Query PubMed
   2. Gather evidence
   3. Judge quality ("Do we have enough?")
   4. If NO → Refine query, search more
-  5. If YES → Synthesize findings
     ↓
 Research Report with Citations
 ```
@@ -49,14 +51,19 @@ Research Report with Citations
 **Key Components**:
 - `src/orchestrator.py` - Main agent loop
 - `src/tools/pubmed.py` - PubMed E-utilities search
 - `src/tools/search_handler.py` - Scatter-gather orchestration
 - `src/services/embeddings.py` - Semantic search & deduplication (ChromaDB)
 - `src/agent_factory/judges.py` - LLM-based evidence assessment
 - `src/agents/` - Magentic multi-agent mode (SearchAgent, JudgeAgent, etc.)
 - `src/utils/config.py` - Pydantic Settings (loads from `.env`)
 - `src/utils/models.py` - Evidence, Citation, SearchResult models
 - `src/utils/exceptions.py` - Exception hierarchy
-- `src/app.py` - Gradio UI (HuggingFace Spaces)
 **Break Conditions**: Judge approval, token budget (50K max), or max iterations (default 10).
@@ -66,6 +73,7 @@ Settings via pydantic-settings from `.env`:
 - `LLM_PROVIDER`: "openai" or "anthropic"
 - `OPENAI_API_KEY` / `ANTHROPIC_API_KEY`: LLM keys
 - `NCBI_API_KEY`: Optional, for higher PubMed rate limits
 - `MAX_ITERATIONS`: 1-50, default 10
 - `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
@@ -88,8 +96,13 @@ DeepCriticalError (base)
 ## Git Workflow
-- `main`: Production-ready
-- `dev`: Development
-- `vcms-dev`: HuggingFace Spaces sandbox
-- Remote `origin`: GitHub
-- Remote `huggingface-upstream`: HuggingFace Spaces

 ## Project Overview
+DeepCritical is an AI-native drug repurposing research agent for a HuggingFace hackathon. It uses a search-and-judge loop to autonomously search biomedical databases (PubMed, ClinicalTrials.gov, bioRxiv) and synthesize evidence for queries like "What existing drugs might help treat long COVID fatigue?".
+**Current Status:** Phases 1-13 COMPLETE (Foundation through Modal sandbox integration).
 ## Development Commands
 User Question → Orchestrator
     ↓
 Search Loop:
+  1. Query PubMed, ClinicalTrials.gov, bioRxiv
   2. Gather evidence
   3. Judge quality ("Do we have enough?")
   4. If NO → Refine query, search more
+  5. If YES → Synthesize findings (+ optional Modal analysis)
     ↓
 Research Report with Citations
 ```
 **Key Components**:
 - `src/orchestrator.py` - Main agent loop
 - `src/tools/pubmed.py` - PubMed E-utilities search
+- `src/tools/clinicaltrials.py` - ClinicalTrials.gov API
+- `src/tools/biorxiv.py` - bioRxiv/medRxiv preprint search
+- `src/tools/code_execution.py` - Modal sandbox execution
 - `src/tools/search_handler.py` - Scatter-gather orchestration
 - `src/services/embeddings.py` - Semantic search & deduplication (ChromaDB)
+- `src/services/statistical_analyzer.py` - Statistical analysis via Modal
 - `src/agent_factory/judges.py` - LLM-based evidence assessment
 - `src/agents/` - Magentic multi-agent mode (SearchAgent, JudgeAgent, etc.)
+- `src/mcp_tools.py` - MCP tool wrappers for Claude Desktop
 - `src/utils/config.py` - Pydantic Settings (loads from `.env`)
 - `src/utils/models.py` - Evidence, Citation, SearchResult models
 - `src/utils/exceptions.py` - Exception hierarchy
+- `src/app.py` - Gradio UI with MCP server (HuggingFace Spaces)
 **Break Conditions**: Judge approval, token budget (50K max), or max iterations (default 10).
 - `LLM_PROVIDER`: "openai" or "anthropic"
 - `OPENAI_API_KEY` / `ANTHROPIC_API_KEY`: LLM keys
 - `NCBI_API_KEY`: Optional, for higher PubMed rate limits
+- `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET`: For Modal sandbox (optional)
 - `MAX_ITERATIONS`: 1-50, default 10
 - `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
 ## Git Workflow
+- `main`: Production-ready (GitHub)
+- `dev`: Development integration (GitHub)
+- Remote `origin`: GitHub (source of truth for PRs/code review)
+- Remote `huggingface-upstream`: HuggingFace Spaces (deployment target)
+**HuggingFace Spaces Collaboration:**
+- Each contributor should use their own dev branch: `yourname-dev` (e.g., `vcms-dev`, `mario-dev`)
+- **DO NOT push directly to `main` or `dev` on HuggingFace** - these can be overwritten easily
+- GitHub is the source of truth; HuggingFace is for deployment/demo
+- Consider using git hooks to prevent accidental pushes to protected branches

GEMINI.md CHANGED Viewed

@@ -2,26 +2,27 @@
 ## Project Overview
 **DeepCritical** is an AI-native Medical Drug Repurposing Research Agent.
-**Goal:** To accelerate the discovery of new uses for existing drugs by intelligently searching biomedical literature (PubMed), evaluating evidence, and hypothesizing potential applications.
 **Architecture:**
 The project follows a **Vertical Slice Architecture** (Search -> Judge -> Orchestrator) and adheres to **Strict TDD** (Test-Driven Development).
 **Current Status:**
-- **Phases 1-8:** COMPLETE. Foundation, Search, Judge, UI, Orchestrator, Embeddings, Hypothesis, Report.
-- **Phase 9 (Source Cleanup):** COMPLETE. Removed DuckDuckGo web search (unreliable for scientific research).
-- **Phase 10-11:** PLANNED. ClinicalTrials.gov and bioRxiv integration.
 ## Tech Stack & Tooling
 - **Language:** Python 3.11 (Pinned)
 - **Package Manager:** `uv` (Rust-based, extremely fast)
-- **Frameworks:** `pydantic`, `pydantic-ai`, `httpx`, `gradio`
 - **Vector DB:** `chromadb` with `sentence-transformers` for semantic search
 - **Testing:** `pytest`, `pytest-asyncio`, `respx` (for mocking)
 - **Quality:** `ruff` (linting/formatting), `mypy` (strict type checking), `pre-commit`
 ## Building & Running
-We use a `Makefile` to standardize developer commands.
 | Command | Description |
 | :--- | :--- |
@@ -36,19 +37,54 @@ We use a `Makefile` to standardize developer commands.
 ## Directory Structure
 - `src/`: Source code
   - `utils/`: Shared utilities (`config.py`, `exceptions.py`, `models.py`)
-  - `tools/`: Search tools (`pubmed.py`, `base.py`, `search_handler.py`)
-  - `services/`: Services (`embeddings.py` - ChromaDB vector store)
   - `agents/`: Magentic multi-agent mode agents
   - `agent_factory/`: Agent definitions (judges, prompts)
 - `tests/`: Test suite
   - `unit/`: Isolated unit tests (Mocked)
   - `integration/`: Real API tests (Marked as slow/integration)
 - `docs/`: Documentation and Implementation Specs
 - `examples/`: Working demos for each phase
 ## Development Conventions
-1.  **Strict TDD:** Write failing tests in `tests/unit/` *before* implementing logic in `src/`.
-2.  **Type Safety:** All code must pass `mypy --strict`. Use Pydantic models for data exchange.
-3.  **Linting:** Zero tolerance for Ruff errors.
-4.  **Mocking:** Use `respx` or `unittest.mock` for all external API calls in unit tests. Real calls go in `tests/integration`.
-5.  **Vertical Slices:** Implement features end-to-end (Search -> Judge -> UI) rather than layer-by-layer.

 ## Project Overview
 **DeepCritical** is an AI-native Medical Drug Repurposing Research Agent.
+**Goal:** To accelerate the discovery of new uses for existing drugs by intelligently searching biomedical literature (PubMed, ClinicalTrials.gov, bioRxiv), evaluating evidence, and hypothesizing potential applications.
 **Architecture:**
 The project follows a **Vertical Slice Architecture** (Search -> Judge -> Orchestrator) and adheres to **Strict TDD** (Test-Driven Development).
 **Current Status:**
+- **Phases 1-9:** COMPLETE. Foundation, Search, Judge, UI, Orchestrator, Embeddings, Hypothesis, Report, Cleanup.
+- **Phases 10-11:** COMPLETE. ClinicalTrials.gov and bioRxiv integration.
+- **Phase 12:** COMPLETE. MCP Server integration (Gradio MCP at `/gradio_api/mcp/`).
+- **Phase 13:** COMPLETE. Modal sandbox for statistical analysis.
 ## Tech Stack & Tooling
 - **Language:** Python 3.11 (Pinned)
 - **Package Manager:** `uv` (Rust-based, extremely fast)
+- **Frameworks:** `pydantic`, `pydantic-ai`, `httpx`, `gradio[mcp]`
 - **Vector DB:** `chromadb` with `sentence-transformers` for semantic search
+- **Code Execution:** `modal` for secure sandboxed Python execution
 - **Testing:** `pytest`, `pytest-asyncio`, `respx` (for mocking)
 - **Quality:** `ruff` (linting/formatting), `mypy` (strict type checking), `pre-commit`
 ## Building & Running
 | Command | Description |
 | :--- | :--- |
 ## Directory Structure
 - `src/`: Source code
   - `utils/`: Shared utilities (`config.py`, `exceptions.py`, `models.py`)
+  - `tools/`: Search tools (`pubmed.py`, `clinicaltrials.py`, `biorxiv.py`, `code_execution.py`)
+  - `services/`: Services (`embeddings.py`, `statistical_analyzer.py`)
   - `agents/`: Magentic multi-agent mode agents
   - `agent_factory/`: Agent definitions (judges, prompts)
+  - `mcp_tools.py`: MCP tool wrappers for Claude Desktop integration
+  - `app.py`: Gradio UI with MCP server
 - `tests/`: Test suite
   - `unit/`: Isolated unit tests (Mocked)
   - `integration/`: Real API tests (Marked as slow/integration)
 - `docs/`: Documentation and Implementation Specs
 - `examples/`: Working demos for each phase
+## Key Components
+- `src/orchestrator.py` - Main agent loop
+- `src/tools/pubmed.py` - PubMed E-utilities search
+- `src/tools/clinicaltrials.py` - ClinicalTrials.gov API
+- `src/tools/biorxiv.py` - bioRxiv/medRxiv preprint search
+- `src/tools/code_execution.py` - Modal sandbox execution
+- `src/services/statistical_analyzer.py` - Statistical analysis via Modal
+- `src/mcp_tools.py` - MCP tool wrappers
+- `src/app.py` - Gradio UI (HuggingFace Spaces) with MCP server
+## Configuration
+Settings via pydantic-settings from `.env`:
+- `LLM_PROVIDER`: "openai" or "anthropic"
+- `OPENAI_API_KEY` / `ANTHROPIC_API_KEY`: LLM keys
+- `NCBI_API_KEY`: Optional, for higher PubMed rate limits
+- `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET`: For Modal sandbox (optional)
+- `MAX_ITERATIONS`: 1-50, default 10
+- `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
 ## Development Conventions
+1. **Strict TDD:** Write failing tests in `tests/unit/` *before* implementing logic in `src/`.
+2. **Type Safety:** All code must pass `mypy --strict`. Use Pydantic models for data exchange.
+3. **Linting:** Zero tolerance for Ruff errors.
+4. **Mocking:** Use `respx` or `unittest.mock` for all external API calls in unit tests.
+5. **Vertical Slices:** Implement features end-to-end rather than layer-by-layer.
+## Git Workflow
+- `main`: Production-ready (GitHub)
+- `dev`: Development integration (GitHub)
+- Remote `origin`: GitHub (source of truth for PRs/code review)
+- Remote `huggingface-upstream`: HuggingFace Spaces (deployment target)
+**HuggingFace Spaces Collaboration:**
+- Each contributor should use their own dev branch: `yourname-dev` (e.g., `vcms-dev`, `mario-dev`)
+- **DO NOT push directly to `main` or `dev` on HuggingFace** - these can be overwritten easily
+- GitHub is the source of truth; HuggingFace is for deployment/demo
+- Consider using git hooks to prevent accidental pushes to protected branches

examples/modal_demo/verify_sandbox.py CHANGED Viewed

@@ -11,7 +11,7 @@ Usage:
 import asyncio
 from functools import partial
-from src.tools.code_execution import get_code_executor
 from src.utils.config import settings
@@ -31,22 +31,23 @@ async def main() -> None:
         print("Set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in .env")
         return
-    executor = get_code_executor()
-    loop = asyncio.get_running_loop()
-    print("=" * 60)
-    print("Modal Sandbox Isolation Verification")
-    print("=" * 60 + "\n")
-    # Test 1: Hostname
-    print("Test 1: Check hostname (should NOT be your machine)")
-    code1 = "import socket; print(f'Hostname: {socket.gethostname()}')"
-    result1 = await loop.run_in_executor(None, partial(executor.execute, code1))
-    print_result(result1)
-    # Test 2: Scientific libraries
-    print("Test 2: Verify scientific libraries")
-    code2 = """
 import pandas as pd
 import numpy as np
 import scipy
@@ -54,12 +55,12 @@ print(f"pandas: {pd.__version__}")
 print(f"numpy: {np.__version__}")
 print(f"scipy: {scipy.__version__}")
 """
-    result2 = await loop.run_in_executor(None, partial(executor.execute, code2))
-    print_result(result2)
-    # Test 3: Network blocked
-    print("Test 3: Verify network isolation")
-    code3 = """
 import urllib.request
 try:
     urllib.request.urlopen("https://google.com", timeout=2)
@@ -67,12 +68,12 @@ try:
 except Exception:
     print("Network: BLOCKED (as expected)")
 """
-    result3 = await loop.run_in_executor(None, partial(executor.execute, code3))
-    print_result(result3)
-    # Test 4: Real statistics
-    print("Test 4: Execute statistical analysis")
-    code4 = """
 import pandas as pd
 import scipy.stats as stats
@@ -84,12 +85,16 @@ print(f"Mean Effect: {mean:.3f}")
 print(f"P-value: {p_val:.4f}")
 print(f"Verdict: {'SUPPORTED' if p_val < 0.05 else 'INCONCLUSIVE'}")
 """
-    result4 = await loop.run_in_executor(None, partial(executor.execute, code4))
-    print_result(result4)
-    print("=" * 60)
-    print("All tests complete - Modal sandbox verified!")
-    print("=" * 60)
 if __name__ == "__main__":

 import asyncio
 from functools import partial
+from src.tools.code_execution import CodeExecutionError, get_code_executor
 from src.utils.config import settings
         print("Set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in .env")
         return
+    try:
+        executor = get_code_executor()
+        loop = asyncio.get_running_loop()
+        print("=" * 60)
+        print("Modal Sandbox Isolation Verification")
+        print("=" * 60 + "\n")
+        # Test 1: Hostname
+        print("Test 1: Check hostname (should NOT be your machine)")
+        code1 = "import socket; print(f'Hostname: {socket.gethostname()}')"
+        result1 = await loop.run_in_executor(None, partial(executor.execute, code1))
+        print_result(result1)
+        # Test 2: Scientific libraries
+        print("Test 2: Verify scientific libraries")
+        code2 = """
 import pandas as pd
 import numpy as np
 import scipy
 print(f"numpy: {np.__version__}")
 print(f"scipy: {scipy.__version__}")
 """
+        result2 = await loop.run_in_executor(None, partial(executor.execute, code2))
+        print_result(result2)
+        # Test 3: Network blocked
+        print("Test 3: Verify network isolation")
+        code3 = """
 import urllib.request
 try:
     urllib.request.urlopen("https://google.com", timeout=2)
 except Exception:
     print("Network: BLOCKED (as expected)")
 """
+        result3 = await loop.run_in_executor(None, partial(executor.execute, code3))
+        print_result(result3)
+        # Test 4: Real statistics
+        print("Test 4: Execute statistical analysis")
+        code4 = """
 import pandas as pd
 import scipy.stats as stats
 print(f"P-value: {p_val:.4f}")
 print(f"Verdict: {'SUPPORTED' if p_val < 0.05 else 'INCONCLUSIVE'}")
 """
+        result4 = await loop.run_in_executor(None, partial(executor.execute, code4))
+        print_result(result4)
+        print("=" * 60)
+        print("All tests complete - Modal sandbox verified!")
+        print("=" * 60)
+    except CodeExecutionError as e:
+        print(f"Error: Modal code execution failed: {e}")
+        print("Hint: Ensure Modal SDK is installed and credentials are valid.")
 if __name__ == "__main__":

src/services/statistical_analyzer.py CHANGED Viewed

@@ -109,8 +109,8 @@ Output format: Return ONLY executable Python code, no explanations.""",
         Returns:
             AnalysisResult with verdict and statistics
         """
-        # Build analysis prompt
-        evidence_summary = self._summarize_evidence(evidence[:10])
         hypothesis_text = ""
         if hypothesis:
             hypothesis_text = (

         Returns:
             AnalysisResult with verdict and statistics
         """
+        # Build analysis prompt (method handles slicing internally)
+        evidence_summary = self._summarize_evidence(evidence)
         hypothesis_text = ""
         if hypothesis:
             hypothesis_text = (