Spaces:

DataQuests
/

DeepCritical

Running

VibecoderMcSwaggins commited on 13 days ago

Commit

b4aa4ad

1 Parent(s): c7584c1

docs: update guides and add testing strategy documentation

- Added links to the Setup and Deployment Guides in the index.
- Introduced a new Testing Strategy document outlining unit, integration, and E2E testing approaches.
- Updated the Development section to include a link to the Contributing guide.

Files changed (3) hide show

docs/development/testing.md +139 -0
docs/guides/deployment.md +142 -0
docs/index.md +5 -4

docs/development/testing.md ADDED Viewed

	@@ -0,0 +1,139 @@

+# Testing Strategy
+## ensuring DeepCritical is Ironclad
+---
+## Overview
+Our testing strategy follows a strict **Pyramid of Reliability**:
+1. **Unit Tests**: Fast, isolated logic checks (60% of tests)
+2. **Integration Tests**: Tool interactions & Agent loops (30% of tests)
+3. **E2E / Regression Tests**: Full research workflows (10% of tests)
+**Goal**: Ship a research agent that doesn't hallucinate, crash on API limits, or burn $100 in tokens by accident.
+---
+## 1. Unit Tests (Fast & Cheap)
+**Location**: `tests/unit/`
+Focus on individual components without external network calls. Mock everything.
+### Key Test Cases
+#### Agent Logic
+- **Initialization**: Verify default config loads correctly.
+- **State Updates**: Ensure `ResearchState` updates correctly (e.g., token counts increment).
+- **Budget Checks**: Test `should_continue()` returns `False` when budget exceeded.
+- **Error Handling**: Test partial failure recovery (e.g., one tool fails, agent continues).
+#### Tools (Mocked)
+- **Parser Logic**: Feed raw XML/JSON to tool parsers and verify `Evidence` objects.
+- **Validation**: Ensure tools reject invalid queries (empty strings, etc.).
+#### Judge Prompts
+- **Schema Compliance**: Verify prompt templates generate valid JSON structure instructions.
+- **Variable Injection**: Ensure `{question}` and `{context}` are injected correctly into prompts.
+```python
+# Example: Testing State Logic
+def test_budget_stop():
+    state = ResearchState(tokens_used=50001, max_tokens=50000)
+    assert should_continue(state) is False
+```
+---
+## 2. Integration Tests (Realistic & Mocked I/O)
+**Location**: `tests/integration/`
+Focus on the interaction between the Orchestrator, Tools, and LLM Judge. Use **VCR.py** or **Replay** patterns to record/replay API calls to save money/time.
+### Key Test Cases
+#### Search Loop
+- **Iteration Flow**: Verify agent performs Search -> Judge -> Search loop.
+- **Tool Selection**: Verify correct tools are called based on judge output (mocked judge).
+- **Context Accumulation**: Ensure findings from Iteration 1 are passed to Iteration 2.
+#### MCP Server Integration
+- **Server Startup**: Verify MCP server starts and exposes tools.
+- **Client Connection**: Verify agent can call tools via MCP protocol.
+```python
+# Example: Testing Search Loop with Mocked Tools
+async def test_search_loop_flow():
+    agent = ResearchAgent(tools=[MockPubMed(), MockWeb()])
+    report = await agent.run("test query")
+    assert agent.state.iterations > 0
+    assert len(report.sources) > 0
+```
+---
+## 3. End-to-End (E2E) Tests (The "Real Deal")
+**Location**: `tests/e2e/`
+Run against **real APIs** (with strict rate limits) to verify system integrity. Run these **on demand** or **nightly**, not on every commit.
+### Key Test Cases
+#### The "Golden Query"
+Run the primary demo query: *"What existing drugs might help treat long COVID fatigue?"*
+- **Success Criteria**:
+  - Returns at least 2 valid drug candidates (e.g., CoQ10, LDN).
+  - Includes citations from PubMed.
+  - Completes within 3 iterations.
+  - JSON output matches schema.
+#### Deployment Smoke Test
+- **Gradio UI**: Verify UI launches and accepts input.
+- **Streaming**: Verify generator yields chunks (first chunk within 2s).
+---
+## 4. Tools & Config
+### Pytest Configuration
+```toml
+# pyproject.toml
+[tool.pytest.ini_options]
+markers = [
+    "unit: fast, isolated tests",
+    "integration: mocked network tests",
+    "e2e: real network tests (slow, expensive)"
+]
+asyncio_mode = "auto"
+```
+### CI/CD Pipeline (GitHub Actions)
+1. **Lint**: `ruff check .`
+2. **Type Check**: `mypy .`
+3. **Unit**: `pytest -m unit`
+4. **Integration**: `pytest -m integration`
+5. **E2E**: (Manual trigger only)
+---
+## 5. Anti-Hallucination Validation
+How do we test if the agent is lying?
+1. **Citation Check**:
+   - Regex verify that every `[PMID: 12345]` in the report exists in the `Evidence` list.
+   - Fail if a citation is "orphaned" (hallucinated ID).
+2. **Negative Constraints**:
+   - Test queries for fake diseases ("Ligma syndrome") -> Agent should return "No evidence found".
+---
+## Checklist for Implementation
+- [ ] Set up `tests/` directory structure
+- [ ] Configure `pytest` and `vcrpy`
+- [ ] Create `tests/fixtures/` for mock data (PubMed XMLs)
+- [ ] Write first unit test for `ResearchState`

docs/guides/deployment.md ADDED Viewed

	@@ -0,0 +1,142 @@

+# Deployment Guide
+## Launching DeepCritical: Gradio, MCP, & Modal
+---
+## Overview
+DeepCritical is designed for a multi-platform deployment strategy to maximize hackathon impact:
+1. **HuggingFace Spaces**: Host the Gradio UI (User Interface).
+2. **MCP Server**: Expose research tools to Claude Desktop/Agents.
+3. **Modal (Optional)**: Run heavy inference or local LLMs if API costs are prohibitive.
+---
+## 1. HuggingFace Spaces (Gradio UI)
+**Goal**: A public URL where judges/users can try the research agent.
+### Prerequisites
+- HuggingFace Account
+- `gradio` installed (`uv add gradio`)
+### Steps
+1. **Create Space**:
+   - Go to HF Spaces -> Create New Space.
+   - SDK: **Gradio**.
+   - Hardware: **CPU Basic** (Free) is sufficient (since we use APIs).
+2. **Prepare Files**:
+   - Ensure `app.py` contains the Gradio interface construction.
+   - Ensure `requirements.txt` or `pyproject.toml` lists all dependencies.
+3. **Secrets**:
+   - Go to Space Settings -> **Repository secrets**.
+   - Add `ANTHROPIC_API_KEY` (or your chosen LLM provider key).
+   - Add `BRAVE_API_KEY` (for web search).
+4. **Deploy**:
+   - Push code to the Space's git repo.
+   - Watch "Build" logs.
+### Streaming Optimization
+Ensure `app.py` uses generator functions for the chat interface to prevent timeouts:
+```python
+# app.py
+def predict(message, history):
+    agent = ResearchAgent()
+    for update in agent.research_stream(message):
+        yield update
+```
+---
+## 2. MCP Server Deployment
+**Goal**: Allow other agents (like Claude Desktop) to use our PubMed/Research tools directly.
+### Local Usage (Claude Desktop)
+1. **Install**:
+   ```bash
+   uv sync
+   ```
+2. **Configure Claude Desktop**:
+   Edit `~/Library/Application Support/Claude/claude_desktop_config.json`:
+   ```json
+   {
+     "mcpServers": {
+       "deepcritical": {
+         "command": "uv",
+         "args": ["run", "fastmcp", "run", "src/mcp_servers/pubmed_server.py"],
+         "cwd": "/absolute/path/to/DeepCritical"
+       }
+     }
+   }
+   ```
+3. **Restart Claude**: You should see a 🔌 icon indicating connected tools.
+### Remote Deployment (Smithery/Glama)
+*Target for "MCP Track" bonus points.*
+1. **Dockerize**: Create a `Dockerfile` for the MCP server.
+   ```dockerfile
+   FROM python:3.11-slim
+   COPY . /app
+   RUN pip install fastmcp httpx
+   CMD ["fastmcp", "run", "src/mcp_servers/pubmed_server.py", "--transport", "sse"]
+   ```
+   *Note: Use SSE transport for remote/HTTP servers.*
+2. **Deploy**: Host on Fly.io or Railway.
+---
+## 3. Modal (GPU/Heavy Compute)
+**Goal**: Run a local LLM (e.g., Llama-3-70B) or handle massive parallel searches if APIs are too slow/expensive.
+### Setup
+1. **Install**: `uv add modal`
+2. **Auth**: `modal token new`
+### Logic
+Instead of calling Anthropic API, we call a Modal function:
+```python
+# src/llm/modal_client.py
+import modal
+stub = modal.Stub("deepcritical-inference")
+@stub.function(gpu="A100")
+def generate_text(prompt: str):
+    # Load vLLM or similar
+    ...
+```
+### When to use?
+- **Hackathon Demo**: Stick to Anthropic/OpenAI APIs for speed/reliability.
+- **Production/Stretch**: Use Modal if you hit rate limits or want to show off "Open Source Models" capability.
+---
+## Deployment Checklist
+### Pre-Flight
+- [ ] Run `pytest -m unit` to ensure logic is sound.
+- [ ] Run `pytest -m e2e` (one pass) to verify APIs connect.
+- [ ] Check `requirements.txt` matches `pyproject.toml`.
+### Secrets Management
+- [ ] **NEVER** commit `.env` files.
+- [ ] Verify keys are added to HF Space settings.
+### Post-Launch
+- [ ] Test the live URL.
+- [ ] Verify "Stop" button in Gradio works (interrupts the agent).
+- [ ] Record a walkthrough video (crucial for hackathon submission).

docs/index.md CHANGED Viewed

@@ -13,12 +13,13 @@ AI-powered deep research system for accelerating drug repurposing discovery.
 - **[Design Patterns](architecture/design-patterns.md)** - 13 technical patterns, judge prompts, data models
 ### Guides
-- Setup Guide (coming soon)
-- User Guide (coming soon)
 ### Development
-- Contributing (coming soon)
-- API Reference (coming soon)
 ---

 - **[Design Patterns](architecture/design-patterns.md)** - 13 technical patterns, judge prompts, data models
 ### Guides
+- [Setup Guide](guides/setup.md) (coming soon)
+- **[Deployment Guide](guides/deployment.md)** - Gradio, MCP, and Modal launch steps
 ### Development
+- **[Testing Strategy](development/testing.md)** - Unit, Integration, and E2E testing patterns
+- [Contributing](development/contributing.md) (coming soon)
 ---