VibecoderMcSwaggins commited on
Commit
b4aa4ad
Β·
1 Parent(s): c7584c1

docs: update guides and add testing strategy documentation

Browse files

- Added links to the Setup and Deployment Guides in the index.
- Introduced a new Testing Strategy document outlining unit, integration, and E2E testing approaches.
- Updated the Development section to include a link to the Contributing guide.

docs/development/testing.md ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Testing Strategy
2
+ ## ensuring DeepCritical is Ironclad
3
+
4
+ ---
5
+
6
+ ## Overview
7
+
8
+ Our testing strategy follows a strict **Pyramid of Reliability**:
9
+ 1. **Unit Tests**: Fast, isolated logic checks (60% of tests)
10
+ 2. **Integration Tests**: Tool interactions & Agent loops (30% of tests)
11
+ 3. **E2E / Regression Tests**: Full research workflows (10% of tests)
12
+
13
+ **Goal**: Ship a research agent that doesn't hallucinate, crash on API limits, or burn $100 in tokens by accident.
14
+
15
+ ---
16
+
17
+ ## 1. Unit Tests (Fast & Cheap)
18
+
19
+ **Location**: `tests/unit/`
20
+
21
+ Focus on individual components without external network calls. Mock everything.
22
+
23
+ ### Key Test Cases
24
+
25
+ #### Agent Logic
26
+ - **Initialization**: Verify default config loads correctly.
27
+ - **State Updates**: Ensure `ResearchState` updates correctly (e.g., token counts increment).
28
+ - **Budget Checks**: Test `should_continue()` returns `False` when budget exceeded.
29
+ - **Error Handling**: Test partial failure recovery (e.g., one tool fails, agent continues).
30
+
31
+ #### Tools (Mocked)
32
+ - **Parser Logic**: Feed raw XML/JSON to tool parsers and verify `Evidence` objects.
33
+ - **Validation**: Ensure tools reject invalid queries (empty strings, etc.).
34
+
35
+ #### Judge Prompts
36
+ - **Schema Compliance**: Verify prompt templates generate valid JSON structure instructions.
37
+ - **Variable Injection**: Ensure `{question}` and `{context}` are injected correctly into prompts.
38
+
39
+ ```python
40
+ # Example: Testing State Logic
41
+ def test_budget_stop():
42
+ state = ResearchState(tokens_used=50001, max_tokens=50000)
43
+ assert should_continue(state) is False
44
+ ```
45
+
46
+ ---
47
+
48
+ ## 2. Integration Tests (Realistic & Mocked I/O)
49
+
50
+ **Location**: `tests/integration/`
51
+
52
+ Focus on the interaction between the Orchestrator, Tools, and LLM Judge. Use **VCR.py** or **Replay** patterns to record/replay API calls to save money/time.
53
+
54
+ ### Key Test Cases
55
+
56
+ #### Search Loop
57
+ - **Iteration Flow**: Verify agent performs Search -> Judge -> Search loop.
58
+ - **Tool Selection**: Verify correct tools are called based on judge output (mocked judge).
59
+ - **Context Accumulation**: Ensure findings from Iteration 1 are passed to Iteration 2.
60
+
61
+ #### MCP Server Integration
62
+ - **Server Startup**: Verify MCP server starts and exposes tools.
63
+ - **Client Connection**: Verify agent can call tools via MCP protocol.
64
+
65
+ ```python
66
+ # Example: Testing Search Loop with Mocked Tools
67
+ async def test_search_loop_flow():
68
+ agent = ResearchAgent(tools=[MockPubMed(), MockWeb()])
69
+ report = await agent.run("test query")
70
+ assert agent.state.iterations > 0
71
+ assert len(report.sources) > 0
72
+ ```
73
+
74
+ ---
75
+
76
+ ## 3. End-to-End (E2E) Tests (The "Real Deal")
77
+
78
+ **Location**: `tests/e2e/`
79
+
80
+ Run against **real APIs** (with strict rate limits) to verify system integrity. Run these **on demand** or **nightly**, not on every commit.
81
+
82
+ ### Key Test Cases
83
+
84
+ #### The "Golden Query"
85
+ Run the primary demo query: *"What existing drugs might help treat long COVID fatigue?"*
86
+ - **Success Criteria**:
87
+ - Returns at least 2 valid drug candidates (e.g., CoQ10, LDN).
88
+ - Includes citations from PubMed.
89
+ - Completes within 3 iterations.
90
+ - JSON output matches schema.
91
+
92
+ #### Deployment Smoke Test
93
+ - **Gradio UI**: Verify UI launches and accepts input.
94
+ - **Streaming**: Verify generator yields chunks (first chunk within 2s).
95
+
96
+ ---
97
+
98
+ ## 4. Tools & Config
99
+
100
+ ### Pytest Configuration
101
+ ```toml
102
+ # pyproject.toml
103
+ [tool.pytest.ini_options]
104
+ markers = [
105
+ "unit: fast, isolated tests",
106
+ "integration: mocked network tests",
107
+ "e2e: real network tests (slow, expensive)"
108
+ ]
109
+ asyncio_mode = "auto"
110
+ ```
111
+
112
+ ### CI/CD Pipeline (GitHub Actions)
113
+ 1. **Lint**: `ruff check .`
114
+ 2. **Type Check**: `mypy .`
115
+ 3. **Unit**: `pytest -m unit`
116
+ 4. **Integration**: `pytest -m integration`
117
+ 5. **E2E**: (Manual trigger only)
118
+
119
+ ---
120
+
121
+ ## 5. Anti-Hallucination Validation
122
+
123
+ How do we test if the agent is lying?
124
+
125
+ 1. **Citation Check**:
126
+ - Regex verify that every `[PMID: 12345]` in the report exists in the `Evidence` list.
127
+ - Fail if a citation is "orphaned" (hallucinated ID).
128
+
129
+ 2. **Negative Constraints**:
130
+ - Test queries for fake diseases ("Ligma syndrome") -> Agent should return "No evidence found".
131
+
132
+ ---
133
+
134
+ ## Checklist for Implementation
135
+
136
+ - [ ] Set up `tests/` directory structure
137
+ - [ ] Configure `pytest` and `vcrpy`
138
+ - [ ] Create `tests/fixtures/` for mock data (PubMed XMLs)
139
+ - [ ] Write first unit test for `ResearchState`
docs/guides/deployment.md ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deployment Guide
2
+ ## Launching DeepCritical: Gradio, MCP, & Modal
3
+
4
+ ---
5
+
6
+ ## Overview
7
+
8
+ DeepCritical is designed for a multi-platform deployment strategy to maximize hackathon impact:
9
+
10
+ 1. **HuggingFace Spaces**: Host the Gradio UI (User Interface).
11
+ 2. **MCP Server**: Expose research tools to Claude Desktop/Agents.
12
+ 3. **Modal (Optional)**: Run heavy inference or local LLMs if API costs are prohibitive.
13
+
14
+ ---
15
+
16
+ ## 1. HuggingFace Spaces (Gradio UI)
17
+
18
+ **Goal**: A public URL where judges/users can try the research agent.
19
+
20
+ ### Prerequisites
21
+ - HuggingFace Account
22
+ - `gradio` installed (`uv add gradio`)
23
+
24
+ ### Steps
25
+
26
+ 1. **Create Space**:
27
+ - Go to HF Spaces -> Create New Space.
28
+ - SDK: **Gradio**.
29
+ - Hardware: **CPU Basic** (Free) is sufficient (since we use APIs).
30
+
31
+ 2. **Prepare Files**:
32
+ - Ensure `app.py` contains the Gradio interface construction.
33
+ - Ensure `requirements.txt` or `pyproject.toml` lists all dependencies.
34
+
35
+ 3. **Secrets**:
36
+ - Go to Space Settings -> **Repository secrets**.
37
+ - Add `ANTHROPIC_API_KEY` (or your chosen LLM provider key).
38
+ - Add `BRAVE_API_KEY` (for web search).
39
+
40
+ 4. **Deploy**:
41
+ - Push code to the Space's git repo.
42
+ - Watch "Build" logs.
43
+
44
+ ### Streaming Optimization
45
+ Ensure `app.py` uses generator functions for the chat interface to prevent timeouts:
46
+ ```python
47
+ # app.py
48
+ def predict(message, history):
49
+ agent = ResearchAgent()
50
+ for update in agent.research_stream(message):
51
+ yield update
52
+ ```
53
+
54
+ ---
55
+
56
+ ## 2. MCP Server Deployment
57
+
58
+ **Goal**: Allow other agents (like Claude Desktop) to use our PubMed/Research tools directly.
59
+
60
+ ### Local Usage (Claude Desktop)
61
+
62
+ 1. **Install**:
63
+ ```bash
64
+ uv sync
65
+ ```
66
+
67
+ 2. **Configure Claude Desktop**:
68
+ Edit `~/Library/Application Support/Claude/claude_desktop_config.json`:
69
+ ```json
70
+ {
71
+ "mcpServers": {
72
+ "deepcritical": {
73
+ "command": "uv",
74
+ "args": ["run", "fastmcp", "run", "src/mcp_servers/pubmed_server.py"],
75
+ "cwd": "/absolute/path/to/DeepCritical"
76
+ }
77
+ }
78
+ }
79
+ ```
80
+
81
+ 3. **Restart Claude**: You should see a πŸ”Œ icon indicating connected tools.
82
+
83
+ ### Remote Deployment (Smithery/Glama)
84
+ *Target for "MCP Track" bonus points.*
85
+
86
+ 1. **Dockerize**: Create a `Dockerfile` for the MCP server.
87
+ ```dockerfile
88
+ FROM python:3.11-slim
89
+ COPY . /app
90
+ RUN pip install fastmcp httpx
91
+ CMD ["fastmcp", "run", "src/mcp_servers/pubmed_server.py", "--transport", "sse"]
92
+ ```
93
+ *Note: Use SSE transport for remote/HTTP servers.*
94
+
95
+ 2. **Deploy**: Host on Fly.io or Railway.
96
+
97
+ ---
98
+
99
+ ## 3. Modal (GPU/Heavy Compute)
100
+
101
+ **Goal**: Run a local LLM (e.g., Llama-3-70B) or handle massive parallel searches if APIs are too slow/expensive.
102
+
103
+ ### Setup
104
+ 1. **Install**: `uv add modal`
105
+ 2. **Auth**: `modal token new`
106
+
107
+ ### Logic
108
+ Instead of calling Anthropic API, we call a Modal function:
109
+
110
+ ```python
111
+ # src/llm/modal_client.py
112
+ import modal
113
+
114
+ stub = modal.Stub("deepcritical-inference")
115
+
116
+ @stub.function(gpu="A100")
117
+ def generate_text(prompt: str):
118
+ # Load vLLM or similar
119
+ ...
120
+ ```
121
+
122
+ ### When to use?
123
+ - **Hackathon Demo**: Stick to Anthropic/OpenAI APIs for speed/reliability.
124
+ - **Production/Stretch**: Use Modal if you hit rate limits or want to show off "Open Source Models" capability.
125
+
126
+ ---
127
+
128
+ ## Deployment Checklist
129
+
130
+ ### Pre-Flight
131
+ - [ ] Run `pytest -m unit` to ensure logic is sound.
132
+ - [ ] Run `pytest -m e2e` (one pass) to verify APIs connect.
133
+ - [ ] Check `requirements.txt` matches `pyproject.toml`.
134
+
135
+ ### Secrets Management
136
+ - [ ] **NEVER** commit `.env` files.
137
+ - [ ] Verify keys are added to HF Space settings.
138
+
139
+ ### Post-Launch
140
+ - [ ] Test the live URL.
141
+ - [ ] Verify "Stop" button in Gradio works (interrupts the agent).
142
+ - [ ] Record a walkthrough video (crucial for hackathon submission).
docs/index.md CHANGED
@@ -13,12 +13,13 @@ AI-powered deep research system for accelerating drug repurposing discovery.
13
  - **[Design Patterns](architecture/design-patterns.md)** - 13 technical patterns, judge prompts, data models
14
 
15
  ### Guides
16
- - Setup Guide (coming soon)
17
- - User Guide (coming soon)
18
 
19
  ### Development
20
- - Contributing (coming soon)
21
- - API Reference (coming soon)
 
22
 
23
  ---
24
 
 
13
  - **[Design Patterns](architecture/design-patterns.md)** - 13 technical patterns, judge prompts, data models
14
 
15
  ### Guides
16
+ - [Setup Guide](guides/setup.md) (coming soon)
17
+ - **[Deployment Guide](guides/deployment.md)** - Gradio, MCP, and Modal launch steps
18
 
19
  ### Development
20
+ - **[Testing Strategy](development/testing.md)** - Unit, Integration, and E2E testing patterns
21
+ - [Contributing](development/contributing.md) (coming soon)
22
+
23
 
24
  ---
25