VibecoderMcSwaggins commited on
Commit
8bac750
·
1 Parent(s): 901acc3

docs: sync AI agent context files and fix final nitpicks

Browse files

- Update GEMINI.md, CLAUDE.md, AGENTS.md to Phase 13 status
- Add HuggingFace Spaces collaboration guidelines
- Remove redundant evidence[:10] slice (method handles internally)
- Add CodeExecutionError handling to verify_sandbox.py demo

AGENTS.md CHANGED
@@ -4,7 +4,9 @@ This file provides guidance to AI agents when working with code in this reposito
4
 
5
  ## Project Overview
6
 
7
- DeepCritical is an AI-native drug repurposing research agent for a HuggingFace hackathon. It uses a search-and-judge loop to autonomously search biomedical databases (PubMed) and synthesize evidence for queries like "What existing drugs might help treat long COVID fatigue?".
 
 
8
 
9
  ## Development Commands
10
 
@@ -37,11 +39,11 @@ uv run pytest -m integration
37
  User Question → Orchestrator
38
 
39
  Search Loop:
40
- 1. Query PubMed
41
  2. Gather evidence
42
  3. Judge quality ("Do we have enough?")
43
  4. If NO → Refine query, search more
44
- 5. If YES → Synthesize findings
45
 
46
  Research Report with Citations
47
  ```
@@ -49,14 +51,19 @@ Research Report with Citations
49
  **Key Components**:
50
  - `src/orchestrator.py` - Main agent loop
51
  - `src/tools/pubmed.py` - PubMed E-utilities search
 
 
 
52
  - `src/tools/search_handler.py` - Scatter-gather orchestration
53
  - `src/services/embeddings.py` - Semantic search & deduplication (ChromaDB)
 
54
  - `src/agent_factory/judges.py` - LLM-based evidence assessment
55
  - `src/agents/` - Magentic multi-agent mode (SearchAgent, JudgeAgent, etc.)
 
56
  - `src/utils/config.py` - Pydantic Settings (loads from `.env`)
57
  - `src/utils/models.py` - Evidence, Citation, SearchResult models
58
  - `src/utils/exceptions.py` - Exception hierarchy
59
- - `src/app.py` - Gradio UI (HuggingFace Spaces)
60
 
61
  **Break Conditions**: Judge approval, token budget (50K max), or max iterations (default 10).
62
 
@@ -66,6 +73,7 @@ Settings via pydantic-settings from `.env`:
66
  - `LLM_PROVIDER`: "openai" or "anthropic"
67
  - `OPENAI_API_KEY` / `ANTHROPIC_API_KEY`: LLM keys
68
  - `NCBI_API_KEY`: Optional, for higher PubMed rate limits
 
69
  - `MAX_ITERATIONS`: 1-50, default 10
70
  - `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
71
 
@@ -95,8 +103,13 @@ DeepCriticalError (base)
95
 
96
  ## Git Workflow
97
 
98
- - `main`: Production-ready
99
- - `dev`: Development
100
- - `vcms-dev`: HuggingFace Spaces sandbox
101
- - Remote `origin`: GitHub
102
- - Remote `huggingface-upstream`: HuggingFace Spaces
 
 
 
 
 
 
4
 
5
  ## Project Overview
6
 
7
+ DeepCritical is an AI-native drug repurposing research agent for a HuggingFace hackathon. It uses a search-and-judge loop to autonomously search biomedical databases (PubMed, ClinicalTrials.gov, bioRxiv) and synthesize evidence for queries like "What existing drugs might help treat long COVID fatigue?".
8
+
9
+ **Current Status:** Phases 1-13 COMPLETE (Foundation through Modal sandbox integration).
10
 
11
  ## Development Commands
12
 
 
39
  User Question → Orchestrator
40
 
41
  Search Loop:
42
+ 1. Query PubMed, ClinicalTrials.gov, bioRxiv
43
  2. Gather evidence
44
  3. Judge quality ("Do we have enough?")
45
  4. If NO → Refine query, search more
46
+ 5. If YES → Synthesize findings (+ optional Modal analysis)
47
 
48
  Research Report with Citations
49
  ```
 
51
  **Key Components**:
52
  - `src/orchestrator.py` - Main agent loop
53
  - `src/tools/pubmed.py` - PubMed E-utilities search
54
+ - `src/tools/clinicaltrials.py` - ClinicalTrials.gov API
55
+ - `src/tools/biorxiv.py` - bioRxiv/medRxiv preprint search
56
+ - `src/tools/code_execution.py` - Modal sandbox execution
57
  - `src/tools/search_handler.py` - Scatter-gather orchestration
58
  - `src/services/embeddings.py` - Semantic search & deduplication (ChromaDB)
59
+ - `src/services/statistical_analyzer.py` - Statistical analysis via Modal
60
  - `src/agent_factory/judges.py` - LLM-based evidence assessment
61
  - `src/agents/` - Magentic multi-agent mode (SearchAgent, JudgeAgent, etc.)
62
+ - `src/mcp_tools.py` - MCP tool wrappers for Claude Desktop
63
  - `src/utils/config.py` - Pydantic Settings (loads from `.env`)
64
  - `src/utils/models.py` - Evidence, Citation, SearchResult models
65
  - `src/utils/exceptions.py` - Exception hierarchy
66
+ - `src/app.py` - Gradio UI with MCP server (HuggingFace Spaces)
67
 
68
  **Break Conditions**: Judge approval, token budget (50K max), or max iterations (default 10).
69
 
 
73
  - `LLM_PROVIDER`: "openai" or "anthropic"
74
  - `OPENAI_API_KEY` / `ANTHROPIC_API_KEY`: LLM keys
75
  - `NCBI_API_KEY`: Optional, for higher PubMed rate limits
76
+ - `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET`: For Modal sandbox (optional)
77
  - `MAX_ITERATIONS`: 1-50, default 10
78
  - `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
79
 
 
103
 
104
  ## Git Workflow
105
 
106
+ - `main`: Production-ready (GitHub)
107
+ - `dev`: Development integration (GitHub)
108
+ - Remote `origin`: GitHub (source of truth for PRs/code review)
109
+ - Remote `huggingface-upstream`: HuggingFace Spaces (deployment target)
110
+
111
+ **HuggingFace Spaces Collaboration:**
112
+ - Each contributor should use their own dev branch: `yourname-dev` (e.g., `vcms-dev`, `mario-dev`)
113
+ - **DO NOT push directly to `main` or `dev` on HuggingFace** - these can be overwritten easily
114
+ - GitHub is the source of truth; HuggingFace is for deployment/demo
115
+ - Consider using git hooks to prevent accidental pushes to protected branches
CLAUDE.md CHANGED
@@ -4,7 +4,9 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
4
 
5
  ## Project Overview
6
 
7
- DeepCritical is an AI-native drug repurposing research agent for a HuggingFace hackathon. It uses a search-and-judge loop to autonomously search biomedical databases (PubMed) and synthesize evidence for queries like "What existing drugs might help treat long COVID fatigue?".
 
 
8
 
9
  ## Development Commands
10
 
@@ -37,11 +39,11 @@ uv run pytest -m integration
37
  User Question → Orchestrator
38
 
39
  Search Loop:
40
- 1. Query PubMed
41
  2. Gather evidence
42
  3. Judge quality ("Do we have enough?")
43
  4. If NO → Refine query, search more
44
- 5. If YES → Synthesize findings
45
 
46
  Research Report with Citations
47
  ```
@@ -49,14 +51,19 @@ Research Report with Citations
49
  **Key Components**:
50
  - `src/orchestrator.py` - Main agent loop
51
  - `src/tools/pubmed.py` - PubMed E-utilities search
 
 
 
52
  - `src/tools/search_handler.py` - Scatter-gather orchestration
53
  - `src/services/embeddings.py` - Semantic search & deduplication (ChromaDB)
 
54
  - `src/agent_factory/judges.py` - LLM-based evidence assessment
55
  - `src/agents/` - Magentic multi-agent mode (SearchAgent, JudgeAgent, etc.)
 
56
  - `src/utils/config.py` - Pydantic Settings (loads from `.env`)
57
  - `src/utils/models.py` - Evidence, Citation, SearchResult models
58
  - `src/utils/exceptions.py` - Exception hierarchy
59
- - `src/app.py` - Gradio UI (HuggingFace Spaces)
60
 
61
  **Break Conditions**: Judge approval, token budget (50K max), or max iterations (default 10).
62
 
@@ -66,6 +73,7 @@ Settings via pydantic-settings from `.env`:
66
  - `LLM_PROVIDER`: "openai" or "anthropic"
67
  - `OPENAI_API_KEY` / `ANTHROPIC_API_KEY`: LLM keys
68
  - `NCBI_API_KEY`: Optional, for higher PubMed rate limits
 
69
  - `MAX_ITERATIONS`: 1-50, default 10
70
  - `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
71
 
@@ -88,8 +96,13 @@ DeepCriticalError (base)
88
 
89
  ## Git Workflow
90
 
91
- - `main`: Production-ready
92
- - `dev`: Development
93
- - `vcms-dev`: HuggingFace Spaces sandbox
94
- - Remote `origin`: GitHub
95
- - Remote `huggingface-upstream`: HuggingFace Spaces
 
 
 
 
 
 
4
 
5
  ## Project Overview
6
 
7
+ DeepCritical is an AI-native drug repurposing research agent for a HuggingFace hackathon. It uses a search-and-judge loop to autonomously search biomedical databases (PubMed, ClinicalTrials.gov, bioRxiv) and synthesize evidence for queries like "What existing drugs might help treat long COVID fatigue?".
8
+
9
+ **Current Status:** Phases 1-13 COMPLETE (Foundation through Modal sandbox integration).
10
 
11
  ## Development Commands
12
 
 
39
  User Question → Orchestrator
40
 
41
  Search Loop:
42
+ 1. Query PubMed, ClinicalTrials.gov, bioRxiv
43
  2. Gather evidence
44
  3. Judge quality ("Do we have enough?")
45
  4. If NO → Refine query, search more
46
+ 5. If YES → Synthesize findings (+ optional Modal analysis)
47
 
48
  Research Report with Citations
49
  ```
 
51
  **Key Components**:
52
  - `src/orchestrator.py` - Main agent loop
53
  - `src/tools/pubmed.py` - PubMed E-utilities search
54
+ - `src/tools/clinicaltrials.py` - ClinicalTrials.gov API
55
+ - `src/tools/biorxiv.py` - bioRxiv/medRxiv preprint search
56
+ - `src/tools/code_execution.py` - Modal sandbox execution
57
  - `src/tools/search_handler.py` - Scatter-gather orchestration
58
  - `src/services/embeddings.py` - Semantic search & deduplication (ChromaDB)
59
+ - `src/services/statistical_analyzer.py` - Statistical analysis via Modal
60
  - `src/agent_factory/judges.py` - LLM-based evidence assessment
61
  - `src/agents/` - Magentic multi-agent mode (SearchAgent, JudgeAgent, etc.)
62
+ - `src/mcp_tools.py` - MCP tool wrappers for Claude Desktop
63
  - `src/utils/config.py` - Pydantic Settings (loads from `.env`)
64
  - `src/utils/models.py` - Evidence, Citation, SearchResult models
65
  - `src/utils/exceptions.py` - Exception hierarchy
66
+ - `src/app.py` - Gradio UI with MCP server (HuggingFace Spaces)
67
 
68
  **Break Conditions**: Judge approval, token budget (50K max), or max iterations (default 10).
69
 
 
73
  - `LLM_PROVIDER`: "openai" or "anthropic"
74
  - `OPENAI_API_KEY` / `ANTHROPIC_API_KEY`: LLM keys
75
  - `NCBI_API_KEY`: Optional, for higher PubMed rate limits
76
+ - `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET`: For Modal sandbox (optional)
77
  - `MAX_ITERATIONS`: 1-50, default 10
78
  - `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
79
 
 
96
 
97
  ## Git Workflow
98
 
99
+ - `main`: Production-ready (GitHub)
100
+ - `dev`: Development integration (GitHub)
101
+ - Remote `origin`: GitHub (source of truth for PRs/code review)
102
+ - Remote `huggingface-upstream`: HuggingFace Spaces (deployment target)
103
+
104
+ **HuggingFace Spaces Collaboration:**
105
+ - Each contributor should use their own dev branch: `yourname-dev` (e.g., `vcms-dev`, `mario-dev`)
106
+ - **DO NOT push directly to `main` or `dev` on HuggingFace** - these can be overwritten easily
107
+ - GitHub is the source of truth; HuggingFace is for deployment/demo
108
+ - Consider using git hooks to prevent accidental pushes to protected branches
GEMINI.md CHANGED
@@ -2,26 +2,27 @@
2
 
3
  ## Project Overview
4
  **DeepCritical** is an AI-native Medical Drug Repurposing Research Agent.
5
- **Goal:** To accelerate the discovery of new uses for existing drugs by intelligently searching biomedical literature (PubMed), evaluating evidence, and hypothesizing potential applications.
6
 
7
  **Architecture:**
8
  The project follows a **Vertical Slice Architecture** (Search -> Judge -> Orchestrator) and adheres to **Strict TDD** (Test-Driven Development).
9
 
10
  **Current Status:**
11
- - **Phases 1-8:** COMPLETE. Foundation, Search, Judge, UI, Orchestrator, Embeddings, Hypothesis, Report.
12
- - **Phase 9 (Source Cleanup):** COMPLETE. Removed DuckDuckGo web search (unreliable for scientific research).
13
- - **Phase 10-11:** PLANNED. ClinicalTrials.gov and bioRxiv integration.
 
14
 
15
  ## Tech Stack & Tooling
16
  - **Language:** Python 3.11 (Pinned)
17
  - **Package Manager:** `uv` (Rust-based, extremely fast)
18
- - **Frameworks:** `pydantic`, `pydantic-ai`, `httpx`, `gradio`
19
  - **Vector DB:** `chromadb` with `sentence-transformers` for semantic search
 
20
  - **Testing:** `pytest`, `pytest-asyncio`, `respx` (for mocking)
21
  - **Quality:** `ruff` (linting/formatting), `mypy` (strict type checking), `pre-commit`
22
 
23
  ## Building & Running
24
- We use a `Makefile` to standardize developer commands.
25
 
26
  | Command | Description |
27
  | :--- | :--- |
@@ -36,19 +37,54 @@ We use a `Makefile` to standardize developer commands.
36
  ## Directory Structure
37
  - `src/`: Source code
38
  - `utils/`: Shared utilities (`config.py`, `exceptions.py`, `models.py`)
39
- - `tools/`: Search tools (`pubmed.py`, `base.py`, `search_handler.py`)
40
- - `services/`: Services (`embeddings.py` - ChromaDB vector store)
41
  - `agents/`: Magentic multi-agent mode agents
42
  - `agent_factory/`: Agent definitions (judges, prompts)
 
 
43
  - `tests/`: Test suite
44
  - `unit/`: Isolated unit tests (Mocked)
45
  - `integration/`: Real API tests (Marked as slow/integration)
46
  - `docs/`: Documentation and Implementation Specs
47
  - `examples/`: Working demos for each phase
48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  ## Development Conventions
50
- 1. **Strict TDD:** Write failing tests in `tests/unit/` *before* implementing logic in `src/`.
51
- 2. **Type Safety:** All code must pass `mypy --strict`. Use Pydantic models for data exchange.
52
- 3. **Linting:** Zero tolerance for Ruff errors.
53
- 4. **Mocking:** Use `respx` or `unittest.mock` for all external API calls in unit tests. Real calls go in `tests/integration`.
54
- 5. **Vertical Slices:** Implement features end-to-end (Search -> Judge -> UI) rather than layer-by-layer.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  ## Project Overview
4
  **DeepCritical** is an AI-native Medical Drug Repurposing Research Agent.
5
+ **Goal:** To accelerate the discovery of new uses for existing drugs by intelligently searching biomedical literature (PubMed, ClinicalTrials.gov, bioRxiv), evaluating evidence, and hypothesizing potential applications.
6
 
7
  **Architecture:**
8
  The project follows a **Vertical Slice Architecture** (Search -> Judge -> Orchestrator) and adheres to **Strict TDD** (Test-Driven Development).
9
 
10
  **Current Status:**
11
+ - **Phases 1-9:** COMPLETE. Foundation, Search, Judge, UI, Orchestrator, Embeddings, Hypothesis, Report, Cleanup.
12
+ - **Phases 10-11:** COMPLETE. ClinicalTrials.gov and bioRxiv integration.
13
+ - **Phase 12:** COMPLETE. MCP Server integration (Gradio MCP at `/gradio_api/mcp/`).
14
+ - **Phase 13:** COMPLETE. Modal sandbox for statistical analysis.
15
 
16
  ## Tech Stack & Tooling
17
  - **Language:** Python 3.11 (Pinned)
18
  - **Package Manager:** `uv` (Rust-based, extremely fast)
19
+ - **Frameworks:** `pydantic`, `pydantic-ai`, `httpx`, `gradio[mcp]`
20
  - **Vector DB:** `chromadb` with `sentence-transformers` for semantic search
21
+ - **Code Execution:** `modal` for secure sandboxed Python execution
22
  - **Testing:** `pytest`, `pytest-asyncio`, `respx` (for mocking)
23
  - **Quality:** `ruff` (linting/formatting), `mypy` (strict type checking), `pre-commit`
24
 
25
  ## Building & Running
 
26
 
27
  | Command | Description |
28
  | :--- | :--- |
 
37
  ## Directory Structure
38
  - `src/`: Source code
39
  - `utils/`: Shared utilities (`config.py`, `exceptions.py`, `models.py`)
40
+ - `tools/`: Search tools (`pubmed.py`, `clinicaltrials.py`, `biorxiv.py`, `code_execution.py`)
41
+ - `services/`: Services (`embeddings.py`, `statistical_analyzer.py`)
42
  - `agents/`: Magentic multi-agent mode agents
43
  - `agent_factory/`: Agent definitions (judges, prompts)
44
+ - `mcp_tools.py`: MCP tool wrappers for Claude Desktop integration
45
+ - `app.py`: Gradio UI with MCP server
46
  - `tests/`: Test suite
47
  - `unit/`: Isolated unit tests (Mocked)
48
  - `integration/`: Real API tests (Marked as slow/integration)
49
  - `docs/`: Documentation and Implementation Specs
50
  - `examples/`: Working demos for each phase
51
 
52
+ ## Key Components
53
+ - `src/orchestrator.py` - Main agent loop
54
+ - `src/tools/pubmed.py` - PubMed E-utilities search
55
+ - `src/tools/clinicaltrials.py` - ClinicalTrials.gov API
56
+ - `src/tools/biorxiv.py` - bioRxiv/medRxiv preprint search
57
+ - `src/tools/code_execution.py` - Modal sandbox execution
58
+ - `src/services/statistical_analyzer.py` - Statistical analysis via Modal
59
+ - `src/mcp_tools.py` - MCP tool wrappers
60
+ - `src/app.py` - Gradio UI (HuggingFace Spaces) with MCP server
61
+
62
+ ## Configuration
63
+
64
+ Settings via pydantic-settings from `.env`:
65
+ - `LLM_PROVIDER`: "openai" or "anthropic"
66
+ - `OPENAI_API_KEY` / `ANTHROPIC_API_KEY`: LLM keys
67
+ - `NCBI_API_KEY`: Optional, for higher PubMed rate limits
68
+ - `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET`: For Modal sandbox (optional)
69
+ - `MAX_ITERATIONS`: 1-50, default 10
70
+ - `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
71
+
72
  ## Development Conventions
73
+ 1. **Strict TDD:** Write failing tests in `tests/unit/` *before* implementing logic in `src/`.
74
+ 2. **Type Safety:** All code must pass `mypy --strict`. Use Pydantic models for data exchange.
75
+ 3. **Linting:** Zero tolerance for Ruff errors.
76
+ 4. **Mocking:** Use `respx` or `unittest.mock` for all external API calls in unit tests.
77
+ 5. **Vertical Slices:** Implement features end-to-end rather than layer-by-layer.
78
+
79
+ ## Git Workflow
80
+
81
+ - `main`: Production-ready (GitHub)
82
+ - `dev`: Development integration (GitHub)
83
+ - Remote `origin`: GitHub (source of truth for PRs/code review)
84
+ - Remote `huggingface-upstream`: HuggingFace Spaces (deployment target)
85
+
86
+ **HuggingFace Spaces Collaboration:**
87
+ - Each contributor should use their own dev branch: `yourname-dev` (e.g., `vcms-dev`, `mario-dev`)
88
+ - **DO NOT push directly to `main` or `dev` on HuggingFace** - these can be overwritten easily
89
+ - GitHub is the source of truth; HuggingFace is for deployment/demo
90
+ - Consider using git hooks to prevent accidental pushes to protected branches
examples/modal_demo/verify_sandbox.py CHANGED
@@ -11,7 +11,7 @@ Usage:
11
  import asyncio
12
  from functools import partial
13
 
14
- from src.tools.code_execution import get_code_executor
15
  from src.utils.config import settings
16
 
17
 
@@ -31,22 +31,23 @@ async def main() -> None:
31
  print("Set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in .env")
32
  return
33
 
34
- executor = get_code_executor()
35
- loop = asyncio.get_running_loop()
 
36
 
37
- print("=" * 60)
38
- print("Modal Sandbox Isolation Verification")
39
- print("=" * 60 + "\n")
40
 
41
- # Test 1: Hostname
42
- print("Test 1: Check hostname (should NOT be your machine)")
43
- code1 = "import socket; print(f'Hostname: {socket.gethostname()}')"
44
- result1 = await loop.run_in_executor(None, partial(executor.execute, code1))
45
- print_result(result1)
46
 
47
- # Test 2: Scientific libraries
48
- print("Test 2: Verify scientific libraries")
49
- code2 = """
50
  import pandas as pd
51
  import numpy as np
52
  import scipy
@@ -54,12 +55,12 @@ print(f"pandas: {pd.__version__}")
54
  print(f"numpy: {np.__version__}")
55
  print(f"scipy: {scipy.__version__}")
56
  """
57
- result2 = await loop.run_in_executor(None, partial(executor.execute, code2))
58
- print_result(result2)
59
 
60
- # Test 3: Network blocked
61
- print("Test 3: Verify network isolation")
62
- code3 = """
63
  import urllib.request
64
  try:
65
  urllib.request.urlopen("https://google.com", timeout=2)
@@ -67,12 +68,12 @@ try:
67
  except Exception:
68
  print("Network: BLOCKED (as expected)")
69
  """
70
- result3 = await loop.run_in_executor(None, partial(executor.execute, code3))
71
- print_result(result3)
72
 
73
- # Test 4: Real statistics
74
- print("Test 4: Execute statistical analysis")
75
- code4 = """
76
  import pandas as pd
77
  import scipy.stats as stats
78
 
@@ -84,12 +85,16 @@ print(f"Mean Effect: {mean:.3f}")
84
  print(f"P-value: {p_val:.4f}")
85
  print(f"Verdict: {'SUPPORTED' if p_val < 0.05 else 'INCONCLUSIVE'}")
86
  """
87
- result4 = await loop.run_in_executor(None, partial(executor.execute, code4))
88
- print_result(result4)
89
 
90
- print("=" * 60)
91
- print("All tests complete - Modal sandbox verified!")
92
- print("=" * 60)
 
 
 
 
93
 
94
 
95
  if __name__ == "__main__":
 
11
  import asyncio
12
  from functools import partial
13
 
14
+ from src.tools.code_execution import CodeExecutionError, get_code_executor
15
  from src.utils.config import settings
16
 
17
 
 
31
  print("Set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in .env")
32
  return
33
 
34
+ try:
35
+ executor = get_code_executor()
36
+ loop = asyncio.get_running_loop()
37
 
38
+ print("=" * 60)
39
+ print("Modal Sandbox Isolation Verification")
40
+ print("=" * 60 + "\n")
41
 
42
+ # Test 1: Hostname
43
+ print("Test 1: Check hostname (should NOT be your machine)")
44
+ code1 = "import socket; print(f'Hostname: {socket.gethostname()}')"
45
+ result1 = await loop.run_in_executor(None, partial(executor.execute, code1))
46
+ print_result(result1)
47
 
48
+ # Test 2: Scientific libraries
49
+ print("Test 2: Verify scientific libraries")
50
+ code2 = """
51
  import pandas as pd
52
  import numpy as np
53
  import scipy
 
55
  print(f"numpy: {np.__version__}")
56
  print(f"scipy: {scipy.__version__}")
57
  """
58
+ result2 = await loop.run_in_executor(None, partial(executor.execute, code2))
59
+ print_result(result2)
60
 
61
+ # Test 3: Network blocked
62
+ print("Test 3: Verify network isolation")
63
+ code3 = """
64
  import urllib.request
65
  try:
66
  urllib.request.urlopen("https://google.com", timeout=2)
 
68
  except Exception:
69
  print("Network: BLOCKED (as expected)")
70
  """
71
+ result3 = await loop.run_in_executor(None, partial(executor.execute, code3))
72
+ print_result(result3)
73
 
74
+ # Test 4: Real statistics
75
+ print("Test 4: Execute statistical analysis")
76
+ code4 = """
77
  import pandas as pd
78
  import scipy.stats as stats
79
 
 
85
  print(f"P-value: {p_val:.4f}")
86
  print(f"Verdict: {'SUPPORTED' if p_val < 0.05 else 'INCONCLUSIVE'}")
87
  """
88
+ result4 = await loop.run_in_executor(None, partial(executor.execute, code4))
89
+ print_result(result4)
90
 
91
+ print("=" * 60)
92
+ print("All tests complete - Modal sandbox verified!")
93
+ print("=" * 60)
94
+
95
+ except CodeExecutionError as e:
96
+ print(f"Error: Modal code execution failed: {e}")
97
+ print("Hint: Ensure Modal SDK is installed and credentials are valid.")
98
 
99
 
100
  if __name__ == "__main__":
src/services/statistical_analyzer.py CHANGED
@@ -109,8 +109,8 @@ Output format: Return ONLY executable Python code, no explanations.""",
109
  Returns:
110
  AnalysisResult with verdict and statistics
111
  """
112
- # Build analysis prompt
113
- evidence_summary = self._summarize_evidence(evidence[:10])
114
  hypothesis_text = ""
115
  if hypothesis:
116
  hypothesis_text = (
 
109
  Returns:
110
  AnalysisResult with verdict and statistics
111
  """
112
+ # Build analysis prompt (method handles slicing internally)
113
+ evidence_summary = self._summarize_evidence(evidence)
114
  hypothesis_text = ""
115
  if hypothesis:
116
  hypothesis_text = (