DeepCritical / docs /implementation /13_phase_modal_integration.md
VibecoderMcSwaggins's picture
docs: update MCP server integration documentation and status
cde8f48
|
raw
history blame
40.7 kB

Phase 13 Implementation Spec: Modal Pipeline Integration

Goal: Wire existing Modal code execution into the agent pipeline. Philosophy: "Sandboxed execution makes AI-generated code trustworthy." Prerequisite: Phase 12 complete (MCP server working) Priority: P1 - HIGH VALUE ($2,500 Modal Innovation Award) Estimated Time: 2-3 hours


1. Why Modal Integration?

Current State Analysis

Mario already implemented src/tools/code_execution.py:

Component Status Notes
ModalCodeExecutor class Built Executes Python in Modal sandbox
SANDBOX_LIBRARIES Defined pandas, numpy, scipy, etc.
execute() method Implemented Stdout/stderr capture
execute_with_return() Implemented Returns result variable
AnalysisAgent Built Uses Modal for statistical analysis
Pipeline Integration MISSING Not wired into main orchestrator

What's Missing

Current Flow:
  User Query β†’ Orchestrator β†’ Search β†’ Judge β†’ [Report] β†’ Done

With Modal:
  User Query β†’ Orchestrator β†’ Search β†’ Judge β†’ [Analysis*] β†’ Report β†’ Done
                                                    ↓
                                          Modal Sandbox Execution

*The AnalysisAgent exists but is NOT called by either orchestrator.


2. Critical Dependency Analysis

The Problem (Senior Feedback)

# src/agents/analysis_agent.py - Line 8
from agent_framework import (
    AgentRunResponse,
    BaseAgent,
    ...
)
# pyproject.toml - agent-framework is OPTIONAL
[project.optional-dependencies]
magentic = [
    "agent-framework-core",
]

If we import AnalysisAgent in the simple orchestrator without the magentic extra installed, the app CRASHES on startup.

The SOLID Solution

Single Responsibility Principle: Decouple Modal execution logic from agent_framework.

BEFORE (Coupled):
  AnalysisAgent (requires agent_framework)
       ↓
  ModalCodeExecutor

AFTER (Decoupled):
  StatisticalAnalyzer (no agent_framework dependency)  ← Simple mode uses this
       ↓
  ModalCodeExecutor
       ↑
  AnalysisAgent (wraps StatisticalAnalyzer)  ← Magentic mode uses this

Key insight: Create src/services/statistical_analyzer.py with ZERO agent_framework imports.


3. Prize Opportunity

Modal Innovation Award: $2,500

Judging Criteria:

  1. Sandbox Isolation - Code runs in container, not local
  2. Scientific Computing - Real pandas/scipy analysis
  3. Safety - Can't access local filesystem
  4. Speed - Modal's fast cold starts

What We Need to Show

# LLM generates analysis code
code = """
import pandas as pd
import scipy.stats as stats

data = pd.DataFrame({
    'study': ['Study1', 'Study2', 'Study3'],
    'effect_size': [0.45, 0.52, 0.38],
    'sample_size': [120, 85, 200]
})

weighted_mean = (data['effect_size'] * data['sample_size']).sum() / data['sample_size'].sum()
t_stat, p_value = stats.ttest_1samp(data['effect_size'], 0)

print(f"Weighted Effect Size: {weighted_mean:.3f}")
print(f"P-value: {p_value:.4f}")

result = "SUPPORTED" if p_value < 0.05 else "INCONCLUSIVE"
"""

# Executed SAFELY in Modal sandbox
executor = get_code_executor()
output = executor.execute(code)  # Runs in isolated container!

4. Technical Specification

4.1 Dependencies

# pyproject.toml - NO CHANGES to dependencies
# StatisticalAnalyzer uses only:
#   - pydantic-ai (already in main deps)
#   - modal (already in main deps)
#   - src.tools.code_execution (no agent_framework)

4.2 Environment Variables

# .env
MODAL_TOKEN_ID=your-token-id
MODAL_TOKEN_SECRET=your-token-secret

4.3 Integration Points

Integration Point File Change Required
New Service src/services/statistical_analyzer.py CREATE (no agent_framework)
Simple Orchestrator src/orchestrator.py Use StatisticalAnalyzer
Config src/utils/config.py Add enable_modal_analysis setting
AnalysisAgent src/agents/analysis_agent.py Refactor to wrap StatisticalAnalyzer
MCP Tool src/mcp_tools.py Add analyze_hypothesis tool

5. Implementation

5.1 Configuration Update (src/utils/config.py)

class Settings(BaseSettings):
    # ... existing settings ...

    # Modal Configuration
    modal_token_id: str | None = None
    modal_token_secret: str | None = None
    enable_modal_analysis: bool = False  # Opt-in for hackathon demo

    @property
    def modal_available(self) -> bool:
        """Check if Modal credentials are configured."""
        return bool(self.modal_token_id and self.modal_token_secret)

5.2 StatisticalAnalyzer Service (src/services/statistical_analyzer.py)

This is the key fix - NO agent_framework imports.

"""Statistical analysis service using Modal code execution.

This module provides Modal-based statistical analysis WITHOUT depending on
agent_framework. This allows it to be used in the simple orchestrator mode
without requiring the magentic optional dependency.

The AnalysisAgent (in src/agents/) wraps this service for magentic mode.
"""

import asyncio
import re
from functools import partial
from typing import Any

from pydantic import BaseModel, Field
from pydantic_ai import Agent

from src.agent_factory.judges import get_model
from src.tools.code_execution import (
    CodeExecutionError,
    get_code_executor,
    get_sandbox_library_prompt,
)
from src.utils.models import Evidence


class AnalysisResult(BaseModel):
    """Result of statistical analysis."""

    verdict: str = Field(
        description="SUPPORTED, REFUTED, or INCONCLUSIVE",
    )
    confidence: float = Field(ge=0.0, le=1.0, description="Confidence in verdict (0-1)")
    statistical_evidence: str = Field(
        description="Summary of statistical findings from code execution"
    )
    code_generated: str = Field(description="Python code that was executed")
    execution_output: str = Field(description="Output from code execution")
    key_findings: list[str] = Field(default_factory=list, description="Key takeaways")
    limitations: list[str] = Field(default_factory=list, description="Limitations")


class StatisticalAnalyzer:
    """Performs statistical analysis using Modal code execution.

    This service:
    1. Generates Python code for statistical analysis using LLM
    2. Executes code in Modal sandbox
    3. Interprets results
    4. Returns verdict (SUPPORTED/REFUTED/INCONCLUSIVE)

    Note: This class has NO agent_framework dependency, making it safe
    to use in the simple orchestrator without the magentic extra.
    """

    def __init__(self) -> None:
        """Initialize the analyzer."""
        self._code_executor: Any = None
        self._agent: Agent[None, str] | None = None

    def _get_code_executor(self) -> Any:
        """Lazy initialization of code executor."""
        if self._code_executor is None:
            self._code_executor = get_code_executor()
        return self._code_executor

    def _get_agent(self) -> Agent[None, str]:
        """Lazy initialization of LLM agent for code generation."""
        if self._agent is None:
            library_versions = get_sandbox_library_prompt()
            self._agent = Agent(
                model=get_model(),
                output_type=str,
                system_prompt=f"""You are a biomedical data scientist.

Generate Python code to analyze research evidence and test hypotheses.

Guidelines:
1. Use pandas, numpy, scipy.stats for analysis
2. Print clear, interpretable results
3. Include statistical tests (t-tests, chi-square, etc.)
4. Calculate effect sizes and confidence intervals
5. Keep code concise (<50 lines)
6. Set 'result' variable to SUPPORTED, REFUTED, or INCONCLUSIVE

Available libraries:
{library_versions}

Output format: Return ONLY executable Python code, no explanations.""",
            )
        return self._agent

    async def analyze(
        self,
        query: str,
        evidence: list[Evidence],
        hypothesis: dict[str, Any] | None = None,
    ) -> AnalysisResult:
        """Run statistical analysis on evidence.

        Args:
            query: The research question
            evidence: List of Evidence objects to analyze
            hypothesis: Optional hypothesis dict with drug, target, pathway, effect

        Returns:
            AnalysisResult with verdict and statistics
        """
        # Build analysis prompt
        evidence_summary = self._summarize_evidence(evidence[:10])
        hypothesis_text = ""
        if hypothesis:
            hypothesis_text = f"""
Hypothesis: {hypothesis.get('drug', 'Unknown')} β†’ {hypothesis.get('target', '?')} β†’ {hypothesis.get('pathway', '?')} β†’ {hypothesis.get('effect', '?')}
Confidence: {hypothesis.get('confidence', 0.5):.0%}
"""

        prompt = f"""Generate Python code to statistically analyze:

**Research Question**: {query}
{hypothesis_text}

**Evidence Summary**:
{evidence_summary}

Generate executable Python code to analyze this evidence."""

        try:
            # Generate code
            agent = self._get_agent()
            code_result = await agent.run(prompt)
            generated_code = code_result.output

            # Execute in Modal sandbox
            loop = asyncio.get_running_loop()
            executor = self._get_code_executor()
            execution = await loop.run_in_executor(
                None, partial(executor.execute, generated_code, timeout=120)
            )

            if not execution["success"]:
                return AnalysisResult(
                    verdict="INCONCLUSIVE",
                    confidence=0.0,
                    statistical_evidence=f"Execution failed: {execution['error']}",
                    code_generated=generated_code,
                    execution_output=execution.get("stderr", ""),
                    key_findings=[],
                    limitations=["Code execution failed"],
                )

            # Interpret results
            return self._interpret_results(generated_code, execution)

        except CodeExecutionError as e:
            return AnalysisResult(
                verdict="INCONCLUSIVE",
                confidence=0.0,
                statistical_evidence=str(e),
                code_generated="",
                execution_output="",
                key_findings=[],
                limitations=[f"Analysis error: {e}"],
            )

    def _summarize_evidence(self, evidence: list[Evidence]) -> str:
        """Summarize evidence for code generation prompt."""
        if not evidence:
            return "No evidence available."

        lines = []
        for i, ev in enumerate(evidence[:5], 1):
            lines.append(f"{i}. {ev.content[:200]}...")
            lines.append(f"   Source: {ev.citation.title}")
            lines.append(f"   Relevance: {ev.relevance:.0%}\n")

        return "\n".join(lines)

    def _interpret_results(
        self,
        code: str,
        execution: dict[str, Any],
    ) -> AnalysisResult:
        """Interpret code execution results."""
        stdout = execution["stdout"]
        stdout_upper = stdout.upper()

        # Extract verdict with robust word-boundary matching
        verdict = "INCONCLUSIVE"
        if re.search(r"\bSUPPORTED\b", stdout_upper) and not re.search(
            r"\b(?:NOT|UN)SUPPORTED\b", stdout_upper
        ):
            verdict = "SUPPORTED"
        elif re.search(r"\bREFUTED\b", stdout_upper):
            verdict = "REFUTED"

        # Extract key findings
        key_findings = []
        for line in stdout.split("\n"):
            line_lower = line.lower()
            if any(kw in line_lower for kw in ["p-value", "significant", "effect", "mean"]):
                key_findings.append(line.strip())

        # Calculate confidence from p-values
        confidence = self._calculate_confidence(stdout)

        return AnalysisResult(
            verdict=verdict,
            confidence=confidence,
            statistical_evidence=stdout.strip(),
            code_generated=code,
            execution_output=stdout,
            key_findings=key_findings[:5],
            limitations=[
                "Analysis based on summary data only",
                "Limited to available evidence",
                "Statistical tests assume data independence",
            ],
        )

    def _calculate_confidence(self, output: str) -> float:
        """Calculate confidence based on statistical results."""
        p_values = re.findall(r"p[-\s]?value[:\s]+(\d+\.?\d*)", output.lower())

        if p_values:
            try:
                min_p = min(float(p) for p in p_values)
                if min_p < 0.001:
                    return 0.95
                elif min_p < 0.01:
                    return 0.90
                elif min_p < 0.05:
                    return 0.80
                else:
                    return 0.60
            except ValueError:
                pass

        return 0.70  # Default


# Singleton for reuse
_analyzer: StatisticalAnalyzer | None = None


def get_statistical_analyzer() -> StatisticalAnalyzer:
    """Get or create singleton StatisticalAnalyzer instance."""
    global _analyzer
    if _analyzer is None:
        _analyzer = StatisticalAnalyzer()
    return _analyzer

5.3 Simple Orchestrator Update (src/orchestrator.py)

Uses StatisticalAnalyzer directly - NO agent_framework import.

"""Main orchestrator with optional Modal analysis."""

from src.utils.config import settings

# ... existing imports ...


class Orchestrator:
    """Search-Judge-Analyze orchestration loop."""

    def __init__(
        self,
        search_handler: SearchHandlerProtocol,
        judge_handler: JudgeHandlerProtocol,
        config: OrchestratorConfig | None = None,
        enable_analysis: bool = False,  # New parameter
    ) -> None:
        self.search = search_handler
        self.judge = judge_handler
        self.config = config or OrchestratorConfig()
        self.history: list[dict[str, Any]] = []
        self._enable_analysis = enable_analysis and settings.modal_available

        # Lazy-load analysis (NO agent_framework dependency!)
        self._analyzer: Any = None

    def _get_analyzer(self) -> Any:
        """Lazy initialization of StatisticalAnalyzer.

        Note: This imports from src.services, NOT src.agents,
        so it works without the magentic optional dependency.
        """
        if self._analyzer is None:
            from src.services.statistical_analyzer import get_statistical_analyzer

            self._analyzer = get_statistical_analyzer()
        return self._analyzer

    async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
        """Main orchestration loop with optional Modal analysis."""
        # ... existing search/judge loop ...

        # After judge says "synthesize", optionally run analysis
        if self._enable_analysis and assessment.recommendation == "synthesize":
            yield AgentEvent(
                type="analyzing",
                message="Running statistical analysis in Modal sandbox...",
                data={},
                iteration=iteration,
            )

            try:
                analyzer = self._get_analyzer()

                # Run Modal analysis (no agent_framework needed!)
                analysis_result = await analyzer.analyze(
                    query=query,
                    evidence=all_evidence,
                    hypothesis=None,  # Could add hypothesis generation later
                )

                yield AgentEvent(
                    type="analysis_complete",
                    message=f"Analysis verdict: {analysis_result.verdict}",
                    data=analysis_result.model_dump(),
                    iteration=iteration,
                )

            except Exception as e:
                yield AgentEvent(
                    type="error",
                    message=f"Modal analysis failed: {e}",
                    data={"error": str(e)},
                    iteration=iteration,
                )

        # Continue to synthesis...

5.4 Refactor AnalysisAgent (src/agents/analysis_agent.py)

Wrap StatisticalAnalyzer for magentic mode.

"""Analysis agent for statistical analysis using Modal code execution.

This agent wraps StatisticalAnalyzer for use in magentic multi-agent mode.
The core logic is in src/services/statistical_analyzer.py to avoid
coupling agent_framework to the simple orchestrator.
"""

from collections.abc import AsyncIterable
from typing import TYPE_CHECKING, Any

from agent_framework import (
    AgentRunResponse,
    AgentRunResponseUpdate,
    AgentThread,
    BaseAgent,
    ChatMessage,
    Role,
)

from src.services.statistical_analyzer import (
    AnalysisResult,
    get_statistical_analyzer,
)
from src.utils.models import Evidence

if TYPE_CHECKING:
    from src.services.embeddings import EmbeddingService


class AnalysisAgent(BaseAgent):  # type: ignore[misc]
    """Wraps StatisticalAnalyzer for magentic multi-agent mode."""

    def __init__(
        self,
        evidence_store: dict[str, Any],
        embedding_service: "EmbeddingService | None" = None,
    ) -> None:
        super().__init__(
            name="AnalysisAgent",
            description="Performs statistical analysis using Modal sandbox",
        )
        self._evidence_store = evidence_store
        self._embeddings = embedding_service
        self._analyzer = get_statistical_analyzer()

    async def run(
        self,
        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
        *,
        thread: AgentThread | None = None,
        **kwargs: Any,
    ) -> AgentRunResponse:
        """Analyze evidence and return verdict."""
        query = self._extract_query(messages)
        hypotheses = self._evidence_store.get("hypotheses", [])
        evidence = self._evidence_store.get("current", [])

        if not evidence:
            return self._error_response("No evidence available.")

        # Get primary hypothesis if available
        hypothesis_dict = None
        if hypotheses:
            h = hypotheses[0]
            hypothesis_dict = {
                "drug": getattr(h, "drug", "Unknown"),
                "target": getattr(h, "target", "?"),
                "pathway": getattr(h, "pathway", "?"),
                "effect": getattr(h, "effect", "?"),
                "confidence": getattr(h, "confidence", 0.5),
            }

        # Delegate to StatisticalAnalyzer
        result = await self._analyzer.analyze(
            query=query,
            evidence=evidence,
            hypothesis=hypothesis_dict,
        )

        # Store in shared context
        self._evidence_store["analysis"] = result.model_dump()

        # Format response
        response_text = self._format_response(result)

        return AgentRunResponse(
            messages=[ChatMessage(role=Role.ASSISTANT, text=response_text)],
            response_id=f"analysis-{result.verdict.lower()}",
            additional_properties={"analysis": result.model_dump()},
        )

    def _format_response(self, result: AnalysisResult) -> str:
        """Format analysis result as markdown."""
        lines = [
            "## Statistical Analysis Complete\n",
            f"### Verdict: **{result.verdict}**",
            f"**Confidence**: {result.confidence:.0%}\n",
            "### Key Findings",
        ]
        for finding in result.key_findings:
            lines.append(f"- {finding}")

        lines.extend([
            "\n### Statistical Evidence",
            "```",
            result.statistical_evidence,
            "```",
        ])
        return "\n".join(lines)

    def _error_response(self, message: str) -> AgentRunResponse:
        """Create error response."""
        return AgentRunResponse(
            messages=[ChatMessage(role=Role.ASSISTANT, text=f"**Error**: {message}")],
            response_id="analysis-error",
        )

    def _extract_query(
        self, messages: str | ChatMessage | list[str] | list[ChatMessage] | None
    ) -> str:
        """Extract query from messages."""
        if isinstance(messages, str):
            return messages
        elif isinstance(messages, ChatMessage):
            return messages.text or ""
        elif isinstance(messages, list):
            for msg in reversed(messages):
                if isinstance(msg, ChatMessage) and msg.role == Role.USER:
                    return msg.text or ""
                elif isinstance(msg, str):
                    return msg
        return ""

    async def run_stream(
        self,
        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
        *,
        thread: AgentThread | None = None,
        **kwargs: Any,
    ) -> AsyncIterable[AgentRunResponseUpdate]:
        """Streaming wrapper."""
        result = await self.run(messages, thread=thread, **kwargs)
        yield AgentRunResponseUpdate(messages=result.messages, response_id=result.response_id)

5.5 MCP Tool for Modal Analysis (src/mcp_tools.py)

Add to existing MCP tools:

async def analyze_hypothesis(
    drug: str,
    condition: str,
    evidence_summary: str,
) -> str:
    """Perform statistical analysis of drug repurposing hypothesis using Modal.

    Executes AI-generated Python code in a secure Modal sandbox to analyze
    the statistical evidence for a drug repurposing hypothesis.

    Args:
        drug: The drug being evaluated (e.g., "metformin")
        condition: The target condition (e.g., "Alzheimer's disease")
        evidence_summary: Summary of evidence to analyze

    Returns:
        Analysis result with verdict (SUPPORTED/REFUTED/INCONCLUSIVE) and statistics
    """
    from src.services.statistical_analyzer import get_statistical_analyzer
    from src.utils.config import settings
    from src.utils.models import Citation, Evidence

    if not settings.modal_available:
        return "Error: Modal credentials not configured. Set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET."

    # Create evidence from summary
    evidence = [
        Evidence(
            content=evidence_summary,
            citation=Citation(
                source="pubmed",
                title=f"Evidence for {drug} in {condition}",
                url="https://example.com",
                date="2024-01-01",
                authors=["User Provided"],
            ),
            relevance=0.9,
        )
    ]

    analyzer = get_statistical_analyzer()
    result = await analyzer.analyze(
        query=f"Can {drug} treat {condition}?",
        evidence=evidence,
        hypothesis={"drug": drug, "target": "unknown", "pathway": "unknown", "effect": condition},
    )

    return f"""## Statistical Analysis: {drug} for {condition}

### Verdict: **{result.verdict}**
**Confidence**: {result.confidence:.0%}

### Key Findings
{chr(10).join(f"- {f}" for f in result.key_findings) or "- No specific findings extracted"}

### Execution Output

{result.execution_output}


### Generated Code
```python
{result.code_generated}

Executed in Modal Sandbox - Isolated, secure, reproducible. """


### 5.6 Demo Scripts

#### `examples/modal_demo/verify_sandbox.py`

```python
#!/usr/bin/env python3
"""Verify that Modal sandbox is properly isolated.

This script proves to judges that code runs in Modal, not locally.
NO agent_framework dependency - uses only src.tools.code_execution.

Usage:
    uv run python examples/modal_demo/verify_sandbox.py
"""

import asyncio
from functools import partial

from src.tools.code_execution import get_code_executor
from src.utils.config import settings


async def main() -> None:
    """Verify Modal sandbox isolation."""
    if not settings.modal_available:
        print("Error: Modal credentials not configured.")
        print("Set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in .env")
        return

    executor = get_code_executor()
    loop = asyncio.get_running_loop()

    print("=" * 60)
    print("Modal Sandbox Isolation Verification")
    print("=" * 60 + "\n")

    # Test 1: Hostname
    print("Test 1: Check hostname (should NOT be your machine)")
    code1 = "import socket; print(f'Hostname: {socket.gethostname()}')"
    result1 = await loop.run_in_executor(None, partial(executor.execute, code1))
    print(f"  {result1['stdout'].strip()}\n")

    # Test 2: Scientific libraries
    print("Test 2: Verify scientific libraries")
    code2 = """
import pandas as pd
import numpy as np
import scipy
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
print(f"scipy: {scipy.__version__}")
"""
    result2 = await loop.run_in_executor(None, partial(executor.execute, code2))
    print(f"  {result2['stdout'].strip()}\n")

    # Test 3: Network blocked
    print("Test 3: Verify network isolation")
    code3 = """
import urllib.request
try:
    urllib.request.urlopen("https://google.com", timeout=2)
    print("Network: ALLOWED (unexpected!)")
except Exception:
    print("Network: BLOCKED (as expected)")
"""
    result3 = await loop.run_in_executor(None, partial(executor.execute, code3))
    print(f"  {result3['stdout'].strip()}\n")

    # Test 4: Real statistics
    print("Test 4: Execute statistical analysis")
    code4 = """
import pandas as pd
import scipy.stats as stats

data = pd.DataFrame({'effect': [0.42, 0.38, 0.51]})
mean = data['effect'].mean()
t_stat, p_val = stats.ttest_1samp(data['effect'], 0)

print(f"Mean Effect: {mean:.3f}")
print(f"P-value: {p_val:.4f}")
print(f"Verdict: {'SUPPORTED' if p_val < 0.05 else 'INCONCLUSIVE'}")
"""
    result4 = await loop.run_in_executor(None, partial(executor.execute, code4))
    print(f"  {result4['stdout'].strip()}\n")

    print("=" * 60)
    print("All tests complete - Modal sandbox verified!")
    print("=" * 60)


if __name__ == "__main__":
    asyncio.run(main())

examples/modal_demo/run_analysis.py

#!/usr/bin/env python3
"""Demo: Modal-powered statistical analysis.

This script uses StatisticalAnalyzer directly (NO agent_framework dependency).

Usage:
    uv run python examples/modal_demo/run_analysis.py "metformin alzheimer"
"""

import argparse
import asyncio
import os
import sys

from src.services.statistical_analyzer import get_statistical_analyzer
from src.tools.pubmed import PubMedTool
from src.utils.config import settings


async def main() -> None:
    """Run the Modal analysis demo."""
    parser = argparse.ArgumentParser(description="Modal Analysis Demo")
    parser.add_argument("query", help="Research query")
    args = parser.parse_args()

    if not settings.modal_available:
        print("Error: Modal credentials not configured.")
        sys.exit(1)

    if not (os.getenv("OPENAI_API_KEY") or os.getenv("ANTHROPIC_API_KEY")):
        print("Error: No LLM API key found.")
        sys.exit(1)

    print(f"\n{'=' * 60}")
    print("DeepCritical Modal Analysis Demo")
    print(f"Query: {args.query}")
    print(f"{'=' * 60}\n")

    # Step 1: Gather Evidence
    print("Step 1: Gathering evidence from PubMed...")
    pubmed = PubMedTool()
    evidence = await pubmed.search(args.query, max_results=5)
    print(f"  Found {len(evidence)} papers\n")

    # Step 2: Run Modal Analysis
    print("Step 2: Running statistical analysis in Modal sandbox...")
    analyzer = get_statistical_analyzer()
    result = await analyzer.analyze(query=args.query, evidence=evidence)

    # Step 3: Display Results
    print("\n" + "=" * 60)
    print("ANALYSIS RESULTS")
    print("=" * 60)
    print(f"\nVerdict: {result.verdict}")
    print(f"Confidence: {result.confidence:.0%}")
    print("\nKey Findings:")
    for finding in result.key_findings:
        print(f"  - {finding}")

    print("\n[Demo Complete - Code executed in Modal, not locally]")


if __name__ == "__main__":
    asyncio.run(main())

6. TDD Test Suite

6.1 Unit Tests (tests/unit/services/test_statistical_analyzer.py)

"""Unit tests for StatisticalAnalyzer service."""

from unittest.mock import AsyncMock, MagicMock, patch

import pytest

from src.services.statistical_analyzer import (
    AnalysisResult,
    StatisticalAnalyzer,
    get_statistical_analyzer,
)
from src.utils.models import Citation, Evidence


@pytest.fixture
def sample_evidence() -> list[Evidence]:
    """Sample evidence for testing."""
    return [
        Evidence(
            content="Metformin shows effect size of 0.45.",
            citation=Citation(
                source="pubmed",
                title="Metformin Study",
                url="https://pubmed.ncbi.nlm.nih.gov/12345/",
                date="2024-01-15",
                authors=["Smith J"],
            ),
            relevance=0.9,
        )
    ]


class TestStatisticalAnalyzer:
    """Tests for StatisticalAnalyzer (no agent_framework dependency)."""

    def test_no_agent_framework_import(self) -> None:
        """StatisticalAnalyzer must NOT import agent_framework."""
        import src.services.statistical_analyzer as module

        # Check module doesn't import agent_framework
        source = open(module.__file__).read()
        assert "agent_framework" not in source
        assert "BaseAgent" not in source

    @pytest.mark.asyncio
    async def test_analyze_returns_result(
        self, sample_evidence: list[Evidence]
    ) -> None:
        """analyze() should return AnalysisResult."""
        analyzer = StatisticalAnalyzer()

        with patch.object(analyzer, "_get_agent") as mock_agent, \
             patch.object(analyzer, "_get_code_executor") as mock_executor:

            # Mock LLM
            mock_agent.return_value.run = AsyncMock(
                return_value=MagicMock(output="print('SUPPORTED')")
            )

            # Mock Modal
            mock_executor.return_value.execute.return_value = {
                "stdout": "SUPPORTED\np-value: 0.01",
                "stderr": "",
                "success": True,
            }

            result = await analyzer.analyze("test query", sample_evidence)

            assert isinstance(result, AnalysisResult)
            assert result.verdict == "SUPPORTED"

    def test_singleton(self) -> None:
        """get_statistical_analyzer should return singleton."""
        a1 = get_statistical_analyzer()
        a2 = get_statistical_analyzer()
        assert a1 is a2


class TestAnalysisResult:
    """Tests for AnalysisResult model."""

    def test_verdict_values(self) -> None:
        """Verdict should be one of the expected values."""
        for verdict in ["SUPPORTED", "REFUTED", "INCONCLUSIVE"]:
            result = AnalysisResult(
                verdict=verdict,
                confidence=0.8,
                statistical_evidence="test",
                code_generated="print('test')",
                execution_output="test",
            )
            assert result.verdict == verdict

    def test_confidence_bounds(self) -> None:
        """Confidence must be 0.0-1.0."""
        with pytest.raises(ValueError):
            AnalysisResult(
                verdict="SUPPORTED",
                confidence=1.5,  # Invalid
                statistical_evidence="test",
                code_generated="test",
                execution_output="test",
            )

6.2 Integration Test (tests/integration/test_modal.py)

"""Integration tests for Modal (requires credentials)."""

import pytest

from src.utils.config import settings


@pytest.mark.integration
@pytest.mark.skipif(not settings.modal_available, reason="Modal not configured")
class TestModalIntegration:
    """Integration tests requiring Modal credentials."""

    @pytest.mark.asyncio
    async def test_sandbox_executes_code(self) -> None:
        """Modal sandbox should execute Python code."""
        import asyncio
        from functools import partial

        from src.tools.code_execution import get_code_executor

        executor = get_code_executor()
        code = "import pandas as pd; print(pd.DataFrame({'a': [1,2,3]})['a'].sum())"

        loop = asyncio.get_running_loop()
        result = await loop.run_in_executor(
            None, partial(executor.execute, code, timeout=30)
        )

        assert result["success"]
        assert "6" in result["stdout"]

    @pytest.mark.asyncio
    async def test_statistical_analyzer_works(self) -> None:
        """StatisticalAnalyzer should work end-to-end."""
        from src.services.statistical_analyzer import get_statistical_analyzer
        from src.utils.models import Citation, Evidence

        evidence = [
            Evidence(
                content="Drug shows 40% improvement in trial.",
                citation=Citation(
                    source="pubmed",
                    title="Test",
                    url="https://test.com",
                    date="2024-01-01",
                    authors=["Test"],
                ),
                relevance=0.9,
            )
        ]

        analyzer = get_statistical_analyzer()
        result = await analyzer.analyze("test drug efficacy", evidence)

        assert result.verdict in ["SUPPORTED", "REFUTED", "INCONCLUSIVE"]
        assert 0.0 <= result.confidence <= 1.0

7. Verification Commands

# 1. Verify NO agent_framework in StatisticalAnalyzer
grep -r "agent_framework" src/services/statistical_analyzer.py
# Should return nothing!

# 2. Run unit tests (no Modal needed)
uv run pytest tests/unit/services/test_statistical_analyzer.py -v

# 3. Run verification script (requires Modal)
uv run python examples/modal_demo/verify_sandbox.py

# 4. Run analysis demo (requires Modal + LLM)
uv run python examples/modal_demo/run_analysis.py "metformin alzheimer"

# 5. Run integration tests
uv run pytest tests/integration/test_modal.py -v -m integration

# 6. Full test suite
make check

8. Definition of Done

Phase 13 is COMPLETE when:

  • src/services/statistical_analyzer.py created (NO agent_framework)
  • src/utils/config.py has enable_modal_analysis setting
  • src/orchestrator.py uses StatisticalAnalyzer directly
  • src/agents/analysis_agent.py refactored to wrap StatisticalAnalyzer
  • src/mcp_tools.py has analyze_hypothesis tool
  • examples/modal_demo/verify_sandbox.py working
  • examples/modal_demo/run_analysis.py working
  • Unit tests pass WITHOUT magentic extra installed
  • Integration tests pass WITH Modal credentials
  • All lints pass

9. Architecture After Phase 13

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        MCP Clients                              β”‚
β”‚              (Claude Desktop, Cursor, etc.)                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚ MCP Protocol
                            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Gradio App + MCP Server                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  MCP Tools: search_pubmed, search_trials, search_biorxiv β”‚   β”‚
β”‚  β”‚             search_all, analyze_hypothesis               β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                                       β”‚
        β–Ό                                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Simple Orchestrator β”‚            β”‚   Magentic Orchestrator   β”‚
β”‚  (no agent_framework) β”‚            β”‚   (with agent_framework)  β”‚
β”‚                       β”‚            β”‚                           β”‚
β”‚  SearchHandler        β”‚            β”‚  SearchAgent              β”‚
β”‚  JudgeHandler         β”‚            β”‚  JudgeAgent               β”‚
β”‚  StatisticalAnalyzer ─┼────────────┼→ AnalysisAgent ────────────
β”‚                       β”‚            β”‚  (wraps StatisticalAnalyzer)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    StatisticalAnalyzer                           β”‚
β”‚              (src/services/statistical_analyzer.py)              β”‚
β”‚                    NO agent_framework dependency                 β”‚
β”‚                                                                  β”‚
β”‚  1. Generate code with pydantic-ai                               β”‚
β”‚  2. Execute in Modal sandbox                                     β”‚
β”‚  3. Return AnalysisResult                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       Modal Sandbox                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  - pandas, numpy, scipy, sklearn, statsmodels           β”‚    β”‚
β”‚  β”‚  - Network: BLOCKED                                     β”‚    β”‚
β”‚  β”‚  - Filesystem: ISOLATED                                 β”‚    β”‚
β”‚  β”‚  - Timeout: ENFORCED                                    β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This is the dependency-safe Modal stack.


10. Files Summary

File Action Purpose
src/services/statistical_analyzer.py CREATE Core analysis (no agent_framework)
src/utils/config.py MODIFY Add enable_modal_analysis
src/orchestrator.py MODIFY Use StatisticalAnalyzer
src/agents/analysis_agent.py MODIFY Wrap StatisticalAnalyzer
src/mcp_tools.py MODIFY Add analyze_hypothesis
examples/modal_demo/verify_sandbox.py CREATE Sandbox verification
examples/modal_demo/run_analysis.py CREATE Demo script
tests/unit/services/test_statistical_analyzer.py CREATE Unit tests
tests/integration/test_modal.py CREATE Integration tests

Key Fix: StatisticalAnalyzer has ZERO agent_framework imports, making it safe for the simple orchestrator.