# Phase 3 Implementation Spec: Judge Vertical Slice **Goal**: Implement the "Brain" of the agent — evaluating evidence quality. **Philosophy**: "Structured Output or Bust." **Prerequisite**: Phase 2 complete (all search tests passing) --- ## 1. The Slice Definition This slice covers: 1. **Input**: A user question + a list of `Evidence` (from Phase 2). 2. **Process**: - Construct a prompt with the evidence. - Call LLM (PydanticAI / OpenAI / Anthropic). - Force JSON structured output. 3. **Output**: A `JudgeAssessment` object. **Files to Create**: - `src/utils/models.py` - Add JudgeAssessment models (extend from Phase 2) - `src/prompts/judge.py` - Judge prompt templates - `src/agent_factory/judges.py` - JudgeHandler with PydanticAI - `tests/unit/agent_factory/test_judges.py` - Unit tests --- ## 2. Models (Add to `src/utils/models.py`) The output schema must be strict for reliable structured output. ```python """Add these models to src/utils/models.py (after Evidence models from Phase 2).""" from pydantic import BaseModel, Field from typing import List, Literal class AssessmentDetails(BaseModel): """Detailed assessment of evidence quality.""" mechanism_score: int = Field( ..., ge=0, le=10, description="How well does the evidence explain the mechanism? 0-10" ) mechanism_reasoning: str = Field( ..., min_length=10, description="Explanation of mechanism score" ) clinical_evidence_score: int = Field( ..., ge=0, le=10, description="Strength of clinical/preclinical evidence. 0-10" ) clinical_reasoning: str = Field( ..., min_length=10, description="Explanation of clinical evidence score" ) drug_candidates: List[str] = Field( default_factory=list, description="List of specific drug candidates mentioned" ) key_findings: List[str] = Field( default_factory=list, description="Key findings from the evidence" ) class JudgeAssessment(BaseModel): """Complete assessment from the Judge.""" details: AssessmentDetails sufficient: bool = Field( ..., description="Is evidence sufficient to provide a recommendation?" ) confidence: float = Field( ..., ge=0.0, le=1.0, description="Confidence in the assessment (0-1)" ) recommendation: Literal["continue", "synthesize"] = Field( ..., description="continue = need more evidence, synthesize = ready to answer" ) next_search_queries: List[str] = Field( default_factory=list, description="If continue, what queries to search next" ) reasoning: str = Field( ..., min_length=20, description="Overall reasoning for the recommendation" ) ``` --- ## 3. Prompt Engineering (`src/prompts/judge.py`) We treat prompts as code. They should be versioned and clean. ```python """Judge prompts for evidence assessment.""" from typing import List from src.utils.models import Evidence SYSTEM_PROMPT = """You are an expert drug repurposing research judge. Your task is to evaluate evidence from biomedical literature and determine if it's sufficient to recommend drug candidates for a given condition. ## Evaluation Criteria 1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism? - 0-3: No clear mechanism, speculative - 4-6: Some mechanistic insight, but gaps exist - 7-10: Clear, well-supported mechanism of action 2. **Clinical Evidence Score (0-10)**: Strength of clinical/preclinical support? - 0-3: No clinical data, only theoretical - 4-6: Preclinical or early clinical data - 7-10: Strong clinical evidence (trials, meta-analyses) 3. **Sufficiency**: Evidence is sufficient when: - Combined scores >= 12 AND - At least one specific drug candidate identified AND - Clear mechanistic rationale exists ## Output Rules - Always output valid JSON matching the schema - Be conservative: only recommend "synthesize" when truly confident - If continuing, suggest specific, actionable search queries - Never hallucinate drug names or findings not in the evidence """ def format_user_prompt(question: str, evidence: List[Evidence]) -> str: """ Format the user prompt with question and evidence. Args: question: The user's research question evidence: List of Evidence objects from search Returns: Formatted prompt string """ evidence_text = "\n\n".join([ f"### Evidence {i+1}\n" f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n" f"**URL**: {e.citation.url}\n" f"**Date**: {e.citation.date}\n" f"**Content**:\n{e.content[:1500]}..." if len(e.content) > 1500 else f"### Evidence {i+1}\n" f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n" f"**URL**: {e.citation.url}\n" f"**Date**: {e.citation.date}\n" f"**Content**:\n{e.content}" for i, e in enumerate(evidence) ]) return f"""## Research Question {question} ## Available Evidence ({len(evidence)} sources) {evidence_text} ## Your Task Evaluate this evidence and determine if it's sufficient to recommend drug repurposing candidates. Respond with a JSON object matching the JudgeAssessment schema. """ def format_empty_evidence_prompt(question: str) -> str: """ Format prompt when no evidence was found. Args: question: The user's research question Returns: Formatted prompt string """ return f"""## Research Question {question} ## Available Evidence No evidence was found from the search. ## Your Task Since no evidence was found, recommend search queries that might yield better results. Set sufficient=False and recommendation="continue". Suggest 3-5 specific search queries. """ ``` --- ## 4. JudgeHandler Implementation (`src/agent_factory/judges.py`) Using PydanticAI for structured output with retry logic. ```python """Judge handler for evidence assessment using PydanticAI.""" import os from typing import List import structlog from pydantic_ai import Agent from pydantic_ai.models.openai import OpenAIModel from pydantic_ai.models.anthropic import AnthropicModel from src.utils.models import Evidence, JudgeAssessment, AssessmentDetails from src.utils.config import settings from src.prompts.judge import SYSTEM_PROMPT, format_user_prompt, format_empty_evidence_prompt logger = structlog.get_logger() def get_model(): """Get the LLM model based on configuration.""" provider = getattr(settings, "llm_provider", "openai") if provider == "anthropic": return AnthropicModel( model_name=getattr(settings, "anthropic_model", "claude-3-5-sonnet-20241022"), api_key=os.getenv("ANTHROPIC_API_KEY"), ) else: return OpenAIModel( model_name=getattr(settings, "openai_model", "gpt-4o"), api_key=os.getenv("OPENAI_API_KEY"), ) class JudgeHandler: """ Handles evidence assessment using an LLM with structured output. Uses PydanticAI to ensure responses match the JudgeAssessment schema. """ def __init__(self, model=None): """ Initialize the JudgeHandler. Args: model: Optional PydanticAI model. If None, uses config default. """ self.model = model or get_model() self.agent = Agent( model=self.model, result_type=JudgeAssessment, system_prompt=SYSTEM_PROMPT, retries=3, ) async def assess( self, question: str, evidence: List[Evidence], ) -> JudgeAssessment: """ Assess evidence and determine if it's sufficient. Args: question: The user's research question evidence: List of Evidence objects from search Returns: JudgeAssessment with evaluation results Raises: JudgeError: If assessment fails after retries """ logger.info( "Starting evidence assessment", question=question[:100], evidence_count=len(evidence), ) # Format the prompt based on whether we have evidence if evidence: user_prompt = format_user_prompt(question, evidence) else: user_prompt = format_empty_evidence_prompt(question) try: # Run the agent with structured output result = await self.agent.run(user_prompt) assessment = result.data logger.info( "Assessment complete", sufficient=assessment.sufficient, recommendation=assessment.recommendation, confidence=assessment.confidence, ) return assessment except Exception as e: logger.error("Assessment failed", error=str(e)) # Return a safe default assessment on failure return self._create_fallback_assessment(question, str(e)) def _create_fallback_assessment( self, question: str, error: str, ) -> JudgeAssessment: """ Create a fallback assessment when LLM fails. Args: question: The original question error: The error message Returns: Safe fallback JudgeAssessment """ return JudgeAssessment( details=AssessmentDetails( mechanism_score=0, mechanism_reasoning="Assessment failed due to LLM error", clinical_evidence_score=0, clinical_reasoning="Assessment failed due to LLM error", drug_candidates=[], key_findings=[], ), sufficient=False, confidence=0.0, recommendation="continue", next_search_queries=[ f"{question} mechanism", f"{question} clinical trials", f"{question} drug candidates", ], reasoning=f"Assessment failed: {error}. Recommend retrying with refined queries.", ) class MockJudgeHandler: """ Mock JudgeHandler for testing without LLM calls. Use this in unit tests to avoid API calls. """ def __init__(self, mock_response: JudgeAssessment | None = None): """ Initialize with optional mock response. Args: mock_response: The assessment to return. If None, uses default. """ self.mock_response = mock_response self.call_count = 0 self.last_question = None self.last_evidence = None async def assess( self, question: str, evidence: List[Evidence], ) -> JudgeAssessment: """Return the mock response.""" self.call_count += 1 self.last_question = question self.last_evidence = evidence if self.mock_response: return self.mock_response # Default mock response return JudgeAssessment( details=AssessmentDetails( mechanism_score=7, mechanism_reasoning="Mock assessment - good mechanism evidence", clinical_evidence_score=6, clinical_reasoning="Mock assessment - moderate clinical evidence", drug_candidates=["Drug A", "Drug B"], key_findings=["Finding 1", "Finding 2"], ), sufficient=len(evidence) >= 3, confidence=0.75, recommendation="synthesize" if len(evidence) >= 3 else "continue", next_search_queries=["query 1", "query 2"] if len(evidence) < 3 else [], reasoning="Mock assessment for testing purposes", ) ``` --- ## 5. TDD Workflow ### Test File: `tests/unit/agent_factory/test_judges.py` ```python """Unit tests for JudgeHandler.""" import pytest from unittest.mock import AsyncMock, MagicMock, patch from src.utils.models import ( Evidence, Citation, JudgeAssessment, AssessmentDetails, ) class TestJudgeHandler: """Tests for JudgeHandler.""" @pytest.mark.asyncio async def test_assess_returns_assessment(self): """JudgeHandler should return JudgeAssessment from LLM.""" from src.agent_factory.judges import JudgeHandler # Create mock assessment mock_assessment = JudgeAssessment( details=AssessmentDetails( mechanism_score=8, mechanism_reasoning="Strong mechanistic evidence", clinical_evidence_score=7, clinical_reasoning="Good clinical support", drug_candidates=["Metformin"], key_findings=["Neuroprotective effects"], ), sufficient=True, confidence=0.85, recommendation="synthesize", next_search_queries=[], reasoning="Evidence is sufficient for synthesis", ) # Mock the PydanticAI agent mock_result = MagicMock() mock_result.data = mock_assessment with patch("src.agent_factory.judges.Agent") as mock_agent_class: mock_agent = AsyncMock() mock_agent.run = AsyncMock(return_value=mock_result) mock_agent_class.return_value = mock_agent handler = JudgeHandler() # Replace the agent with our mock handler.agent = mock_agent evidence = [ Evidence( content="Metformin shows neuroprotective properties...", citation=Citation( source="pubmed", title="Metformin in AD", url="https://pubmed.ncbi.nlm.nih.gov/12345/", date="2024-01-01", ), ) ] result = await handler.assess("metformin alzheimer", evidence) assert result.sufficient is True assert result.recommendation == "synthesize" assert result.confidence == 0.85 assert "Metformin" in result.details.drug_candidates @pytest.mark.asyncio async def test_assess_empty_evidence(self): """JudgeHandler should handle empty evidence gracefully.""" from src.agent_factory.judges import JudgeHandler mock_assessment = JudgeAssessment( details=AssessmentDetails( mechanism_score=0, mechanism_reasoning="No evidence to assess", clinical_evidence_score=0, clinical_reasoning="No evidence to assess", drug_candidates=[], key_findings=[], ), sufficient=False, confidence=0.0, recommendation="continue", next_search_queries=["metformin alzheimer mechanism"], reasoning="No evidence found, need to search more", ) mock_result = MagicMock() mock_result.data = mock_assessment with patch("src.agent_factory.judges.Agent") as mock_agent_class: mock_agent = AsyncMock() mock_agent.run = AsyncMock(return_value=mock_result) mock_agent_class.return_value = mock_agent handler = JudgeHandler() handler.agent = mock_agent result = await handler.assess("metformin alzheimer", []) assert result.sufficient is False assert result.recommendation == "continue" assert len(result.next_search_queries) > 0 @pytest.mark.asyncio async def test_assess_handles_llm_failure(self): """JudgeHandler should return fallback on LLM failure.""" from src.agent_factory.judges import JudgeHandler with patch("src.agent_factory.judges.Agent") as mock_agent_class: mock_agent = AsyncMock() mock_agent.run = AsyncMock(side_effect=Exception("API Error")) mock_agent_class.return_value = mock_agent handler = JudgeHandler() handler.agent = mock_agent evidence = [ Evidence( content="Some content", citation=Citation( source="pubmed", title="Title", url="url", date="2024", ), ) ] result = await handler.assess("test question", evidence) # Should return fallback, not raise assert result.sufficient is False assert result.recommendation == "continue" assert "failed" in result.reasoning.lower() class TestMockJudgeHandler: """Tests for MockJudgeHandler.""" @pytest.mark.asyncio async def test_mock_handler_returns_default(self): """MockJudgeHandler should return default assessment.""" from src.agent_factory.judges import MockJudgeHandler handler = MockJudgeHandler() evidence = [ Evidence( content="Content 1", citation=Citation(source="pubmed", title="T1", url="u1", date="2024"), ), Evidence( content="Content 2", citation=Citation(source="web", title="T2", url="u2", date="2024"), ), ] result = await handler.assess("test", evidence) assert handler.call_count == 1 assert handler.last_question == "test" assert len(handler.last_evidence) == 2 assert result.details.mechanism_score == 7 @pytest.mark.asyncio async def test_mock_handler_custom_response(self): """MockJudgeHandler should return custom response when provided.""" from src.agent_factory.judges import MockJudgeHandler custom_assessment = JudgeAssessment( details=AssessmentDetails( mechanism_score=10, mechanism_reasoning="Custom reasoning", clinical_evidence_score=10, clinical_reasoning="Custom clinical", drug_candidates=["CustomDrug"], key_findings=["Custom finding"], ), sufficient=True, confidence=1.0, recommendation="synthesize", next_search_queries=[], reasoning="Custom assessment", ) handler = MockJudgeHandler(mock_response=custom_assessment) result = await handler.assess("test", []) assert result.details.mechanism_score == 10 assert result.details.drug_candidates == ["CustomDrug"] @pytest.mark.asyncio async def test_mock_handler_insufficient_with_few_evidence(self): """MockJudgeHandler should recommend continue with < 3 evidence.""" from src.agent_factory.judges import MockJudgeHandler handler = MockJudgeHandler() # Only 2 pieces of evidence evidence = [ Evidence( content="Content", citation=Citation(source="pubmed", title="T", url="u", date="2024"), ), Evidence( content="Content 2", citation=Citation(source="web", title="T2", url="u2", date="2024"), ), ] result = await handler.assess("test", evidence) assert result.sufficient is False assert result.recommendation == "continue" assert len(result.next_search_queries) > 0 ``` --- ## 6. Dependencies Add to `pyproject.toml`: ```toml [project] dependencies = [ # ... existing deps ... "pydantic-ai>=0.0.16", "openai>=1.0.0", "anthropic>=0.18.0", ] ``` --- ## 7. Configuration (`src/utils/config.py`) Add LLM configuration: ```python """Add to src/utils/config.py.""" from pydantic_settings import BaseSettings from typing import Literal class Settings(BaseSettings): """Application settings.""" # LLM Configuration llm_provider: Literal["openai", "anthropic"] = "openai" openai_model: str = "gpt-4o" anthropic_model: str = "claude-3-5-sonnet-20241022" # API Keys (loaded from environment) openai_api_key: str | None = None anthropic_api_key: str | None = None ncbi_api_key: str | None = None class Config: env_file = ".env" env_file_encoding = "utf-8" settings = Settings() ``` --- ## 8. Implementation Checklist - [ ] Add `AssessmentDetails` and `JudgeAssessment` models to `src/utils/models.py` - [ ] Create `src/prompts/__init__.py` (empty, for package) - [ ] Create `src/prompts/judge.py` with prompt templates - [ ] Create `src/agent_factory/__init__.py` with exports - [ ] Implement `src/agent_factory/judges.py` with JudgeHandler - [ ] Update `src/utils/config.py` with LLM settings - [ ] Create `tests/unit/agent_factory/__init__.py` - [ ] Write tests in `tests/unit/agent_factory/test_judges.py` - [ ] Run `uv run pytest tests/unit/agent_factory/ -v` — **ALL TESTS MUST PASS** - [ ] Commit: `git commit -m "feat: phase 3 judge slice complete"` --- ## 9. Definition of Done Phase 3 is **COMPLETE** when: 1. All unit tests pass: `uv run pytest tests/unit/agent_factory/ -v` 2. `JudgeHandler` can assess evidence and return structured output 3. Graceful degradation: if LLM fails, returns safe fallback 4. MockJudgeHandler works for testing without API calls 5. Can run this in Python REPL: ```python import asyncio import os from src.utils.models import Evidence, Citation from src.agent_factory.judges import JudgeHandler, MockJudgeHandler # Test with mock (no API key needed) async def test_mock(): handler = MockJudgeHandler() evidence = [ Evidence( content="Metformin shows neuroprotective effects in AD models", citation=Citation( source="pubmed", title="Metformin and Alzheimer's", url="https://pubmed.ncbi.nlm.nih.gov/12345/", date="2024-01-01", ), ), ] result = await handler.assess("metformin alzheimer", evidence) print(f"Sufficient: {result.sufficient}") print(f"Recommendation: {result.recommendation}") print(f"Drug candidates: {result.details.drug_candidates}") asyncio.run(test_mock()) # Test with real LLM (requires API key) async def test_real(): os.environ["OPENAI_API_KEY"] = "your-key-here" # Or set in .env handler = JudgeHandler() evidence = [ Evidence( content="Metformin shows neuroprotective effects in AD models...", citation=Citation( source="pubmed", title="Metformin and Alzheimer's", url="https://pubmed.ncbi.nlm.nih.gov/12345/", date="2024-01-01", ), ), ] result = await handler.assess("metformin alzheimer", evidence) print(f"Sufficient: {result.sufficient}") print(f"Confidence: {result.confidence}") print(f"Reasoning: {result.reasoning}") # asyncio.run(test_real()) # Uncomment with valid API key ``` **Proceed to Phase 4 ONLY after all checkboxes are complete.**