Spaces:

DataQuests
/

DeepCritical

Running

App Files Files Community

VibecoderMcSwaggins commited on 13 days ago

Commit

b115229

unverified ·

1 Parent(s): 2e4a760

feat: HFInferenceJudgeHandler - Free AI analysis for hackathon judges (#36)

Browse files

* feat: implement HFInferenceJudgeHandler for free-tier AI analysis

Replace MockJudgeHandler with real AI analysis using HuggingFace Inference API:

- Add HFInferenceJudgeHandler with chat_completion API
- Model fallback chain: Llama 3.1 → Mistral → Zephyr (ungated)
- Robust JSON extraction (handles markdown blocks, nested braces)
- Tenacity retry with exponential backoff for rate limits
- Fix app.py to use HF Inference when no paid API keys present

Priority: User API key → Env API key → HF Inference (free)

Hackathon judges now get real AI analysis without needing API keys.
Set HF_TOKEN as Space secret for best model (Llama 3.1).

* feat: add documentation for Magentic mode bug and implementation spec

- Introduced a new bug report for Magentic mode, detailing its non-functionality and root causes.
- Updated the implementation specification for Magentic integration, emphasizing the architecture, critical insights, and necessary changes for agent coordination.
- Enhanced clarity on the roles of various agents and their interactions within the Magentic workflow.
- Provided recommendations for fixing or abandoning the Magentic mode based on observed issues.

This commit aims to improve understanding and troubleshooting of the Magentic mode within the project.

* feat: implement Magentic ChatAgent pattern with semantic state management

- Add src/agents/state.py: Thread-safe MagenticState with contextvars
- Evidence store for structured citation access
- EmbeddingService integration for semantic deduplication

- Add src/agents/tools.py: AIFunction tools that update shared state
- search_pubmed, search_clinical_trials, search_preprints
- get_bibliography for ReportAgent citations
- Tools return strings to LLM AND update state

- Add src/agents/magentic_agents.py: ChatAgent factories
- SearchAgent with search tools
- JudgeAgent, HypothesisAgent, ReportAgent
- Each agent has internal OpenAIChatClient

- Update src/orchestrator_magentic.py: Use ChatAgent pattern
- Initialize MagenticState at workflow start
- Properly stream events from MagenticBuilder

- Fix type errors for pre-commit mypy compatibility

Implements Phase 5 spec for correct Microsoft Agent Framework integration.

* docs: add P0 blockers documentation for Magentic mode implementation

- Introduced a new markdown document outlining critical blockers in the Magentic mode implementation.
- Highlighted issues such as hardcoded OpenAI models, dependency source ambiguity, and the lack of a "Free Tier" for users.
- Provided detailed impacts and required fixes for each identified issue to ensure a stable deployment.

This documentation aims to facilitate resolution of critical issues and improve the overall user experience in Magentic mode.

* fix: address CodeRabbit feedback and P0 blockers

Code Fixes (HIGH priority):
- Add API key/provider validation to prevent silent auth failures
- Fix hardcoded manager model in orchestrator_magentic.py (now uses settings.openai_model)
- Add bounds checking to JSON extraction in judges.py (prevents IndexError)
- Fix fragile test assertion in test_judges_hf.py

Code Quality (MEDIUM priority):
- Add explicit type annotation for models_to_try: list[str]
- Fix structured logging (f-string → structured params)
- Align fallback query count (3 queries) between handlers

Test Improvements:
- Add

@pytest
.mark.unit decorator to TestHFInferenceJudgeHandler

Documentation Sync:
- Update Phase 3 docs to match actual implementation:
- __init__ signature (simplified, no inline imports)
- _extract_json (string split with bounds checking)
- _call_with_retry (tenacity decorator, asyncio.get_running_loop())
- assess method (simplified model loop)
- Update Phase 4 docs with ChatInterface additional_inputs for BYOK

All 104 tests pass.

* fix: pin agent-framework-core and remove resolved bug doc

- Pin agent-framework-core>=1.0.0b251120,<2.0.0 to prevent breaking changes
- Remove docs/bugs/007_magentic_p0_blockers.md - all issues resolved:
- Issue 1 (hardcoded models): Already fixed in previous commit
- Issue 2 (dependency unpinned): Fixed in this commit
- Issue 3 (no free tier): Working as Designed

* chore: remove resolved bug documentation

- Delete 005_services_not_integrated.md - embeddings now wired to simple orchestrator
(enable_embeddings=True is the default in orchestrator.py)
- Delete 006_magentic_mode_broken.md - magentic mode is experimental/optional,
documented as requiring OpenAI (not a bug)

Files changed (16) hide show

.env.example +14 -0
docs/bugs/005_services_not_integrated.md +0 -142
docs/implementation/03_phase_judge.md +309 -14
docs/implementation/04_phase_ui.md +118 -28
docs/implementation/05_phase_magentic.md +885 -432
pyproject.toml +4 -3
src/agent_factory/judges.py +206 -1
src/agents/magentic_agents.py +184 -0
src/agents/state.py +90 -0
src/agents/tools.py +175 -0
src/app.py +53 -34
src/orchestrator_factory.py +14 -16
src/orchestrator_magentic.py +90 -146
src/prompts/report.py +4 -4
tests/unit/agent_factory/test_judges_hf.py +138 -0
uv.lock +3 -1

.env.example CHANGED Viewed

@@ -11,6 +11,20 @@ ANTHROPIC_API_KEY=sk-ant-your-key-here
 OPENAI_MODEL=gpt-5.1
 ANTHROPIC_MODEL=claude-sonnet-4-5-20250929
 # ============== AGENT CONFIGURATION ==============
 MAX_ITERATIONS=10

 OPENAI_MODEL=gpt-5.1
 ANTHROPIC_MODEL=claude-sonnet-4-5-20250929
+# ============== HUGGINGFACE (FREE TIER) ==============
+# HuggingFace Token - enables Llama 3.1 (best quality free model)
+# Get yours at: https://huggingface.co/settings/tokens
+#
+# WITHOUT HF_TOKEN: Falls back to ungated models (zephyr-7b-beta)
+# WITH HF_TOKEN: Uses Llama 3.1 8B Instruct (requires accepting license)
+#
+# For HuggingFace Spaces deployment:
+#   Set this as a "Secret" in Space Settings → Variables and secrets
+#   Users/judges don't need their own token - the Space secret is used
+#
+HF_TOKEN=hf_your-token-here
 # ============== AGENT CONFIGURATION ==============
 MAX_ITERATIONS=10

docs/bugs/005_services_not_integrated.md DELETED Viewed

@@ -1,142 +0,0 @@
-# Bug 005: Embedding Services Built But Not Wired to Default Orchestrator
-**Date:** November 26, 2025
-**Severity:** CRITICAL
-**Status:** Open
-## 1. The Problem
-Two complete semantic search services exist but are **NOT USED** by the default orchestrator:
-| Service | Location | Status |
-| ------- | -------- | ------ |
-| EmbeddingService | `src/services/embeddings.py` | BUILT, not wired to simple mode |
-| LlamaIndexRAGService | `src/services/llamaindex_rag.py` | BUILT, not wired to simple mode |
-## 2. Root Cause: Two Orchestrators
-```
-┌─────────────────────────────────────────────────────────────────┐
-│ orchestrator.py (SIMPLE MODE - DEFAULT)                         │
-│ - Basic search → judge → loop                                   │
-│ - NO embeddings                                                 │
-│ - NO semantic search                                            │
-│ - Hand-rolled keyword matching                                  │
-└─────────────────────────────────────────────────────────────────┘
-┌─────────────────────────────────────────────────────────────────┐
-│ orchestrator_magentic.py (MAGENTIC MODE)                        │
-│ - Multi-agent architecture                                      │
-│ - USES EmbeddingService                                         │
-│ - USES semantic search                                          │
-│ - Requires agent-framework (optional dep)                       │
-│ - OpenAI only                                                   │
-└─────────────────────────────────────────────────────────────────┘
-```
-**The UI defaults to simple mode**, which bypasses all the semantic search infrastructure.
-## 3. What's Built (Not Wired)
-### EmbeddingService (NO API KEY NEEDED)
-```python
-# src/services/embeddings.py
-class EmbeddingService:
-    async def embed(text) -> list[float]
-    async def search_similar(query) -> list[dict]  # SEMANTIC SEARCH
-    async def deduplicate(evidence) -> list        # DEDUPLICATION
-```
-- Uses local sentence-transformers
-- ChromaDB vector store
-- **Works without API keys**
-### LlamaIndexRAGService
-```python
-# src/services/llamaindex_rag.py
-class LlamaIndexRAGService:
-    def ingest_evidence(evidence_list)
-    def retrieve(query) -> list[dict]  # Semantic retrieval
-    def query(query_str) -> str        # Synthesized response
-```
-## 4. Where Services ARE Used
-```
-src/orchestrator_magentic.py    ← Uses EmbeddingService
-src/agents/search_agent.py      ← Uses EmbeddingService
-src/agents/report_agent.py      ← Uses EmbeddingService
-src/agents/hypothesis_agent.py  ← Uses EmbeddingService
-src/agents/analysis_agent.py    ← Uses EmbeddingService
-```
-All in magentic mode agents, NOT in simple orchestrator.
-## 5. The Fix Options
-### Option A: Add Embeddings to Simple Orchestrator (RECOMMENDED)
-Modify `src/orchestrator.py` to optionally use EmbeddingService:
-```python
-class Orchestrator:
-    def __init__(self, ..., use_embeddings: bool = True):
-        if use_embeddings:
-            from src.services.embeddings import get_embedding_service
-            self.embeddings = get_embedding_service()
-        else:
-            self.embeddings = None
-    async def run(self, query):
-        # ... search phase ...
-        if self.embeddings:
-            # Semantic ranking
-            all_evidence = await self._rank_by_relevance(all_evidence, query)
-            # Deduplication
-            all_evidence = await self.embeddings.deduplicate(all_evidence)
-```
-### Option B: Make Magentic Mode Default
-Change app.py to default to "magentic" mode when deps available.
-### Option C: Merge Best of Both
-Create a new orchestrator that:
-- Has the simplicity of simple mode
-- Uses embeddings for ranking/dedup
-- Doesn't require agent-framework
-## 6. Implementation Plan
-### Phase 1: Wire EmbeddingService to Simple Orchestrator
-1. Import EmbeddingService in orchestrator.py
-2. Add semantic ranking after search
-3. Add deduplication before judge
-4. Test end-to-end
-### Phase 2: Add Relevance to Evidence
-1. Use embedding similarity as relevance score
-2. Sort evidence by relevance
-3. Only send top-K to judge
-## 7. Files to Modify
-```
-src/orchestrator.py           ← Add embedding integration
-src/orchestrator_factory.py   ← Pass embeddings flag
-src/app.py                    ← Enable embeddings by default
-```
-## 8. Success Criteria
-- [ ] Default mode uses semantic search
-- [ ] Evidence ranked by relevance
-- [ ] Duplicates removed
-- [ ] No new API keys required (sentence-transformers is local)
-- [ ] Magentic mode still works as before

docs/implementation/03_phase_judge.md CHANGED Viewed

@@ -350,20 +350,228 @@ class JudgeHandler:
         )
-class MockJudgeHandler:
     """
-    Mock JudgeHandler for testing without LLM calls.
-    Use this in unit tests to avoid API calls.
     """
-    def __init__(self, mock_response: JudgeAssessment | None = None):
         """
-        Initialize with optional mock response.
         Args:
-            mock_response: The assessment to return. If None, uses default.
         """
         self.mock_response = mock_response
         self.call_count = 0
         self.last_question = None
@@ -374,7 +582,7 @@ class MockJudgeHandler:
         question: str,
         evidence: List[Evidence],
     ) -> JudgeAssessment:
-        """Return the mock response."""
         self.call_count += 1
         self.last_question = question
         self.last_evidence = evidence
@@ -382,21 +590,21 @@ class MockJudgeHandler:
         if self.mock_response:
             return self.mock_response
-        # Default mock response
         return JudgeAssessment(
             details=AssessmentDetails(
                 mechanism_score=7,
-                mechanism_reasoning="Mock assessment - good mechanism evidence",
                 clinical_evidence_score=6,
-                clinical_reasoning="Mock assessment - moderate clinical evidence",
-                drug_candidates=["Drug A", "Drug B"],
-                key_findings=["Finding 1", "Finding 2"],
             ),
             sufficient=len(evidence) >= 3,
             confidence=0.75,
             recommendation="synthesize" if len(evidence) >= 3 else "continue",
             next_search_queries=["query 1", "query 2"] if len(evidence) < 3 else [],
-            reasoning="Mock assessment for testing purposes",
         )
 ```
@@ -547,8 +755,89 @@ class TestJudgeHandler:
             assert "failed" in result.reasoning.lower()
 class TestMockJudgeHandler:
-    """Tests for MockJudgeHandler."""
     @pytest.mark.asyncio
     async def test_mock_handler_returns_default(self):
@@ -641,9 +930,15 @@ dependencies = [
     "pydantic-ai>=0.0.16",
     "openai>=1.0.0",
     "anthropic>=0.18.0",
 ]
 ```
 ---
 ## 7. Configuration (`src/utils/config.py`)

         )
+class HFInferenceJudgeHandler:
     """
+    JudgeHandler using HuggingFace Inference API for FREE LLM calls.
+    This is the DEFAULT for demo mode - provides real AI analysis without
+    requiring users to have OpenAI/Anthropic API keys.
+    Model Fallback Chain (handles gated models and rate limits):
+        1. meta-llama/Llama-3.1-8B-Instruct (best quality, requires HF_TOKEN)
+        2. mistralai/Mistral-7B-Instruct-v0.3 (good quality, may require token)
+        3. HuggingFaceH4/zephyr-7b-beta (ungated, always works)
+    Rate Limit Handling:
+        - Exponential backoff with 3 retries
+        - Falls back to next model on persistent 429/503 errors
     """
+    # Model fallback chain: gated (best) → ungated (fallback)
+    FALLBACK_MODELS = [
+        "meta-llama/Llama-3.1-8B-Instruct",      # Best quality (gated)
+        "mistralai/Mistral-7B-Instruct-v0.3",    # Good quality
+        "HuggingFaceH4/zephyr-7b-beta",          # Ungated fallback
+    ]
+    def __init__(self, model_id: str | None = None) -> None:
         """
+        Initialize with HF Inference client.
         Args:
+            model_id: Optional specific model ID. If None, uses FALLBACK_MODELS chain.
+        """
+        self.model_id = model_id
+        # Will automatically use HF_TOKEN from env if available
+        self.client = InferenceClient()
+        self.call_count = 0
+        self.last_question: str | None = None
+        self.last_evidence: list[Evidence] | None = None
+    def _extract_json(self, text: str) -> dict[str, Any] | None:
+        """
+        Robust JSON extraction that handles markdown blocks and nested braces.
+        """
+        text = text.strip()
+        # Remove markdown code blocks if present (with bounds checking)
+        if "```json" in text:
+            parts = text.split("```json", 1)
+            if len(parts) > 1:
+                inner_parts = parts[1].split("```", 1)
+                text = inner_parts[0]
+        elif "```" in text:
+            parts = text.split("```", 1)
+            if len(parts) > 1:
+                inner_parts = parts[1].split("```", 1)
+                text = inner_parts[0]
+        text = text.strip()
+        # Find first '{'
+        start_idx = text.find("{")
+        if start_idx == -1:
+            return None
+        # Stack-based parsing ignoring chars in strings
+        count = 0
+        in_string = False
+        escape = False
+        for i, char in enumerate(text[start_idx:], start=start_idx):
+            if in_string:
+                if escape:
+                    escape = False
+                elif char == "\\":
+                    escape = True
+                elif char == '"':
+                    in_string = False
+            elif char == '"':
+                in_string = True
+            elif char == "{":
+                count += 1
+            elif char == "}":
+                count -= 1
+                if count == 0:
+                    try:
+                        result = json.loads(text[start_idx : i + 1])
+                        if isinstance(result, dict):
+                            return result
+                        return None
+                    except json.JSONDecodeError:
+                        return None
+        return None
+    @retry(
+        stop=stop_after_attempt(3),
+        wait=wait_exponential(multiplier=1, min=1, max=4),
+        retry=retry_if_exception_type(Exception),
+        reraise=True,
+    )
+    async def _call_with_retry(self, model: str, prompt: str, question: str) -> JudgeAssessment:
+        """Make API call with retry logic using chat_completion."""
+        loop = asyncio.get_running_loop()
+        # Build messages for chat_completion (model-agnostic)
+        messages = [
+            {
+                "role": "system",
+                "content": f"""{SYSTEM_PROMPT}
+IMPORTANT: Respond with ONLY valid JSON matching this schema:
+{{
+    "details": {{
+        "mechanism_score": <int 0-10>,
+        "mechanism_reasoning": "<string>",
+        "clinical_evidence_score": <int 0-10>,
+        "clinical_reasoning": "<string>",
+        "drug_candidates": ["<string>", ...],
+        "key_findings": ["<string>", ...]
+    }},
+    "sufficient": <bool>,
+    "confidence": <float 0-1>,
+    "recommendation": "continue" | "synthesize",
+    "next_search_queries": ["<string>", ...],
+    "reasoning": "<string>"
+}}""",
+            },
+            {"role": "user", "content": prompt},
+        ]
+        # Use chat_completion (conversational task - supported by all models)
+        response = await loop.run_in_executor(
+            None,
+            lambda: self.client.chat_completion(
+                messages=messages,
+                model=model,
+                max_tokens=1024,
+                temperature=0.1,
+            ),
+        )
+        # Extract content from response
+        content = response.choices[0].message.content
+        if not content:
+            raise ValueError("Empty response from model")
+        # Extract and parse JSON
+        json_data = self._extract_json(content)
+        if not json_data:
+            raise ValueError("No valid JSON found in response")
+        return JudgeAssessment(**json_data)
+    async def assess(
+        self,
+        question: str,
+        evidence: list[Evidence],
+    ) -> JudgeAssessment:
         """
+        Assess evidence using HuggingFace Inference API.
+        Attempts models in order until one succeeds.
+        """
+        self.call_count += 1
+        self.last_question = question
+        self.last_evidence = evidence
+        # Format the user prompt
+        if evidence:
+            user_prompt = format_user_prompt(question, evidence)
+        else:
+            user_prompt = format_empty_evidence_prompt(question)
+        models_to_try: list[str] = [self.model_id] if self.model_id else self.FALLBACK_MODELS
+        last_error: Exception | None = None
+        for model in models_to_try:
+            try:
+                return await self._call_with_retry(model, user_prompt, question)
+            except Exception as e:
+                logger.warning("Model failed", model=model, error=str(e))
+                last_error = e
+                continue
+        # All models failed
+        logger.error("All HF models failed", error=str(last_error))
+        return self._create_fallback_assessment(question, str(last_error))
+    def _create_fallback_assessment(
+        self,
+        question: str,
+        error: str,
+    ) -> JudgeAssessment:
+        """Create a fallback assessment when inference fails."""
+        return JudgeAssessment(
+            details=AssessmentDetails(
+                mechanism_score=0,
+                mechanism_reasoning=f"Assessment failed: {error}",
+                clinical_evidence_score=0,
+                clinical_reasoning=f"Assessment failed: {error}",
+                drug_candidates=[],
+                key_findings=[],
+            ),
+            sufficient=False,
+            confidence=0.0,
+            recommendation="continue",
+            next_search_queries=[
+                f"{question} mechanism",
+                f"{question} clinical trials",
+                f"{question} drug candidates",
+            ],
+            reasoning=f"HF Inference failed: {error}. Recommend retrying.",
+        )
+class MockJudgeHandler:
+    """
+    Mock JudgeHandler for UNIT TESTING ONLY.
+    NOT for production use. Use HFInferenceJudgeHandler for demo mode.
+    """
+    def __init__(self, mock_response: JudgeAssessment | None = None):
+        """Initialize with optional mock response for testing."""
         self.mock_response = mock_response
         self.call_count = 0
         self.last_question = None
         question: str,
         evidence: List[Evidence],
     ) -> JudgeAssessment:
+        """Return the mock response (for testing only)."""
         self.call_count += 1
         self.last_question = question
         self.last_evidence = evidence
         if self.mock_response:
             return self.mock_response
+        # Default mock response for tests
         return JudgeAssessment(
             details=AssessmentDetails(
                 mechanism_score=7,
+                mechanism_reasoning="Mock assessment for testing",
                 clinical_evidence_score=6,
+                clinical_reasoning="Mock assessment for testing",
+                drug_candidates=["TestDrug"],
+                key_findings=["Test finding"],
             ),
             sufficient=len(evidence) >= 3,
             confidence=0.75,
             recommendation="synthesize" if len(evidence) >= 3 else "continue",
             next_search_queries=["query 1", "query 2"] if len(evidence) < 3 else [],
+            reasoning="Mock assessment for unit testing only",
         )
 ```
             assert "failed" in result.reasoning.lower()
+class TestHFInferenceJudgeHandler:
+    """Tests for HFInferenceJudgeHandler."""
+    @pytest.mark.asyncio
+    async def test_extract_json_raw(self):
+        """Should extract raw JSON."""
+        from src.agent_factory.judges import HFInferenceJudgeHandler
+        handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
+        # Bypass __init__ for unit testing extraction
+        result = handler._extract_json('{"key": "value"}')
+        assert result == {"key": "value"}
+    @pytest.mark.asyncio
+    async def test_extract_json_markdown_block(self):
+        """Should extract JSON from markdown code block."""
+        from src.agent_factory.judges import HFInferenceJudgeHandler
+        handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
+        response = '''Here is the assessment:
+```json
+{"key": "value", "nested": {"inner": 1}}
+```
+'''
+        result = handler._extract_json(response)
+        assert result == {"key": "value", "nested": {"inner": 1}}
+    @pytest.mark.asyncio
+    async def test_extract_json_with_preamble(self):
+        """Should extract JSON with preamble text."""
+        from src.agent_factory.judges import HFInferenceJudgeHandler
+        handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
+        response = 'Here is your JSON response:\n{"sufficient": true, "confidence": 0.85}'
+        result = handler._extract_json(response)
+        assert result == {"sufficient": True, "confidence": 0.85}
+    @pytest.mark.asyncio
+    async def test_extract_json_nested_braces(self):
+        """Should handle nested braces correctly."""
+        from src.agent_factory.judges import HFInferenceJudgeHandler
+        handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
+        response = '{"details": {"mechanism_score": 8}, "reasoning": "test"}'
+        result = handler._extract_json(response)
+        assert result["details"]["mechanism_score"] == 8
+    @pytest.mark.asyncio
+    async def test_hf_handler_uses_fallback_models(self):
+        """HFInferenceJudgeHandler should have fallback model chain."""
+        from src.agent_factory.judges import HFInferenceJudgeHandler
+        # Check class has fallback models defined
+        assert len(HFInferenceJudgeHandler.FALLBACK_MODELS) >= 3
+        assert "zephyr-7b-beta" in HFInferenceJudgeHandler.FALLBACK_MODELS[-1]
+    @pytest.mark.asyncio
+    async def test_hf_handler_fallback_on_auth_error(self):
+        """Should fall back to ungated model on auth error."""
+        from src.agent_factory.judges import HFInferenceJudgeHandler
+        from unittest.mock import MagicMock, patch
+        with patch("src.agent_factory.judges.InferenceClient") as mock_client_class:
+            # First call raises 403, second succeeds
+            mock_client = MagicMock()
+            mock_client.chat_completion.side_effect = [
+                Exception("403 Forbidden: gated model"),
+                MagicMock(choices=[MagicMock(message=MagicMock(content='{"sufficient": false}'))])
+            ]
+            mock_client_class.return_value = mock_client
+            handler = HFInferenceJudgeHandler()
+            # Manually trigger fallback test
+            assert handler._try_fallback_model() is True
+            assert handler.model_id != "meta-llama/Llama-3.1-8B-Instruct"
 class TestMockJudgeHandler:
+    """Tests for MockJudgeHandler (UNIT TESTING ONLY)."""
     @pytest.mark.asyncio
     async def test_mock_handler_returns_default(self):
     "pydantic-ai>=0.0.16",
     "openai>=1.0.0",
     "anthropic>=0.18.0",
+    "huggingface-hub>=0.20.0",  # For HFInferenceJudgeHandler (FREE LLM)
 ]
 ```
+**Note**: `huggingface-hub` is required for the free tier to work. It:
+- Provides `InferenceClient` for API calls
+- Auto-reads `HF_TOKEN` from environment (optional, for gated models)
+- Works without any token for ungated models like `zephyr-7b-beta`
 ---
 ## 7. Configuration (`src/utils/config.py`)

docs/implementation/04_phase_ui.md CHANGED Viewed

@@ -408,33 +408,65 @@ from typing import AsyncGenerator
 from src.orchestrator import Orchestrator
 from src.tools.pubmed import PubMedTool
-from src.tools.websearch import WebTool
 from src.tools.search_handler import SearchHandler
-from src.agent_factory.judges import JudgeHandler, MockJudgeHandler
 from src.utils.models import OrchestratorConfig, AgentEvent
-def create_orchestrator(use_mock: bool = False) -> Orchestrator:
     """
     Create an orchestrator instance.
     Args:
-        use_mock: If True, use MockJudgeHandler (no API key needed)
     Returns:
-        Configured Orchestrator instance
     """
     # Create search tools
     search_handler = SearchHandler(
-        tools=[PubMedTool(), WebTool()],
         timeout=30.0,
     )
-    # Create judge (mock or real)
-    if use_mock:
-        judge_handler = MockJudgeHandler()
     else:
-        judge_handler = JudgeHandler()
     # Create orchestrator
     config = OrchestratorConfig(
@@ -446,12 +478,14 @@ def create_orchestrator(use_mock: bool = False) -> Orchestrator:
         search_handler=search_handler,
         judge_handler=judge_handler,
         config=config,
-    )
 async def research_agent(
     message: str,
     history: list[dict],
 ) -> AsyncGenerator[str, None]:
     """
     Gradio chat function that runs the research agent.
@@ -459,6 +493,8 @@ async def research_agent(
     Args:
         message: User's research question
         history: Chat history (Gradio format)
     Yields:
         Markdown-formatted responses for streaming
@@ -467,10 +503,31 @@ async def research_agent(
         yield "Please enter a research question."
         return
-    # Create orchestrator (use mock if no API key)
     import os
-    use_mock = not (os.getenv("OPENAI_API_KEY") or os.getenv("ANTHROPIC_API_KEY"))
-    orchestrator = create_orchestrator(use_mock=use_mock)
     # Run the agent and stream events
     response_parts = []
@@ -516,19 +573,43 @@ def create_demo() -> gr.Blocks:
         - "What existing medications show promise for Long COVID?"
         """)
-        chatbot = gr.ChatInterface(
             fn=research_agent,
-            type="messages",
-            title="",
             examples=[
-                "What drugs could be repurposed for Alzheimer's disease?",
-                "Is metformin effective for treating cancer?",
-                "What medications show promise for Long COVID treatment?",
-                "Can statins be repurposed for neurological conditions?",
             ],
-            retry_btn="🔄 Retry",
-            undo_btn="↩️ Undo",
-            clear_btn="🗑️ Clear",
         )
         gr.Markdown("""
@@ -952,15 +1033,22 @@ uv run python -m src.app
 import asyncio
 from src.orchestrator import Orchestrator
 from src.tools.pubmed import PubMedTool
-from src.tools.websearch import WebTool
 from src.tools.search_handler import SearchHandler
-from src.agent_factory.judges import MockJudgeHandler
 from src.utils.models import OrchestratorConfig
 async def test_full_flow():
     # Create components
-    search_handler = SearchHandler([PubMedTool(), WebTool()])
-    judge_handler = MockJudgeHandler()  # Use mock for testing
     config = OrchestratorConfig(max_iterations=3)
     # Create orchestrator
@@ -980,6 +1068,8 @@ async def test_full_flow():
 asyncio.run(test_full_flow())
 ```
 ---
 ## 10. Deployment Verification

 from src.orchestrator import Orchestrator
 from src.tools.pubmed import PubMedTool
+from src.tools.clinicaltrials import ClinicalTrialsTool
+from src.tools.biorxiv import BioRxivTool
 from src.tools.search_handler import SearchHandler
+from src.agent_factory.judges import JudgeHandler, HFInferenceJudgeHandler
 from src.utils.models import OrchestratorConfig, AgentEvent
+def create_orchestrator(
+    user_api_key: str | None = None,
+    api_provider: str = "openai",
+) -> tuple[Orchestrator, str]:
     """
     Create an orchestrator instance.
     Args:
+        user_api_key: Optional user-provided API key (BYOK)
+        api_provider: API provider ("openai" or "anthropic")
     Returns:
+        Tuple of (Configured Orchestrator instance, backend_name)
+    Priority:
+        1. User-provided API key → JudgeHandler (OpenAI/Anthropic)
+        2. Environment API key → JudgeHandler (OpenAI/Anthropic)
+        3. No key → HFInferenceJudgeHandler (FREE, automatic fallback chain)
+    HF Inference Fallback Chain:
+        1. Llama 3.1 8B (requires HF_TOKEN for gated model)
+        2. Mistral 7B (may require token)
+        3. Zephyr 7B (ungated, always works)
     """
+    import os
     # Create search tools
     search_handler = SearchHandler(
+        tools=[PubMedTool(), ClinicalTrialsTool(), BioRxivTool()],
         timeout=30.0,
     )
+    # Determine which judge to use
+    has_env_key = bool(os.getenv("OPENAI_API_KEY") or os.getenv("ANTHROPIC_API_KEY"))
+    has_user_key = bool(user_api_key)
+    has_hf_token = bool(os.getenv("HF_TOKEN"))
+    if has_user_key:
+        # User provided their own key
+        judge_handler = JudgeHandler(model=None)
+        backend_name = f"your {api_provider.upper()} API key"
+    elif has_env_key:
+        # Environment has API key configured
+        judge_handler = JudgeHandler(model=None)
+        backend_name = "configured API key"
     else:
+        # Use FREE HuggingFace Inference with automatic fallback
+        judge_handler = HFInferenceJudgeHandler()
+        if has_hf_token:
+            backend_name = "HuggingFace Inference (Llama 3.1)"
+        else:
+            backend_name = "HuggingFace Inference (free tier)"
     # Create orchestrator
     config = OrchestratorConfig(
         search_handler=search_handler,
         judge_handler=judge_handler,
         config=config,
+    ), backend_name
 async def research_agent(
     message: str,
     history: list[dict],
+    api_key: str = "",
+    api_provider: str = "openai",
 ) -> AsyncGenerator[str, None]:
     """
     Gradio chat function that runs the research agent.
     Args:
         message: User's research question
         history: Chat history (Gradio format)
+        api_key: Optional user-provided API key (BYOK)
+        api_provider: API provider ("openai" or "anthropic")
     Yields:
         Markdown-formatted responses for streaming
         yield "Please enter a research question."
         return
     import os
+    # Clean user-provided API key
+    user_api_key = api_key.strip() if api_key else None
+    # Create orchestrator with appropriate judge
+    orchestrator, backend_name = create_orchestrator(
+        user_api_key=user_api_key,
+        api_provider=api_provider,
+    )
+    # Determine icon based on backend
+    has_hf_token = bool(os.getenv("HF_TOKEN"))
+    if "HuggingFace" in backend_name:
+        icon = "🤗"
+        extra_note = (
+            "\n*For premium analysis, enter an OpenAI or Anthropic API key.*"
+            if not has_hf_token else ""
+        )
+    else:
+        icon = "🔑"
+        extra_note = ""
+    # Inform user which backend is being used
+    yield f"{icon} **Using {backend_name}**{extra_note}\n\n"
     # Run the agent and stream events
     response_parts = []
         - "What existing medications show promise for Long COVID?"
         """)
+        # Note: additional_inputs render in an accordion below the chat input
+        gr.ChatInterface(
             fn=research_agent,
             examples=[
+                [
+                    "What drugs could be repurposed for Alzheimer's disease?",
+                    "simple",
+                    "",
+                    "openai",
+                ],
+                [
+                    "Is metformin effective for treating cancer?",
+                    "simple",
+                    "",
+                    "openai",
+                ],
+            ],
+            additional_inputs=[
+                gr.Radio(
+                    choices=["simple", "magentic"],
+                    value="simple",
+                    label="Orchestrator Mode",
+                    info="Simple: Linear | Magentic: Multi-Agent (OpenAI)",
+                ),
+                gr.Textbox(
+                    label="API Key (Optional - Bring Your Own Key)",
+                    placeholder="sk-... or sk-ant-...",
+                    type="password",
+                    info="Enter your own API key for full AI analysis. Never stored.",
+                ),
+                gr.Radio(
+                    choices=["openai", "anthropic"],
+                    value="openai",
+                    label="API Provider",
+                    info="Select the provider for your API key",
+                ),
             ],
         )
         gr.Markdown("""
 import asyncio
 from src.orchestrator import Orchestrator
 from src.tools.pubmed import PubMedTool
+from src.tools.biorxiv import BioRxivTool
+from src.tools.clinicaltrials import ClinicalTrialsTool
 from src.tools.search_handler import SearchHandler
+from src.agent_factory.judges import HFInferenceJudgeHandler, MockJudgeHandler
 from src.utils.models import OrchestratorConfig
 async def test_full_flow():
     # Create components
+    search_handler = SearchHandler([PubMedTool(), ClinicalTrialsTool(), BioRxivTool()])
+    # Option 1: Use FREE HuggingFace Inference (real AI analysis)
+    judge_handler = HFInferenceJudgeHandler()
+    # Option 2: Use MockJudgeHandler for UNIT TESTING ONLY
+    # judge_handler = MockJudgeHandler()
     config = OrchestratorConfig(max_iterations=3)
     # Create orchestrator
 asyncio.run(test_full_flow())
 ```
+**Important**: `MockJudgeHandler` is for **unit testing only**. For actual demo/production use, always use `HFInferenceJudgeHandler` (free) or `JudgeHandler` (with API key).
 ---
 ## 10. Deployment Verification

docs/implementation/05_phase_magentic.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# Phase 5 Implementation Spec: Magentic Integration (Optional)
 **Goal**: Upgrade orchestrator to use Microsoft Agent Framework's Magentic-One pattern.
 **Philosophy**: "Same API, Better Engine."
@@ -15,385 +15,744 @@ Magentic-One provides:
 - **Event streaming** for real-time UI updates
 - **Multi-agent coordination** with round limits and reset logic
-This is **NOT required for MVP**. Only implement if time permits after Phase 4.
 ---
-## 2. Architecture Alignment
-### Current Phase 4 Architecture
-```
-User Query
-    ↓
-Orchestrator (while loop)
-    ├── SearchHandler.execute() → Evidence
-    ├── JudgeHandler.assess() → JudgeAssessment
-    └── Loop/Synthesize decision
-    ↓
-Research Report
-```
-### Phase 5 Magentic Architecture
 ```
-User Query
-    ↓
-MagenticBuilder
-    ├── SearchAgent (wraps SearchHandler)
-    ├── JudgeAgent (wraps JudgeHandler)
-    └── StandardMagenticManager (LLM coordinator)
-    ↓
-Research Report (same output format)
 ```
-**Key Insight**: We wrap existing handlers as `AgentProtocol` implementations. The domain logic stays the same.
 ---
-## 3. Design for Seamless Integration
-### 3.1 Protocol-Based Design (Phase 4 prep)
-In Phase 4, define handlers using Protocols so they can be wrapped later:
 ```python
-# src/orchestrator.py (Phase 4)
-from typing import Protocol, List
-from src.utils.models import Evidence, SearchResult, JudgeAssessment
-class SearchHandlerProtocol(Protocol):
-    """Protocol for search handler - can be wrapped as Agent later."""
-    async def execute(self, query: str, max_results_per_tool: int = 10) -> SearchResult:
-        ...
-class JudgeHandlerProtocol(Protocol):
-    """Protocol for judge handler - can be wrapped as Agent later."""
-    async def assess(self, question: str, evidence: List[Evidence]) -> JudgeAssessment:
-        ...
-class OrchestratorProtocol(Protocol):
-    """Protocol for orchestrator - allows swapping implementations."""
-    async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
-        ...
-```
-### 3.2 Facade Pattern
-The `Orchestrator` class is a facade. In Phase 5, we create `MagenticOrchestrator` with the same interface:
-```python
-# Phase 4: Simple orchestrator
-orchestrator = Orchestrator(search_handler, judge_handler)
-# Phase 5: Magentic orchestrator (SAME API)
-orchestrator = MagenticOrchestrator(search_handler, judge_handler)
-# Usage is identical
-async for event in orchestrator.run("metformin alzheimer"):
-    print(event.to_markdown())
-```
----
-## 4. Phase 5 Implementation
-### 4.1 Install Agent Framework
-Add to `pyproject.toml`:
-```toml
-[project.optional-dependencies]
-magentic = [
-    "agent-framework-core>=0.1.0",
-]
 ```
-### 4.2 Agent Wrappers (`src/agents/search_agent.py`)
-Wrap `SearchHandler` as an `AgentProtocol`.
-**Note**: `AgentProtocol` requires `id`, `name`, `display_name`, `description`, `run`, `run_stream`, and `get_new_thread`.
 ```python
-"""Search agent wrapper for Magentic integration."""
-from typing import Any, AsyncIterable
-from agent_framework import AgentProtocol, AgentRunResponse, AgentRunResponseUpdate, ChatMessage, Role, AgentThread
-from src.tools.search_handler import SearchHandler
-from src.utils.models import SearchResult
-class SearchAgent:
-    """Wraps SearchHandler as an AgentProtocol for Magentic."""
-    def __init__(self, search_handler: SearchHandler):
-        self._handler = search_handler
-        self._id = "search-agent"
-        self._name = "SearchAgent"
-        self._description = "Searches PubMed and web for drug repurposing evidence"
-    @property
-    def id(self) -> str:
-        return self._id
-    @property
-    def name(self) -> str | None:
-        return self._name
-    @property
-    def display_name(self) -> str:
-        return self._name
-    @property
-    def description(self) -> str | None:
-        return self._description
-    async def run(
-        self,
-        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
-        *,
-        thread: AgentThread | None = None,
-        **kwargs: Any,
-    ) -> AgentRunResponse:
-        """Execute search based on the last user message."""
-        # Extract query from messages
-        query = ""
-        if isinstance(messages, list):
-            for msg in reversed(messages):
-                if isinstance(msg, ChatMessage) and msg.role == Role.USER and msg.text:
-                    query = msg.text
-                    break
-                elif isinstance(msg, str):
-                    query = msg
-                    break
-        elif isinstance(messages, str):
-            query = messages
-        if not query:
-            return AgentRunResponse(
-                messages=[ChatMessage(role=Role.ASSISTANT, text="No query provided")],
-                response_id="search-no-query",
-            )
-        # Execute search
-        result: SearchResult = await self._handler.execute(query, max_results_per_tool=10)
-        # Format response
-        evidence_text = "\n".join([
-            f"- [{e.citation.title}]({e.citation.url}): {e.content[:200]}..."
-            for e in result.evidence[:5]
-        ])
-        response_text = f"Found {result.total_found} sources:\n\n{evidence_text}"
-        return AgentRunResponse(
-            messages=[ChatMessage(role=Role.ASSISTANT, text=response_text)],
-            response_id=f"search-{result.total_found}",
-            additional_properties={"evidence": [e.model_dump() for e in result.evidence]},
-        )
-    async def run_stream(
-        self,
-        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
-        *,
-        thread: AgentThread | None = None,
-        **kwargs: Any,
-    ) -> AsyncIterable[AgentRunResponseUpdate]:
-        """Streaming wrapper for search (search itself isn't streaming)."""
-        result = await self.run(messages, thread=thread, **kwargs)
-        # Yield single update with full result
-        yield AgentRunResponseUpdate(
-            messages=result.messages,
-            response_id=result.response_id
         )
-    def get_new_thread(self, **kwargs: Any) -> AgentThread:
-        """Create a new thread."""
-        return AgentThread(**kwargs)
 ```
-### 4.3 Judge Agent Wrapper (`src/agents/judge_agent.py`)
 ```python
-"""Judge agent wrapper for Magentic integration."""
-from typing import Any, List, AsyncIterable
-from agent_framework import AgentProtocol, AgentRunResponse, AgentRunResponseUpdate, ChatMessage, Role, AgentThread
-from src.agent_factory.judges import JudgeHandler
-from src.utils.models import Evidence, JudgeAssessment
-class JudgeAgent:
-    """Wraps JudgeHandler as an AgentProtocol for Magentic."""
-    def __init__(self, judge_handler: JudgeHandler, evidence_store: dict[str, List[Evidence]]):
-        self._handler = judge_handler
-        self._evidence_store = evidence_store  # Shared state for evidence
-        self._id = "judge-agent"
-        self._name = "JudgeAgent"
-        self._description = "Evaluates evidence quality and determines if sufficient for synthesis"
-    @property
-    def id(self) -> str:
-        return self._id
-    @property
-    def name(self) -> str | None:
-        return self._name
-    @property
-    def display_name(self) -> str:
-        return self._name
-    @property
-    def description(self) -> str | None:
-        return self._description
-    async def run(
-        self,
-        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
-        *,
-        thread: AgentThread | None = None,
-        **kwargs: Any,
-    ) -> AgentRunResponse:
-        """Assess evidence quality."""
-        # Extract original question from messages
-        question = ""
-        if isinstance(messages, list):
-            for msg in reversed(messages):
-                if isinstance(msg, ChatMessage) and msg.role == Role.USER and msg.text:
-                    question = msg.text
-                    break
-                elif isinstance(msg, str):
-                    question = msg
-                    break
-        elif isinstance(messages, str):
-            question = messages
-        # Get evidence from shared store
-        evidence = self._evidence_store.get("current", [])
-        # Assess
-        assessment: JudgeAssessment = await self._handler.assess(question, evidence)
-        # Format response
-        response_text = f"""## Assessment
-**Sufficient**: {assessment.sufficient}
-**Confidence**: {assessment.confidence:.0%}
-**Recommendation**: {assessment.recommendation}
-### Scores
-- Mechanism: {assessment.details.mechanism_score}/10
-- Clinical: {assessment.details.clinical_evidence_score}/10
-### Reasoning
-{assessment.reasoning}
-"""
-        if assessment.next_search_queries:
-            response_text += f"\n### Next Queries\n" + "\n".join(
-                f"- {q}" for q in assessment.next_search_queries
-            )
-        return AgentRunResponse(
-            messages=[ChatMessage(role=Role.ASSISTANT, text=response_text)],
-            response_id=f"judge-{assessment.recommendation}",
-            additional_properties={"assessment": assessment.model_dump()},
-        )
-    async def run_stream(
-        self,
-        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
-        *,
-        thread: AgentThread | None = None,
-        **kwargs: Any,
-    ) -> AsyncIterable[AgentRunResponseUpdate]:
-        """Streaming wrapper for judge."""
-        result = await self.run(messages, thread=thread, **kwargs)
-        yield AgentRunResponseUpdate(
-            messages=result.messages,
-            response_id=result.response_id
-        )
-    def get_new_thread(self, **kwargs: Any) -> AgentThread:
-        """Create a new thread."""
-        return AgentThread(**kwargs)
 ```
-### 4.4 Magentic Orchestrator (`src/orchestrator_magentic.py`)
 ```python
-"""Magentic-based orchestrator for DeepCritical."""
-from typing import AsyncGenerator, List
-import structlog
 from agent_framework import (
     MagenticBuilder,
     MagenticFinalResultEvent,
-    MagenticAgentMessageEvent,
     MagenticOrchestratorMessageEvent,
-    MagenticAgentDeltaEvent,
     WorkflowOutputEvent,
 )
 from agent_framework.openai import OpenAIChatClient
-from src.agents.search_agent import SearchAgent
-from src.agents.judge_agent import JudgeAgent
-from src.tools.search_handler import SearchHandler
-from src.agent_factory.judges import JudgeHandler
-from src.utils.models import AgentEvent, Evidence
 logger = structlog.get_logger()
 class MagenticOrchestrator:
     """
-    Magentic-based orchestrator - same API as Orchestrator.
-    Uses Microsoft Agent Framework's MagenticBuilder for multi-agent coordination.
     """
     def __init__(
         self,
-        search_handler: SearchHandler,
-        judge_handler: JudgeHandler,
         max_rounds: int = 10,
-    ):
-        self._search_handler = search_handler
-        self._judge_handler = judge_handler
-        self._max_rounds = max_rounds
-        self._evidence_store: dict[str, List[Evidence]] = {"current": []}
-    async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
-        """
-        Run the Magentic workflow - same API as simple Orchestrator.
-        Yields AgentEvent objects for real-time UI updates.
         """
-        logger.info("Starting Magentic orchestrator", query=query)
-        yield AgentEvent(
-            type="started",
-            message=f"Starting research (Magentic mode): {query}",
-            iteration=0,
         )
-        # Create agent wrappers
-        search_agent = SearchAgent(self._search_handler)
-        judge_agent = JudgeAgent(self._judge_handler, self._evidence_store)
-        # Build Magentic workflow
-        # Note: MagenticBuilder.participants takes named arguments for agent instances
-        workflow = (
             MagenticBuilder()
             .participants(
                 searcher=search_agent,
                 judge=judge_agent,
             )
             .with_standard_manager(
-                chat_client=OpenAIChatClient(),
                 max_round_count=self._max_rounds,
                 max_stall_count=3,
                 max_reset_count=2,
@@ -401,139 +760,173 @@ class MagenticOrchestrator:
             .build()
         )
-        # Task instruction for the manager
         task = f"""Research drug repurposing opportunities for: {query}
-Instructions:
-1. Use SearcherAgent to find evidence from PubMed and web
-2. Use JudgeAgent to evaluate if evidence is sufficient
-3. If JudgeAgent says "continue", search with refined queries
-4. If JudgeAgent says "synthesize", provide final synthesis
-5. Stop when synthesis is ready or max rounds reached
-Focus on finding:
-- Mechanism of action evidence
-- Clinical/preclinical studies
-- Specific drug candidates
-"""
         iteration = 0
         try:
-            # workflow.run_stream returns an async generator of workflow events
             async for event in workflow.run_stream(task):
-                if isinstance(event, MagenticOrchestratorMessageEvent):
-                    # Manager events (planning, instruction, ledger)
-                    message_text = event.message.text if event.message else ""
-                    yield AgentEvent(
-                        type="judging",
-                        message=f"Manager ({event.kind}): {message_text[:100]}...",
-                        iteration=iteration,
-                    )
-                elif isinstance(event, MagenticAgentMessageEvent):
-                    # Complete agent response
-                    iteration += 1
-                    agent_name = event.agent_id or "unknown"
-                    msg_text = event.message.text if event.message else ""
-                    if "search" in agent_name.lower():
-                        # Check if we found evidence (based on SearchAgent logic)
-                        # In a real implementation we might extract metadata
-                        yield AgentEvent(
-                            type="search_complete",
-                            message=f"Search agent: {msg_text[:100]}...",
-                            iteration=iteration,
-                        )
-                    elif "judge" in agent_name.lower():
-                        yield AgentEvent(
-                            type="judge_complete",
-                            message=f"Judge agent: {msg_text[:100]}...",
-                            iteration=iteration,
-                        )
-                elif isinstance(event, MagenticFinalResultEvent):
-                    # Final workflow result
-                    final_text = event.message.text if event.message else "No result"
-                    yield AgentEvent(
-                        type="complete",
-                        message=final_text,
-                        data={"iterations": iteration},
-                        iteration=iteration,
-                    )
-                elif isinstance(event, MagenticAgentDeltaEvent):
-                    # Streaming token chunks from agents (optional "typing" effect)
-                    # Only emit if we have actual text content
-                    if event.text:
-                        yield AgentEvent(
-                            type="streaming",
-                            message=event.text,
-                            data={"agent_id": event.agent_id},
-                            iteration=iteration,
-                        )
-                elif isinstance(event, WorkflowOutputEvent):
-                    # Alternative final output event
-                    if event.data:
-                        yield AgentEvent(
-                            type="complete",
-                            message=str(event.data),
-                            iteration=iteration,
-                        )
         except Exception as e:
             logger.error("Magentic workflow failed", error=str(e))
             yield AgentEvent(
                 type="error",
-                message=f"Workflow error: {str(e)}",
                 iteration=iteration,
             )
-```
-### 4.5 Factory Pattern (`src/orchestrator_factory.py`)
-Allow switching between implementations:
 ```python
 """Factory for creating orchestrators."""
-from typing import Literal
-from src.orchestrator import Orchestrator
-from src.tools.search_handler import SearchHandler
-from src.agent_factory.judges import JudgeHandler
 from src.utils.models import OrchestratorConfig
 def create_orchestrator(
-    search_handler: SearchHandler,
-    judge_handler: JudgeHandler,
     config: OrchestratorConfig | None = None,
     mode: Literal["simple", "magentic"] = "simple",
-):
     """
     Create an orchestrator instance.
     Args:
-        search_handler: The search handler
-        judge_handler: The judge handler
         config: Optional configuration
-        mode: "simple" for Phase 4 loop, "magentic" for Phase 5 multi-agent
     Returns:
-        Orchestrator instance (same interface regardless of mode)
     """
     if mode == "magentic":
         try:
             from src.orchestrator_magentic import MagenticOrchestrator
             return MagenticOrchestrator(
-                search_handler=search_handler,
-                judge_handler=judge_handler,
                 max_rounds=config.max_iterations if config else 10,
             )
         except ImportError:
             # Fallback to simple if agent-framework not installed
             pass
     return Orchestrator(
         search_handler=search_handler,
         judge_handler=judge_handler,
@@ -543,96 +936,156 @@ def create_orchestrator(
 ---
-## 5. Directory Structure After Phase 5
 ```
-src/
-├── app.py                      # Gradio UI (unchanged)
-├── orchestrator.py             # Phase 4 simple orchestrator
-├── orchestrator_magentic.py    # Phase 5 Magentic orchestrator
-├── orchestrator_factory.py     # Factory to switch implementations
-├── agents/                     # NEW: Agent wrappers
-│   ├── __init__.py
-│   ├── search_agent.py         # SearchHandler as AgentProtocol
-│   └── judge_agent.py          # JudgeHandler as AgentProtocol
-├── agent_factory/
-│   └── judges.py               # JudgeHandler (unchanged)
-├── tools/
-│   ├── pubmed.py               # PubMed tool (unchanged)
-│   ├── websearch.py            # Web tool (unchanged)
-│   └── search_handler.py       # SearchHandler (unchanged)
-└── utils/
-    └── models.py               # Models (unchanged)
 ```
 ---
-## 6. Implementation Checklist
-- [ ] Ensure Phase 4 uses Protocol-based handler interfaces
-- [ ] Add `agent-framework-core` to optional dependencies
-- [ ] Create `src/agents/` directory
-- [ ] Implement `SearchAgent` wrapper
-- [ ] Implement `JudgeAgent` wrapper
-- [ ] Implement `MagenticOrchestrator`
-- [ ] Implement `orchestrator_factory.py`
-- [ ] Add tests for agent wrappers
-- [ ] Test Magentic flow end-to-end
-- [ ] Update `src/app.py` to use factory with mode toggle
 ---
-## 7. Definition of Done
-Phase 5 is **COMPLETE** when:
-1. All Phase 4 tests still pass (no regression)
-2. `MagenticOrchestrator` has same API as `Orchestrator`
-3. Can switch between modes via factory:
-```python
-# Simple mode (Phase 4)
-orchestrator = create_orchestrator(search, judge, mode="simple")
-# Magentic mode (Phase 5)
-orchestrator = create_orchestrator(search, judge, mode="magentic")
-# Same usage!
-async for event in orchestrator.run("metformin alzheimer"):
-    print(event.to_markdown())
-```
-4. UI works with both modes
-5. Graceful fallback if agent-framework not installed
----
-## 8. When to Implement
-**Priority**: LOW (optional enhancement)
-Implement ONLY after:
-1. ✅ Phase 1: Foundation
-2. ✅ Phase 2: Search
-3. ✅ Phase 3: Judge
-4. ✅ Phase 4: Orchestrator + UI (MVP SHIPPED)
-If hackathon deadline is approaching, **SKIP Phase 5**. Ship the MVP.
 ---
-## 9. Benefits of This Design
-1. **No breaking changes** - Phase 4 code works unchanged
-2. **Same API** - `run()` returns `AsyncGenerator[AgentEvent, None]`
-3. **Gradual adoption** - Optional dependency, factory fallback
-4. **Testable** - Each component can be tested independently
-5. **Aligns with Tonic's vision** - Uses Microsoft Agent Framework patterns
----
-## 10. Reference
-- Microsoft Agent Framework: `reference_repos/agent-framework/`
-- Magentic samples: `reference_repos/agent-framework/python/samples/getting_started/workflows/orchestration/magentic.py`
-- AgentProtocol: `reference_repos/agent-framework/python/packages/core/agent_framework/_agents.py`

+# Phase 5 Implementation Spec: Magentic Integration
 **Goal**: Upgrade orchestrator to use Microsoft Agent Framework's Magentic-One pattern.
 **Philosophy**: "Same API, Better Engine."
 - **Event streaming** for real-time UI updates
 - **Multi-agent coordination** with round limits and reset logic
 ---
+## 2. Critical Architecture Understanding
+### 2.1 How Magentic Actually Works
 ```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                        MagenticBuilder Workflow                          │
+├─────────────────────────────────────────────────────────────────────────┤
+│                                                                          │
+│  User Task: "Research drug repurposing for metformin alzheimer"          │
+│                              ↓                                           │
+│  ┌──────────────────────────────────────────────────────────────────┐   │
+│  │                   StandardMagenticManager                         │   │
+│  │                                                                   │   │
+│  │  1. plan() → LLM generates facts & plan                          │   │
+│  │  2. create_progress_ledger() → LLM decides:                      │   │
+│  │     - is_request_satisfied?                                       │   │
+│  │     - next_speaker: "searcher"                                    │   │
+│  │     - instruction_or_question: "Search for clinical trials..."   │   │
+│  │                                                                   │   │
+│  └──────────────────────────────────────────────────────────────────┘   │
+│                              ↓                                           │
+│           NATURAL LANGUAGE INSTRUCTION sent to agent                     │
+│           "Search for clinical trials about metformin..."                │
+│                              ↓                                           │
+│  ┌──────────────────────────────────────────────────────────────────┐   │
+│  │                      ChatAgent (searcher)                         │   │
+│  │                                                                   │   │
+│  │  chat_client (INTERNAL LLM) ← understands instruction            │   │
+│  │         ↓                                                         │   │
+│  │  "I'll search for metformin alzheimer clinical trials"           │   │
+│  │         ↓                                                         │   │
+│  │  tools=[search_pubmed, search_clinicaltrials] ← calls tools      │   │
+│  │         ↓                                                         │   │
+│  │  Returns natural language response to manager                     │   │
+│  │                                                                   │   │
+│  └──────────────────────────────────────────────────────────────────┘   │
+│                              ↓                                           │
+│                    Manager evaluates response                            │
+│                    Decides next agent or completion                      │
+│                                                                          │
+└──────────────────────��──────────────────────────────────────────────────┘
 ```
+### 2.2 The Critical Insight
+**Microsoft's ChatAgent has an INTERNAL LLM (`chat_client`) that:**
+1. Receives natural language instructions from the manager
+2. Understands what action to take
+3. Calls attached tools (functions)
+4. Returns natural language responses
+**Our previous implementation was WRONG because:**
+- We wrapped handlers as bare `BaseAgent` subclasses
+- No internal LLM to understand instructions
+- Raw instruction text was passed directly to APIs (PubMed doesn't understand "Search for clinical trials...")
+### 2.3 Correct Pattern: ChatAgent with Tools
+```python
+# CORRECT: Agent backed by LLM that calls tools
+from agent_framework import ChatAgent, AIFunction
+from agent_framework.openai import OpenAIChatClient
+# Define tool that ChatAgent can call
+@AIFunction
+async def search_pubmed(query: str, max_results: int = 10) -> str:
+    """Search PubMed for biomedical literature.
+    Args:
+        query: Search keywords (e.g., "metformin alzheimer mechanism")
+        max_results: Maximum number of results to return
+    """
+    result = await pubmed_tool.search(query, max_results)
+    return format_results(result)
+# ChatAgent with internal LLM + tools
+search_agent = ChatAgent(
+    name="SearchAgent",
+    description="Searches biomedical databases for drug repurposing evidence",
+    instructions="You search PubMed, ClinicalTrials.gov, and bioRxiv for evidence.",
+    chat_client=OpenAIChatClient(model_id="gpt-4o-mini"),  # INTERNAL LLM
+    tools=[search_pubmed, search_clinicaltrials, search_biorxiv],  # TOOLS
+)
+```
 ---
+## 3. Correct Implementation
+### 3.1 Shared State Module (`src/agents/state.py`)
+**CRITICAL**: Tools must update shared state so:
+1. EmbeddingService can deduplicate across searches
+2. ReportAgent can access structured Evidence objects for citations
 ```python
+"""Shared state for Magentic agents.
+This module provides global state that tools update as a side effect.
+ChatAgent tools return strings to the LLM, but also update this state
+for semantic deduplication and structured citation access.
+"""
+from __future__ import annotations
+from typing import TYPE_CHECKING
+import structlog
+if TYPE_CHECKING:
+    from src.services.embeddings import EmbeddingService
+from src.utils.models import Evidence
+logger = structlog.get_logger()
+class MagenticState:
+    """Shared state container for Magentic workflow.
+    Maintains:
+    - evidence_store: All collected Evidence objects (for citations)
+    - embedding_service: Optional semantic search (for deduplication)
+    """
+    def __init__(self) -> None:
+        self.evidence_store: list[Evidence] = []
+        self.embedding_service: EmbeddingService | None = None
+        self._seen_urls: set[str] = set()
+    def init_embedding_service(self) -> None:
+        """Lazy-initialize embedding service if available."""
+        if self.embedding_service is not None:
+            return
+        try:
+            from src.services.embeddings import get_embedding_service
+            self.embedding_service = get_embedding_service()
+            logger.info("Embedding service enabled for Magentic mode")
+        except Exception as e:
+            logger.warning("Embedding service unavailable", error=str(e))
+    async def add_evidence(self, evidence_list: list[Evidence]) -> list[Evidence]:
+        """Add evidence with semantic deduplication.
+        Args:
+            evidence_list: New evidence from search
+        Returns:
+            List of unique evidence (not duplicates)
+        """
+        if not evidence_list:
+            return []
+        # URL-based deduplication first (fast)
+        url_unique = [
+            e for e in evidence_list
+            if e.citation.url not in self._seen_urls
+        ]
+        # Semantic deduplication if available
+        if self.embedding_service and url_unique:
+            try:
+                unique = await self.embedding_service.deduplicate(url_unique, threshold=0.85)
+                logger.info(
+                    "Semantic deduplication",
+                    before=len(url_unique),
+                    after=len(unique),
+                )
+            except Exception as e:
+                logger.warning("Deduplication failed, using URL-based", error=str(e))
+                unique = url_unique
+        else:
+            unique = url_unique
+        # Update state
+        for e in unique:
+            self._seen_urls.add(e.citation.url)
+            self.evidence_store.append(e)
+        return unique
+    async def search_related(self, query: str, n_results: int = 5) -> list[Evidence]:
+        """Find semantically related evidence from vector store.
+        Args:
+            query: Search query
+            n_results: Number of related items
+        Returns:
+            Related Evidence objects (reconstructed from vector store)
+        """
+        if not self.embedding_service:
+            return []
+        try:
+            from src.utils.models import Citation
+            related = await self.embedding_service.search_similar(query, n_results)
+            evidence = []
+            for item in related:
+                if item["id"] in self._seen_urls:
+                    continue  # Already in results
+                meta = item.get("metadata", {})
+                authors_str = meta.get("authors", "")
+                authors = [a.strip() for a in authors_str.split(",") if a.strip()]
+                ev = Evidence(
+                    content=item["content"],
+                    citation=Citation(
+                        title=meta.get("title", "Related Evidence"),
+                        url=item["id"],
+                        source=meta.get("source", "pubmed"),
+                        date=meta.get("date", "n.d."),
+                        authors=authors,
+                    ),
+                    relevance=max(0.0, 1.0 - item.get("distance", 0.5)),
+                )
+                evidence.append(ev)
+            return evidence
+        except Exception as e:
+            logger.warning("Related search failed", error=str(e))
+            return []
+    def reset(self) -> None:
+        """Reset state for new workflow run."""
+        self.evidence_store.clear()
+        self._seen_urls.clear()
+# Global singleton for workflow
+_state: MagenticState | None = None
+def get_magentic_state() -> MagenticState:
+    """Get or create the global Magentic state."""
+    global _state
+    if _state is None:
+        _state = MagenticState()
+    return _state
+def reset_magentic_state() -> None:
+    """Reset state for a fresh workflow run."""
+    global _state
+    if _state is not None:
+        _state.reset()
+    else:
+        _state = MagenticState()
 ```
+### 3.2 Tool Functions (`src/agents/tools.py`)
+Tools call APIs AND update shared state. Return strings to LLM, but also store structured Evidence.
 ```python
+"""Tool functions for Magentic agents.
+IMPORTANT: These tools do TWO things:
+1. Return formatted strings to the ChatAgent's internal LLM
+2. Update shared state (evidence_store, embeddings) as a side effect
+This preserves semantic deduplication and structured citation access.
+"""
+from agent_framework import AIFunction
+from src.agents.state import get_magentic_state
+from src.tools.biorxiv import BioRxivTool
+from src.tools.clinicaltrials import ClinicalTrialsTool
+from src.tools.pubmed import PubMedTool
+# Singleton tool instances
+_pubmed = PubMedTool()
+_clinicaltrials = ClinicalTrialsTool()
+_biorxiv = BioRxivTool()
+def _format_results(results: list, source_name: str, query: str) -> str:
+    """Format search results for LLM consumption."""
+    if not results:
+        return f"No {source_name} results found for: {query}"
+    output = [f"Found {len(results)} {source_name} results:\n"]
+    for i, r in enumerate(results[:10], 1):
+        output.append(f"{i}. **{r.citation.title}**")
+        output.append(f"   Source: {r.citation.source} | Date: {r.citation.date}")
+        output.append(f"   {r.content[:300]}...")
+        output.append(f"   URL: {r.citation.url}\n")
+    return "\n".join(output)
+@AIFunction
+async def search_pubmed(query: str, max_results: int = 10) -> str:
+    """Search PubMed for biomedical research papers.
+    Use this tool to find peer-reviewed scientific literature about
+    drugs, diseases, mechanisms of action, and clinical studies.
+    Args:
+        query: Search keywords (e.g., "metformin alzheimer mechanism")
+        max_results: Maximum results to return (default 10)
+    Returns:
+        Formatted list of papers with titles, abstracts, and citations
+    """
+    # 1. Execute search
+    results = await _pubmed.search(query, max_results)
+    # 2. Update shared state (semantic dedup + evidence store)
+    state = get_magentic_state()
+    unique = await state.add_evidence(results)
+    # 3. Also get related evidence from vector store
+    related = await state.search_related(query, n_results=3)
+    if related:
+        await state.add_evidence(related)
+    # 4. Return formatted string for LLM
+    total_new = len(unique)
+    total_stored = len(state.evidence_store)
+    output = _format_results(results, "PubMed", query)
+    output += f"\n[State: {total_new} new, {total_stored} total in evidence store]"
+    if related:
+        output += f"\n[Also found {len(related)} semantically related items from previous searches]"
+    return output
+@AIFunction
+async def search_clinical_trials(query: str, max_results: int = 10) -> str:
+    """Search ClinicalTrials.gov for clinical studies.
+    Use this tool to find ongoing and completed clinical trials
+    for drug repurposing candidates.
+    Args:
+        query: Search terms (e.g., "metformin cancer phase 3")
+        max_results: Maximum results to return (default 10)
+    Returns:
+        Formatted list of clinical trials with status and details
+    """
+    # 1. Execute search
+    results = await _clinicaltrials.search(query, max_results)
+    # 2. Update shared state
+    state = get_magentic_state()
+    unique = await state.add_evidence(results)
+    # 3. Return formatted string
+    total_new = len(unique)
+    total_stored = len(state.evidence_store)
+    output = _format_results(results, "ClinicalTrials.gov", query)
+    output += f"\n[State: {total_new} new, {total_stored} total in evidence store]"
+    return output
+@AIFunction
+async def search_preprints(query: str, max_results: int = 10) -> str:
+    """Search bioRxiv/medRxiv for preprint papers.
+    Use this tool to find the latest research that hasn't been
+    peer-reviewed yet. Good for cutting-edge findings.
+    Args:
+        query: Search terms (e.g., "long covid treatment")
+        max_results: Maximum results to return (default 10)
+    Returns:
+        Formatted list of preprints with abstracts and links
+    """
+    # 1. Execute search
+    results = await _biorxiv.search(query, max_results)
+    # 2. Update shared state
+    state = get_magentic_state()
+    unique = await state.add_evidence(results)
+    # 3. Return formatted string
+    total_new = len(unique)
+    total_stored = len(state.evidence_store)
+    output = _format_results(results, "bioRxiv/medRxiv", query)
+    output += f"\n[State: {total_new} new, {total_stored} total in evidence store]"
+    return output
+@AIFunction
+async def get_evidence_summary() -> str:
+    """Get summary of all collected evidence.
+    Use this tool when you need to review what evidence has been collected
+    before making an assessment or generating a report.
+    Returns:
+        Summary of evidence store with counts and key citations
+    """
+    state = get_magentic_state()
+    evidence = state.evidence_store
+    if not evidence:
+        return "No evidence collected yet."
+    # Group by source
+    by_source: dict[str, list] = {}
+    for e in evidence:
+        src = e.citation.source
+        if src not in by_source:
+            by_source[src] = []
+        by_source[src].append(e)
+    output = [f"**Evidence Store Summary** ({len(evidence)} total items)\n"]
+    for source, items in by_source.items():
+        output.append(f"\n### {source.upper()} ({len(items)} items)")
+        for e in items[:5]:  # First 5 per source
+            output.append(f"- {e.citation.title[:80]}...")
+    return "\n".join(output)
+@AIFunction
+async def get_bibliography() -> str:
+    """Get full bibliography of all collected evidence.
+    Use this tool when generating a final report to get properly
+    formatted citations for all evidence.
+    Returns:
+        Numbered bibliography with full citation details
+    """
+    state = get_magentic_state()
+    evidence = state.evidence_store
+    if not evidence:
+        return "No evidence collected for bibliography."
+    output = ["## References\n"]
+    for i, e in enumerate(evidence, 1):
+        # Format: Authors (Year). Title. Source. URL
+        authors = ", ".join(e.citation.authors[:3]) if e.citation.authors else "Unknown"
+        if e.citation.authors and len(e.citation.authors) > 3:
+            authors += " et al."
+        year = e.citation.date[:4] if e.citation.date else "n.d."
+        output.append(
+            f"{i}. {authors} ({year}). {e.citation.title}. "
+            f"*{e.citation.source.upper()}*. [{e.citation.url}]({e.citation.url})"
         )
+    return "\n".join(output)
 ```
+### 3.3 ChatAgent-Based Agents (`src/agents/magentic_agents.py`)
 ```python
+"""Magentic-compatible agents using ChatAgent pattern."""
+from agent_framework import ChatAgent
+from agent_framework.openai import OpenAIChatClient
+from src.agents.tools import (
+    get_bibliography,
+    get_evidence_summary,
+    search_clinical_trials,
+    search_preprints,
+    search_pubmed,
+)
+from src.utils.config import settings
+def create_search_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
+    """Create a search agent with internal LLM and search tools.
+    Args:
+        chat_client: Optional custom chat client. If None, uses default.
+    Returns:
+        ChatAgent configured for biomedical search
+    """
+    client = chat_client or OpenAIChatClient(
+        model_id="gpt-4o-mini",  # Fast, cheap for tool orchestration
+        api_key=settings.openai_api_key,
+    )
+    return ChatAgent(
+        name="SearchAgent",
+        description="Searches biomedical databases (PubMed, ClinicalTrials.gov, bioRxiv) for drug repurposing evidence",
+        instructions="""You are a biomedical search specialist. When asked to find evidence:
+1. Analyze the request to determine what to search for
+2. Extract key search terms (drug names, disease names, mechanisms)
+3. Use the appropriate search tools:
+   - search_pubmed for peer-reviewed papers
+   - search_clinical_trials for clinical studies
+   - search_preprints for cutting-edge findings
+4. Summarize what you found and highlight key evidence
+Be thorough - search multiple databases when appropriate.
+Focus on finding: mechanisms of action, clinical evidence, and specific drug candidates.""",
+        chat_client=client,
+        tools=[search_pubmed, search_clinical_trials, search_preprints],
+        temperature=0.3,  # More deterministic for tool use
+    )
+def create_judge_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
+    """Create a judge agent that evaluates evidence quality.
+    Args:
+        chat_client: Optional custom chat client. If None, uses default.
+    Returns:
+        ChatAgent configured for evidence assessment
+    """
+    client = chat_client or OpenAIChatClient(
+        model_id="gpt-4o",  # Better model for nuanced judgment
+        api_key=settings.openai_api_key,
+    )
+    return ChatAgent(
+        name="JudgeAgent",
+        description="Evaluates evidence quality and determines if sufficient for synthesis",
+        instructions="""You are an evidence quality assessor. When asked to evaluate:
+1. First, call get_evidence_summary() to see all collected evidence
+2. Score on two dimensions (0-10 each):
+   - Mechanism Score: How well is the biological mechanism explained?
+   - Clinical Score: How strong is the clinical/preclinical evidence?
+3. Determine if evidence is SUFFICIENT for a final report:
+   - Sufficient: Clear mechanism + supporting clinical data
+   - Insufficient: Gaps in mechanism OR weak clinical evidence
+4. If insufficient, suggest specific search queries to fill gaps
+Be rigorous but fair. Look for:
+- Molecular targets and pathways
+- Animal model studies
+- Human clinical trials
+- Safety data
+- Drug-drug interactions""",
+        chat_client=client,
+        tools=[get_evidence_summary],  # Can review collected evidence
+        temperature=0.2,  # Consistent judgments
+    )
+def create_hypothesis_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
+    """Create a hypothesis generation agent.
+    Args:
+        chat_client: Optional custom chat client. If None, uses default.
+    Returns:
+        ChatAgent configured for hypothesis generation
+    """
+    client = chat_client or OpenAIChatClient(
+        model_id="gpt-4o",
+        api_key=settings.openai_api_key,
+    )
+    return ChatAgent(
+        name="HypothesisAgent",
+        description="Generates mechanistic hypotheses for drug repurposing",
+        instructions="""You are a biomedical hypothesis generator. Based on evidence:
+1. Identify the key molecular targets involved
+2. Map the biological pathways affected
+3. Generate testable hypotheses in this format:
+   DRUG → TARGET → PATHWAY → THERAPEUTIC EFFECT
+   Example:
+   Metformin → AMPK activation → mTOR inhibition → Reduced tau phosphorylation
+4. Explain the rationale for each hypothesis
+5. Suggest what additional evidence would support or refute it
+Focus on mechanistic plausibility and existing evidence.""",
+        chat_client=client,
+        temperature=0.5,  # Some creativity for hypothesis generation
+    )
+def create_report_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
+    """Create a report synthesis agent.
+    Args:
+        chat_client: Optional custom chat client. If None, uses default.
+    Returns:
+        ChatAgent configured for report generation
+    """
+    client = chat_client or OpenAIChatClient(
+        model_id="gpt-4o",
+        api_key=settings.openai_api_key,
+    )
+    return ChatAgent(
+        name="ReportAgent",
+        description="Synthesizes research findings into structured reports",
+        instructions="""You are a scientific report writer. When asked to synthesize:
+1. First, call get_evidence_summary() to review all collected evidence
+2. Then call get_bibliography() to get properly formatted citations
+Generate a structured report with these sections:
+## Executive Summary
+Brief overview of findings and recommendation
+## Methodology
+Databases searched, queries used, evidence reviewed
+## Key Findings
+### Mechanism of Action
+- Molecular targets
+- Biological pathways
+- Proposed mechanism
+### Clinical Evidence
+- Preclinical studies
+- Clinical trials
+- Safety profile
+## Drug Candidates
+List specific drugs with repurposing potential
+## Limitations
+Gaps in evidence, conflicting data, caveats
+## Conclusion
+Final recommendation with confidence level
+## References
+Use the output from get_bibliography() - do not make up citations!
+Be comprehensive but concise. Cite evidence for all claims.""",
+        chat_client=client,
+        tools=[get_evidence_summary, get_bibliography],  # Access to collected evidence
+        temperature=0.3,
+    )
 ```
+### 3.4 Magentic Orchestrator (`src/orchestrator_magentic.py`)
 ```python
+"""Magentic-based orchestrator using ChatAgent pattern."""
+from collections.abc import AsyncGenerator
+from typing import Any
+import structlog
 from agent_framework import (
+    MagenticAgentDeltaEvent,
+    MagenticAgentMessageEvent,
     MagenticBuilder,
     MagenticFinalResultEvent,
     MagenticOrchestratorMessageEvent,
     WorkflowOutputEvent,
 )
 from agent_framework.openai import OpenAIChatClient
+from src.agents.magentic_agents import (
+    create_hypothesis_agent,
+    create_judge_agent,
+    create_report_agent,
+    create_search_agent,
+)
+from src.agents.state import get_magentic_state, reset_magentic_state
+from src.utils.config import settings
+from src.utils.exceptions import ConfigurationError
+from src.utils.models import AgentEvent
 logger = structlog.get_logger()
 class MagenticOrchestrator:
     """
+    Magentic-based orchestrator using ChatAgent pattern.
+    Each agent has an internal LLM that understands natural language
+    instructions from the manager and can call tools appropriately.
     """
     def __init__(
         self,
         max_rounds: int = 10,
+        chat_client: OpenAIChatClient | None = None,
+    ) -> None:
+        """Initialize orchestrator.
+        Args:
+            max_rounds: Maximum coordination rounds
+            chat_client: Optional shared chat client for agents
         """
+        if not settings.openai_api_key:
+            raise ConfigurationError(
+                "Magentic mode requires OPENAI_API_KEY. "
+                "Set the key or use mode='simple'."
+            )
+        self._max_rounds = max_rounds
+        self._chat_client = chat_client
+    def _build_workflow(self) -> Any:
+        """Build the Magentic workflow with ChatAgent participants."""
+        # Create agents with internal LLMs
+        search_agent = create_search_agent(self._chat_client)
+        judge_agent = create_judge_agent(self._chat_client)
+        hypothesis_agent = create_hypothesis_agent(self._chat_client)
+        report_agent = create_report_agent(self._chat_client)
+        # Manager chat client (orchestrates the agents)
+        manager_client = OpenAIChatClient(
+            model_id="gpt-4o",  # Good model for planning/coordination
+            api_key=settings.openai_api_key,
         )
+        return (
             MagenticBuilder()
             .participants(
                 searcher=search_agent,
+                hypothesizer=hypothesis_agent,
                 judge=judge_agent,
+                reporter=report_agent,
             )
             .with_standard_manager(
+                chat_client=manager_client,
                 max_round_count=self._max_rounds,
                 max_stall_count=3,
                 max_reset_count=2,
             .build()
         )
+    async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
+        """
+        Run the Magentic workflow.
+        Args:
+            query: User's research question
+        Yields:
+            AgentEvent objects for real-time UI updates
+        """
+        logger.info("Starting Magentic orchestrator", query=query)
+        # CRITICAL: Reset state for fresh workflow run
+        reset_magentic_state()
+        # Initialize embedding service if available
+        state = get_magentic_state()
+        state.init_embedding_service()
+        yield AgentEvent(
+            type="started",
+            message=f"Starting research (Magentic mode): {query}",
+            iteration=0,
+        )
+        workflow = self._build_workflow()
         task = f"""Research drug repurposing opportunities for: {query}
+Workflow:
+1. SearchAgent: Find evidence from PubMed, ClinicalTrials.gov, and bioRxiv
+2. HypothesisAgent: Generate mechanistic hypotheses (Drug → Target → Pathway → Effect)
+3. JudgeAgent: Evaluate if evidence is sufficient
+4. If insufficient → SearchAgent refines search based on gaps
+5. If sufficient → ReportAgent synthesizes final report
+Focus on:
+- Identifying specific molecular targets
+- Understanding mechanism of action
+- Finding clinical evidence supporting hypotheses
+The final output should be a structured research report."""
         iteration = 0
         try:
             async for event in workflow.run_stream(task):
+                agent_event = self._process_event(event, iteration)
+                if agent_event:
+                    if isinstance(event, MagenticAgentMessageEvent):
+                        iteration += 1
+                    yield agent_event
         except Exception as e:
             logger.error("Magentic workflow failed", error=str(e))
             yield AgentEvent(
                 type="error",
+                message=f"Workflow error: {e!s}",
+                iteration=iteration,
+            )
+    def _process_event(self, event: Any, iteration: int) -> AgentEvent | None:
+        """Process workflow event into AgentEvent."""
+        if isinstance(event, MagenticOrchestratorMessageEvent):
+            text = event.message.text if event.message else ""
+            if text:
+                return AgentEvent(
+                    type="judging",
+                    message=f"Manager ({event.kind}): {text[:200]}...",
+                    iteration=iteration,
+                )
+        elif isinstance(event, MagenticAgentMessageEvent):
+            agent_name = event.agent_id or "unknown"
+            text = event.message.text if event.message else ""
+            event_type = "judging"
+            if "search" in agent_name.lower():
+                event_type = "search_complete"
+            elif "judge" in agent_name.lower():
+                event_type = "judge_complete"
+            elif "hypothes" in agent_name.lower():
+                event_type = "hypothesizing"
+            elif "report" in agent_name.lower():
+                event_type = "synthesizing"
+            return AgentEvent(
+                type=event_type,
+                message=f"{agent_name}: {text[:200]}...",
+                iteration=iteration + 1,
+            )
+        elif isinstance(event, MagenticFinalResultEvent):
+            text = event.message.text if event.message else "No result"
+            return AgentEvent(
+                type="complete",
+                message=text,
+                data={"iterations": iteration},
                 iteration=iteration,
             )
+        elif isinstance(event, MagenticAgentDeltaEvent):
+            if event.text:
+                return AgentEvent(
+                    type="streaming",
+                    message=event.text,
+                    data={"agent_id": event.agent_id},
+                    iteration=iteration,
+                )
+        elif isinstance(event, WorkflowOutputEvent):
+            if event.data:
+                return AgentEvent(
+                    type="complete",
+                    message=str(event.data),
+                    iteration=iteration,
+                )
+        return None
+```
+### 3.4 Updated Factory (`src/orchestrator_factory.py`)
 ```python
 """Factory for creating orchestrators."""
+from typing import Any, Literal
+from src.orchestrator import JudgeHandlerProtocol, Orchestrator, SearchHandlerProtocol
 from src.utils.models import OrchestratorConfig
 def create_orchestrator(
+    search_handler: SearchHandlerProtocol | None = None,
+    judge_handler: JudgeHandlerProtocol | None = None,
     config: OrchestratorConfig | None = None,
     mode: Literal["simple", "magentic"] = "simple",
+) -> Any:
     """
     Create an orchestrator instance.
     Args:
+        search_handler: The search handler (required for simple mode)
+        judge_handler: The judge handler (required for simple mode)
         config: Optional configuration
+        mode: "simple" for Phase 4 loop, "magentic" for ChatAgent-based multi-agent
     Returns:
+        Orchestrator instance
+    Note:
+        Magentic mode does NOT use search_handler/judge_handler.
+        It creates ChatAgent instances with internal LLMs that call tools directly.
     """
     if mode == "magentic":
         try:
             from src.orchestrator_magentic import MagenticOrchestrator
             return MagenticOrchestrator(
                 max_rounds=config.max_iterations if config else 10,
             )
         except ImportError:
             # Fallback to simple if agent-framework not installed
             pass
+    # Simple mode requires handlers
+    if search_handler is None or judge_handler is None:
+        raise ValueError("Simple mode requires search_handler and judge_handler")
     return Orchestrator(
         search_handler=search_handler,
         judge_handler=judge_handler,
 ---
+## 4. Why This Works
+### 4.1 The Manager → Agent Communication
 ```
+Manager LLM decides: "Tell SearchAgent to find clinical trials for metformin"
+           ↓
+Sends instruction: "Search for clinical trials about metformin and cancer"
+           ↓
+SearchAgent's INTERNAL LLM receives this
+           ↓
+Internal LLM understands: "I should call search_clinical_trials('metformin cancer')"
+           ↓
+Tool executes: ClinicalTrials.gov API
+           ↓
+Internal LLM formats response: "I found 15 trials. Here are the key ones..."
+           ↓
+Manager receives natural language response
+```
+### 4.2 Why Our Old Implementation Failed
+```
+Manager sends: "Search for clinical trials about metformin..."
+           ↓
+OLD SearchAgent.run() extracts: query = "Search for clinical trials about metformin..."
+           ↓
+Passes to PubMed: pubmed.search("Search for clinical trials about metformin...")
+           ↓
+PubMed doesn't understand English instructions → garbage results or error
 ```
 ---
+## 5. Directory Structure
+```text
+src/
+├── agents/
+│   ├── __init__.py
+│   ├── state.py                 # MagenticState (evidence_store + embeddings)
+│   ├── tools.py                 # AIFunction tool definitions (update state)
+│   └── magentic_agents.py       # ChatAgent factory functions
+├── services/
+│   └── embeddings.py            # EmbeddingService (semantic dedup)
+├── orchestrator.py              # Simple mode (unchanged)
+├── orchestrator_magentic.py     # Magentic mode with ChatAgents
+└── orchestrator_factory.py      # Mode selection
+```
 ---
+## 6. Dependencies
+```toml
+[project.optional-dependencies]
+magentic = [
+    "agent-framework-core>=1.0.0b",
+    "agent-framework-openai>=1.0.0b",  # For OpenAIChatClient
+]
+embeddings = [
+    "chromadb>=0.4.0",
+    "sentence-transformers>=2.2.0",
+]
+```
+**IMPORTANT: Magentic mode REQUIRES OpenAI API key.**
+The Microsoft Agent Framework's standard manager and ChatAgent use OpenAIChatClient internally.
+There is no AnthropicChatClient in the framework. If only `ANTHROPIC_API_KEY` is set:
+- `mode="simple"` works fine
+- `mode="magentic"` throws `ConfigurationError`
+This is enforced in `MagenticOrchestrator.__init__`.
+---
+## 7. Implementation Checklist
+- [ ] Create `src/agents/state.py` with MagenticState class
+- [ ] Create `src/agents/tools.py` with AIFunction search tools + state updates
+- [ ] Create `src/agents/magentic_agents.py` with ChatAgent factories
+- [ ] Rewrite `src/orchestrator_magentic.py` to use ChatAgent pattern
+- [ ] Update `src/orchestrator_factory.py` for new signature
+- [ ] Test with real OpenAI API
+- [ ] Verify manager properly coordinates agents
+- [ ] Ensure tools are called with correct parameters
+- [ ] Verify semantic deduplication works (evidence_store populates)
+- [ ] Verify bibliography generation in final reports
+---
+## 8. Definition of Done
+Phase 5 is **COMPLETE** when:
+1. Magentic mode runs without hanging
+2. Manager successfully coordinates agents via natural language
+3. SearchAgent calls tools with proper search keywords (not raw instructions)
+4. JudgeAgent evaluates evidence from conversation history
+5. ReportAgent generates structured final report
+6. Events stream to UI correctly
 ---
+## 9. Testing Magentic Mode
+```bash
+# Test with real API
+OPENAI_API_KEY=sk-... uv run python -c "
+import asyncio
+from src.orchestrator_factory import create_orchestrator
+async def test():
+    orch = create_orchestrator(mode='magentic')
+    async for event in orch.run('metformin alzheimer'):
+        print(f'[{event.type}] {event.message[:100]}')
+asyncio.run(test())
+"
+```
+Expected output:
+```
+[started] Starting research (Magentic mode): metformin alzheimer
+[judging] Manager (plan): I will coordinate the agents to research...
+[search_complete] SearchAgent: Found 25 PubMed results for metformin alzheimer...
+[hypothesizing] HypothesisAgent: Based on the evidence, I propose...
+[judge_complete] JudgeAgent: Mechanism Score: 7/10, Clinical Score: 6/10...
+[synthesizing] ReportAgent: ## Executive Summary...
+[complete] <full research report>
+```
+---
+## 10. Key Differences from Old Spec
+| Aspect | OLD (Wrong) | NEW (Correct) |
+|--------|-------------|---------------|
+| Agent type | `BaseAgent` subclass | `ChatAgent` with `chat_client` |
+| Internal LLM | None | OpenAIChatClient |
+| How tools work | Handler.execute(raw_instruction) | LLM understands instruction, calls AIFunction |
+| Message handling | Extract text → pass to API | LLM interprets → extracts keywords → calls tool |
+| State management | Passed to agent constructors | Global MagenticState singleton |
+| Evidence storage | In agent instance | In MagenticState.evidence_store |
+| Semantic search | Coupled to agents | Tools call state.add_evidence() |
+| Citations for report | From agent's store | Via get_bibliography() tool |
+**Key Insights:**
+1. Magentic agents must have internal LLMs to understand natural language instructions
+2. Tools must update shared state as a side effect (return strings, but also store Evidence)
+3. ReportAgent uses `get_bibliography()` tool to access structured citations
+4. State is reset at start of each workflow run via `reset_magentic_state()`

pyproject.toml CHANGED Viewed

@@ -16,6 +16,7 @@ dependencies = [
     "httpx>=0.27", # Async HTTP client (PubMed)
     "beautifulsoup4>=4.12", # HTML parsing
     "xmltodict>=0.13", # PubMed XML -> dict
     # UI
     "gradio[mcp]>=6.0.0", # Chat interface with MCP server support (6.0 required for css in launch())
     # Utils
@@ -42,7 +43,7 @@ dev = [
     "pre-commit>=3.7",
 ]
 magentic = [
-    "agent-framework-core",
 ]
 embeddings = [
     "chromadb>=0.4.0",
@@ -132,5 +133,5 @@ exclude_lines = [
     "raise NotImplementedError",
 ]
-# Note: agent-framework-core is optional and installed locally for magentic mode
-# CI skips tests that require it via pytest.importorskip

     "httpx>=0.27", # Async HTTP client (PubMed)
     "beautifulsoup4>=4.12", # HTML parsing
     "xmltodict>=0.13", # PubMed XML -> dict
+    "huggingface-hub>=0.20.0", # Hugging Face Inference API
     # UI
     "gradio[mcp]>=6.0.0", # Chat interface with MCP server support (6.0 required for css in launch())
     # Utils
     "pre-commit>=3.7",
 ]
 magentic = [
+    "agent-framework-core>=1.0.0b251120,<2.0.0",  # Pin to avoid breaking changes
 ]
 embeddings = [
     "chromadb>=0.4.0",
     "raise NotImplementedError",
 ]
+# Note: agent-framework-core is optional for magentic mode (multi-agent orchestration)
+# Version pinned to 1.0.0b* to avoid breaking changes. CI skips tests via pytest.importorskip

src/agent_factory/judges.py CHANGED Viewed

@@ -1,13 +1,17 @@
 """Judge handler for evidence assessment using PydanticAI."""
-from typing import Any
 import structlog
 from pydantic_ai import Agent
 from pydantic_ai.models.anthropic import AnthropicModel
 from pydantic_ai.models.openai import OpenAIModel
 from pydantic_ai.providers.anthropic import AnthropicProvider
 from pydantic_ai.providers.openai import OpenAIProvider
 from src.prompts.judge import (
     SYSTEM_PROMPT,
@@ -146,6 +150,207 @@ class JudgeHandler:
         )
 class MockJudgeHandler:
     """
     Mock JudgeHandler for demo mode without LLM calls.

 """Judge handler for evidence assessment using PydanticAI."""
+import asyncio
+import json
+from typing import Any, ClassVar
 import structlog
+from huggingface_hub import InferenceClient
 from pydantic_ai import Agent
 from pydantic_ai.models.anthropic import AnthropicModel
 from pydantic_ai.models.openai import OpenAIModel
 from pydantic_ai.providers.anthropic import AnthropicProvider
 from pydantic_ai.providers.openai import OpenAIProvider
+from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_exponential
 from src.prompts.judge import (
     SYSTEM_PROMPT,
         )
+class HFInferenceJudgeHandler:
+    """
+    JudgeHandler using HuggingFace Inference API for FREE LLM calls.
+    Defaults to Llama-3.1-8B-Instruct (requires HF_TOKEN) or falls back to public models.
+    """
+    FALLBACK_MODELS: ClassVar[list[str]] = [
+        "meta-llama/Llama-3.1-8B-Instruct",  # Primary (Gated)
+        "mistralai/Mistral-7B-Instruct-v0.3",  # Secondary
+        "HuggingFaceH4/zephyr-7b-beta",  # Fallback (Ungated)
+    ]
+    def __init__(self, model_id: str | None = None) -> None:
+        """
+        Initialize with HF Inference client.
+        Args:
+            model_id: Optional specific model ID. If None, uses FALLBACK_MODELS chain.
+        """
+        self.model_id = model_id
+        # Will automatically use HF_TOKEN from env if available
+        self.client = InferenceClient()
+        self.call_count = 0
+        self.last_question: str | None = None
+        self.last_evidence: list[Evidence] | None = None
+    async def assess(
+        self,
+        question: str,
+        evidence: list[Evidence],
+    ) -> JudgeAssessment:
+        """
+        Assess evidence using HuggingFace Inference API.
+        Attempts models in order until one succeeds.
+        """
+        self.call_count += 1
+        self.last_question = question
+        self.last_evidence = evidence
+        # Format the user prompt
+        if evidence:
+            user_prompt = format_user_prompt(question, evidence)
+        else:
+            user_prompt = format_empty_evidence_prompt(question)
+        models_to_try: list[str] = [self.model_id] if self.model_id else self.FALLBACK_MODELS
+        last_error: Exception | None = None
+        for model in models_to_try:
+            try:
+                return await self._call_with_retry(model, user_prompt, question)
+            except Exception as e:
+                logger.warning("Model failed", model=model, error=str(e))
+                last_error = e
+                continue
+        # All models failed
+        logger.error("All HF models failed", error=str(last_error))
+        return self._create_fallback_assessment(question, str(last_error))
+    @retry(
+        stop=stop_after_attempt(3),
+        wait=wait_exponential(multiplier=1, min=1, max=4),
+        retry=retry_if_exception_type(Exception),
+        reraise=True,
+    )
+    async def _call_with_retry(self, model: str, prompt: str, question: str) -> JudgeAssessment:
+        """Make API call with retry logic using chat_completion."""
+        loop = asyncio.get_running_loop()
+        # Build messages for chat_completion (model-agnostic)
+        messages = [
+            {
+                "role": "system",
+                "content": f"""{SYSTEM_PROMPT}
+IMPORTANT: Respond with ONLY valid JSON matching this schema:
+{{
+    "details": {{
+        "mechanism_score": <int 0-10>,
+        "mechanism_reasoning": "<string>",
+        "clinical_evidence_score": <int 0-10>,
+        "clinical_reasoning": "<string>",
+        "drug_candidates": ["<string>", ...],
+        "key_findings": ["<string>", ...]
+    }},
+    "sufficient": <bool>,
+    "confidence": <float 0-1>,
+    "recommendation": "continue" | "synthesize",
+    "next_search_queries": ["<string>", ...],
+    "reasoning": "<string>"
+}}""",
+            },
+            {"role": "user", "content": prompt},
+        ]
+        # Use chat_completion (conversational task - supported by all models)
+        response = await loop.run_in_executor(
+            None,
+            lambda: self.client.chat_completion(
+                messages=messages,
+                model=model,
+                max_tokens=1024,
+                temperature=0.1,
+            ),
+        )
+        # Extract content from response
+        content = response.choices[0].message.content
+        if not content:
+            raise ValueError("Empty response from model")
+        # Extract and parse JSON
+        json_data = self._extract_json(content)
+        if not json_data:
+            raise ValueError("No valid JSON found in response")
+        return JudgeAssessment(**json_data)
+    def _extract_json(self, text: str) -> dict[str, Any] | None:
+        """
+        Robust JSON extraction that handles markdown blocks and nested braces.
+        """
+        text = text.strip()
+        # Remove markdown code blocks if present (with bounds checking)
+        if "```json" in text:
+            parts = text.split("```json", 1)
+            if len(parts) > 1:
+                inner_parts = parts[1].split("```", 1)
+                text = inner_parts[0]
+        elif "```" in text:
+            parts = text.split("```", 1)
+            if len(parts) > 1:
+                inner_parts = parts[1].split("```", 1)
+                text = inner_parts[0]
+        text = text.strip()
+        # Find first '{'
+        start_idx = text.find("{")
+        if start_idx == -1:
+            return None
+        # Stack-based parsing ignoring chars in strings
+        count = 0
+        in_string = False
+        escape = False
+        for i, char in enumerate(text[start_idx:], start=start_idx):
+            if in_string:
+                if escape:
+                    escape = False
+                elif char == "\\":
+                    escape = True
+                elif char == '"':
+                    in_string = False
+            elif char == '"':
+                in_string = True
+            elif char == "{":
+                count += 1
+            elif char == "}":
+                count -= 1
+                if count == 0:
+                    try:
+                        result = json.loads(text[start_idx : i + 1])
+                        if isinstance(result, dict):
+                            return result
+                        return None
+                    except json.JSONDecodeError:
+                        return None
+        return None
+    def _create_fallback_assessment(
+        self,
+        question: str,
+        error: str,
+    ) -> JudgeAssessment:
+        """Create a fallback assessment when inference fails."""
+        return JudgeAssessment(
+            details=AssessmentDetails(
+                mechanism_score=0,
+                mechanism_reasoning=f"Assessment failed: {error}",
+                clinical_evidence_score=0,
+                clinical_reasoning=f"Assessment failed: {error}",
+                drug_candidates=[],
+                key_findings=[],
+            ),
+            sufficient=False,
+            confidence=0.0,
+            recommendation="continue",
+            next_search_queries=[
+                f"{question} mechanism",
+                f"{question} clinical trials",
+                f"{question} drug candidates",
+            ],
+            reasoning=f"HF Inference failed: {error}. Recommend configuring OpenAI/Anthropic key.",
+        )
 class MockJudgeHandler:
     """
     Mock JudgeHandler for demo mode without LLM calls.

src/agents/magentic_agents.py ADDED Viewed

	@@ -0,0 +1,184 @@

+"""Magentic-compatible agents using ChatAgent pattern."""
+from agent_framework import ChatAgent
+from agent_framework.openai import OpenAIChatClient
+from src.agents.tools import (
+    get_bibliography,
+    search_clinical_trials,
+    search_preprints,
+    search_pubmed,
+)
+from src.utils.config import settings
+def create_search_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
+    """Create a search agent with internal LLM and search tools.
+    Args:
+        chat_client: Optional custom chat client. If None, uses default.
+    Returns:
+        ChatAgent configured for biomedical search
+    """
+    client = chat_client or OpenAIChatClient(
+        model_id=settings.openai_model,  # Use configured model
+        api_key=settings.openai_api_key,
+    )
+    return ChatAgent(
+        name="SearchAgent",
+        description=(
+            "Searches biomedical databases (PubMed, ClinicalTrials.gov, bioRxiv) "
+            "for drug repurposing evidence"
+        ),
+        instructions="""You are a biomedical search specialist. When asked to find evidence:
+1. Analyze the request to determine what to search for
+2. Extract key search terms (drug names, disease names, mechanisms)
+3. Use the appropriate search tools:
+   - search_pubmed for peer-reviewed papers
+   - search_clinical_trials for clinical studies
+   - search_preprints for cutting-edge findings
+4. Summarize what you found and highlight key evidence
+Be thorough - search multiple databases when appropriate.
+Focus on finding: mechanisms of action, clinical evidence, and specific drug candidates.""",
+        chat_client=client,
+        tools=[search_pubmed, search_clinical_trials, search_preprints],
+        temperature=0.3,  # More deterministic for tool use
+    )
+def create_judge_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
+    """Create a judge agent that evaluates evidence quality.
+    Args:
+        chat_client: Optional custom chat client. If None, uses default.
+    Returns:
+        ChatAgent configured for evidence assessment
+    """
+    client = chat_client or OpenAIChatClient(
+        model_id=settings.openai_model,
+        api_key=settings.openai_api_key,
+    )
+    return ChatAgent(
+        name="JudgeAgent",
+        description="Evaluates evidence quality and determines if sufficient for synthesis",
+        instructions="""You are an evidence quality assessor. When asked to evaluate:
+1. Review all evidence presented in the conversation
+2. Score on two dimensions (0-10 each):
+   - Mechanism Score: How well is the biological mechanism explained?
+   - Clinical Score: How strong is the clinical/preclinical evidence?
+3. Determine if evidence is SUFFICIENT for a final report:
+   - Sufficient: Clear mechanism + supporting clinical data
+   - Insufficient: Gaps in mechanism OR weak clinical evidence
+4. If insufficient, suggest specific search queries to fill gaps
+Be rigorous but fair. Look for:
+- Molecular targets and pathways
+- Animal model studies
+- Human clinical trials
+- Safety data
+- Drug-drug interactions""",
+        chat_client=client,
+        temperature=0.2,  # Consistent judgments
+    )
+def create_hypothesis_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
+    """Create a hypothesis generation agent.
+    Args:
+        chat_client: Optional custom chat client. If None, uses default.
+    Returns:
+        ChatAgent configured for hypothesis generation
+    """
+    client = chat_client or OpenAIChatClient(
+        model_id=settings.openai_model,
+        api_key=settings.openai_api_key,
+    )
+    return ChatAgent(
+        name="HypothesisAgent",
+        description="Generates mechanistic hypotheses for drug repurposing",
+        instructions="""You are a biomedical hypothesis generator. Based on evidence:
+1. Identify the key molecular targets involved
+2. Map the biological pathways affected
+3. Generate testable hypotheses in this format:
+   DRUG -> TARGET -> PATHWAY -> THERAPEUTIC EFFECT
+   Example:
+   Metformin -> AMPK activation -> mTOR inhibition -> Reduced tau phosphorylation
+4. Explain the rationale for each hypothesis
+5. Suggest what additional evidence would support or refute it
+Focus on mechanistic plausibility and existing evidence.""",
+        chat_client=client,
+        temperature=0.5,  # Some creativity for hypothesis generation
+    )
+def create_report_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
+    """Create a report synthesis agent.
+    Args:
+        chat_client: Optional custom chat client. If None, uses default.
+    Returns:
+        ChatAgent configured for report generation
+    """
+    client = chat_client or OpenAIChatClient(
+        model_id=settings.openai_model,
+        api_key=settings.openai_api_key,
+    )
+    return ChatAgent(
+        name="ReportAgent",
+        description="Synthesizes research findings into structured reports",
+        instructions="""You are a scientific report writer. When asked to synthesize:
+Generate a structured report with these sections:
+## Executive Summary
+Brief overview of findings and recommendation
+## Methodology
+Databases searched, queries used, evidence reviewed
+## Key Findings
+### Mechanism of Action
+- Molecular targets
+- Biological pathways
+- Proposed mechanism
+### Clinical Evidence
+- Preclinical studies
+- Clinical trials
+- Safety profile
+## Drug Candidates
+List specific drugs with repurposing potential
+## Limitations
+Gaps in evidence, conflicting data, caveats
+## Conclusion
+Final recommendation with confidence level
+## References
+Use the 'get_bibliography' tool to fetch the complete list of citations.
+Format them as a numbered list.
+Be comprehensive but concise. Cite evidence for all claims.""",
+        chat_client=client,
+        tools=[get_bibliography],
+        temperature=0.3,
+    )

src/agents/state.py ADDED Viewed

	@@ -0,0 +1,90 @@

+"""Thread-safe state management for Magentic agents.
+Uses contextvars to ensure isolation between concurrent requests (e.g., multiple users
+searching simultaneously via Gradio).
+"""
+from contextvars import ContextVar
+from typing import TYPE_CHECKING, Any
+from pydantic import BaseModel, Field
+from src.utils.models import Citation, Evidence
+if TYPE_CHECKING:
+    from src.services.embeddings import EmbeddingService
+class MagenticState(BaseModel):
+    """Mutable state for a Magentic workflow session."""
+    evidence: list[Evidence] = Field(default_factory=list)
+    # Type as Any to avoid circular imports/runtime resolution issues
+    # The actual object injected will be an EmbeddingService instance
+    embedding_service: Any = None
+    model_config = {"arbitrary_types_allowed": True}
+    def add_evidence(self, new_evidence: list[Evidence]) -> int:
+        """Add new evidence, deduplicating by URL.
+        Returns:
+            Number of *new* items added.
+        """
+        existing_urls = {e.citation.url for e in self.evidence}
+        count = 0
+        for item in new_evidence:
+            if item.citation.url not in existing_urls:
+                self.evidence.append(item)
+                existing_urls.add(item.citation.url)
+                count += 1
+        return count
+    async def search_related(self, query: str, n_results: int = 5) -> list[Evidence]:
+        """Search for semantically related evidence using the embedding service."""
+        if not self.embedding_service:
+            return []
+        results = await self.embedding_service.search_similar(query, n_results=n_results)
+        # Convert dict results back to Evidence objects
+        evidence_list = []
+        for item in results:
+            meta = item.get("metadata", {})
+            authors_str = meta.get("authors", "")
+            authors = [a.strip() for a in authors_str.split(",") if a.strip()]
+            ev = Evidence(
+                content=item["content"],
+                citation=Citation(
+                    title=meta.get("title", "Related Evidence"),
+                    url=item["id"],
+                    source="pubmed",  # Defaulting to pubmed if unknown
+                    date=meta.get("date", "n.d."),
+                    authors=authors,
+                ),
+                relevance=max(0.0, 1.0 - item.get("distance", 0.5)),
+            )
+            evidence_list.append(ev)
+        return evidence_list
+# The ContextVar holds the MagenticState for the current execution context
+_magentic_state_var: ContextVar[MagenticState | None] = ContextVar("magentic_state", default=None)
+def init_magentic_state(embedding_service: "EmbeddingService | None" = None) -> MagenticState:
+    """Initialize a new state for the current context."""
+    state = MagenticState(embedding_service=embedding_service)
+    _magentic_state_var.set(state)
+    return state
+def get_magentic_state() -> MagenticState:
+    """Get the current state. Raises RuntimeError if not initialized."""
+    state = _magentic_state_var.get()
+    if state is None:
+        # Auto-initialize if missing (e.g. during tests or simple scripts)
+        return init_magentic_state()
+    return state

src/agents/tools.py ADDED Viewed

	@@ -0,0 +1,175 @@

+"""Tool functions for Magentic agents.
+These functions are decorated with @ai_function to be callable by the ChatAgent's internal LLM.
+They also interact with the thread-safe MagenticState to persist evidence.
+"""
+from agent_framework import ai_function
+from src.agents.state import get_magentic_state
+from src.tools.biorxiv import BioRxivTool
+from src.tools.clinicaltrials import ClinicalTrialsTool
+from src.tools.pubmed import PubMedTool
+# Singleton tool instances (stateless wrappers)
+_pubmed = PubMedTool()
+_clinicaltrials = ClinicalTrialsTool()
+_biorxiv = BioRxivTool()
+@ai_function  # type: ignore[arg-type, misc]
+async def search_pubmed(query: str, max_results: int = 10) -> str:
+    """Search PubMed for biomedical research papers.
+    Use this tool to find peer-reviewed scientific literature about
+    drugs, diseases, mechanisms of action, and clinical studies.
+    Args:
+        query: Search keywords (e.g., "metformin alzheimer mechanism")
+        max_results: Maximum results to return (default 10)
+    Returns:
+        Formatted list of papers with titles, abstracts, and citations
+    """
+    state = get_magentic_state()
+    # 1. Execute raw search
+    results = await _pubmed.search(query, max_results)
+    if not results:
+        return f"No PubMed results found for: {query}"
+    # 2. Semantic Deduplication & Expansion (The "Digital Twin" Brain)
+    display_results = results
+    if state.embedding_service:
+        # Deduplicate against what we just found vs what's in the DB
+        unique_results = await state.embedding_service.deduplicate(results)
+        # Search for related context in the vector DB (previous searches)
+        related = await state.search_related(query, n_results=3)
+        # Combine unique new results + relevant historical results
+        display_results = unique_results + related
+    # 3. Update State (Persist for ReportAgent)
+    # We add *all* found results to state, not just the displayed ones
+    new_count = state.add_evidence(results)
+    # 4. Format Output for LLM
+    output = [f"Found {len(results)} results ({new_count} new stored):\n"]
+    # Limit display to avoid context window overflow, but state has everything
+    limit = min(len(display_results), max_results)
+    for i, r in enumerate(display_results[:limit], 1):
+        title = r.citation.title
+        date = r.citation.date
+        source = r.citation.source
+        content_clean = r.content[:300].replace("\n", " ")
+        url = r.citation.url
+        output.append(f"{i}. **{title}** ({date})")
+        output.append(f"   Source: {source} | {url}")
+        output.append(f"   {content_clean}...")
+        output.append("")
+    return "\n".join(output)
+@ai_function  # type: ignore[arg-type, misc]
+async def search_clinical_trials(query: str, max_results: int = 10) -> str:
+    """Search ClinicalTrials.gov for clinical studies.
+    Use this tool to find ongoing and completed clinical trials
+    for drug repurposing candidates.
+    Args:
+        query: Search terms (e.g., "metformin cancer phase 3")
+        max_results: Maximum results to return (default 10)
+    Returns:
+        Formatted list of clinical trials with status and details
+    """
+    state = get_magentic_state()
+    results = await _clinicaltrials.search(query, max_results)
+    if not results:
+        return f"No clinical trials found for: {query}"
+    # Update state
+    new_count = state.add_evidence(results)
+    output = [f"Found {len(results)} clinical trials ({new_count} new stored):\n"]
+    for i, r in enumerate(results[:max_results], 1):
+        title = r.citation.title
+        date = r.citation.date
+        source = r.citation.source
+        content_clean = r.content[:300].replace("\n", " ")
+        url = r.citation.url
+        output.append(f"{i}. **{title}**")
+        output.append(f"   Status: {source} | Date: {date}")
+        output.append(f"   {content_clean}...")
+        output.append(f"   URL: {url}\n")
+    return "\n".join(output)
+@ai_function  # type: ignore[arg-type, misc]
+async def search_preprints(query: str, max_results: int = 10) -> str:
+    """Search bioRxiv/medRxiv for preprint papers.
+    Use this tool to find the latest research that hasn't been
+    peer-reviewed yet. Good for cutting-edge findings.
+    Args:
+        query: Search terms (e.g., "long covid treatment")
+        max_results: Maximum results to return (default 10)
+    Returns:
+        Formatted list of preprints with abstracts and links
+    """
+    state = get_magentic_state()
+    results = await _biorxiv.search(query, max_results)
+    if not results:
+        return f"No preprints found for: {query}"
+    # Update state
+    new_count = state.add_evidence(results)
+    output = [f"Found {len(results)} preprints ({new_count} new stored):\n"]
+    for i, r in enumerate(results[:max_results], 1):
+        title = r.citation.title
+        date = r.citation.date
+        source = r.citation.source
+        content_clean = r.content[:300].replace("\n", " ")
+        url = r.citation.url
+        output.append(f"{i}. **{title}**")
+        output.append(f"   Server: {source} | Date: {date}")
+        output.append(f"   {content_clean}...")
+        output.append(f"   URL: {url}\n")
+    return "\n".join(output)
+@ai_function  # type: ignore[arg-type, misc]
+async def get_bibliography() -> str:
+    """Get the full list of collected evidence for the bibliography.
+    Use this tool when generating the final report to get the complete
+    list of references.
+    Returns:
+        Formatted bibliography string.
+    """
+    state = get_magentic_state()
+    if not state.evidence:
+        return "No evidence collected."
+    output = ["## References"]
+    for i, ev in enumerate(state.evidence, 1):
+        output.append(f"{i}. {ev.citation.formatted}")
+        output.append(f"   URL: {ev.citation.url}")
+    return "\n".join(output)

src/app.py CHANGED Viewed

@@ -10,7 +10,7 @@ from pydantic_ai.models.openai import OpenAIModel
 from pydantic_ai.providers.anthropic import AnthropicProvider
 from pydantic_ai.providers.openai import OpenAIProvider
-from src.agent_factory.judges import JudgeHandler, MockJudgeHandler
 from src.mcp_tools import (
     analyze_hypothesis,
     search_all_sources,
@@ -32,7 +32,7 @@ def configure_orchestrator(
     mode: str = "simple",
     user_api_key: str | None = None,
     api_provider: str = "openai",
-) -> Any:
     """
     Create an orchestrator instance.
@@ -43,7 +43,7 @@ def configure_orchestrator(
         api_provider: API provider ("openai" or "anthropic")
     Returns:
-        Configured Orchestrator instance
     """
     # Create orchestrator config
     config = OrchestratorConfig(
@@ -57,31 +57,57 @@ def configure_orchestrator(
         timeout=config.search_timeout,
     )
-    # Create judge (mock or real)
-    judge_handler: JudgeHandler | MockJudgeHandler
     if use_mock:
         judge_handler = MockJudgeHandler()
-    else:
-        # Create model with user's API key if provided
         model: AnthropicModel | OpenAIModel | None = None
         if user_api_key:
             if api_provider == "anthropic":
                 anthropic_provider = AnthropicProvider(api_key=user_api_key)
                 model = AnthropicModel(settings.anthropic_model, provider=anthropic_provider)
             elif api_provider == "openai":
                 openai_provider = OpenAIProvider(api_key=user_api_key)
                 model = OpenAIModel(settings.openai_model, provider=openai_provider)
-            else:
-                raise ValueError(f"Unsupported API provider: {api_provider}")
         judge_handler = JudgeHandler(model=model)
-    return create_orchestrator(
         search_handler=search_handler,
         judge_handler=judge_handler,
         config=config,
         mode=mode,  # type: ignore
     )
 async def research_agent(
     message: str,
@@ -110,54 +136,47 @@ async def research_agent(
     # Clean user-provided API key
     user_api_key = api_key.strip() if api_key else None
-    # Decide whether to use real LLMs or mock based on mode and available keys
     has_openai = bool(os.getenv("OPENAI_API_KEY"))
     has_anthropic = bool(os.getenv("ANTHROPIC_API_KEY"))
     has_user_key = bool(user_api_key)
-    if mode == "magentic":
-        # Magentic currently supports OpenAI only
-        use_mock = not (has_openai or (has_user_key and api_provider == "openai"))
-    else:
-        # Simple mode can work with either provider
-        use_mock = not (has_openai or has_anthropic or has_user_key)
-    # If magentic mode requested but no OpenAI key, fallback/warn
-    if mode == "magentic" and use_mock:
         yield (
-            "⚠️ **Warning**: Magentic mode requires OpenAI API key. "
-            "Falling back to demo mode.\n\n"
         )
         mode = "simple"
     # Inform user about their key being used
-    if has_user_key and not use_mock:
         yield (
             f"🔑 **Using your {api_provider.upper()} API key** - "
             "Your key is used only for this session and is never stored.\n\n"
         )
-    # Warn users when running in demo mode (no LLM keys)
-    if use_mock:
         yield (
-            "🔬 **Demo Mode**: Running with real biomedical searches but without "
-            "LLM-powered analysis.\n\n"
-            "**To unlock full AI analysis:**\n"
-            "- Enter your OpenAI or Anthropic API key below, OR\n"
-            "- Configure secrets in HuggingFace Space settings\n\n"
-            "---\n\n"
         )
     # Run the agent and stream events
     response_parts: list[str] = []
     try:
-        orchestrator = configure_orchestrator(
-            use_mock=use_mock,
             mode=mode,
             user_api_key=user_api_key,
             api_provider=api_provider,
         )
         async for event in orchestrator.run(message):
             # Format event as markdown
             event_md = event.to_markdown()

 from pydantic_ai.providers.anthropic import AnthropicProvider
 from pydantic_ai.providers.openai import OpenAIProvider
+from src.agent_factory.judges import HFInferenceJudgeHandler, JudgeHandler, MockJudgeHandler
 from src.mcp_tools import (
     analyze_hypothesis,
     search_all_sources,
     mode: str = "simple",
     user_api_key: str | None = None,
     api_provider: str = "openai",
+) -> tuple[Any, str]:
     """
     Create an orchestrator instance.
         api_provider: API provider ("openai" or "anthropic")
     Returns:
+        Tuple of (Orchestrator instance, backend_name)
     """
     # Create orchestrator config
     config = OrchestratorConfig(
         timeout=config.search_timeout,
     )
+    # Create judge (mock, real, or free tier)
+    judge_handler: JudgeHandler | MockJudgeHandler | HFInferenceJudgeHandler
+    backend_info = "Unknown"
+    # 1. Forced Mock (Unit Testing)
     if use_mock:
         judge_handler = MockJudgeHandler()
+        backend_info = "Mock (Testing)"
+    # 2. Paid API Key (User provided or Env)
+    elif (
+        user_api_key
+        or (api_provider == "openai" and os.getenv("OPENAI_API_KEY"))
+        or (api_provider == "anthropic" and os.getenv("ANTHROPIC_API_KEY"))
+    ):
         model: AnthropicModel | OpenAIModel | None = None
         if user_api_key:
+            # Validate key/provider match to prevent silent auth failures
+            if api_provider == "openai" and user_api_key.startswith("sk-ant-"):
+                raise ValueError("Anthropic key provided but OpenAI provider selected")
+            is_openai_key = user_api_key.startswith("sk-") and not user_api_key.startswith(
+                "sk-ant-"
+            )
+            if api_provider == "anthropic" and is_openai_key:
+                raise ValueError("OpenAI key provided but Anthropic provider selected")
             if api_provider == "anthropic":
                 anthropic_provider = AnthropicProvider(api_key=user_api_key)
                 model = AnthropicModel(settings.anthropic_model, provider=anthropic_provider)
             elif api_provider == "openai":
                 openai_provider = OpenAIProvider(api_key=user_api_key)
                 model = OpenAIModel(settings.openai_model, provider=openai_provider)
+            backend_info = f"Paid API ({api_provider.upper()})"
+        else:
+            backend_info = "Paid API (Env Config)"
         judge_handler = JudgeHandler(model=model)
+    # 3. Free Tier (HuggingFace Inference)
+    else:
+        judge_handler = HFInferenceJudgeHandler()
+        backend_info = "Free Tier (Llama 3.1 / Mistral)"
+    orchestrator = create_orchestrator(
         search_handler=search_handler,
         judge_handler=judge_handler,
         config=config,
         mode=mode,  # type: ignore
     )
+    return orchestrator, backend_info
 async def research_agent(
     message: str,
     # Clean user-provided API key
     user_api_key = api_key.strip() if api_key else None
+    # Check available keys
     has_openai = bool(os.getenv("OPENAI_API_KEY"))
     has_anthropic = bool(os.getenv("ANTHROPIC_API_KEY"))
     has_user_key = bool(user_api_key)
+    has_paid_key = has_openai or has_anthropic or has_user_key
+    # Magentic mode requires OpenAI specifically
+    if mode == "magentic" and not (has_openai or (has_user_key and api_provider == "openai")):
         yield (
+            "⚠️ **Warning**: Magentic mode requires OpenAI API key. Falling back to simple mode.\n\n"
         )
         mode = "simple"
     # Inform user about their key being used
+    if has_user_key:
         yield (
             f"🔑 **Using your {api_provider.upper()} API key** - "
             "Your key is used only for this session and is never stored.\n\n"
         )
+    elif not has_paid_key:
+        # No paid keys - will use FREE HuggingFace Inference
         yield (
+            "🤗 **Free Tier**: Using HuggingFace Inference (Llama 3.1 / Mistral) for AI analysis.\n"
+            "For premium models, enter an OpenAI or Anthropic API key below.\n\n"
         )
     # Run the agent and stream events
     response_parts: list[str] = []
     try:
+        # use_mock=False - let configure_orchestrator decide based on available keys
+        # It will use: Paid API > HF Inference (free tier)
+        orchestrator, backend_name = configure_orchestrator(
+            use_mock=False,  # Never use mock in production - HF Inference is the free fallback
             mode=mode,
             user_api_key=user_api_key,
             api_provider=api_provider,
         )
+        yield f"🧠 **Backend**: {backend_name}\n\n"
         async for event in orchestrator.run(message):
             # Format event as markdown
             event_md = event.to_markdown()

src/orchestrator_factory.py CHANGED Viewed

@@ -5,18 +5,10 @@ from typing import Any, Literal
 from src.orchestrator import JudgeHandlerProtocol, Orchestrator, SearchHandlerProtocol
 from src.utils.models import OrchestratorConfig
-# Define protocols again or import if they were in a shared place.
-# Since they are in src/orchestrator.py, we can import them?
-# But SearchHandler and JudgeHandler in arguments are concrete classes in the type hint,
-# which satisfy the protocol.
 def create_orchestrator(
-    search_handler: SearchHandlerProtocol,
-    judge_handler: JudgeHandlerProtocol,
     config: OrchestratorConfig | None = None,
     mode: Literal["simple", "magentic"] = "simple",
 ) -> Any:
@@ -24,27 +16,33 @@ def create_orchestrator(
     Create an orchestrator instance.
     Args:
-        search_handler: The search handler
-        judge_handler: The judge handler
         config: Optional configuration
-        mode: "simple" for Phase 4 loop, "magentic" for Phase 5 multi-agent
     Returns:
-        Orchestrator instance (same interface regardless of mode)
     """
     if mode == "magentic":
         try:
             from src.orchestrator_magentic import MagenticOrchestrator
             return MagenticOrchestrator(
-                search_handler=search_handler,
-                judge_handler=judge_handler,
                 max_rounds=config.max_iterations if config else 10,
             )
         except ImportError:
             # Fallback to simple if agent-framework not installed
             pass
     return Orchestrator(
         search_handler=search_handler,
         judge_handler=judge_handler,

 from src.orchestrator import JudgeHandlerProtocol, Orchestrator, SearchHandlerProtocol
 from src.utils.models import OrchestratorConfig
 def create_orchestrator(
+    search_handler: SearchHandlerProtocol | None = None,
+    judge_handler: JudgeHandlerProtocol | None = None,
     config: OrchestratorConfig | None = None,
     mode: Literal["simple", "magentic"] = "simple",
 ) -> Any:
     Create an orchestrator instance.
     Args:
+        search_handler: The search handler (required for simple mode)
+        judge_handler: The judge handler (required for simple mode)
         config: Optional configuration
+        mode: "simple" for Phase 4 loop, "magentic" for ChatAgent-based multi-agent
     Returns:
+        Orchestrator instance
+    Note:
+        Magentic mode does NOT use search_handler/judge_handler.
+        It creates ChatAgent instances with internal LLMs that call tools directly.
     """
     if mode == "magentic":
         try:
             from src.orchestrator_magentic import MagenticOrchestrator
             return MagenticOrchestrator(
                 max_rounds=config.max_iterations if config else 10,
             )
         except ImportError:
             # Fallback to simple if agent-framework not installed
             pass
+    # Simple mode requires handlers
+    if search_handler is None or judge_handler is None:
+        raise ValueError("Simple mode requires search_handler and judge_handler")
     return Orchestrator(
         search_handler=search_handler,
         judge_handler=judge_handler,

src/orchestrator_magentic.py CHANGED Viewed

@@ -1,18 +1,9 @@
-"""Magentic-based orchestrator for DeepCritical.
-NOTE: Magentic mode currently requires OpenAI API keys. The MagenticBuilder's
-standard manager uses OpenAIChatClient. Anthropic support may be added when
-the agent_framework provides an AnthropicChatClient.
-"""
 from collections.abc import AsyncGenerator
 from typing import TYPE_CHECKING, Any
 import structlog
-if TYPE_CHECKING:
-    from src.services.embeddings import EmbeddingService
 from agent_framework import (
     MagenticAgentDeltaEvent,
     MagenticAgentMessageEvent,
@@ -23,45 +14,49 @@ from agent_framework import (
 )
 from agent_framework.openai import OpenAIChatClient
-from src.agents.hypothesis_agent import HypothesisAgent
-from src.agents.judge_agent import JudgeAgent
-from src.agents.report_agent import ReportAgent
-from src.agents.search_agent import SearchAgent
-from src.orchestrator import JudgeHandlerProtocol, SearchHandlerProtocol
 from src.utils.config import settings
 from src.utils.exceptions import ConfigurationError
-from src.utils.models import AgentEvent, Evidence
-logger = structlog.get_logger()
-def _truncate(text: str, max_len: int = 100) -> str:
-    """Truncate text with ellipsis only if needed."""
-    return f"{text[:max_len]}..." if len(text) > max_len else text
 class MagenticOrchestrator:
     """
-    Magentic-based orchestrator - same API as Orchestrator.
-    Uses Microsoft Agent Framework's MagenticBuilder for multi-agent coordination.
-    Note:
-        Magentic mode requires OPENAI_API_KEY. The MagenticBuilder's standard
-        manager currently only supports OpenAI. If you have only an Anthropic
-        key, use the "simple" orchestrator mode instead.
     """
     def __init__(
         self,
-        search_handler: SearchHandlerProtocol,
-        judge_handler: JudgeHandlerProtocol,
         max_rounds: int = 10,
     ) -> None:
-        self._search_handler = search_handler
-        self._judge_handler = judge_handler
         self._max_rounds = max_rounds
-        self._evidence_store: dict[str, list[Evidence]] = {"current": []}
     def _init_embedding_service(self) -> "EmbeddingService | None":
         """Initialize embedding service if available."""
@@ -77,19 +72,19 @@ class MagenticOrchestrator:
             logger.warning("Failed to initialize embedding service", error=str(e))
         return None
-    def _build_workflow(
-        self,
-        search_agent: SearchAgent,
-        hypothesis_agent: HypothesisAgent,
-        judge_agent: JudgeAgent,
-        report_agent: ReportAgent,
-    ) -> Any:
-        """Build the Magentic workflow with participants."""
-        if not settings.openai_api_key:
-            raise ConfigurationError(
-                "Magentic mode requires OPENAI_API_KEY. "
-                "Set the key or use mode='simple' with Anthropic."
-            )
         return (
             MagenticBuilder()
@@ -100,9 +95,7 @@ class MagenticOrchestrator:
                 reporter=report_agent,
             )
             .with_standard_manager(
-                chat_client=OpenAIChatClient(
-                    model_id=settings.openai_model, api_key=settings.openai_api_key
-                ),
                 max_round_count=self._max_rounds,
                 max_stall_count=3,
                 max_reset_count=2,
@@ -110,46 +103,15 @@ class MagenticOrchestrator:
             .build()
         )
-    def _format_task(self, query: str, has_embeddings: bool) -> str:
-        """Format the task instruction for the manager."""
-        semantic_note = ""
-        if has_embeddings:
-            semantic_note = """
-The system has semantic search enabled. When evidence is found:
-1. Related concepts will be automatically surfaced
-2. Duplicates are removed by meaning, not just URL
-3. Use the surfaced related concepts to refine searches
-"""
-        return f"""Research drug repurposing opportunities for: {query}
-{semantic_note}
-Workflow:
-1. SearcherAgent: Find initial evidence from PubMed and web. SEND ONLY A SIMPLE KEYWORD QUERY.
-2. HypothesisAgent: Generate mechanistic hypotheses (Drug -> Target -> Pathway -> Effect).
-3. SearcherAgent: Use hypothesis-suggested queries for targeted search.
-4. JudgeAgent: Evaluate if evidence supports hypotheses.
-5. If sufficient -> ReportAgent: Generate structured research report.
-6. If not sufficient -> Repeat from step 1 with refined queries.
-Focus on:
-- Identifying specific molecular targets
-- Understanding mechanism of action
-- Finding supporting/contradicting evidence for hypotheses
-The final output should be a complete research report with:
-- Executive summary
-- Methodology
-- Hypotheses tested
-- Mechanistic and clinical findings
-- Drug candidates
-- Limitations
-- Conclusion with references
-"""
     async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
         """
-        Run the Magentic workflow - same API as simple Orchestrator.
-        Yields AgentEvent objects for real-time UI updates.
         """
         logger.info("Starting Magentic orchestrator", query=query)
@@ -159,20 +121,27 @@ The final output should be a complete research report with:
             iteration=0,
         )
-        # Initialize services and agents
         embedding_service = self._init_embedding_service()
-        search_agent = SearchAgent(
-            self._search_handler, self._evidence_store, embedding_service=embedding_service
-        )
-        judge_agent = JudgeAgent(self._judge_handler, self._evidence_store)
-        hypothesis_agent = HypothesisAgent(
-            self._evidence_store, embedding_service=embedding_service
-        )
-        report_agent = ReportAgent(self._evidence_store, embedding_service=embedding_service)
-        # Build workflow and task
-        workflow = self._build_workflow(search_agent, hypothesis_agent, judge_agent, report_agent)
-        task = self._format_task(query, embedding_service is not None)
         iteration = 0
         try:
@@ -182,6 +151,7 @@ The final output should be a complete research report with:
                     if isinstance(event, MagenticAgentMessageEvent):
                         iteration += 1
                     yield agent_event
         except Exception as e:
             logger.error("Magentic workflow failed", error=str(e))
             yield AgentEvent(
@@ -191,35 +161,41 @@ The final output should be a complete research report with:
             )
     def _process_event(self, event: Any, iteration: int) -> AgentEvent | None:
-        """Process a workflow event and return an AgentEvent if applicable."""
         if isinstance(event, MagenticOrchestratorMessageEvent):
-            message_text = (
-                event.message.text if event.message and hasattr(event.message, "text") else ""
-            )
-            kind = getattr(event, "kind", "manager")
-            if message_text:
                 return AgentEvent(
                     type="judging",
-                    message=f"Manager ({kind}): {_truncate(message_text)}",
                     iteration=iteration,
                 )
         elif isinstance(event, MagenticAgentMessageEvent):
             agent_name = event.agent_id or "unknown"
-            msg_text = (
-                event.message.text if event.message and hasattr(event.message, "text") else ""
             )
-            return self._agent_message_event(agent_name, msg_text, iteration + 1)
         elif isinstance(event, MagenticFinalResultEvent):
-            final_text = (
-                event.message.text
-                if event.message and hasattr(event.message, "text")
-                else "No result"
-            )
             return AgentEvent(
                 type="complete",
-                message=final_text,
                 data={"iterations": iteration},
                 iteration=iteration,
             )
@@ -242,35 +218,3 @@ The final output should be a complete research report with:
                 )
         return None
-    def _agent_message_event(self, agent_name: str, msg_text: str, iteration: int) -> AgentEvent:
-        """Create an AgentEvent for an agent message."""
-        if "search" in agent_name.lower():
-            return AgentEvent(
-                type="search_complete",
-                message=f"Search agent: {_truncate(msg_text)}",
-                iteration=iteration,
-            )
-        elif "hypothes" in agent_name.lower():
-            return AgentEvent(
-                type="hypothesizing",
-                message=f"Hypothesis agent: {_truncate(msg_text)}",
-                iteration=iteration,
-            )
-        elif "judge" in agent_name.lower():
-            return AgentEvent(
-                type="judge_complete",
-                message=f"Judge agent: {_truncate(msg_text)}",
-                iteration=iteration,
-            )
-        elif "report" in agent_name.lower():
-            return AgentEvent(
-                type="synthesizing",
-                message=f"Report agent: {_truncate(msg_text)}" if msg_text else "Report generated.",
-                iteration=iteration,
-            )
-        return AgentEvent(
-            type="judging",
-            message=f"{agent_name}: {_truncate(msg_text)}",
-            iteration=iteration,
-        )

+"""Magentic-based orchestrator using ChatAgent pattern."""
 from collections.abc import AsyncGenerator
 from typing import TYPE_CHECKING, Any
 import structlog
 from agent_framework import (
     MagenticAgentDeltaEvent,
     MagenticAgentMessageEvent,
 )
 from agent_framework.openai import OpenAIChatClient
+from src.agents.magentic_agents import (
+    create_hypothesis_agent,
+    create_judge_agent,
+    create_report_agent,
+    create_search_agent,
+)
+from src.agents.state import init_magentic_state
 from src.utils.config import settings
 from src.utils.exceptions import ConfigurationError
+from src.utils.models import AgentEvent
+if TYPE_CHECKING:
+    from src.services.embeddings import EmbeddingService
+logger = structlog.get_logger()
 class MagenticOrchestrator:
     """
+    Magentic-based orchestrator using ChatAgent pattern.
+    Each agent has an internal LLM that understands natural language
+    instructions from the manager and can call tools appropriately.
     """
     def __init__(
         self,
         max_rounds: int = 10,
+        chat_client: OpenAIChatClient | None = None,
     ) -> None:
+        """Initialize orchestrator.
+        Args:
+            max_rounds: Maximum coordination rounds
+            chat_client: Optional shared chat client for agents
+        """
+        if not settings.openai_api_key:
+            raise ConfigurationError(
+                "Magentic mode requires OPENAI_API_KEY. " "Set the key or use mode='simple'."
+            )
         self._max_rounds = max_rounds
+        self._chat_client = chat_client
     def _init_embedding_service(self) -> "EmbeddingService | None":
         """Initialize embedding service if available."""
             logger.warning("Failed to initialize embedding service", error=str(e))
         return None
+    def _build_workflow(self) -> Any:
+        """Build the Magentic workflow with ChatAgent participants."""
+        # Create agents with internal LLMs
+        search_agent = create_search_agent(self._chat_client)
+        judge_agent = create_judge_agent(self._chat_client)
+        hypothesis_agent = create_hypothesis_agent(self._chat_client)
+        report_agent = create_report_agent(self._chat_client)
+        # Manager chat client (orchestrates the agents)
+        manager_client = OpenAIChatClient(
+            model_id=settings.openai_model,  # Use configured model
+            api_key=settings.openai_api_key,
+        )
         return (
             MagenticBuilder()
                 reporter=report_agent,
             )
             .with_standard_manager(
+                chat_client=manager_client,
                 max_round_count=self._max_rounds,
                 max_stall_count=3,
                 max_reset_count=2,
             .build()
         )
     async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
         """
+        Run the Magentic workflow.
+        Args:
+            query: User's research question
+        Yields:
+            AgentEvent objects for real-time UI updates
         """
         logger.info("Starting Magentic orchestrator", query=query)
             iteration=0,
         )
+        # Initialize context state
         embedding_service = self._init_embedding_service()
+        init_magentic_state(embedding_service)
+        workflow = self._build_workflow()
+        task = f"""Research drug repurposing opportunities for: {query}
+Workflow:
+1. SearchAgent: Find evidence from PubMed, ClinicalTrials.gov, and bioRxiv
+2. HypothesisAgent: Generate mechanistic hypotheses (Drug -> Target -> Pathway -> Effect)
+3. JudgeAgent: Evaluate if evidence is sufficient
+4. If insufficient -> SearchAgent refines search based on gaps
+5. If sufficient -> ReportAgent synthesizes final report
+Focus on:
+- Identifying specific molecular targets
+- Understanding mechanism of action
+- Finding clinical evidence supporting hypotheses
+The final output should be a structured research report."""
         iteration = 0
         try:
                     if isinstance(event, MagenticAgentMessageEvent):
                         iteration += 1
                     yield agent_event
         except Exception as e:
             logger.error("Magentic workflow failed", error=str(e))
             yield AgentEvent(
             )
     def _process_event(self, event: Any, iteration: int) -> AgentEvent | None:
+        """Process workflow event into AgentEvent."""
         if isinstance(event, MagenticOrchestratorMessageEvent):
+            text = event.message.text if event.message else ""
+            if text:
                 return AgentEvent(
                     type="judging",
+                    message=f"Manager ({event.kind}): {text[:200]}...",
                     iteration=iteration,
                 )
         elif isinstance(event, MagenticAgentMessageEvent):
             agent_name = event.agent_id or "unknown"
+            text = event.message.text if event.message else ""
+            event_type = "judging"
+            if "search" in agent_name.lower():
+                event_type = "search_complete"
+            elif "judge" in agent_name.lower():
+                event_type = "judge_complete"
+            elif "hypothes" in agent_name.lower():
+                event_type = "hypothesizing"
+            elif "report" in agent_name.lower():
+                event_type = "synthesizing"
+            return AgentEvent(
+                type=event_type,  # type: ignore[arg-type]
+                message=f"{agent_name}: {text[:200]}...",
+                iteration=iteration + 1,
             )
         elif isinstance(event, MagenticFinalResultEvent):
+            text = event.message.text if event.message else "No result"
             return AgentEvent(
                 type="complete",
+                message=text,
                 data={"iterations": iteration},
                 iteration=iteration,
             )
                 )
         return None

src/prompts/report.py CHANGED Viewed

@@ -124,13 +124,13 @@ async def format_report_prompt(
 {hypotheses_summary}
 ## Assessment Scores
-- Mechanism Score: {assessment.get('mechanism_score', 'N/A')}/10
-- Clinical Evidence Score: {assessment.get('clinical_score', 'N/A')}/10
-- Overall Confidence: {assessment.get('confidence', 0):.0%}
 ## Metadata
 - Sources Searched: {sources}
-- Search Iterations: {metadata.get('iterations', 0)}
 Generate a complete ResearchReport with all sections filled in.

 {hypotheses_summary}
 ## Assessment Scores
+- Mechanism Score: {assessment.get("mechanism_score", "N/A")}/10
+- Clinical Evidence Score: {assessment.get("clinical_score", "N/A")}/10
+- Overall Confidence: {assessment.get("confidence", 0):.0%}
 ## Metadata
 - Sources Searched: {sources}
+- Search Iterations: {metadata.get("iterations", 0)}
 Generate a complete ResearchReport with all sections filled in.

tests/unit/agent_factory/test_judges_hf.py ADDED Viewed

	@@ -0,0 +1,138 @@

+"""Unit tests for HFInferenceJudgeHandler."""
+from unittest.mock import AsyncMock, MagicMock, patch
+import pytest
+from src.agent_factory.judges import HFInferenceJudgeHandler
+from src.utils.models import Citation, Evidence
+@pytest.mark.unit
+class TestHFInferenceJudgeHandler:
+    """Tests for HFInferenceJudgeHandler."""
+    @pytest.fixture
+    def mock_client(self):
+        """Mock HuggingFace InferenceClient."""
+        with patch("src.agent_factory.judges.InferenceClient") as mock:
+            client_instance = MagicMock()
+            mock.return_value = client_instance
+            yield client_instance
+    @pytest.fixture
+    def handler(self, mock_client):
+        """Create a handler instance with mocked client."""
+        return HFInferenceJudgeHandler()
+    @pytest.mark.asyncio
+    async def test_assess_success(self, handler, mock_client):
+        """Test successful assessment with primary model."""
+        import json
+        # Construct valid JSON payload
+        data = {
+            "details": {
+                "mechanism_score": 8,
+                "mechanism_reasoning": "Good mechanism",
+                "clinical_evidence_score": 7,
+                "clinical_reasoning": "Good clinical",
+                "drug_candidates": ["Drug A"],
+                "key_findings": ["Finding 1"],
+            },
+            "sufficient": True,
+            "confidence": 0.85,
+            "recommendation": "synthesize",
+            "next_search_queries": [],
+            "reasoning": (
+                "Sufficient evidence provided to support the hypothesis with high confidence."
+            ),
+        }
+        # Mock chat_completion response structure
+        mock_message = MagicMock()
+        mock_message.content = f"""Here is the analysis:
+```json
+{json.dumps(data)}
+```"""
+        mock_choice = MagicMock()
+        mock_choice.message = mock_message
+        mock_response = MagicMock()
+        mock_response.choices = [mock_choice]
+        # Setup async mock for run_in_executor
+        with patch("asyncio.get_running_loop") as mock_loop:
+            mock_loop.return_value.run_in_executor = AsyncMock(return_value=mock_response)
+            evidence = [
+                Evidence(
+                    content="test", citation=Citation(source="pubmed", title="t", url="u", date="d")
+                )
+            ]
+            result = await handler.assess("test question", evidence)
+            assert result.sufficient is True
+            assert result.confidence == 0.85
+            assert result.details.drug_candidates == ["Drug A"]
+    @pytest.mark.asyncio
+    async def test_assess_fallback_logic(self, handler, mock_client):
+        """Test fallback to secondary model when primary fails."""
+        # Setup async mock to fail first, then succeed
+        with patch("asyncio.get_running_loop"):
+            # We need to mock the _call_with_retry method directly to test the loop in assess
+            # but _call_with_retry is decorated with tenacity,
+            # which makes it harder to mock partial failures easily
+            # without triggering the tenacity retry loop first.
+            # Instead, let's mock run_in_executor to raise exception on first call
+            # This is tricky because assess loops over models,
+            # and for each model _call_with_retry retries.
+            # We want to simulate: Model 1 fails (retries exhausted) -> Model 2 succeeds.
+            # Let's patch _call_with_retry to avoid waiting for real retries
+            side_effect = [
+                Exception("Model 1 failed"),
+                Exception("Model 2 failed"),
+                Exception("Model 3 failed"),
+            ]
+            with patch.object(handler, "_call_with_retry", side_effect=side_effect) as mock_call:
+                evidence = []
+                result = await handler.assess("test", evidence)
+                # Should have tried all 3 fallback models
+                assert mock_call.call_count == 3
+                # Fallback assessment should indicate failure
+                assert result.sufficient is False
+                assert "failed" in result.reasoning.lower() or "error" in result.reasoning.lower()
+    def test_extract_json_robustness(self, handler):
+        """Test JSON extraction with various inputs."""
+        # 1. Clean JSON
+        assert handler._extract_json('{"a": 1}') == {"a": 1}
+        # 2. Markdown block
+        assert handler._extract_json('```json\n{"a": 1}\n```') == {"a": 1}
+        # 3. Text preamble/postamble
+        text = """
+        Sure, here is the JSON:
+        {
+            "a": 1,
+            "b": {
+                "c": 2
+            }
+        }
+        Hope that helps!
+        """
+        assert handler._extract_json(text) == {"a": 1, "b": {"c": 2}}
+        # 4. Nested braces
+        nested = '{"a": {"b": "}"}}'
+        assert handler._extract_json(nested) == {"a": {"b": "}"}}
+        # 5. Invalid JSON
+        assert handler._extract_json("Not JSON") is None
+        assert handler._extract_json("{Incomplete") is None

uv.lock CHANGED Viewed

@@ -1065,6 +1065,7 @@ dependencies = [
     { name = "beautifulsoup4" },
     { name = "gradio", extra = ["mcp"] },
     { name = "httpx" },
     { name = "openai" },
     { name = "pydantic" },
     { name = "pydantic-ai" },
@@ -1107,13 +1108,14 @@ modal = [
 [package.metadata]
 requires-dist = [
-    { name = "agent-framework-core", marker = "extra == 'magentic'" },
     { name = "anthropic", specifier = ">=0.18.0" },
     { name = "beautifulsoup4", specifier = ">=4.12" },
     { name = "chromadb", marker = "extra == 'embeddings'", specifier = ">=0.4.0" },
     { name = "chromadb", marker = "extra == 'modal'", specifier = ">=0.4.0" },
     { name = "gradio", extras = ["mcp"], specifier = ">=6.0.0" },
     { name = "httpx", specifier = ">=0.27" },
     { name = "llama-index", marker = "extra == 'modal'", specifier = ">=0.11.0" },
     { name = "llama-index-embeddings-openai", marker = "extra == 'modal'" },
     { name = "llama-index-llms-openai", marker = "extra == 'modal'" },

     { name = "beautifulsoup4" },
     { name = "gradio", extra = ["mcp"] },
     { name = "httpx" },
+    { name = "huggingface-hub" },
     { name = "openai" },
     { name = "pydantic" },
     { name = "pydantic-ai" },
 [package.metadata]
 requires-dist = [
+    { name = "agent-framework-core", marker = "extra == 'magentic'", specifier = ">=1.0.0b251120,<2.0.0" },
     { name = "anthropic", specifier = ">=0.18.0" },
     { name = "beautifulsoup4", specifier = ">=4.12" },
     { name = "chromadb", marker = "extra == 'embeddings'", specifier = ">=0.4.0" },
     { name = "chromadb", marker = "extra == 'modal'", specifier = ">=0.4.0" },
     { name = "gradio", extras = ["mcp"], specifier = ">=6.0.0" },
     { name = "httpx", specifier = ">=0.27" },
+    { name = "huggingface-hub", specifier = ">=0.20.0" },
     { name = "llama-index", marker = "extra == 'modal'", specifier = ">=0.11.0" },
     { name = "llama-index-embeddings-openai", marker = "extra == 'modal'" },
     { name = "llama-index-llms-openai", marker = "extra == 'modal'" },