Spaces:
Running
Running
Commit
·
8bac750
1
Parent(s):
901acc3
docs: sync AI agent context files and fix final nitpicks
Browse files- Update GEMINI.md, CLAUDE.md, AGENTS.md to Phase 13 status
- Add HuggingFace Spaces collaboration guidelines
- Remove redundant evidence[:10] slice (method handles internally)
- Add CodeExecutionError handling to verify_sandbox.py demo
- AGENTS.md +22 -9
- CLAUDE.md +22 -9
- GEMINI.md +49 -13
- examples/modal_demo/verify_sandbox.py +34 -29
- src/services/statistical_analyzer.py +2 -2
AGENTS.md
CHANGED
|
@@ -4,7 +4,9 @@ This file provides guidance to AI agents when working with code in this reposito
|
|
| 4 |
|
| 5 |
## Project Overview
|
| 6 |
|
| 7 |
-
DeepCritical is an AI-native drug repurposing research agent for a HuggingFace hackathon. It uses a search-and-judge loop to autonomously search biomedical databases (PubMed) and synthesize evidence for queries like "What existing drugs might help treat long COVID fatigue?".
|
|
|
|
|
|
|
| 8 |
|
| 9 |
## Development Commands
|
| 10 |
|
|
@@ -37,11 +39,11 @@ uv run pytest -m integration
|
|
| 37 |
User Question → Orchestrator
|
| 38 |
↓
|
| 39 |
Search Loop:
|
| 40 |
-
1. Query PubMed
|
| 41 |
2. Gather evidence
|
| 42 |
3. Judge quality ("Do we have enough?")
|
| 43 |
4. If NO → Refine query, search more
|
| 44 |
-
5. If YES → Synthesize findings
|
| 45 |
↓
|
| 46 |
Research Report with Citations
|
| 47 |
```
|
|
@@ -49,14 +51,19 @@ Research Report with Citations
|
|
| 49 |
**Key Components**:
|
| 50 |
- `src/orchestrator.py` - Main agent loop
|
| 51 |
- `src/tools/pubmed.py` - PubMed E-utilities search
|
|
|
|
|
|
|
|
|
|
| 52 |
- `src/tools/search_handler.py` - Scatter-gather orchestration
|
| 53 |
- `src/services/embeddings.py` - Semantic search & deduplication (ChromaDB)
|
|
|
|
| 54 |
- `src/agent_factory/judges.py` - LLM-based evidence assessment
|
| 55 |
- `src/agents/` - Magentic multi-agent mode (SearchAgent, JudgeAgent, etc.)
|
|
|
|
| 56 |
- `src/utils/config.py` - Pydantic Settings (loads from `.env`)
|
| 57 |
- `src/utils/models.py` - Evidence, Citation, SearchResult models
|
| 58 |
- `src/utils/exceptions.py` - Exception hierarchy
|
| 59 |
-
- `src/app.py` - Gradio UI (HuggingFace Spaces)
|
| 60 |
|
| 61 |
**Break Conditions**: Judge approval, token budget (50K max), or max iterations (default 10).
|
| 62 |
|
|
@@ -66,6 +73,7 @@ Settings via pydantic-settings from `.env`:
|
|
| 66 |
- `LLM_PROVIDER`: "openai" or "anthropic"
|
| 67 |
- `OPENAI_API_KEY` / `ANTHROPIC_API_KEY`: LLM keys
|
| 68 |
- `NCBI_API_KEY`: Optional, for higher PubMed rate limits
|
|
|
|
| 69 |
- `MAX_ITERATIONS`: 1-50, default 10
|
| 70 |
- `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
|
| 71 |
|
|
@@ -95,8 +103,13 @@ DeepCriticalError (base)
|
|
| 95 |
|
| 96 |
## Git Workflow
|
| 97 |
|
| 98 |
-
- `main`: Production-ready
|
| 99 |
-
- `dev`: Development
|
| 100 |
-
- `
|
| 101 |
-
- Remote `
|
| 102 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
## Project Overview
|
| 6 |
|
| 7 |
+
DeepCritical is an AI-native drug repurposing research agent for a HuggingFace hackathon. It uses a search-and-judge loop to autonomously search biomedical databases (PubMed, ClinicalTrials.gov, bioRxiv) and synthesize evidence for queries like "What existing drugs might help treat long COVID fatigue?".
|
| 8 |
+
|
| 9 |
+
**Current Status:** Phases 1-13 COMPLETE (Foundation through Modal sandbox integration).
|
| 10 |
|
| 11 |
## Development Commands
|
| 12 |
|
|
|
|
| 39 |
User Question → Orchestrator
|
| 40 |
↓
|
| 41 |
Search Loop:
|
| 42 |
+
1. Query PubMed, ClinicalTrials.gov, bioRxiv
|
| 43 |
2. Gather evidence
|
| 44 |
3. Judge quality ("Do we have enough?")
|
| 45 |
4. If NO → Refine query, search more
|
| 46 |
+
5. If YES → Synthesize findings (+ optional Modal analysis)
|
| 47 |
↓
|
| 48 |
Research Report with Citations
|
| 49 |
```
|
|
|
|
| 51 |
**Key Components**:
|
| 52 |
- `src/orchestrator.py` - Main agent loop
|
| 53 |
- `src/tools/pubmed.py` - PubMed E-utilities search
|
| 54 |
+
- `src/tools/clinicaltrials.py` - ClinicalTrials.gov API
|
| 55 |
+
- `src/tools/biorxiv.py` - bioRxiv/medRxiv preprint search
|
| 56 |
+
- `src/tools/code_execution.py` - Modal sandbox execution
|
| 57 |
- `src/tools/search_handler.py` - Scatter-gather orchestration
|
| 58 |
- `src/services/embeddings.py` - Semantic search & deduplication (ChromaDB)
|
| 59 |
+
- `src/services/statistical_analyzer.py` - Statistical analysis via Modal
|
| 60 |
- `src/agent_factory/judges.py` - LLM-based evidence assessment
|
| 61 |
- `src/agents/` - Magentic multi-agent mode (SearchAgent, JudgeAgent, etc.)
|
| 62 |
+
- `src/mcp_tools.py` - MCP tool wrappers for Claude Desktop
|
| 63 |
- `src/utils/config.py` - Pydantic Settings (loads from `.env`)
|
| 64 |
- `src/utils/models.py` - Evidence, Citation, SearchResult models
|
| 65 |
- `src/utils/exceptions.py` - Exception hierarchy
|
| 66 |
+
- `src/app.py` - Gradio UI with MCP server (HuggingFace Spaces)
|
| 67 |
|
| 68 |
**Break Conditions**: Judge approval, token budget (50K max), or max iterations (default 10).
|
| 69 |
|
|
|
|
| 73 |
- `LLM_PROVIDER`: "openai" or "anthropic"
|
| 74 |
- `OPENAI_API_KEY` / `ANTHROPIC_API_KEY`: LLM keys
|
| 75 |
- `NCBI_API_KEY`: Optional, for higher PubMed rate limits
|
| 76 |
+
- `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET`: For Modal sandbox (optional)
|
| 77 |
- `MAX_ITERATIONS`: 1-50, default 10
|
| 78 |
- `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
|
| 79 |
|
|
|
|
| 103 |
|
| 104 |
## Git Workflow
|
| 105 |
|
| 106 |
+
- `main`: Production-ready (GitHub)
|
| 107 |
+
- `dev`: Development integration (GitHub)
|
| 108 |
+
- Remote `origin`: GitHub (source of truth for PRs/code review)
|
| 109 |
+
- Remote `huggingface-upstream`: HuggingFace Spaces (deployment target)
|
| 110 |
+
|
| 111 |
+
**HuggingFace Spaces Collaboration:**
|
| 112 |
+
- Each contributor should use their own dev branch: `yourname-dev` (e.g., `vcms-dev`, `mario-dev`)
|
| 113 |
+
- **DO NOT push directly to `main` or `dev` on HuggingFace** - these can be overwritten easily
|
| 114 |
+
- GitHub is the source of truth; HuggingFace is for deployment/demo
|
| 115 |
+
- Consider using git hooks to prevent accidental pushes to protected branches
|
CLAUDE.md
CHANGED
|
@@ -4,7 +4,9 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
|
|
| 4 |
|
| 5 |
## Project Overview
|
| 6 |
|
| 7 |
-
DeepCritical is an AI-native drug repurposing research agent for a HuggingFace hackathon. It uses a search-and-judge loop to autonomously search biomedical databases (PubMed) and synthesize evidence for queries like "What existing drugs might help treat long COVID fatigue?".
|
|
|
|
|
|
|
| 8 |
|
| 9 |
## Development Commands
|
| 10 |
|
|
@@ -37,11 +39,11 @@ uv run pytest -m integration
|
|
| 37 |
User Question → Orchestrator
|
| 38 |
↓
|
| 39 |
Search Loop:
|
| 40 |
-
1. Query PubMed
|
| 41 |
2. Gather evidence
|
| 42 |
3. Judge quality ("Do we have enough?")
|
| 43 |
4. If NO → Refine query, search more
|
| 44 |
-
5. If YES → Synthesize findings
|
| 45 |
↓
|
| 46 |
Research Report with Citations
|
| 47 |
```
|
|
@@ -49,14 +51,19 @@ Research Report with Citations
|
|
| 49 |
**Key Components**:
|
| 50 |
- `src/orchestrator.py` - Main agent loop
|
| 51 |
- `src/tools/pubmed.py` - PubMed E-utilities search
|
|
|
|
|
|
|
|
|
|
| 52 |
- `src/tools/search_handler.py` - Scatter-gather orchestration
|
| 53 |
- `src/services/embeddings.py` - Semantic search & deduplication (ChromaDB)
|
|
|
|
| 54 |
- `src/agent_factory/judges.py` - LLM-based evidence assessment
|
| 55 |
- `src/agents/` - Magentic multi-agent mode (SearchAgent, JudgeAgent, etc.)
|
|
|
|
| 56 |
- `src/utils/config.py` - Pydantic Settings (loads from `.env`)
|
| 57 |
- `src/utils/models.py` - Evidence, Citation, SearchResult models
|
| 58 |
- `src/utils/exceptions.py` - Exception hierarchy
|
| 59 |
-
- `src/app.py` - Gradio UI (HuggingFace Spaces)
|
| 60 |
|
| 61 |
**Break Conditions**: Judge approval, token budget (50K max), or max iterations (default 10).
|
| 62 |
|
|
@@ -66,6 +73,7 @@ Settings via pydantic-settings from `.env`:
|
|
| 66 |
- `LLM_PROVIDER`: "openai" or "anthropic"
|
| 67 |
- `OPENAI_API_KEY` / `ANTHROPIC_API_KEY`: LLM keys
|
| 68 |
- `NCBI_API_KEY`: Optional, for higher PubMed rate limits
|
|
|
|
| 69 |
- `MAX_ITERATIONS`: 1-50, default 10
|
| 70 |
- `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
|
| 71 |
|
|
@@ -88,8 +96,13 @@ DeepCriticalError (base)
|
|
| 88 |
|
| 89 |
## Git Workflow
|
| 90 |
|
| 91 |
-
- `main`: Production-ready
|
| 92 |
-
- `dev`: Development
|
| 93 |
-
- `
|
| 94 |
-
- Remote `
|
| 95 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
## Project Overview
|
| 6 |
|
| 7 |
+
DeepCritical is an AI-native drug repurposing research agent for a HuggingFace hackathon. It uses a search-and-judge loop to autonomously search biomedical databases (PubMed, ClinicalTrials.gov, bioRxiv) and synthesize evidence for queries like "What existing drugs might help treat long COVID fatigue?".
|
| 8 |
+
|
| 9 |
+
**Current Status:** Phases 1-13 COMPLETE (Foundation through Modal sandbox integration).
|
| 10 |
|
| 11 |
## Development Commands
|
| 12 |
|
|
|
|
| 39 |
User Question → Orchestrator
|
| 40 |
↓
|
| 41 |
Search Loop:
|
| 42 |
+
1. Query PubMed, ClinicalTrials.gov, bioRxiv
|
| 43 |
2. Gather evidence
|
| 44 |
3. Judge quality ("Do we have enough?")
|
| 45 |
4. If NO → Refine query, search more
|
| 46 |
+
5. If YES → Synthesize findings (+ optional Modal analysis)
|
| 47 |
↓
|
| 48 |
Research Report with Citations
|
| 49 |
```
|
|
|
|
| 51 |
**Key Components**:
|
| 52 |
- `src/orchestrator.py` - Main agent loop
|
| 53 |
- `src/tools/pubmed.py` - PubMed E-utilities search
|
| 54 |
+
- `src/tools/clinicaltrials.py` - ClinicalTrials.gov API
|
| 55 |
+
- `src/tools/biorxiv.py` - bioRxiv/medRxiv preprint search
|
| 56 |
+
- `src/tools/code_execution.py` - Modal sandbox execution
|
| 57 |
- `src/tools/search_handler.py` - Scatter-gather orchestration
|
| 58 |
- `src/services/embeddings.py` - Semantic search & deduplication (ChromaDB)
|
| 59 |
+
- `src/services/statistical_analyzer.py` - Statistical analysis via Modal
|
| 60 |
- `src/agent_factory/judges.py` - LLM-based evidence assessment
|
| 61 |
- `src/agents/` - Magentic multi-agent mode (SearchAgent, JudgeAgent, etc.)
|
| 62 |
+
- `src/mcp_tools.py` - MCP tool wrappers for Claude Desktop
|
| 63 |
- `src/utils/config.py` - Pydantic Settings (loads from `.env`)
|
| 64 |
- `src/utils/models.py` - Evidence, Citation, SearchResult models
|
| 65 |
- `src/utils/exceptions.py` - Exception hierarchy
|
| 66 |
+
- `src/app.py` - Gradio UI with MCP server (HuggingFace Spaces)
|
| 67 |
|
| 68 |
**Break Conditions**: Judge approval, token budget (50K max), or max iterations (default 10).
|
| 69 |
|
|
|
|
| 73 |
- `LLM_PROVIDER`: "openai" or "anthropic"
|
| 74 |
- `OPENAI_API_KEY` / `ANTHROPIC_API_KEY`: LLM keys
|
| 75 |
- `NCBI_API_KEY`: Optional, for higher PubMed rate limits
|
| 76 |
+
- `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET`: For Modal sandbox (optional)
|
| 77 |
- `MAX_ITERATIONS`: 1-50, default 10
|
| 78 |
- `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
|
| 79 |
|
|
|
|
| 96 |
|
| 97 |
## Git Workflow
|
| 98 |
|
| 99 |
+
- `main`: Production-ready (GitHub)
|
| 100 |
+
- `dev`: Development integration (GitHub)
|
| 101 |
+
- Remote `origin`: GitHub (source of truth for PRs/code review)
|
| 102 |
+
- Remote `huggingface-upstream`: HuggingFace Spaces (deployment target)
|
| 103 |
+
|
| 104 |
+
**HuggingFace Spaces Collaboration:**
|
| 105 |
+
- Each contributor should use their own dev branch: `yourname-dev` (e.g., `vcms-dev`, `mario-dev`)
|
| 106 |
+
- **DO NOT push directly to `main` or `dev` on HuggingFace** - these can be overwritten easily
|
| 107 |
+
- GitHub is the source of truth; HuggingFace is for deployment/demo
|
| 108 |
+
- Consider using git hooks to prevent accidental pushes to protected branches
|
GEMINI.md
CHANGED
|
@@ -2,26 +2,27 @@
|
|
| 2 |
|
| 3 |
## Project Overview
|
| 4 |
**DeepCritical** is an AI-native Medical Drug Repurposing Research Agent.
|
| 5 |
-
**Goal:** To accelerate the discovery of new uses for existing drugs by intelligently searching biomedical literature (PubMed), evaluating evidence, and hypothesizing potential applications.
|
| 6 |
|
| 7 |
**Architecture:**
|
| 8 |
The project follows a **Vertical Slice Architecture** (Search -> Judge -> Orchestrator) and adheres to **Strict TDD** (Test-Driven Development).
|
| 9 |
|
| 10 |
**Current Status:**
|
| 11 |
-
- **Phases 1-
|
| 12 |
-
- **
|
| 13 |
-
- **Phase
|
|
|
|
| 14 |
|
| 15 |
## Tech Stack & Tooling
|
| 16 |
- **Language:** Python 3.11 (Pinned)
|
| 17 |
- **Package Manager:** `uv` (Rust-based, extremely fast)
|
| 18 |
-
- **Frameworks:** `pydantic`, `pydantic-ai`, `httpx`, `gradio`
|
| 19 |
- **Vector DB:** `chromadb` with `sentence-transformers` for semantic search
|
|
|
|
| 20 |
- **Testing:** `pytest`, `pytest-asyncio`, `respx` (for mocking)
|
| 21 |
- **Quality:** `ruff` (linting/formatting), `mypy` (strict type checking), `pre-commit`
|
| 22 |
|
| 23 |
## Building & Running
|
| 24 |
-
We use a `Makefile` to standardize developer commands.
|
| 25 |
|
| 26 |
| Command | Description |
|
| 27 |
| :--- | :--- |
|
|
@@ -36,19 +37,54 @@ We use a `Makefile` to standardize developer commands.
|
|
| 36 |
## Directory Structure
|
| 37 |
- `src/`: Source code
|
| 38 |
- `utils/`: Shared utilities (`config.py`, `exceptions.py`, `models.py`)
|
| 39 |
-
- `tools/`: Search tools (`pubmed.py`, `
|
| 40 |
-
- `services/`: Services (`embeddings.py`
|
| 41 |
- `agents/`: Magentic multi-agent mode agents
|
| 42 |
- `agent_factory/`: Agent definitions (judges, prompts)
|
|
|
|
|
|
|
| 43 |
- `tests/`: Test suite
|
| 44 |
- `unit/`: Isolated unit tests (Mocked)
|
| 45 |
- `integration/`: Real API tests (Marked as slow/integration)
|
| 46 |
- `docs/`: Documentation and Implementation Specs
|
| 47 |
- `examples/`: Working demos for each phase
|
| 48 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
## Development Conventions
|
| 50 |
-
1.
|
| 51 |
-
2.
|
| 52 |
-
3.
|
| 53 |
-
4.
|
| 54 |
-
5.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
## Project Overview
|
| 4 |
**DeepCritical** is an AI-native Medical Drug Repurposing Research Agent.
|
| 5 |
+
**Goal:** To accelerate the discovery of new uses for existing drugs by intelligently searching biomedical literature (PubMed, ClinicalTrials.gov, bioRxiv), evaluating evidence, and hypothesizing potential applications.
|
| 6 |
|
| 7 |
**Architecture:**
|
| 8 |
The project follows a **Vertical Slice Architecture** (Search -> Judge -> Orchestrator) and adheres to **Strict TDD** (Test-Driven Development).
|
| 9 |
|
| 10 |
**Current Status:**
|
| 11 |
+
- **Phases 1-9:** COMPLETE. Foundation, Search, Judge, UI, Orchestrator, Embeddings, Hypothesis, Report, Cleanup.
|
| 12 |
+
- **Phases 10-11:** COMPLETE. ClinicalTrials.gov and bioRxiv integration.
|
| 13 |
+
- **Phase 12:** COMPLETE. MCP Server integration (Gradio MCP at `/gradio_api/mcp/`).
|
| 14 |
+
- **Phase 13:** COMPLETE. Modal sandbox for statistical analysis.
|
| 15 |
|
| 16 |
## Tech Stack & Tooling
|
| 17 |
- **Language:** Python 3.11 (Pinned)
|
| 18 |
- **Package Manager:** `uv` (Rust-based, extremely fast)
|
| 19 |
+
- **Frameworks:** `pydantic`, `pydantic-ai`, `httpx`, `gradio[mcp]`
|
| 20 |
- **Vector DB:** `chromadb` with `sentence-transformers` for semantic search
|
| 21 |
+
- **Code Execution:** `modal` for secure sandboxed Python execution
|
| 22 |
- **Testing:** `pytest`, `pytest-asyncio`, `respx` (for mocking)
|
| 23 |
- **Quality:** `ruff` (linting/formatting), `mypy` (strict type checking), `pre-commit`
|
| 24 |
|
| 25 |
## Building & Running
|
|
|
|
| 26 |
|
| 27 |
| Command | Description |
|
| 28 |
| :--- | :--- |
|
|
|
|
| 37 |
## Directory Structure
|
| 38 |
- `src/`: Source code
|
| 39 |
- `utils/`: Shared utilities (`config.py`, `exceptions.py`, `models.py`)
|
| 40 |
+
- `tools/`: Search tools (`pubmed.py`, `clinicaltrials.py`, `biorxiv.py`, `code_execution.py`)
|
| 41 |
+
- `services/`: Services (`embeddings.py`, `statistical_analyzer.py`)
|
| 42 |
- `agents/`: Magentic multi-agent mode agents
|
| 43 |
- `agent_factory/`: Agent definitions (judges, prompts)
|
| 44 |
+
- `mcp_tools.py`: MCP tool wrappers for Claude Desktop integration
|
| 45 |
+
- `app.py`: Gradio UI with MCP server
|
| 46 |
- `tests/`: Test suite
|
| 47 |
- `unit/`: Isolated unit tests (Mocked)
|
| 48 |
- `integration/`: Real API tests (Marked as slow/integration)
|
| 49 |
- `docs/`: Documentation and Implementation Specs
|
| 50 |
- `examples/`: Working demos for each phase
|
| 51 |
|
| 52 |
+
## Key Components
|
| 53 |
+
- `src/orchestrator.py` - Main agent loop
|
| 54 |
+
- `src/tools/pubmed.py` - PubMed E-utilities search
|
| 55 |
+
- `src/tools/clinicaltrials.py` - ClinicalTrials.gov API
|
| 56 |
+
- `src/tools/biorxiv.py` - bioRxiv/medRxiv preprint search
|
| 57 |
+
- `src/tools/code_execution.py` - Modal sandbox execution
|
| 58 |
+
- `src/services/statistical_analyzer.py` - Statistical analysis via Modal
|
| 59 |
+
- `src/mcp_tools.py` - MCP tool wrappers
|
| 60 |
+
- `src/app.py` - Gradio UI (HuggingFace Spaces) with MCP server
|
| 61 |
+
|
| 62 |
+
## Configuration
|
| 63 |
+
|
| 64 |
+
Settings via pydantic-settings from `.env`:
|
| 65 |
+
- `LLM_PROVIDER`: "openai" or "anthropic"
|
| 66 |
+
- `OPENAI_API_KEY` / `ANTHROPIC_API_KEY`: LLM keys
|
| 67 |
+
- `NCBI_API_KEY`: Optional, for higher PubMed rate limits
|
| 68 |
+
- `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET`: For Modal sandbox (optional)
|
| 69 |
+
- `MAX_ITERATIONS`: 1-50, default 10
|
| 70 |
+
- `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR
|
| 71 |
+
|
| 72 |
## Development Conventions
|
| 73 |
+
1. **Strict TDD:** Write failing tests in `tests/unit/` *before* implementing logic in `src/`.
|
| 74 |
+
2. **Type Safety:** All code must pass `mypy --strict`. Use Pydantic models for data exchange.
|
| 75 |
+
3. **Linting:** Zero tolerance for Ruff errors.
|
| 76 |
+
4. **Mocking:** Use `respx` or `unittest.mock` for all external API calls in unit tests.
|
| 77 |
+
5. **Vertical Slices:** Implement features end-to-end rather than layer-by-layer.
|
| 78 |
+
|
| 79 |
+
## Git Workflow
|
| 80 |
+
|
| 81 |
+
- `main`: Production-ready (GitHub)
|
| 82 |
+
- `dev`: Development integration (GitHub)
|
| 83 |
+
- Remote `origin`: GitHub (source of truth for PRs/code review)
|
| 84 |
+
- Remote `huggingface-upstream`: HuggingFace Spaces (deployment target)
|
| 85 |
+
|
| 86 |
+
**HuggingFace Spaces Collaboration:**
|
| 87 |
+
- Each contributor should use their own dev branch: `yourname-dev` (e.g., `vcms-dev`, `mario-dev`)
|
| 88 |
+
- **DO NOT push directly to `main` or `dev` on HuggingFace** - these can be overwritten easily
|
| 89 |
+
- GitHub is the source of truth; HuggingFace is for deployment/demo
|
| 90 |
+
- Consider using git hooks to prevent accidental pushes to protected branches
|
examples/modal_demo/verify_sandbox.py
CHANGED
|
@@ -11,7 +11,7 @@ Usage:
|
|
| 11 |
import asyncio
|
| 12 |
from functools import partial
|
| 13 |
|
| 14 |
-
from src.tools.code_execution import get_code_executor
|
| 15 |
from src.utils.config import settings
|
| 16 |
|
| 17 |
|
|
@@ -31,22 +31,23 @@ async def main() -> None:
|
|
| 31 |
print("Set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in .env")
|
| 32 |
return
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
|
|
|
| 36 |
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
import pandas as pd
|
| 51 |
import numpy as np
|
| 52 |
import scipy
|
|
@@ -54,12 +55,12 @@ print(f"pandas: {pd.__version__}")
|
|
| 54 |
print(f"numpy: {np.__version__}")
|
| 55 |
print(f"scipy: {scipy.__version__}")
|
| 56 |
"""
|
| 57 |
-
|
| 58 |
-
|
| 59 |
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
import urllib.request
|
| 64 |
try:
|
| 65 |
urllib.request.urlopen("https://google.com", timeout=2)
|
|
@@ -67,12 +68,12 @@ try:
|
|
| 67 |
except Exception:
|
| 68 |
print("Network: BLOCKED (as expected)")
|
| 69 |
"""
|
| 70 |
-
|
| 71 |
-
|
| 72 |
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
import pandas as pd
|
| 77 |
import scipy.stats as stats
|
| 78 |
|
|
@@ -84,12 +85,16 @@ print(f"Mean Effect: {mean:.3f}")
|
|
| 84 |
print(f"P-value: {p_val:.4f}")
|
| 85 |
print(f"Verdict: {'SUPPORTED' if p_val < 0.05 else 'INCONCLUSIVE'}")
|
| 86 |
"""
|
| 87 |
-
|
| 88 |
-
|
| 89 |
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
|
| 95 |
if __name__ == "__main__":
|
|
|
|
| 11 |
import asyncio
|
| 12 |
from functools import partial
|
| 13 |
|
| 14 |
+
from src.tools.code_execution import CodeExecutionError, get_code_executor
|
| 15 |
from src.utils.config import settings
|
| 16 |
|
| 17 |
|
|
|
|
| 31 |
print("Set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in .env")
|
| 32 |
return
|
| 33 |
|
| 34 |
+
try:
|
| 35 |
+
executor = get_code_executor()
|
| 36 |
+
loop = asyncio.get_running_loop()
|
| 37 |
|
| 38 |
+
print("=" * 60)
|
| 39 |
+
print("Modal Sandbox Isolation Verification")
|
| 40 |
+
print("=" * 60 + "\n")
|
| 41 |
|
| 42 |
+
# Test 1: Hostname
|
| 43 |
+
print("Test 1: Check hostname (should NOT be your machine)")
|
| 44 |
+
code1 = "import socket; print(f'Hostname: {socket.gethostname()}')"
|
| 45 |
+
result1 = await loop.run_in_executor(None, partial(executor.execute, code1))
|
| 46 |
+
print_result(result1)
|
| 47 |
|
| 48 |
+
# Test 2: Scientific libraries
|
| 49 |
+
print("Test 2: Verify scientific libraries")
|
| 50 |
+
code2 = """
|
| 51 |
import pandas as pd
|
| 52 |
import numpy as np
|
| 53 |
import scipy
|
|
|
|
| 55 |
print(f"numpy: {np.__version__}")
|
| 56 |
print(f"scipy: {scipy.__version__}")
|
| 57 |
"""
|
| 58 |
+
result2 = await loop.run_in_executor(None, partial(executor.execute, code2))
|
| 59 |
+
print_result(result2)
|
| 60 |
|
| 61 |
+
# Test 3: Network blocked
|
| 62 |
+
print("Test 3: Verify network isolation")
|
| 63 |
+
code3 = """
|
| 64 |
import urllib.request
|
| 65 |
try:
|
| 66 |
urllib.request.urlopen("https://google.com", timeout=2)
|
|
|
|
| 68 |
except Exception:
|
| 69 |
print("Network: BLOCKED (as expected)")
|
| 70 |
"""
|
| 71 |
+
result3 = await loop.run_in_executor(None, partial(executor.execute, code3))
|
| 72 |
+
print_result(result3)
|
| 73 |
|
| 74 |
+
# Test 4: Real statistics
|
| 75 |
+
print("Test 4: Execute statistical analysis")
|
| 76 |
+
code4 = """
|
| 77 |
import pandas as pd
|
| 78 |
import scipy.stats as stats
|
| 79 |
|
|
|
|
| 85 |
print(f"P-value: {p_val:.4f}")
|
| 86 |
print(f"Verdict: {'SUPPORTED' if p_val < 0.05 else 'INCONCLUSIVE'}")
|
| 87 |
"""
|
| 88 |
+
result4 = await loop.run_in_executor(None, partial(executor.execute, code4))
|
| 89 |
+
print_result(result4)
|
| 90 |
|
| 91 |
+
print("=" * 60)
|
| 92 |
+
print("All tests complete - Modal sandbox verified!")
|
| 93 |
+
print("=" * 60)
|
| 94 |
+
|
| 95 |
+
except CodeExecutionError as e:
|
| 96 |
+
print(f"Error: Modal code execution failed: {e}")
|
| 97 |
+
print("Hint: Ensure Modal SDK is installed and credentials are valid.")
|
| 98 |
|
| 99 |
|
| 100 |
if __name__ == "__main__":
|
src/services/statistical_analyzer.py
CHANGED
|
@@ -109,8 +109,8 @@ Output format: Return ONLY executable Python code, no explanations.""",
|
|
| 109 |
Returns:
|
| 110 |
AnalysisResult with verdict and statistics
|
| 111 |
"""
|
| 112 |
-
# Build analysis prompt
|
| 113 |
-
evidence_summary = self._summarize_evidence(evidence
|
| 114 |
hypothesis_text = ""
|
| 115 |
if hypothesis:
|
| 116 |
hypothesis_text = (
|
|
|
|
| 109 |
Returns:
|
| 110 |
AnalysisResult with verdict and statistics
|
| 111 |
"""
|
| 112 |
+
# Build analysis prompt (method handles slicing internally)
|
| 113 |
+
evidence_summary = self._summarize_evidence(evidence)
|
| 114 |
hypothesis_text = ""
|
| 115 |
if hypothesis:
|
| 116 |
hypothesis_text = (
|