Update README.md
Browse files
README.md
CHANGED
|
@@ -5,26 +5,36 @@ colorFrom: purple
|
|
| 5 |
colorTo: indigo
|
| 6 |
sdk: docker
|
| 7 |
pinned: false
|
|
|
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
| 10 |
# matrix-ai
|
| 11 |
|
| 12 |
-
**matrix-ai** is the AI planning microservice for the Matrix EcoSystem. It generates **short, low
|
|
|
|
|
|
|
| 13 |
|
| 14 |
> **Endpoints**
|
| 15 |
>
|
| 16 |
> * `POST /v1/plan` – internal API for Matrix Guardian: returns a safe JSON plan.
|
| 17 |
-
> * `POST /v1/chat` – (
|
|
|
|
|
|
|
| 18 |
|
| 19 |
The service emphasizes **safety, performance, and auditability**:
|
| 20 |
|
| 21 |
-
* Strict, schema
|
| 22 |
* PII redaction before calling upstream model endpoints
|
|
|
|
|
|
|
| 23 |
* Exponential backoff, short timeouts, and structured JSON logs
|
| 24 |
-
*
|
| 25 |
-
*
|
|
|
|
| 26 |
|
| 27 |
-
*Last Updated: 2025
|
| 28 |
|
| 29 |
---
|
| 30 |
|
|
@@ -32,45 +42,74 @@ The service emphasizes **safety, performance, and auditability**:
|
|
| 32 |
|
| 33 |
```mermaid
|
| 34 |
flowchart LR
|
| 35 |
-
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
-
Client -->|monitor| HubAPI[Matrix‑Hub API]
|
| 39 |
-
Guardian[Matrix‑Guardian
|
| 40 |
-
control plane] -->|/v1/plan| AI[matrix‑ai
|
| 41 |
-
HF Space]
|
| 42 |
-
Guardian -->|/status,/apps,...| HubAPI
|
| 43 |
-
HubAPI <-->|SQL| DB[(MatrixDB
|
| 44 |
-
Postgres)]
|
| 45 |
-
|
| 46 |
-
AI -->|HF Inference| HF[Hugging Face
|
| 47 |
-
Inference API]
|
| 48 |
-
|
| 49 |
-
classDef svc fill:#0ea5e9,stroke:#0b4,stroke-width:1,color:#fff
|
| 50 |
-
classDef db fill:#f59e0b,stroke:#0b4,stroke-width:1,color:#fff
|
| 51 |
-
class Guardian,AI,HubAPI svc
|
| 52 |
-
class DB db
|
| 53 |
```
|
| 54 |
|
| 55 |
-
### Sequence: `POST /v1/
|
| 56 |
|
| 57 |
```mermaid
|
| 58 |
sequenceDiagram
|
| 59 |
-
participant
|
| 60 |
-
participant A as matrix
|
| 61 |
-
participant
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
A->>
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
| 69 |
```
|
| 70 |
|
| 71 |
---
|
| 72 |
|
| 73 |
## Quick Start (Local Development)
|
|
|
|
| 74 |
```bash
|
| 75 |
# 1) Create venv
|
| 76 |
python3 -m venv .venv
|
|
@@ -80,47 +119,99 @@ source .venv/bin/activate
|
|
| 80 |
pip install -r requirements.txt
|
| 81 |
|
| 82 |
# 3) Configure env (local only; use Space Secrets in prod)
|
| 83 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
# 4) Run
|
| 86 |
uvicorn app.main:app --host 0.0.0.0 --port 7860
|
| 87 |
```
|
| 88 |
|
| 89 |
-
OpenAPI docs: http://localhost:7860/docs
|
| 90 |
|
| 91 |
---
|
| 92 |
|
| 93 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
-
|
| 96 |
-
2) In **Settings → Secrets**, add:
|
| 97 |
-
* `HF_TOKEN` (required) – used by the upstream HF Inference client
|
| 98 |
-
* `ADMIN_TOKEN` (optional) – if set, private‑gates `/v1/plan` and `/v1/chat`
|
| 99 |
-
3) Choose hardware. CPU is fine for tests; GPU recommended for larger models.
|
| 100 |
-
4) The Space will serve FastAPI on the default port; the two endpoints are ready.
|
| 101 |
|
| 102 |
-
|
| 103 |
|
| 104 |
---
|
| 105 |
|
| 106 |
## Configuration
|
| 107 |
|
| 108 |
-
All options can be set via environment variables (Space Secrets in HF)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
| `TEMPERATURE` | `0.2` | Generation temperature |
|
| 116 |
-
| `RATE_LIMIT_PER_MIN` | `120` | Per‑IP fixed‑window limit |
|
| 117 |
-
| `REQUEST_TIMEOUT_SEC` | `15` | HTTP client timeout to HF |
|
| 118 |
-
| `RETRY_MAX_ATTEMPTS` | `3` | Retry budget to HF |
|
| 119 |
-
| `CACHE_TTL_SEC` | `30` | Optional in‑memory caching for GET |
|
| 120 |
-
| `ADMIN_TOKEN` | — | If set, requires `Authorization: Bearer <ADMIN_TOKEN>` |
|
| 121 |
-
| `LOG_LEVEL` | `INFO` | Log level (JSON logs) |
|
| 122 |
|
| 123 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
|
| 125 |
---
|
| 126 |
|
|
@@ -128,24 +219,24 @@ All options can be set via environment variables (Space Secrets in HF) or `.env`
|
|
| 128 |
|
| 129 |
### `POST /v1/plan`
|
| 130 |
|
| 131 |
-
**Description:** Generate a short, low
|
| 132 |
|
| 133 |
**Headers**
|
| 134 |
|
| 135 |
```
|
| 136 |
Content-Type: application/json
|
| 137 |
-
Authorization: Bearer <ADMIN_TOKEN> # required
|
| 138 |
```
|
| 139 |
|
| 140 |
-
**Request
|
| 141 |
|
| 142 |
```json
|
| 143 |
{
|
| 144 |
"context": {
|
| 145 |
"entity_uid": "matrix-ai",
|
| 146 |
-
"health": {"score": 0.64, "status": "degraded", "last_checked": "2025-
|
| 147 |
"recent_checks": [
|
| 148 |
-
{"check": "http", "result": "fail", "latency_ms": 900, "ts": "2025-
|
| 149 |
]
|
| 150 |
},
|
| 151 |
"constraints": {"max_steps": 3, "risk": "low"}
|
|
@@ -167,6 +258,7 @@ Authorization: Bearer <ADMIN_TOKEN> # required iff ADMIN_TOKEN set
|
|
| 167 |
```
|
| 168 |
|
| 169 |
**Status codes**
|
|
|
|
| 170 |
* `200` – plan generated
|
| 171 |
* `400` – invalid payload (schema)
|
| 172 |
* `401/403` – missing/invalid bearer (only if `ADMIN_TOKEN` configured)
|
|
@@ -175,25 +267,64 @@ Authorization: Bearer <ADMIN_TOKEN> # required iff ADMIN_TOKEN set
|
|
| 175 |
|
| 176 |
### `POST /v1/chat`
|
| 177 |
|
| 178 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 179 |
|
| 180 |
---
|
| 181 |
|
| 182 |
## Safety & Reliability
|
| 183 |
|
| 184 |
-
* **PII redaction** – tokens/emails removed from prompts as a pre
|
| 185 |
-
* **Strict schema** – JSON plan parsing with
|
| 186 |
-
* **Time
|
| 187 |
-
* **Rate
|
| 188 |
-
* **Structured logs** – JSON logs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 189 |
|
| 190 |
---
|
| 191 |
|
| 192 |
## Observability
|
| 193 |
|
| 194 |
-
*
|
| 195 |
-
*
|
| 196 |
-
*
|
|
|
|
|
|
|
|
|
|
|
|
|
| 197 |
|
| 198 |
---
|
| 199 |
|
|
@@ -201,10 +332,15 @@ Authorization: Bearer <ADMIN_TOKEN> # required iff ADMIN_TOKEN set
|
|
| 201 |
|
| 202 |
* Keep `/v1/plan` **internal** behind a network boundary or `ADMIN_TOKEN`.
|
| 203 |
* Validate payloads rigorously (Pydantic) and write contract tests for the plan schema.
|
| 204 |
-
* If you switch models, re
|
|
|
|
| 205 |
|
| 206 |
---
|
| 207 |
|
| 208 |
## License
|
| 209 |
|
| 210 |
-
Apache
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
colorTo: indigo
|
| 6 |
sdk: docker
|
| 7 |
pinned: false
|
| 8 |
+
---
|
| 9 |
+
Here’s an updated, production-ready **README.md** that reflects the latest architecture (multi-provider cascade: **GROQ → Gemini → HF Router**), SSE streaming, RAG, and middleware hardening. It keeps a professional tone, includes fresh diagrams, and is organized for fast adoption in prod.
|
| 10 |
+
|
| 11 |
---
|
| 12 |
|
| 13 |
# matrix-ai
|
| 14 |
|
| 15 |
+
**matrix-ai** is the AI planning microservice for the Matrix EcoSystem. It generates **short, low-risk, auditable remediation plans** from compact health context provided by **Matrix Guardian**, and also exposes a lightweight **RAG** Q&A over MatrixHub documents.
|
| 16 |
+
|
| 17 |
+
It is optimized for **Hugging Face Spaces / Inference Endpoints**, but also runs locally and in containers.
|
| 18 |
|
| 19 |
> **Endpoints**
|
| 20 |
>
|
| 21 |
> * `POST /v1/plan` – internal API for Matrix Guardian: returns a safe JSON plan.
|
| 22 |
+
> * `POST /v1/chat` – Q&A (RAG-assisted) over MatrixHub content; returns a single answer.
|
| 23 |
+
> * `GET /v1/chat/stream` – **SSE** token stream for interactive chat (production-hardened).
|
| 24 |
+
> * `POST /v1/chat/stream` – same as `GET` but with JSON payloads.
|
| 25 |
|
| 26 |
The service emphasizes **safety, performance, and auditability**:
|
| 27 |
|
| 28 |
+
* Strict, schema-validated JSON plans (bounded steps, risk label, rationale)
|
| 29 |
* PII redaction before calling upstream model endpoints
|
| 30 |
+
* **Multi-provider LLM cascade:** **GROQ → Gemini → HF Router (Zephyr → Mistral)** with automatic failover
|
| 31 |
+
* Production-safe **SSE** streaming & middleware (no body buffering, trace IDs, CORS, gzip)
|
| 32 |
* Exponential backoff, short timeouts, and structured JSON logs
|
| 33 |
+
* Per-IP rate limiting; optional `ADMIN_TOKEN` for private deployments
|
| 34 |
+
* RAG with SentenceTransformers (optional CrossEncoder re-ranker) over `data/kb.jsonl`
|
| 35 |
+
* ETag & response caching for non-mutating reads (where applicable)
|
| 36 |
|
| 37 |
+
*Last Updated: 2025-10-01 (UTC)*
|
| 38 |
|
| 39 |
---
|
| 40 |
|
|
|
|
| 42 |
|
| 43 |
```mermaid
|
| 44 |
flowchart LR
|
| 45 |
+
subgraph Client [Matrix Operators / Observers]
|
| 46 |
+
end
|
| 47 |
+
|
| 48 |
+
Client -->|monitor| HubAPI[Matrix-Hub API]
|
| 49 |
+
Guardian[Matrix-Guardian<br/>control plane] -->|/v1/plan| AI[matrix-ai<br/>FastAPI service]
|
| 50 |
+
Guardian -->|/status,/apps,...| HubAPI
|
| 51 |
+
HubAPI <-->|SQL| DB[MatrixDB<br/>Postgres]
|
| 52 |
+
|
| 53 |
+
subgraph LLM [LLM Providers fallback cascade]
|
| 54 |
+
GROQ[Groq<br/>llama-3.1-8b-instant]
|
| 55 |
+
GEM[Google Gemini<br/>gemini-2.5-flash]
|
| 56 |
+
HF[Hugging Face Router<br/>Zephyr → Mistral]
|
| 57 |
+
end
|
| 58 |
+
|
| 59 |
+
AI -->|primary| GROQ
|
| 60 |
+
AI -->|fallback| GEM
|
| 61 |
+
AI -->|final| HF
|
| 62 |
+
|
| 63 |
+
classDef svc fill:#0ea5e9,stroke:#0b4,stroke-width:1,color:#fff
|
| 64 |
+
classDef db fill:#f59e0b,stroke:#0b4,stroke-width:1,color:#fff
|
| 65 |
+
class Guardian,AI,HubAPI svc
|
| 66 |
+
class DB db
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
### Sequence: `POST /v1/plan` (planning)
|
| 70 |
+
|
| 71 |
+
```mermaid
|
| 72 |
+
sequenceDiagram
|
| 73 |
+
participant G as Matrix-Guardian
|
| 74 |
+
participant A as matrix-ai
|
| 75 |
+
participant P as Provider Cascade
|
| 76 |
+
|
| 77 |
+
G->>A: POST /v1/plan { context, constraints }
|
| 78 |
+
A->>A: redact PII, validate payload (schema)
|
| 79 |
+
A->>P: generate plan (timeouts, retries)
|
| 80 |
+
alt Provider available
|
| 81 |
+
P-->>A: model output text
|
| 82 |
+
else Provider unavailable/limited
|
| 83 |
+
P-->>A: fallback to next provider
|
| 84 |
+
end
|
| 85 |
+
A->>A: parse → strict JSON plan (safe defaults if needed)
|
| 86 |
+
A-->>G: 200 { plan_id, steps[], risk, explanation }
|
| 87 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
```
|
| 89 |
|
| 90 |
+
### Sequence: `GET/POST /v1/chat/stream` (SSE chat)
|
| 91 |
|
| 92 |
```mermaid
|
| 93 |
sequenceDiagram
|
| 94 |
+
participant C as Client (UI)
|
| 95 |
+
participant A as matrix-ai (SSE-safe middleware)
|
| 96 |
+
participant P as Provider Cascade
|
| 97 |
+
|
| 98 |
+
C->>A: GET /v1/chat/stream?query=...
|
| 99 |
+
A->>P: chat(messages, stream=True)
|
| 100 |
+
loop token chunks
|
| 101 |
+
P-->>A: delta (text)
|
| 102 |
+
A-->>C: SSE data: {"delta": "..."}
|
| 103 |
+
end
|
| 104 |
+
A-->>C: SSE data: [DONE]
|
| 105 |
+
|
| 106 |
+
|
| 107 |
```
|
| 108 |
|
| 109 |
---
|
| 110 |
|
| 111 |
## Quick Start (Local Development)
|
| 112 |
+
|
| 113 |
```bash
|
| 114 |
# 1) Create venv
|
| 115 |
python3 -m venv .venv
|
|
|
|
| 119 |
pip install -r requirements.txt
|
| 120 |
|
| 121 |
# 3) Configure env (local only; use Space Secrets in prod)
|
| 122 |
+
cp configs/.env.example configs/.env
|
| 123 |
+
# Edit configs/.env with your keys (do NOT commit):
|
| 124 |
+
# GROQ_API_KEY=...
|
| 125 |
+
# GOOGLE_API_KEY=...
|
| 126 |
+
# HF_TOKEN=...
|
| 127 |
|
| 128 |
# 4) Run
|
| 129 |
uvicorn app.main:app --host 0.0.0.0 --port 7860
|
| 130 |
```
|
| 131 |
|
| 132 |
+
OpenAPI docs: [http://localhost:7860/docs](http://localhost:7860/docs)
|
| 133 |
|
| 134 |
---
|
| 135 |
|
| 136 |
+
## Provider Cascade (GROQ → Gemini → HF Router)
|
| 137 |
+
|
| 138 |
+
**matrix-ai** uses a production-ready multi-provider orchestrator:
|
| 139 |
+
|
| 140 |
+
1. **Groq** (`llama-3.1-8b-instant`) – free, fast, great latency
|
| 141 |
+
2. **Gemini** (`gemini-2.5-flash`) – free tier
|
| 142 |
+
3. **HF Router** – `HuggingFaceH4/zephyr-7b-beta` → `mistralai/Mistral-7B-Instruct-v0.2`
|
| 143 |
|
| 144 |
+
Order is configurable via `provider_order`. Providers are skipped automatically if misconfigured or if quotas/credits are exceeded.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
+
**Streaming:** Groq streams true tokens; Gemini/HF may yield one chunk (normalized to SSE).
|
| 147 |
|
| 148 |
---
|
| 149 |
|
| 150 |
## Configuration
|
| 151 |
|
| 152 |
+
All options can be set via environment variables (Space Secrets in HF), `.env` for local use, and/or `configs/settings.yaml`.
|
| 153 |
+
|
| 154 |
+
### `configs/settings.yaml` (excerpt)
|
| 155 |
+
|
| 156 |
+
```yaml
|
| 157 |
+
model:
|
| 158 |
+
# HF router defaults (used at the last step)
|
| 159 |
+
name: "HuggingFaceH4/zephyr-7b-beta"
|
| 160 |
+
fallback: "mistralai/Mistral-7B-Instruct-v0.2"
|
| 161 |
+
provider: "featherless-ai"
|
| 162 |
+
max_new_tokens: 256
|
| 163 |
+
temperature: 0.2
|
| 164 |
+
|
| 165 |
+
# Provider-specific defaults (free-tier friendly)
|
| 166 |
+
groq_model: "llama-3.1-8b-instant"
|
| 167 |
+
gemini_model: "gemini-2.5-flash"
|
| 168 |
|
| 169 |
+
# Try providers in this order
|
| 170 |
+
provider_order:
|
| 171 |
+
- groq
|
| 172 |
+
- gemini
|
| 173 |
+
- router
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 174 |
|
| 175 |
+
# Switch to the multi-provider path
|
| 176 |
+
chat_backend: "multi"
|
| 177 |
+
chat_stream: true
|
| 178 |
+
|
| 179 |
+
limits:
|
| 180 |
+
rate_per_min: 60
|
| 181 |
+
cache_size: 256
|
| 182 |
+
|
| 183 |
+
rag:
|
| 184 |
+
index_dataset: ""
|
| 185 |
+
top_k: 4
|
| 186 |
+
|
| 187 |
+
matrixhub:
|
| 188 |
+
base_url: "https://api.matrixhub.io"
|
| 189 |
+
|
| 190 |
+
security:
|
| 191 |
+
admin_token: ""
|
| 192 |
+
```
|
| 193 |
+
|
| 194 |
+
### Environment variables
|
| 195 |
+
|
| 196 |
+
| Variable | Default | Purpose |
|
| 197 |
+
| ---------------- | -----------------------------------: | ----------------------------------------- |
|
| 198 |
+
| `GROQ_API_KEY` | — | API key for Groq (primary) |
|
| 199 |
+
| `GOOGLE_API_KEY` | — | API key for Gemini |
|
| 200 |
+
| `HF_TOKEN` | — | Token for Hugging Face Inference Router |
|
| 201 |
+
| `GROQ_MODEL` | `llama-3.1-8b-instant` | Override Groq model |
|
| 202 |
+
| `GEMINI_MODEL` | `gemini-2.5-flash` | Override Gemini model |
|
| 203 |
+
| `MODEL_NAME` | `HuggingFaceH4/zephyr-7b-beta` | HF Router primary model |
|
| 204 |
+
| `MODEL_FALLBACK` | `mistralai/Mistral-7B-Instruct-v0.2` | HF Router fallback |
|
| 205 |
+
| `MODEL_PROVIDER` | `featherless-ai` | HF provider tag (`model:provider`) |
|
| 206 |
+
| `PROVIDER_ORDER` | `groq,gemini,router` | Comma-sep. cascade order |
|
| 207 |
+
| `CHAT_STREAM` | `true` | Enable streaming where available |
|
| 208 |
+
| `RATE_LIMITS` | `60` | Per-IP req/min (middleware) |
|
| 209 |
+
| `ADMIN_TOKEN` | — | Gate `/v1/plan` & `/v1/chat*` (Bearer) |
|
| 210 |
+
| `RAG_KB_PATH` | `data/kb.jsonl` | Path to KB (if present) |
|
| 211 |
+
| `RAG_RERANK` | `true` | Enable CrossEncoder re-ranker (GPU-aware) |
|
| 212 |
+
| `LOG_LEVEL` | `INFO` | Structured JSON logs level |
|
| 213 |
+
|
| 214 |
+
> Never commit real API keys. Use Space Secrets / Vault in production.
|
| 215 |
|
| 216 |
---
|
| 217 |
|
|
|
|
| 219 |
|
| 220 |
### `POST /v1/plan`
|
| 221 |
|
| 222 |
+
**Description:** Generate a short, low-risk remediation plan from a compact app health context.
|
| 223 |
|
| 224 |
**Headers**
|
| 225 |
|
| 226 |
```
|
| 227 |
Content-Type: application/json
|
| 228 |
+
Authorization: Bearer <ADMIN_TOKEN> # required if ADMIN_TOKEN set
|
| 229 |
```
|
| 230 |
|
| 231 |
+
**Request (example)**
|
| 232 |
|
| 233 |
```json
|
| 234 |
{
|
| 235 |
"context": {
|
| 236 |
"entity_uid": "matrix-ai",
|
| 237 |
+
"health": {"score": 0.64, "status": "degraded", "last_checked": "2025-10-01T00:00:00Z"},
|
| 238 |
"recent_checks": [
|
| 239 |
+
{"check": "http", "result": "fail", "latency_ms": 900, "ts": "2025-10-01T00:00:00Z"}
|
| 240 |
]
|
| 241 |
},
|
| 242 |
"constraints": {"max_steps": 3, "risk": "low"}
|
|
|
|
| 258 |
```
|
| 259 |
|
| 260 |
**Status codes**
|
| 261 |
+
|
| 262 |
* `200` – plan generated
|
| 263 |
* `400` – invalid payload (schema)
|
| 264 |
* `401/403` – missing/invalid bearer (only if `ADMIN_TOKEN` configured)
|
|
|
|
| 267 |
|
| 268 |
### `POST /v1/chat`
|
| 269 |
|
| 270 |
+
Given a query about MatrixHub, returns an answer with citations **if** a local KB is configured at `RAG_KB_PATH`. Uses the same provider cascade.
|
| 271 |
+
|
| 272 |
+
### `GET /v1/chat/stream` & `POST /v1/chat/stream`
|
| 273 |
+
|
| 274 |
+
Server-Sent Events (SSE) streaming of token deltas. Production-safe middleware ensures no body buffering and proper headers (`Cache-Control: no-cache`, `X-Trace-Id`, `X-Process-Time-Ms`, `Server-Timing`).
|
| 275 |
|
| 276 |
---
|
| 277 |
|
| 278 |
## Safety & Reliability
|
| 279 |
|
| 280 |
+
* **PII redaction** – tokens/emails removed from prompts as a pre-filter
|
| 281 |
+
* **Strict schema** – JSON plan parsing with safe defaults; rejects unsafe shapes
|
| 282 |
+
* **Time-boxed** – short timeouts and bounded retries to providers
|
| 283 |
+
* **Rate-limited** – per-IP fixed window (configurable)
|
| 284 |
+
* **Structured logs** – JSON logs with `trace_id` for correlation
|
| 285 |
+
* **SSE-safe middleware** – never consumes streaming bodies; avoids Starlette “No response returned” pitfalls
|
| 286 |
+
|
| 287 |
+
---
|
| 288 |
+
|
| 289 |
+
## RAG (Optional)
|
| 290 |
+
|
| 291 |
+
* **Embeddings:** `sentence-transformers/all-MiniLM-L6-v2` (GPU-aware)
|
| 292 |
+
* **Re-ranking:** optional `cross-encoder/ms-marco-MiniLM-L-2-v2` (GPU-aware)
|
| 293 |
+
* **KB:** `data/kb.jsonl` (one JSON per line: `{ "text": "...", "source": "..." }`)
|
| 294 |
+
* **Tunable:** `rag.top_k`, `RAG_RERANK`, `RAG_KB_PATH`
|
| 295 |
+
|
| 296 |
+
---
|
| 297 |
+
|
| 298 |
+
## Deployments
|
| 299 |
+
|
| 300 |
+
### Hugging Face Spaces (recommended for demo)
|
| 301 |
+
|
| 302 |
+
1. Push repo to a new **Space** (FastAPI).
|
| 303 |
+
2. **Settings → Secrets**:
|
| 304 |
+
|
| 305 |
+
* `GROQ_API_KEY`, `GOOGLE_API_KEY`, `HF_TOKEN` (as needed by cascade)
|
| 306 |
+
* `ADMIN_TOKEN` (optional; gates `/v1/plan` & `/v1/chat*`)
|
| 307 |
+
3. Choose hardware (CPU is fine; GPU improves RAG throughput and cross-encoder).
|
| 308 |
+
4. Space runs `uvicorn` and exposes all endpoints.
|
| 309 |
+
|
| 310 |
+
### Containers / Cloud
|
| 311 |
+
|
| 312 |
+
* Use a minimal Python base, install with `pip install -r requirements.txt`.
|
| 313 |
+
* Expose port `7860` (configurable).
|
| 314 |
+
* Set secrets via your orchestrator (Kubernetes Secrets, ECS, etc.).
|
| 315 |
+
* Scale with multiple Uvicorn workers; put behind an HTTP proxy that supports streaming (e.g., nginx with `proxy_buffering off` for SSE).
|
| 316 |
|
| 317 |
---
|
| 318 |
|
| 319 |
## Observability
|
| 320 |
|
| 321 |
+
* **Trace IDs** (`X-Trace-Id`) attached per request and logged
|
| 322 |
+
* **Timing headers**: `X-Process-Time-Ms`, `Server-Timing`
|
| 323 |
+
* Provider selection logs (e.g., `Provider 'groq' succeeded in 0.82s`)
|
| 324 |
+
* Metrics endpoints can be added behind an auth wall (Prometheus friendly)
|
| 325 |
+
|
| 326 |
+
---
|
| 327 |
+
|
| 328 |
|
| 329 |
---
|
| 330 |
|
|
|
|
| 332 |
|
| 333 |
* Keep `/v1/plan` **internal** behind a network boundary or `ADMIN_TOKEN`.
|
| 334 |
* Validate payloads rigorously (Pydantic) and write contract tests for the plan schema.
|
| 335 |
+
* If you switch models, re-run golden tests to guard against plan drift.
|
| 336 |
+
* Avoid logging sensitive data; logs are structured JSON only.
|
| 337 |
|
| 338 |
---
|
| 339 |
|
| 340 |
## License
|
| 341 |
|
| 342 |
+
Apache-2.0
|
| 343 |
+
|
| 344 |
+
---
|
| 345 |
+
|
| 346 |
+
**Tip:** The cascade order is controlled by `provider_order` (`groq,gemini,router`). If Groq is rate-limited or missing, the service automatically falls back to Gemini, then Hugging Face Router (Zephyr → Mistral). Streaming works out of the box and is middleware-safe.
|