Evaluation as a Goal Surface: Experiments, Learning Boundary, and ETH-Aware A/B
How to design experiments that respect goals, roles, and ethics in SI-Core
Draft v0.1 — Non-normative supplement to SI-Core / SI-NOS / EVAL / Learning Boundary / PoLB docs
Acronym note (art-60):
- PLB is reserved for Pattern-Learning-Bridge (art-60-016).
- This document uses Learning Boundary (LB) to mean the “policy learning boundary” concept.
- Rollout governance (modes/waves/backoff) is written as PoLB (Policy Load Balancer).
This document is non-normative. It describes how to think about and operationalize evaluation in a Structured Intelligence (SI) stack: experiments, A/B tests, shadow runs, and off-policy evaluation — all treated as goal surfaces with ETH and Role/Persona overlays.
Normative behavior lives in the SI-Core / SI-NOS / Jump / ETH / EVAL specs.
0. Conventions used in this draft (non-normative)
This draft follows the portability conventions used in 069/084+ when an artifact might be exported, hashed, or attested (EvalSurfaces, experiment contracts, assignment logs, EvalTrace summaries, approval packets):
created_atis operational time (advisory unless time is attested).as_ofcarries markers only (time claim + optional revocation view markers) and SHOULD declareclock_profile: "si/clock-profile/utc/v1"when exported.trustcarries digests only (trust anchors + optional revocation view digests). Never mix markers intotrust.bindingspins meaning as{id,digest}(meaningful identities must not be digest-only).- Avoid floats in policy-/digest-bound artifacts: prefer scaled integers (
*_bp,*_ppm) and integer micro/milliseconds. - If you hash/attest legal/procedural artifacts, declare canonicalization explicitly:
canonicalization: "si/jcs-strict/v1"andcanonicalization_profile_digest: "sha256:...". digest_rulestrings (when present) are explanatory only; verifiers MUST compute digests using pinned schemas/profiles, not by parsingdigest_rule.
Numeric conventions used in examples:
For weights and ratios in
[0,1], export as basis points:x_bp = round(x * 10000).For probabilities in
[0,1], export as basis points:p_bp = round(p * 10000).For very small probabilities, ppm is acceptable:
p_ppm = round(p * 1_000_000).If a clause uses a legal threshold with decimals, either:
- export
value_scaled_int+ explicitscale, or - keep the raw numeric payload in a referenced artifact and export only its digest (
*_ref/*_hash) plus display-only summaries.
- export
Internal computation may still use floats; the convention here is about exported/hashed representations.
1. Why “evaluation as a goal surface”
Most systems treat “evaluation” as something outside the system:
- metrics dashboards bolted onto production,
- A/B platforms glued onto the side,
- offline experiments that no one wires back into governance.
In SI-Core, evaluation is part of the same decision fabric:
- Evaluators see goal surfaces and Jumps, just like decision engines.
- Experiments are just another kind of Jump — with a different goal surface.
- ETH and ID apply equally to “we’re only experimenting” as to “we’re shipping.”
This suggests a design principle:
EVAL is itself a Goal Surface, with its own ETH, ID, and PoLB constraints (within the Learning Boundary).
Concretely:
- We define what we are optimizing (metrics),
- for whom (principals / personas),
- on which agents / roles (what is being evaluated),
- under what safety constraints (ETH, PoLB).
Everything else (A/B, shadow, off-policy) are just different ways to explore that evaluation goal surface.
2. What EVAL is in SI-Core
We distinguish:
Metrics — raw measurements (loss, CTR, latency, RBL, SCI, fairness metrics…)
Goal surfaces — multi-objective functions over state/actions (GCS, constraints).
EVAL — the subsystem that:
- defines evaluation goal surfaces,
- designs experiments,
- measures outcomes,
- feeds back into model / Jump / policy updates.
At a high level:
OBS → Jumps / Policies → World Effects
▲ │
│ ▼
EVAL ← Metrics ← Logs / Traces / Shadow Runs
The key twist: EVAL is not “outside” this diagram. It is itself goal-native: it runs Jumps, respects ETH, and writes to MEM and ID.
3. Who is being evaluated? (Agents, roles, principals)
Before we talk about “A/B testing”, we must answer three questions:
Who is the subject of evaluation?
- A specific SI agent? (e.g.
si:learning_companion:v2) - A specific role? (e.g.
role:teacher_delegate/reading_support) - A policy bundle behind a Jump? (e.g.
jump:learning.pick_next_exercise@v2.3.1)
- A specific SI agent? (e.g.
For whom is this evaluation being done (principal)?
- A learner, a patient, a city resident, a regulator?
From which persona’s vantage point is this reported?
- Learner view, teacher view, ops view, regulator view?
We can encode this as an EvaluationSubject and EvaluationView:
evaluation_subject:
kind: "jump_definition" # or "si_agent" / "role" / "policy_bundle"
id: "jump:learning.pick_next_exercise@v2.3.1"
evaluation_principal:
id: "learner:1234" # whose interests are primary
evaluation_view:
persona_id: "persona:teacher_view"
roles_involved:
- "role:teacher"
- "role:learning_companion"
EVAL should refuse vaguely specified experiments like “see which variant has higher engagement” with no subject and no principal. You must say:
“We are evaluating Jump version v2.3.1 vs v2.2.0 for learners of type X, under teacher/guardian/ops personas.”
4. Evaluation goal surfaces
We model an EvalSurface as a structured goal surface with:
- objectives (metrics to maximize/minimize),
- constraints (ETH, safety, fairness),
- scope (who/where/when),
- subject (what is being varied).
eval_surface:
id: "eval:learning_exercise_selection/v1"
created_at: "2028-04-01T00:00:00Z" # operational time (advisory unless time is attested)
as_of:
time: "2028-04-01T00:00:00Z"
clock_profile: "si/clock-profile/utc/v1"
trust:
trust_anchor_set_id: "si/trust-anchors/example/v1"
trust_anchor_set_digest: "sha256:..."
canonicalization: "si/jcs-strict/v1"
canonicalization_profile_digest: "sha256:..."
bindings:
subject_jump:
id: "jump:learning.pick_next_exercise"
digest: "sha256:..."
subject: "jump:learning.pick_next_exercise"
scope:
domain: "learning"
population: "grade_5_reading_difficulties"
context: "school_hours"
objectives:
primary:
- name: "mastery_gain_7d_bp"
weight_bp: 6000 # 0.60
- name: "wellbeing_score_bp"
weight_bp: 4000 # 0.40
secondary:
- name: "ops_cost_per_session_usd_micros"
weight_bp: -1000 # -0.10 (minimize cost)
constraints:
hard:
- "wellbeing_score_bp >= 7000"
- "no_increase_in_flagged_distress_events == true"
- "no_EthViolation(type='protected_group_harm')"
soft:
- "ops_cost_ratio_bp <= 12000" # <= 1.20 * baseline
roles:
- "role:learning_companion"
- "role:teacher_delegate"
personas:
report_to:
- "persona:teacher_view"
- "persona:ops_view"
This makes evaluation a first-class object:
- EVAL Jumps optimize this eval surface when picking experiment designs.
- ETH overlays can inspect it: “are you treating wellbeing as a primary goal?”
- ID/MEM can trace: “who approved this eval surface?”
5. Experiments as Jumps (E-Jumps)
We treat experiments as a specific class of Jumps:
They are effectful w.r.t. world state (because they actually change assignments).
Their primary “action” is not a business action, but a policy decision:
- choose policy A vs B vs C for a given subject/context,
- decide assignment proportions, durations, PoLB boundaries.
5.1 Experiment Jump request
@dataclass
class ExperimentJumpRequest:
eval_surface: EvalSurface
subject: EvaluationSubject
candidate_policies: list[PolicyVariant] # e.g. Jump versions
population: PopulationDefinition
polb_config: PoLBConfig
eth_overlay: ETHConfig
role_persona: RolePersonaContext
5.2 Experiment Jump draft
@dataclass
class ExperimentJumpDraft:
assignment_scheme: AssignmentScheme
monitoring_plan: MonitoringPlan
stop_rules: StopRuleSet
eval_trace_contract: EvalTraceContract
learning_boundary: LearningBoundarySpec
The core idea:
Experiment design = plan for how to move within the Learning Boundary, while respecting ETH and delivering signal on the eval surface.
6. Learning Boundary (LB) recap
The Learning Boundary (LB) is the structural fence between:
- the “live world” (citizens, learners, patients, infrastructure…), and
- the experimentation / learning machinery.
Very roughly:
World ←→ LB ←→ Policies / Models / SI agents
Key non-normative LB patterns for experiments:
Sandbox / historical (LB)
- world is frozen as logs; we can replay counterfactuals.
- No new risk to principals; ETH is easier.
Shadow (LB)
- new policy runs in parallel (“shadow”), no real effects.
- We compare recommendations to production decisions.
Online (LB)
- new policy actually affects the world.
- Needs ETH-aware constraints, role-aware accountability, and alive stop-rules.
E-Jumps (“Experiment Jumps”) must declare which PoLB mode they operate in (as the operational interface of the Learning Boundary), and ETH should treat these differently.
polb_config:
envelope_mode: "online" # sandbox | shadow | online | degraded (see art-60-038)
mode_name: "ONLINE_EXPERIMENTAL_STRATIFIED" # PoLB mode label (see art-60-038)
max_risk_level: "medium" # low | medium | high | emergency
rollout_strategy: "canary" # canary | random | stratified
# Exported form avoids floats: ratios in [0,1] use basis points (bp).
max_population_share_bp: 1000 # 0.10
guarded_by_eth: true
7. ETH-aware A/B and variant experiments
Now, given EvalSurface + PoLB, what does an ETH-aware A/B test look like?
7.1 ETH-aware experiment contract
experiment:
id: "exp:learning_pick_next_exercise_v2_vs_v1"
created_at: "2028-04-10T00:00:00Z" # operational time (advisory unless time is attested)
as_of:
time: "2028-04-10T00:00:00Z"
clock_profile: "si/clock-profile/utc/v1"
trust:
trust_anchor_set_id: "si/trust-anchors/example/v1"
trust_anchor_set_digest: "sha256:..."
canonicalization: "si/jcs-strict/v1"
canonicalization_profile_digest: "sha256:..."
# Pin meaning when exporting/hashing (avoid digest-only meaning).
bindings:
subject_jump:
id: "jump:learning.pick_next_exercise"
digest: "sha256:..."
eval_surface:
id: "eval:learning_exercise_selection/v1"
digest: "sha256:..."
subject: "jump:learning.pick_next_exercise"
eval_surface: "eval:learning_exercise_selection/v1"
polb_config:
envelope_mode: "online" # sandbox | shadow | online | degraded (see art-60-038)
mode_name: "ONLINE_EXPERIMENTAL_STRATIFIED" # can be domain-specific
max_risk_level: "medium" # low | medium | high | emergency
rollout_strategy: "stratified" # canary | random | stratified
max_population_share_bp: 2500 # 0.25
guarded_by_eth: true
variants:
control:
policy: "jump:learning.pick_next_exercise@v1.9.0"
traffic_share_bp: 7500 # 0.75
treatment:
policy: "jump:learning.pick_next_exercise@v2.0.0"
traffic_share_bp: 2500 # 0.25
eth_constraints:
forbid:
- "randomization_by_protected_attribute"
- "higher_exposure_to_risky_content_for_vulnerable_learners"
require:
- "treatment_never_worse_than_control_for_wellbeing_on_avg"
- "continuous_monitoring_of_distress_signals"
- "immediate_abort_on_wellbeing_regression_bp > 500" # 0.05
stop_rules:
min_duration_days: 14
max_duration_days: 60
early_stop_on:
- "clear_superiority (p < 0.01) on mastery_gain_7d_bp"
- "eth_violation_detected"
7.2 ETH + ID at assignment time
Assignment itself is a small Jump:
class VariantAssigner:
def assign(self, principal, context, experiment):
"""
experiment.variants is expected to match the exported contract form:
variants:
control: {policy: "...", traffic_share_bp: 7500}
treatment: {policy: "...", traffic_share_bp: 2500}
"""
# 1) ETH / PoLB gating
if not self.eth_overlay.permits_assignment(principal, context, experiment):
control = experiment.variants["control"]
return control["policy"], "eth_forced_control"
# 2) Build traffic shares from the exported contract (basis points, no floats).
shares_bp = {k: int(v["traffic_share_bp"]) for k, v in experiment.variants.items()}
# 3) Deterministic draw (portable): randomizer SHOULD use a stable digest (e.g. sha256).
variant_id = self.randomizer.draw_bp(
principal_id=principal.id,
experiment_id=experiment.id,
shares_bp=shares_bp,
)
policy = experiment.variants[variant_id]["policy"]
# 4) Log assignment with ID and MEM
self.eval_trace.log_assignment(
principal_id=principal.id,
experiment_id=experiment.id,
variant=variant_id,
role_context=context.role_persona,
)
return policy, "assigned"
ETH-aware constraints might:
- forbid certain principals from being randomized (e.g. high-risk medical cases),
- force control for specific groups,
- require stratified randomization to preserve fairness.
8. Shadow evaluation and off-policy evaluation
Not all evaluation must touch the live world.
8.1 Shadow Jumps
A shadow Jump runs a candidate policy in parallel with the real one:
- uses the same OBS bundle, ID, Role/Persona context,
- runs a Jump with RML budget set to NONE (no effectful ops),
- compares decisions + GCS to production behavior.
Contract sketch:
shadow_eval:
id: "shadow:city_flood_policy_v3"
subject: "jump:city.adjust_flood_gates"
polb_config:
envelope_mode: "shadow" # sandbox | shadow | online | degraded (see art-60-038)
mode_name: "SHADOW_PROD" # illustrative PoLB mode label (see art-60-038)
rml_budget: "NONE" # shadow eval must be non-effectful
candidate_policy: "jump:city.adjust_flood_gates@v3.0.0"
baseline_policy: "jump:city.adjust_flood_gates@v2.5.1"
metrics:
- "GCS_delta_safety"
- "GCS_delta_cost"
- "policy_disagreement_rate_bp"
- "ETH_block_rate_bp"
Shadow evaluation is ideal for:
- domain where online A/B is too risky (floodgates, ICU),
- quick sanity checks (does ETH block the new policy constantly?),
- feeding into more precise off-policy estimators.
8.2 Off-policy evaluation
Off-policy evaluation (OPE) uses historical logs to estimate how a new policy would have performed, without deploying it.
We won’t specify a particular estimator (IPW, DR, etc.), but we can define a portable interface shape:
class OffPolicyEvaluator:
def evaluate(self, logs, candidate_policy, eval_surface):
"""Estimate candidate policy performance on eval_surface from logs."""
estimates = []
for log in logs:
context = log.context
action_taken = log.action
outcome = log.outcome
# Candidate's action
candidate_action = candidate_policy.propose(context)
# Importance weight / model-based correction
w = self._importance_weight(log.behavior_policy_prob,
candidate_policy.prob(context, candidate_action))
contribution = self._eval_contribution(
candidate_action, outcome, eval_surface
)
estimates.append(w * contribution)
return aggregate_estimates(estimates)
Important invariants:
- OPE must respect ID / MEM: you know which behavior policy generated each log.
- ETH may forbid using some logs for certain kinds of off-policy queries (GDPR, consent).
- OPE is a sandbox (LB) evaluation method — zero new live risk.
9. Role & Persona overlays for EVAL
We now combine with art-60-036.
9.1 Role-aware evaluations
Different roles may be subjects or context for evaluation:
- Evaluate
role:learning_companionvsrole:teacher_delegatein a co-teaching scenario. - Evaluate multi-agent protocols (e.g.
city_opsagent +flood_modelagent).
We can annotate eval surfaces with explicit roles:
eval_surface:
id: "eval:multi_agent_city_control/v1"
subject:
kind: "multi_agent_protocol"
id: "proto:city_ops+flood_model@v1"
roles_under_test:
- "role:city_operator_ai"
- "role:flood_model_ai"
roles_observing:
- "role:human_city_operator"
EVAL can then:
- attribute metrics to each role (e.g. who proposed what),
- reason about delegation chains: was a bad outcome due to mis-delegation or mis-execution?
9.2 Persona-aware reporting
The same experiment’s outcomes are rendered differently for different personas:
persona_views:
learner_view:
show_metrics:
- "mastery_gain_7d"
- "stress_load"
explanation_style: "simple"
teacher_view:
show_metrics:
- "mastery_gain_7d"
- "curriculum_coverage"
- "risk_flags"
explanation_style: "technical"
regulator_view:
show_metrics:
- "wellbeing_score"
- "fairness_gap_metrics"
- "policy_rollout_pattern"
explanation_style: "regulatory"
EVAL is responsible for:
- projecting metrics into persona views (using MetricProjector),
- adapting explanations (using ExplanationAdapter),
- ensuring that regulators see the structural evaluation setup (EvalSurface, PoLB, ETH), not just outcome metrics.
10. EvalTrace: auditing experiments and interventions
Just as Jumps produce JumpTrace, experiments produce EvalTrace.
eval_trace:
experiment_id: "exp:learning_pick_next_exercise_v2_vs_v1"
subject: "jump:learning.pick_next_exercise"
eval_surface_id: "eval:learning_exercise_selection/v1"
created_at: "2028-04-15T10:00:00Z" # operational time (advisory unless time is attested)
as_of:
time: "2028-04-15T10:00:00Z"
clock_profile: "si/clock-profile/utc/v1"
trust:
trust_anchor_set_id: "si/trust-anchors/example/v1"
trust_anchor_set_digest: "sha256:..."
canonicalization: "si/jcs-strict/v1"
canonicalization_profile_digest: "sha256:..."
bindings:
experiment:
id: "exp:learning_pick_next_exercise_v2_vs_v1"
digest: "sha256:..."
eval_surface:
id: "eval:learning_exercise_selection/v1"
digest: "sha256:..."
assignments:
- principal_id: "learner:1234"
variant: "treatment"
assigned_at: "2028-04-15T10:00:00Z"
role_context: "role:learning_companion"
delegation_chain: ["guardian:777", "teacher:42", "si:learning_companion"]
randomization_seed_digest: "sha256:..." # portable seed identity
reason: "assigned"
outcomes:
window: "7d"
metrics:
treatment:
mastery_gain_7d_bp: 2100
wellbeing_score_bp: 8100
control:
mastery_gain_7d_bp: 1800
wellbeing_score_bp: 8200
ethics:
eth_policy: "eth:learning_v4"
violations_detected: []
fairness_audit:
demographic_breakdown:
group_A: {...}
group_B: {...}
polb:
envelope_mode: "online"
mode_name: "ONLINE_EXPERIMENTAL_STRATIFIED"
canary_phase:
start: "2028-04-10"
end: "2028-04-14"
max_population_share_bp: 500 # 0.05
ramp_up_plan:
steps:
- {date: "2028-04-20", share_bp: 1000}
- {date: "2028-05-01", share_bp: 2500}
EvalTrace is what lets you answer:
- Who designed this experiment?
- On whose behalf?
- With what constraints?
- Did we actually follow our PoLB / ETH / stop-rules?
11. Testing strategies: “evaluate the evaluators”
EVAL itself can go wrong (p-hacking, unsafe randomization, mis-logged assignments). We need tests and governance.
11.1 Unit tests
EvalSurface construction:
- reject missing subject/principal,
- enforce ETH constraints present for high-risk domains.
Variant assignment:
- deterministic under fixed seeds,
- respects traffic shares,
- never randomizes on forbidden attributes.
11.2 Integration tests
- Full pipeline: E-Jump → assignments → metrics → analysis.
- Shadow evaluation: check that RML=NONE truly produces no effects.
- PoLB: verify online experiments never exceed stated population share.
11.3 Property-based tests
Example: no ETH violation due to experiment:
from hypothesis import given
@given(context=gen_contexts(), principal=gen_principals())
def test_assignment_respects_eth(context, principal):
exp = make_test_experiment()
policy, reason = assigner.assign(principal, context, exp)
assert not eth_overlay.is_forbidden_assignment(
principal, context, exp, policy
)
Example: ID consistency — every assignment, outcome, and metric entry can be tied back to the same principal and experiment.
12. Summary: EVAL as a first-class goal surface
In a Structured Intelligence stack, “evaluation” is not an afterthought or an external dashboard. It is:
- a goal surface with explicit objectives and constraints,
- scoped to subjects (agents, roles, Jump definitions),
- defined for principals and rendered through personas,
- executed via E-Jumps inside a Policy Learning Boundary,
- guarded by ETH and Role/Persona overlays,
- fully traceable in EvalTrace under MEM/ID.
This lets you say things like:
“We improved mastery gain by 3 points for learners like X, without harming wellbeing or fairness, via Jump v2.0.0, evaluated in a PoLB-constrained, ETH-approved experiment, and here is the trace that proves it.”
That is what “Evaluation as a Goal Surface” means in the context of SI-Core.
13. Experiment design algorithms
Challenge: An Experiment Jump (E-Jump) should not just say “do an A/B” — it should propose a statistically sound design: sample size, power, stopping rules, and monitoring, all consistent with the EvalSurface and PoLB.
13.1 Sample size calculation
A first layer is a SampleSizeCalculator that uses:
- the primary objective on the EvalSurface,
- a minimum detectable effect (MDE),
- desired power and alpha,
- variance estimates from historical data or simulations.
import numpy as np
from scipy.stats import norm
class SampleSizeCalculator:
def calculate(
self,
eval_surface,
effect_size, # absolute difference you want to detect
power: float = 0.8,
alpha: float = 0.05,
num_variants: int = 2,
):
"""Compute required sample size per variant for the primary metric (normal approx)."""
if effect_size <= 0:
raise ValueError("effect_size must be > 0")
primary_metric = eval_surface.objectives.primary[0]
variance = self._estimate_variance(
primary_metric.name,
eval_surface.scope.population,
)
if variance <= 0:
raise ValueError("variance must be > 0")
z_alpha = norm.ppf(1 - alpha / 2)
z_beta = norm.ppf(power)
# Two-arm z-approximation (common planning approximation)
n_per_variant = 2 * variance * ((z_alpha + z_beta) / effect_size) ** 2
return {
"n_per_variant": int(np.ceil(n_per_variant)),
"total_n": int(np.ceil(n_per_variant * num_variants)),
"assumptions": {
"effect_size": effect_size,
"variance": variance,
"power": power,
"alpha": alpha,
"primary_metric": primary_metric.name,
},
}
In SI-Core terms, this calculator is parameterized by the EvalSurface: the same infrastructure can be used for many experiments, but the primary metric, variance sources, and population are goal-surface–specific.
13.2 Power analysis
During or after an experiment, EVAL may want to ask:
- “Given what we’ve seen so far, how much power do we actually have?”
- “If we stop now, what’s the risk of a false negative?”
import numpy as np
from scipy.stats import t, nct
def power_analysis(eval_surface, observed_n, effect_size, alpha=0.05, variance=None):
"""
Approx power for a two-sided two-sample t-test via noncentral t.
If variance is not provided, this expects an external estimator.
"""
if observed_n <= 2:
return {"power": 0.0, "note": "observed_n too small for df"}
if variance is None:
primary_metric = eval_surface.objectives.primary[0].name
variance = estimate_variance(primary_metric, eval_surface.scope.population) # external hook
if variance <= 0 or effect_size == 0:
return {"power": 0.0, "note": "non-positive variance or zero effect_size"}
cohen_d = effect_size / np.sqrt(variance)
df = observed_n - 2
# Two-sample: ncp ~ d * sqrt(n/2)
ncp = cohen_d * np.sqrt(observed_n / 2)
tcrit = t.ppf(1 - alpha / 2, df=df)
# power = P(T > tcrit) + P(T < -tcrit) under noncentral t
power = (1 - nct.cdf(tcrit, df=df, nc=ncp)) + nct.cdf(-tcrit, df=df, nc=ncp)
return {"power": float(power), "df": int(df), "ncp": float(ncp), "tcrit": float(tcrit)}
This can be wired into monitoring and stop-rules: if power is clearly insufficient given the PoLB constraints, the system can recommend “stop for futility” or “extend duration if ETH permits.”
13.3 Early stopping: sequential testing and ETH
E-Jumps should be able to stop early:
- for efficacy (treatment clearly better),
- for futility (no realistic chance of detecting meaningful effects),
- for harm (ETH / safety issues).
class SequentialTestingEngine:
def __init__(self, alpha=0.05, spending_function="obrien_fleming"):
self.alpha = alpha
self.spending_function = spending_function
def check_stop(self, experiment, current_data, analysis_number, max_analyses):
"""Sequential testing with alpha spending."""
from scipy.stats import norm
# O'Brien–Fleming-style alpha spending (illustrative)
if self.spending_function == "obrien_fleming":
z = norm.ppf(1 - self.alpha / 2)
spent_alpha = self.alpha * (2 * (1 - norm.cdf(
z / np.sqrt(analysis_number / max_analyses)
)))
else:
spent_alpha = self.alpha / max_analyses # fallback
test_stat, p_value = self._compute_test_stat(
current_data, experiment.eval_surface
)
# Stop for efficacy
if p_value < spent_alpha:
return StopDecision(
stop=True,
reason="efficacy",
effect_estimate=self._estimate_effect(current_data),
confidence_interval=self._ci(current_data, spent_alpha),
)
# Stop for futility
if self._futility_check(current_data, experiment):
return StopDecision(stop=True, reason="futility")
# Stop for harm: ETH-aware check
if self._harm_check(current_data, experiment.eth_constraints):
return StopDecision(stop=True, reason="eth_violation_detected")
return StopDecision(stop=False)
ETH constraints should be able to override purely statistical arguments: if harm is detected, you stop even if power is low or alpha is not yet “spent.”
13.4 E-Jump proposal algorithm
Putting it together, an E-Jump engine can propose a full design:
class ExperimentProposer:
def propose_experiment(self, eval_surface, candidates, polb_config):
"""Propose an E-Jump design consistent with EvalSurface + PoLB."""
# 1. Infer minimum detectable effect from domain norms
effect_size = self._minimum_detectable_effect(eval_surface)
# 2. Calculate required sample size
sample_size = self.sample_size_calc.calculate(
eval_surface,
effect_size,
power=0.8,
alpha=0.05,
num_variants=len(candidates)
)
# 3. Choose rollout strategy based on PoLB and risk
if polb_config.envelope_mode == "online":
if polb_config.max_risk_level in ("high", "emergency"):
rollout = "canary"
initial_share = 0.01
else:
rollout = "stratified"
initial_share = 0.10
else:
rollout = "sandbox" # naming aligned with envelope_mode taxonomy
initial_share = 0.0
assignment = self._design_assignment(
population=eval_surface.scope.population,
candidates=candidates,
sample_size=sample_size,
polb_config=polb_config,
initial_share=initial_share,
)
# 4. Monitoring plan: which metrics, how often, with which alerts
monitoring = MonitoringPlan(
metrics=(
[o.name for o in eval_surface.objectives.primary] +
[c for c in eval_surface.constraints.hard]
),
check_frequency="daily",
alert_thresholds=self._derive_alert_thresholds(eval_surface),
)
# 5. Stop rules (statistical + ETH)
stop_rules = self._design_stop_rules(eval_surface, sample_size)
return ExperimentJumpDraft(
assignment_scheme=assignment,
monitoring_plan=monitoring,
stop_rules=stop_rules,
eval_trace_contract=self._eval_trace_contract(eval_surface),
learning_boundary=polb_config.boundary,
expected_duration=self._estimate_duration(
sample_size, eval_surface.scope.population
),
statistical_guarantees={"power": 0.8, "alpha": 0.05},
)
E-Jumps, in this view, are planners over experiment space: they optimize not just “which variant,” but how to learn about which variant under PoLB/ETH constraints.
14. Multi-objective optimization in evaluation
Challenge: EvalSurfaces are multi-objective: mastery vs wellbeing, cost vs safety, etc. EVAL must reason about trade-offs both when designing experiments and when ranking outcomes.
14.1 Pareto-optimal experiment designs
Different experiment designs can trade off:
- information gain on different metrics,
- risk to participants,
- cost and duration.
class ParetoExperimentOptimizer:
def find_pareto_optimal_experiments(self, eval_surface, candidate_experiments):
"""Find Pareto-optimal experiment designs on multiple criteria."""
evaluations = []
for exp in candidate_experiments:
scores = {}
# Expected information gain per primary objective
for obj in eval_surface.objectives.primary:
scores[obj.name] = self._predict_info_gain(exp, obj)
# Treat risk and cost as additional objectives
scores["risk"] = self._assess_risk(exp, eval_surface)
scores["cost"] = self._estimate_cost(exp)
evaluations.append((exp, scores))
# Compute Pareto frontier
pareto_set = []
for i, (exp_i, scores_i) in enumerate(evaluations):
dominated = False
for j, (exp_j, scores_j) in enumerate(evaluations):
if i == j:
continue
if self._dominates(scores_j, scores_i, eval_surface):
dominated = True
break
if not dominated:
pareto_set.append((exp_i, scores_i))
return pareto_set
Governance can then choose among Pareto-optimal designs based on domain norms (e.g. “always prefer lower risk given similar information gain”).
14.2 Scalarization with constraint penalties
When a single scalar score is needed (e.g. for bandits or automated ranking), we can scalarize the multi-objective EvalSurface:
def weighted_scalarization(eval_surface, experiment_outcomes):
"""
Convert multi-objective outcomes into a single scalar score.
Portability note:
- Prefer weight_bp in exported EvalSurfaces.
- If only float weights exist (internal sketches), they are accepted.
"""
score = 0.0
# primary/secondary objectives (assumed structured)
for obj in eval_surface.objectives.primary:
w_bp = getattr(obj, "weight_bp", None)
w = (float(w_bp) / 10000.0) if w_bp is not None else float(getattr(obj, "weight", 0.0))
score += w * experiment_outcomes[obj.name]
for obj in eval_surface.objectives.secondary:
w_bp = getattr(obj, "weight_bp", None)
w = (float(w_bp) / 10000.0) if w_bp is not None else float(getattr(obj, "weight", 0.0))
score += w * experiment_outcomes[obj.name]
# hard constraints (assumed strings/expr)
for expr in eval_surface.constraints.hard:
if not check_constraint(expr, experiment_outcomes):
return -1e6 # hard fail
# soft constraints: allow either
# - "expr string"
# - {expr: "...", penalty_weight: 0.1} or {expr: "...", penalty_weight_bp: 1000}
for c in eval_surface.constraints.soft:
if isinstance(c, str):
expr = c
penalty_weight = 1.0
else:
expr = c.get("expr") or c.get("constraint")
if "penalty_weight_bp" in c:
penalty_weight = float(c.get("penalty_weight_bp")) / 10000.0
else:
penalty_weight = float(c.get("penalty_weight", 1.0))
if expr and (not check_constraint(expr, experiment_outcomes)):
score -= penalty_weight
return score
This is non-normative, but illustrates the shape: never silently turn hard constraints into soft preferences.
14.3 Multi-objective bandits
For continuous evaluation and adaptive experiments, we often need a bandit over policies that respects multi-objective EvalSurfaces.
class MultiObjectiveBandit:
"""Thompson sampling over scalarized multi-objective rewards."""
def __init__(self, eval_surface, candidates):
self.eval_surface = eval_surface
self.candidates = candidates
self.posteriors = {
c.id: self._init_posterior() for c in candidates
}
def select_arm(self):
"""Sample from posteriors, then scalarize."""
samples = {}
for cand in self.candidates:
objective_samples = {}
for obj in self.eval_surface.objectives.primary:
objective_samples[obj.name] = (
self.posteriors[cand.id][obj.name].sample()
)
samples[cand.id] = self._scalarize(
objective_samples, self.eval_surface
)
# Candidate with highest sampled scalarized reward
best_id = max(samples, key=samples.get)
return next(c for c in self.candidates if c.id == best_id)
def update(self, cand_id, outcomes):
"""Update per-objective posteriors from observed outcomes."""
for obj in self.eval_surface.objectives.primary:
self.posteriors[cand_id][obj.name].update(outcomes[obj.name])
14.4 Constraint handling strategies
Non-normative but useful patterns:
constraint_handling_strategies:
hard_constraints:
strategy: "Feasibility preservation"
implementation: "Reject candidates violating constraints"
soft_constraints:
strategy: "Penalty method"
implementation: "Add penalty term to scalarized objective"
chance_constraints:
strategy: "Probabilistic satisfaction"
implementation: "Require Pr(constraint satisfied) >= threshold"
15. Continuous evaluation and adaptive experiments
Challenge: Static A/B tests are slow and wasteful when we could be doing continuous learning. But continuous bandits must still respect PoLB and ETH.
15.1 Multi-armed bandit integration
A simple BanditEvaluator wraps a MAB algorithm and plugs into EvalSurface:
class BanditEvaluator:
"""Continuous evaluation via multi-armed bandits."""
def __init__(self, eval_surface, candidates, algorithm="thompson_sampling"):
self.eval_surface = eval_surface
self.candidates = candidates
if algorithm == "thompson_sampling":
self.bandit = ThompsonSamplingBandit(candidates)
elif algorithm == "ucb":
self.bandit = UCBBandit(candidates)
else:
raise ValueError(f"Unknown algorithm: {algorithm}")
def run_episode(self, principal, context):
"""Single evaluation episode for one principal/context."""
candidate = self.bandit.select_arm()
# Execute Jump with selected candidate policy
result = self.execute_jump(principal, context, candidate)
# Measure outcome projected onto EvalSurface
outcome = self.measure_outcome(result, self.eval_surface)
# Update bandit posteriors
self.bandit.update(candidate.id, outcome)
return result
PoLB and ETH decide where bandits are allowed:
- maybe allowed in low-risk learning UX,
- forbidden in high-risk city or medical domains, or limited to shadow mode.
15.2 Contextual bandits (personalized evaluation)
For personalization, contextual bandits can choose variants based on features:
class ContextualBanditEvaluator:
"""Personalized evaluation with a contextual bandit."""
def __init__(self, eval_surface, candidates, feature_extractor, priors):
self.eval_surface = eval_surface
self.candidates = candidates
self.feature_extractor = feature_extractor
self.posteriors = priors # e.g., Bayesian linear models per candidate
def select_candidate(self, context):
features = self.feature_extractor.extract(context)
samples = {}
for cand in self.candidates:
theta_sample = self.posteriors[cand.id].sample()
samples[cand.id] = float(np.dot(theta_sample, features))
best_id = max(samples, key=samples.get)
return next(c for c in self.candidates if c.id == best_id)
def update(self, cand_id, context, outcome):
features = self.feature_extractor.extract(context)
self.posteriors[cand_id].update(features, outcome)
Again, ETH/PoLB overlay must control:
- which features are allowed (no sensitive attributes),
- where context-driven adaptation is permitted.
15.3 Adaptive experiment design
An AdaptiveExperimentDesigner can reallocate traffic as evidence accrues:
class AdaptiveExperimentDesigner:
"""Adapt experiment allocations based on accumulated evidence."""
def adapt_traffic_allocation(self, experiment, current_results):
"""Reallocate traffic to better-performing variants."""
posteriors = {}
for variant in experiment.variants:
posteriors[variant.id] = self._compute_posterior(
variant, current_results
)
# Probability each variant is best
prob_best = {}
for variant in experiment.variants:
prob_best[variant.id] = self._prob_best(
posteriors, variant.id
)
# Minimum allocation (e.g. 5%) to keep learning and avoid starvation
new_allocations = {}
for variant in experiment.variants:
new_allocations[variant.id] = max(
0.05,
prob_best[variant.id],
)
# Normalize to sum to 1
total = sum(new_allocations.values())
return {k: v / total for k, v in new_allocations.items()}
Regret bounds (TS, UCB, contextual bandits) can be part of EvalSurface commentary (“we accept at most X regret over T steps for this domain”).
16. Causal inference for evaluation
Challenge: Even with experiments, we face confounding, selection bias, and heterogeneous effects. For off-policy evaluation and quasi-experiments, EVAL should expose a causal layer.
16.1 Basic causal effect estimation
A simple IPW-style CausalEvaluator:
class CausalEvaluator:
"""Causal inference utilities for evaluation."""
def estimate_treatment_effect(self, data, treatment_var, outcome_var, covariates):
"""Estimate a causal effect adjusting for confounders."""
from sklearn.linear_model import LogisticRegression
# Propensity score model
ps_model = LogisticRegression()
ps_model.fit(data[covariates], data[treatment_var])
propensity = ps_model.predict_proba(data[covariates])[:, 1]
treated = data[treatment_var] == 1
weights = np.where(treated, 1 / propensity, 1 / (1 - propensity))
ate = (
np.mean(weights[treated] * data.loc[treated, outcome_var]) -
np.mean(weights[~treated] * data.loc[~treated, outcome_var])
)
return {
"ate": ate,
"std_error": self._bootstrap_se(
data, treatment_var, outcome_var, weights
),
}
This fits naturally with off-policy evaluation: logs already include behavior policy information (ID/MEM).
16.2 Heterogeneous treatment effects (HTE)
EVAL should be able to ask:
- “For whom does this policy help or hurt?”
class HTEEstimator:
"""Estimate conditional average treatment effects (CATE)."""
def estimate_cate(self, data, treatment, outcome, features):
"""Return a function mapping features → CATE estimate."""
from sklearn.ensemble import RandomForestRegressor
treated_data = data[data[treatment] == 1]
control_data = data[data[treatment] == 0]
model_treated = RandomForestRegressor().fit(
treated_data[features], treated_data[outcome]
)
model_control = RandomForestRegressor().fit(
control_data[features], control_data[outcome]
)
def cate(x):
return (
model_treated.predict([x])[0] -
model_control.predict([x])[0]
)
return cate
ETH overlays can then enforce constraints like “no group’s CATE is significantly negative on wellbeing.”
16.3 Instrumental variables
For domains where randomization isn’t possible, instrumental variables provide another tool:
def instrumental_variable_estimation(data, instrument, treatment, outcome, controls):
"""Two-stage least squares (2SLS) estimation."""
import statsmodels.api as sm
import pandas as pd
# First stage: treatment ~ instrument + controls
first_stage = sm.OLS(
data[treatment],
sm.add_constant(data[[instrument] + controls])
).fit()
treatment_hat = first_stage.fittedvalues
# Second stage: outcome ~ treatment_hat + controls
regressors = pd.DataFrame(
{"treatment_hat": treatment_hat, **{c: data[c] for c in controls}}
)
second_stage = sm.OLS(
data[outcome],
sm.add_constant(regressors)
).fit()
return {
"effect": second_stage.params["treatment_hat"],
"se": second_stage.bse["treatment_hat"],
"first_stage_f": first_stage.fvalue,
}
The point is not to prescribe a particular causal toolkit, but to make causal thinking a first-class part of EvalSurface design (e.g. fields like causal_assumptions, identification_strategy).
17. Performance and scalability of evaluation systems
Challenge: Evaluation infrastructure itself must scale: variant assignment, ETH checks, logging, and metric aggregation must fit tight latency budgets.
17.1 Scaling the assignment service
A ScalableAssignmentService for high QPS:
class ScalableAssignmentService:
"""Low-latency, high-throughput variant assignment."""
def __init__(self):
self.experiment_cache = ExperimentCache() # experiment configs
self.assignment_cache = AssignmentCache() # deterministic assignments
self.async_logger = AsyncLogger()
def assign(self, principal_id, experiment_id, context):
"""Sub-10ms p99 assignment path."""
# 1. Check deterministic assignment cache
cached = self.assignment_cache.get(principal_id, experiment_id)
if cached:
return cached
# 2. Load experiment config (usually from in-memory cache)
experiment = self.experiment_cache.get(experiment_id)
# 3. Fast, stateless assignment (no DB write on critical path)
variant = self._fast_assign(principal_id, experiment)
# 4. Async logging into EvalTrace / MEM
self.async_logger.log_assignment(
principal_id=principal_id,
experiment_id=experiment_id,
variant_id=variant.id,
context=context,
)
return variant
def _fast_assign(self, principal_id, experiment):
"""
Deterministic assignment honoring traffic shares.
Avoid language/runtime-dependent hash() (Python hash is salted per process).
Use a stable digest (sha256) and basis-point buckets for portability.
"""
import hashlib
key = f"{principal_id}:{experiment.id}:{experiment.salt}".encode("utf-8")
digest = hashlib.sha256(key).digest()
# 0..9999 bucket (basis points)
bucket = int.from_bytes(digest[:4], "big") % 10000
cumulative_bp = 0
for variant in experiment.variants:
share_bp = getattr(variant, "traffic_share_bp", None)
if share_bp is None:
# Non-normative fallback (local conversion): prefer explicit *_bp in exported contracts.
share = float(getattr(variant, "traffic_share", 0.0))
share_bp = int(round(share * 10000))
cumulative_bp += int(share_bp)
if bucket < cumulative_bp:
return variant
return experiment.control # fallback
17.2 Streaming metrics aggregation
For large experiments, metrics must be aggregated in streaming fashion:
class StreamingMetricsAggregator:
"""Real-time metrics aggregation with bounded memory."""
def __init__(self):
self.sketches = {} # keyed by (experiment, variant, metric)
def _key(self, experiment_id, variant_id, metric_name):
return (experiment_id, variant_id, metric_name)
def update(self, experiment_id, variant_id, metric_name, value):
key = self._key(experiment_id, variant_id, metric_name)
if key not in self.sketches:
self.sketches[key] = self._init_sketch(metric_name)
# e.g. t-digest or similar
self.sketches[key].update(value)
def query(self, experiment_id, variant_id, metric_name, quantile=0.5):
key = self._key(experiment_id, variant_id, metric_name)
return self.sketches[key].quantile(quantile)
17.3 Performance budgets
Non-normative but useful target budgets:
latency_budgets_p99:
assignment_service: 10ms
eth_check: 5ms
metrics_logging: 2ms # async
throughput_targets:
assignments_per_second: 100000
metrics_updates_per_second: 1000000
These should be treated as part of the EvalSurface for ops/persona: evaluation that is too slow can itself violate goals (latency, cost).
18. Experiment governance and approval processes
Challenge: Experiments — especially online and safety-critical ones — need structured governance, not ad-hoc “ship and watch the dashboards.”
18.1 Approval workflow
A non-normative experiment approval workflow:
experiment_approval_workflow:
stage_1_proposal:
required:
- eval_surface
- polb_config
- eth_constraints
- sample_size_justification
submitted_by: "experiment_designer"
stage_2_risk_assessment:
assessor: "domain_expert + ethics_board_representative"
criteria:
- "PoLB mode appropriate for domain risk"
- "Hard constraints cover key safety concerns"
- "Max population share within policy limits"
- "Stop rules adequate for ETH + statistics"
outputs:
- "risk_level: low|medium|high"
- "required_reviewers"
stage_3_review:
low_risk:
reviewers: ["technical_lead"]
turnaround: "2 business days"
medium_risk:
reviewers: ["technical_lead", "domain_expert"]
turnaround: "5 business days"
high_risk:
reviewers: ["technical_lead", "domain_expert", "ethics_board", "legal"]
turnaround: "10 business days"
requires: "formal_ethics_committee_approval"
stage_4_monitoring:
automated:
- "stop_rule_checks"
- "eth_violation_detection"
human:
- "weekly_review_for_high_risk"
- "monthly_review_for_medium_risk"
18.2 Risk rubric
A simple risk assessor for experiments:
class ExperimentRiskAssessor:
def assess_risk(self, experiment):
"""Coarse risk score for experiment governance (portable, bp-friendly)."""
score = 0
envelope_mode = getattr(experiment.polb_config, "envelope_mode", None)
# PoLB envelope: online > shadow > sandbox/offline
if envelope_mode == "online":
score += 3
elif envelope_mode == "shadow":
score += 1
elif envelope_mode == "sandbox":
score += 0
# Domain risk
domain = getattr(getattr(experiment, "subject", None), "domain", None)
if domain in ["medical", "city_critical", "finance"]:
score += 3
# Population share (export-friendly): basis points
max_share_bp = getattr(experiment.polb_config, "max_population_share_bp", None)
if max_share_bp is None:
# Missing explicit bp fields is itself a governance risk (be conservative).
score += 2
else:
if int(max_share_bp) > 2500: # 0.25
score += 2
# Hard constraints present?
hard = getattr(getattr(experiment, "eth_constraints", None), "hard", None)
if not hard:
score += 2
# Vulnerable populations
if self._involves_vulnerable_pop(experiment.population):
score += 3
if score <= 3:
return "low"
if score <= 7:
return "medium"
return "high"
Risk level can automatically drive:
- required reviewers,
- PoLB modes allowed,
- extra logging / monitoring requirements.
18.3 Ethics committee integration
For high-risk experiments, you typically require ethics committee involvement:
ethics_committee_review:
triggered_by:
- "risk_level == high"
- "involves_vulnerable_populations"
- "novel_experimental_design"
review_packet:
- "Experiment proposal (EvalSurface + PoLB)"
- "Risk assessment report"
- "Informed consent procedures (if applicable)"
- "Data handling and retention plan"
- "Monitoring and stop rules"
- "Post-experiment analysis and debrief plan"
committee_decision:
- "Approved as proposed"
- "Approved with modifications"
- "Deferred pending additional information"
- "Rejected"
Because EvalSurfaces and E-Jumps are structurally defined, the governance layer can reason over them directly — rather than reading informal design docs — and MEM/ID can keep a durable record of exactly what was approved.